Architecture Deep Research: the questions that keep coming up

After the last post on ADR, the same handful of questions kept coming back. Here are the ones I keep answering.

"Why can't a coding agent just plan the architecture? Isn't that what a planner step is for?"

Two reasons.

Coding agent planning operates inside one session with one context window. It can reason carefully about what is in that window. It cannot reason about what is not: the current deployment topology, the data your warehouse already has, the migration the infra team started last sprint, the vendor your legal team flagged in March, the postmortem from the incident that quietly killed your prior queue choice. Architecture decisions depend on those facts. A planner step does not go acquire them. A research loop does.

The second reason is incentive shape. A coding agent is rewarded for completion. If you ask it for an architecture, it produces one. The shape of "produce a plan" punishes saying "I do not have enough evidence." ADR is shaped the opposite way — it produces a research report on the option space, not a pick, so "the evidence is thin on this axis" becomes a first-class section in the report (under "What the evidence does not show" per candidate) instead of being smuggled into a confident answer.

"Why deep research at all? The frontier models have read the whole web. They already know."

They know the surface.

Training data is heavily biased toward popular content. There is more "Why I love Kafka" than "Why we migrated off Kafka." More README files than ARCHITECTURE.md files. More launch blog posts than postmortems. A frontier model can recite the canonical pitch for any well-known architecture pattern. That is a different skill than telling you which pattern your specific product should bet on.

Live research swaps the inventory. Instead of averaging over what was popular when the model trained, it pulls the current state of the projects and papers that match this domain, this week. A repo that quietly stopped getting commits, a paper whose measured results came in much lower than its abstract, a candidate that everyone loved two years ago but whose maintainer team has churned out — those facts only show up if you go look.

"When we prototyped deep research, the agent just scraped README files. That is not enough. What changes?"

This is the real failure mode of off-the-shelf research agents pointed at code.

The README is the marketing surface of an open-source project. It tells you what the project wants to be. It does not tell you what is hard about owning it. The architecture evidence lives elsewhere:

ARCHITECTURE.md or docs/architecture/ — when it exists, the authors already wrote down the trade-offs they accepted.
The top-level directory layout — packages/, services/, cmd/, internal/, worker/, proto/. Five seconds of scanning tells you the actual decomposition.
Recent closed issues filtered for migration / perf / incident / regression keywords — this is where production reality leaks out.
Last push, contributor count, license, topics — is this alive, is it usable, is anyone left.
The papers cited in the README — what is the theoretical backing, and does that paper actually report what the README claims it reports.

ADR's inspectGithubRepo reads all of those, not just the README. Each evidence item carries a repo_digest so a reviewer can audit exactly what the kernel saw. Papers go through digestPaper, which extracts structured {problem, methodology, datasets, baselines, headline_results, measured_results, limitations} — and marks abstract-only digests so a paper's headline cannot be quoted as a measured result.

The shallow-search failure mode is the default. Going one layer deeper is most of the work.

"How does ADR connect to internal context that is not on the public web?"

Architecture decisions are about your context: the data you actually hold, the services you actually run, the vendor your legal team flagged in March. None of that is on the public web. ADR has three doors for it.

The first is the product brief. ADR takes a structured input file — a PRD, an architecture brief, or whatever you write — and treats it as the source of truth for product context. The brief is where the constraints that are not on GitHub live: data volumes, query shapes, compliance envelope, team maturity, operating budget. The strategic context matrix is derived from this input, and the comparison-matrix axes come out of it.

The second is the private MCP corpus. Set ADR_MCP_SERVER_URL and the search machinery can hit a read-only remote MCP corpus through OpenAI's hosted MCP tool. Use ADR_SEARCH_PROVIDER=mcp or ADR_PRIVATE_MCP_ONLY=1 to force private-corpus search. This is where your internal architecture wiki, prior decision records, and incident postmortems can live without leaking to public search providers.

The third is durable agent memory. Once an ADR run finishes, the memory pack lands on your Architect bee. From that point forward, future runs ask the memory before they ask the web. The internal context is no longer something you re-explain on every run.

"Why does the Beevibe philosophy help here? Couldn't ADR be a standalone tool?"

It could be. We tried that shape first. It is worse, for a specific reason.

A standalone architecture researcher is a one-shot agent. You give it a brief, it produces an artifact, you read the artifact, the agent goes away. The artifact does not know who you are. The next run is just as cold as the first. Whatever the Architect figured out about your team's constraints lives in a folder on a laptop, not in a system anyone else queries.

Beevibe gives ADR four things a standalone tool cannot:

A named specialist. Architect is an Agent row with a hierarchy level, a parent, and a review policy. It is addressable. Other specialists can ask it. Humans can review what it shipped.
Durable memory. The Architect bee remembers what your team decided, what got rejected, what assumptions changed. Memory is scoped to your team and persists across runs.
Mesh escalation. When an architecture call depends on a constraint Architect does not own — a deployment limit, a compliance line, a data-ownership question — it routes to the specialist that does. The constraint comes back as a memory fact, not a guess.
Self-hosted, your keys. ADR runs inside Beevibe, which runs inside your infrastructure. The product brief, the private MCP corpus, the durable memory, and the decision records all stay where your code already is.

The Beevibe philosophy is not "agents are smarter together." It is more specific: research that does not accumulate is half a product. The mesh, the memory, and the self-hosted runtime are what turn one ADR run into an architecture practice the team can keep.

"How does ADR help teams accumulate knowledge across decisions instead of re-deciding from scratch every time?"

Two mechanisms.

First, the Beevibe mesh handoff. Every ADR run produces a memory pack the Architect specialist writes to durable memory: selected topology, rejected alternatives with reasons, citation IDs, failure modes, forbidden topologies. Future runs query that memory before going to the web. If your team has already decided "no new queue technology this fiscal year," the next ADR run sees that and weighs it accordingly. If a candidate that was rejected six months ago has shipped a major fix, the run surfaces the change so you can re-evaluate explicitly.

Second, the supersede flow. When implementation evidence contradicts an earlier architecture spec, you do not just patch the code. You run a superseding ADR. The new run writes supersedes.json, appends a supersession section to the architecture decision record, and pulls the prior decision's evidence forward. The decision history becomes a versioned artifact that mirrors the codebase. Future architects on your team — human or agent — can walk the chain.

Architecture decisions start to compound the way code does. The team's accumulated calls become a private corpus that shapes the next research run.

"What does ADR not do yet?"

A few real gaps.

It does not negotiate with humans mid-run. The current loop is research → synthesize → emit artifact. If a reviewer disagrees with a single cell of the comparison matrix, the path is to write a critique input and re-run, not to interrupt the run live.

It does not yet learn across organizations. Memory is scoped to your Architect bee. The community-maintained capability registry handles cross-team OSS reuse; architecture-level cross-team learning is harder, and we have not shipped it.

It is research-bounded by the live search providers you wire up. Brave, Tavily, Serper, SearXNG, or OpenAI's hosted web_search. A private corpus through MCP is supported but you have to bring it. ADR cannot read what is behind a login wall it does not have credentials for.

And the obvious one — it is a research loop, not a code generator. The output is the artifact set and the execution handoff. Implementation is downstream, and the coding agent on the other side still has to do its job.

The repo is open: github.com/beevibe-ai/architecture-deep-research.

Next post: the cross-team learning problem, and what shipping a shared architecture memory would look like.

Architecture Deep Research:the questions that keep coming up