All posts
Engineering · May 2026 · 6 min read

Architecture Deep Research:
the layer before the coding agent

Post 03 in the Beevibe series.

A founder I know said it over lunch last week. Half the AI panel circuit is saying some version of it: "AI is going to replace software engineers. Not system architects. Architecture is taste. Architecture is judgment. You can't automate that."

I nodded. Then on the walk back to the office I started to wonder. If writing the code is getting automated, and designing the system is the last bastion of human judgment, isn't that just because nobody has tried hard enough yet? Architects look irreplaceable for the same reason coders looked irreplaceable five years ago. The work has not been broken down into a loop you can run.

That is the engineering question. Not "can AI replace architects" but "what would it take to build the loop?"

Quietly, I have been running a version of the loop by hand for months. When I hit the design phase of anything non-trivial, I open Gemini and ask it to dig up how the well-known open-source projects, the architects with track records, and the papers in the area have solved this kind of problem. I read what it brings back. Then I hand Claude Code the direction and let it execute the implementation.

When I describe this workflow, the pushback I get is always the same: Gemini is worse than Claude at coding. Maybe it is. But the Gemini half of the loop is not coding. It is reading what the strong open-source projects and the published architects already worked out and reporting back honestly. That is a different job, and the choice of model is getting argued on the wrong axis.

Step back and the bigger picture is hard to ignore. This is the first moment in history when reading what the strong open-source projects, the working papers, and the architects with track records actually figured out costs minutes instead of months. The raw material has never been more accessible. Most teams skip past it anyway. We hand a one-line product brief to a coding agent and hope the model fills in the architecture from training data.

What teams lean on instead is the gut of the senior engineer in the room. The gut is good. It was earned. But it was earned in the world before agents existed. What is cheap, what is expensive, what is risky, what is worth owning — all of it has shifted. A pre-AI gut making post-AI architecture calls is a slow drift that nobody notices until the second migration.

The strangest part: we already have deep research products for almost everything else. Markets. Legal. Medicine. Competitive teardown. The pattern shipped first for the easier problems. Architecture, where the cost of being wrong is highest and lives in production the longest, is still done in a chat box.

And the chat box default falls apart on its own. I asked a coding agent to design and build a deep research feature. It scaffolded a worker, a queue, a vector store, and a streaming endpoint inside the first hour. The code compiled. The tests passed. The architecture was wrong.

The worker had no place to live in our deployment. The queue duplicated a service the infra team had been deleting. The vector store assumed a corpus we did not have. The streaming endpoint forced a client contract our mobile team was not going to ship for another release.

None of that came from bad code. It came from a missing decision one layer up. The agent picked the easiest local implementation path before anyone had picked the right architecture family — because nothing in the loop forced that decision to happen first.

That gap is what Architecture Deep Research is for.

What ADR is

ADR is a Beevibe open source project: a live, evidence-only research loop for strategic architecture decisions. The repo is at github.com/beevibe-ai/architecture-deep-research.

It does not write product code. It answers one question:

Given this product, domain, data shape, compliance envelope, team maturity, and operating budget, which architecture family should we bet on before a coding agent writes the first file?

The output is a structured research report a human reads — and a parallel set of audit-ready artifacts a coding agent can consume: research-report.json with per-candidate sections, a knowledge map with citations, an evidence pool, citation and claim audits, and an evaluation pack. Same posture as OpenAI Deep Research / Perplexity / Gemini Deep Research — narrower scope: architectural decisions, not topics.

Maps the space, doesn't pick

The piece that makes ADR different from a generic deep research agent is what it refuses to do. ADR does not pick a winner. The decision is yours; the kernel maps what's available with evidence depth shown so you weight each section.

A general deep research agent collects sources, summarizes them, and tells you the answer. The summarizer averages over whatever the web says.

ADR keeps every architecture candidate the evidence pool surfaced. Each one gets its own section in the report — what the evidence shows, what the evidence does not show, strong and weak axes, and the citations behind every claim. Candidates with one well-cited claim are tagged thin. Candidates with five or more corroborating claims are thick. You read the depth tag and weight the section accordingly.

If no candidates surface at all, the mode is deferred. That is a first-class result, not a fallback. ADR will say "the evidence pool is too thin to map this decision space" instead of inventing confidence.

Five spaces, five different ways of reading

ADR doesn't crawl URLs and squash them into 1,600 characters of summary. Each URL gets routed to a reader that knows what to extract from that source class — because what a GitHub repo tells you is structurally different from what a paper, a blog post, an HN thread, or your own repo tells you.

Open source code. When a research task surfaces a GitHub URL, ADR reads the repository: README, ARCHITECTURE.md, top-level directory layout, stars, last push, license, topics, and recent closed issues filtered for failure-mode keywords. The evidence item carries a repo_digest so a reviewer can audit exactly what the model saw. Source type: mature_oss. With GITHUB_TOKEN set, the rate limit jumps from 60 to 5,000 calls an hour.

Papers and benchmarks. When a URL points to arXiv, OpenReview, ACL, ACM, IEEE, or bioRxiv, the digest is structured: problem, methodology, datasets, baselines, headline results, measured results, ablations, limitations, conflicts of interest. Source type: paper_or_benchmark. Abstract-only digests are marked so a paper's headline claim cannot be quoted as a measured result.

Engineering writeups. Blog posts where teams describe their own architecture — Memgraph's engineering blog, Weaviate's deep-dives, Vercel's "how we built X" posts. Source type: engineering_writeup. These are the closest thing the public web has to ARCHITECTURE.md for closed-source products, and the report surfaces them when they're available.

Community discussion. Reddit threads, HN comment chains, Twitter, Stack Exchange — tagged source_type: community_discussion with platform + subreddit / story_id captured where available. These are the "we tried X, switched to Y, here's why" posts that capture what teams actually picked and what they regretted. The citation auditor uses a softer rule for community sources (≥60% significant-token overlap with the source excerpt) because discussion threads paraphrase. The synthesizer frames their claims as practitioner signal — "r/LocalLLaMA users report X" — not as architectural fact, so the reader knows the confidence shape.

Your own repo. The discover stage runs before deep research. It scans the user's repo for stack signals (existing dependencies, deploy shape, CODEOWNERS), patterns the team already follows, and antipatterns the team has explicitly rejected — TODO/FIXME hot spots, ARCHITECTURE.md "Rejected alternatives" sections, past ADRs in docs/adr/, removed dependencies in git history. These flow into the evidence pool as private_corpus claims and shape the matrix. The same kernel that reads other people's GitHub repos reads yours first.

Reading peers two different ways

Picking architectures doesn't happen in the abstract. Teams look at 3-5 already-shipped comparable products and ask "what did they do, and does it apply to us?" ADR enumerates those peers — Logseq, Neo4j, Memgraph, ArangoDB, Roam Research, Mem.ai — and reads each one. But it reads them differently depending on what's actually available to read.

Open-source products with public source code get an evidence_strategy: "architecture" tag. The research planner queries their GitHub, docs, and engineering blog. Closed-source consumer products like Obsidian, Roam Research, and Mem.ai get evidence_strategy: "adoption" — the research planner routes through community signal: Reddit threads about user experience, HN comments on architecture takes, "we tried X and switched to Y" migration write-ups, reverse-engineering posts. Both-strategy peers get both query sets.

The point is that adoption signal isn't worse evidence than architecture documentation — it's different evidence. A million teams using Obsidian for their personal knowledge graph tells you something Neo4j's docs can't. ADR keeps both kinds in the evidence pool, frames each appropriately in the report, and lets them earn separate matrix axes (ecosystem_traction, integration_breadth, practitioner_pain_points) so the cross-cutting tradeoffs surface what each source class is good at telling you.

Why this is deeper than general deep research

General deep research agents — OpenAI Deep Research, Perplexity Deep Research, Gemini Deep Research — optimize for breadth on a topic. They cover the space and produce a long-form essay. ADR optimizes for decision-relevance within a narrower scope: every claim has to land in a candidate's section or on a matrix axis. The evidence pool isn't a corpus to summarize; it's a corpus to organize against options.

The deeper move is the source-class routing. A general agent treats every URL as a page to summarize. ADR treats a GitHub URL as a repository to read, a paper as a methodology to dissect, an HN thread as practitioner sentiment to surface as such, and your own repo as the constraint set the decision actually sits inside. The evidence pool a reviewer audits afterwards looks like a research file, not a clipping pile — and the report's per-candidate sections cite each claim back to a source whose type is on the record.

The comparison matrix is the input to synthesis

Before the synthesizer writes the report, ADR builds comparison-matrix.json: rows are candidates from the knowledge map, columns are axes derived from the product brief's query shapes, risk invariants, and operational envelope. Each cell carries a verdict — strong, mixed, weak, or no_evidence — and the citation IDs that back it.

The synthesizer reads the matrix, not the raw evidence pool. The raw pool is the audit trail. The matrix becomes the cross-cutting tradeoffs table in the final report — axes where candidates land at different verdicts, with strong and weak candidates listed per axis.

If the matrix has empty cells or weak coverage, ADR runs an adversarial cycle. For each candidate, the planner generates "find the strongest case against this candidate" tasks. Production incidents. Latency stories. Ecosystem decline. Migration regrets. The matrix is rebuilt after each adversarial cycle and the synthesizer re-runs over the new state.

This is the move that surfaces real architecture pressure. A candidate that is popular in blog posts but quietly broken in production lands on the weak side of the matrix when the adversarial search finds the receipts — and the report's what_evidence_does_not_show section calls it out.

One kernel, three runtimes

The kernel is framework-neutral. The same loop runs under three CLIs:

  • npm run adr — OpenAI-compatible (OpenAI, Azure, vLLM, LM Studio, llamafile, Ollama).
  • npm run adr:langgraph — full LangGraph StateGraph; LLM via LangChain's initChatModel, which means any provider it supports (Anthropic, Google, Bedrock, Mistral, Groq, DeepSeek, Ollama).
  • npm run adr:adk — Google ADK with Gemini as the LLM.

All three produce the same artifact set. The choice of runtime is operational, not strategic.

A run starts with one of four live search providers — Brave, Tavily, Serper, or a self-hosted SearXNG — or falls back to OpenAI's hosted web_search if only an OpenAI key is set. There is no offline mode and no deterministic mock research. If no live provider is configured, the kernel fails fast.

The Beevibe handoff

ADR ships with a Beevibe mesh adapter. The Architect specialist is a normal Agent row at team or org level, with a review_policy that can require human sign-off before its output reaches implementation.

After you read the report and pick an option, adr handoff <out_dir> --option <name> scopes an execution-handoff.json + agent-guardrails.md for that one candidate: required invariants, forbidden topologies, artifact paths, evaluation suite name, and the memory facts the Architect bee writes to durable memory. Handoff is lazy — the kernel does not generate it by default. The report is what you decide from; the contract is what you ask for after.

The handoff is also where ADR stops. It does not pick libraries, write migrations, or commit code. The coding agent on the other side does — but under explicit architecture constraints, with an evaluation pack ready to test the boundary.

When implementation evidence contradicts the spec, the next move is not a code patch. It is a superseding ADR run. The new run writes supersedes.json and appends a supersession section to the architecture decision record. Architecture changes are versioned the same way code is.

Why this slot in the series

Two earlier posts in this series argued that AI is making engineers individually faster without making teams smarter, and that agents keep reinventing the wheel because nothing in their training rewards looking for an off-the-shelf tool. ADR is the same problem one layer up.

Coding agents are excellent execution engines. They edit files, run tests, and iterate quickly. The failure mode is one layer higher — picking the wrong architecture family before the first file is written. ADR is the deep research layer that sits there.


The repo is open: github.com/beevibe-ai/architecture-deep-research

Next post: what changes when the agent doing the research has durable memory of every architecture decision your team has already made.