All posts
Engineering · May 2026 · 8 min read

We dogfooded ADR.
It refused to commit.

Post 06 in the Beevibe series.

The first version of Architecture Deep Research had a strong opinion built in: refuse to bluff. No static pattern library. No offline mode. If the evidence is thin, return requires_human_architecture_review instead of pretending confidence.

Then we ran it on a real Beevibe decision. Picking a vector store for agent memory. Self-hosted, multi-tenant, Postgres already in the stack, single-container Docker Compose deploy. We expected pgvector to come out as the answer in under two minutes. Instead, the pipeline ran for five minutes, produced a defensible-looking 9KB ADR.md, and concluded:

"No single recommendation. All 5 options have genuine tradeoffs that depend on team-side constraints ADR cannot know."

That's the failure mode we hadn't planned for. The tool we built to refuse bluffing was now refusing to engage. It looked balanced; it was actually dishonest.

The run that broke it

The Beevibe stack already runs Postgres. The PRD answered the clarification questions explicitly: self-hosted only, must fit the existing Docker Compose, low-single-digit-M vectors over the next 12 months, multi-tenant via filtered queries. Under those constraints, four of five candidates eliminate themselves:

  • Pinecone — cloud-only managed service. Self-hosted is a hard requirement. Out.
  • Weaviate / Milvus — self-hostable, but force customers to provision a separate service alongside the existing Postgres container. Violates "must fit existing Docker Compose."
  • Faiss — a library, not a database. We'd build the persistence and multi-tenancy layer ourselves. Not v1 material.
  • pgvector — Postgres extension. Runs inside our existing :5433. Sub-100ms p95 at our scale. Multi-tenant via WHERE tenant_id =. Obvious pick.

ADR ran the full pipeline. It found 62 evidence items, 29 of 30 citations verified, 13×14 matrix with 141 empty cells, 7 critique issues (0 high-severity). It correctly identified the strengths of pgvector ("Pgvector achieves sub-100ms p95 latency..."). And it still landed at mode: ranked_options, recommendation: null.

Here's the diagnosis that came back from one of our power users, who'd been beta-testing it:

"The product is rewarding intellectual hedging over making the actual call its evidence supports. The right answer was reachable in 30 seconds of human reasoning from the discover-stage output alone."

That stung because it was right. The pipeline went wide instead of deep. It collected breadth that didn't change the conclusion, and refused to draw the conclusion the evidence pointed at.

What was actually wrong

Three structural bugs that all read as the same kind of mistake — treating user-stated constraints as decoration instead of as filters.

1. Hard constraints didn't filter the candidate pool

"Self-hosted only" should eliminate Pinecone. Not give it a "weak on on_prem_self_host" cell buried in row 12 of the matrix. ADR was happily putting Pinecone in ranked_options with a footnote — exactly the kind of false-balance behavior the matrix was supposed to fix.

2. Refusing to recommend when the field narrowed

If the user already told you their constraints, and the constraints eliminate four of five options, the answer is the one left. "All options have genuine tradeoffs" is structurally true only when there are 3+ viable options and the dominance pattern is genuinely ambiguous. Saying it when one option is the last one standing is hedging, not nuance.

3. Discovered context didn't influence ranking

ADR's discover stage scanned the Beevibe repo and surfaced the existing Postgres dependency. Installing a Postgres extension is materially cheaper than spinning up Weaviate/Milvus/Pinecone alongside. That's a huge prior toward pgvector. But it sat in discovered-constraints.json as context-padding, never shaped a matrix axis, never influenced scoring.

What we shipped (three rounds)

We tried three rounds of fixes. The first two stayed inside the original frame — make ADR commit when committal is honest. The third broke the frame.

Round 1 — commitment fixes

Hard constraints became filters instead of decoration. A new stage parsed "self-hosted only" / "must fit Docker Compose" into must_have entries, batched a per-candidate verdict LLM call, and eliminated failures before the matrix got built. A commitment threshold forced the synthesizer to commit when the field narrowed: 1 survivor → recommend it; 2 survivors with a clear net-strong lead → recommend the leader; otherwise stay at ranked_options. The discover stage's stack signals drove a real fits_existing_stack matrix axis instead of sitting as context-padding.

This worked. The same pgvector run that had hedged now committed cleanly: "Pinecone fails self-hosted, Weaviate/Milvus violate Docker Compose, Faiss is a library not a database, pgvector wins."

Round 2 — wider exploration

After living with round 1 for a week we kept hitting cases where the constraint extractor over-fired or under-fired and the user couldn't see why. Constraints were doing too much work invisibly — a single must_have mislabel and the matrix saw three candidates instead of fifteen. So we pivoted: constraints became annotations on the report instead of filters on the pool. The clarification gate stopped halting runs. The kernel started mapping the option space wider, surfacing the same Pinecone in the report with a footnote noting it didn't fit the deploy shape, instead of removing it entirely.

This worked too, in the sense that it removed the hidden-narrowing failure mode. But every run now produced a 1-candidate "ranked options" report — the candidate pool was still being narrowed, just by a different stage (the promotion gate that required ≥2 cited sources from official_docs / mature_oss / paper_or_benchmark). The hedging came back wearing different clothes.

Round 3 — the real reframe

The deeper bug was treating ADR as a decision engine at all.

The kernel can't actually pick. The decision depends on team context ADR doesn't see — existing infrastructure, hiring plans, vendor relationships, budget envelope, what your senior engineer prefers to operate. A research kernel that tries to pick will always be either over-committing (forcing a winner the evidence doesn't fully support) or under-committing (refusing to engage with what its evidence supports). Both are dishonest.

So we pivoted ADR to a research-report engine. Every architecture candidate the evidence surfaced gets a section in the report — what the evidence shows, what the evidence does not show, strong and weak axes, citations. Evidence depth is shown per candidate (thick / medium / thin) so the reader weights each section. There's no mode=recommended anymore. There's an executive summary, an option-space overview, per-candidate sections, cross-cutting tradeoffs across the matrix axes, and a list of open questions the evidence pool didn't resolve. You read the report and decide. The same posture as OpenAI Deep Research / Perplexity / Gemini Deep Research — narrower scope (architectural decisions, not topics).

The implementation contract — agent-guardrails.md and execution-handoff.json — became lazy. They're no longer generated by default. After you read the report and pick an option, you run adr handoff <out_dir> --option <name> and the kernel scopes the contract to that one candidate. The report is what you decide from; the contract is what you ask for after.

The peer products feature

The biggest user-driven addition was independent of the depth fixes. Real users picking architectures don't reason in the abstract — they look at 3-5 already-shipped similar products and ask "what did they do, and does it apply to us?"

ADR now does that. --include-peers on discover finds 3-5 named comparable products with their GitHub URLs, ranks them by stars + recency (dead repos dropped, closed-source like Notion / Linear kept), and writes peers.json. Deep-research picks it up automatically and adds one targeted research task per peer — "how does Onyx handle vector storage?" — hitting that peer's repo, ARCHITECTURE.md, docs, and engineering blog.

The peer findings flow into the evidence pool as regular citations. If three of five peers ship with pgvector in their default deploy, the matrix sees three citations backing pgvector and the report's pgvector section reads "thick evidence: Onyx, AnythingLLM, and Open WebUI each ship pgvector as the default vector store in their reference deployments." Concrete reasoning, real receipts.

One thing the peer feature kept teaching us: closed-source consumer products (Obsidian, Roam Research, Mem.ai) carry real adoption signal teams weigh — even though their architecture isn't publicly documented. The current kernel tags each peer with an evidence_strategy (architecture / adoption / both). Architecture-strategy peers get researched via engineering blogs and docs; adoption-strategy peers route through Reddit, HN, Twitter, and migration write-ups via a separate adoption_research_planner. The synthesizer frames adoption-source claims as practitioner signal ("r/LocalLLaMA users report X") rather than architectural fact. Reddit, HN, Twitter, and Stack Exchange URLs land as source_type: community_discussion with a relaxed citation rule that allows paraphrased quotes — because discussion threads paraphrase, and the audit needs to honor that.

Streaming as a live research log

The last shift was in the chat UX. Events used to be counters: "extracted 3 claims, score 0.8." Useful for the engineer watching the kernel logs; useless for a chat surface. Every event now carries concrete content the user can read:

✓  Extracted 10 context notes
  • [deployment] Self-hosted is the primary deploy model
    └ from: "self-hosted is the primary deploy model"
  • [integration] Must fit existing Docker Compose stack
    └ from: "fits the existing Docker Compose"

🤝  Found 5 peer products (3 architecture, 2 adoption)
  • Onyx (★12k, Python) [architecture] — Self-hosted agent runtime
  • AnythingLLM (★22k, JavaScript) [architecture] — Self-hosted RAG / chat OS
  • Mem.ai [adoption] — Closed-source consumer; community signal via HN + Reddit

      ⤓ fetching "Architecture · Onyx Docs" (official_docs)
      ✓ fetched (8.1KB, http_fetch_ok)
        └ "Onyx uses Postgres with pgvector as the default vector store.
           The schema separates per-tenant indexes via tenant_id partition
           columns..."
      ✓ extracted 3 claims (score 0.82)
        • [pgvector / supports] Onyx ships with pgvector as the default
          vector store across all deployments.
        • [self_hosted / supports] Default deployment is single-container
          Docker Compose.

      ⤓ fetching "Mem.ai vs Notion vs Obsidian for AI agents - r/LocalLLaMA"
        (community_discussion, platform=reddit, subreddit=LocalLLaMA)
      ✓ extracted 2 claims (community signal, soft-match audit)

✓  Synthesis done — research report written (8 candidates, 4 thick / 3 medium / 1 thin)
  Executive summary, per-candidate sections, cross-cutting tradeoffs,
  open questions. No recommendation — read the report and decide.

Same event count, same wall-clock time, but the chat surface is now a live research log. The user can read what ADR is finding as it finds it, not what it found in summary form after the run completes.

The lesson

The dogfood failure pointed at the wrong fix. We patched the hedging at first, then patched the patches, then realized the whole shape was wrong.

A decision system trying to be a research system gets pulled toward forcing answers it can't fully support. A research system trying to be a decision system gets pulled toward hedging. The way out wasn't a better commitment threshold — it was admitting what the kernel could actually do honestly.

ADR researches; the human decides. Every cited candidate gets a section. Evidence depth is shown so the reader weights each section. Cross-cutting tradeoffs are surfaced as their own table. Open questions are listed at the end. The reader walks away with the information they need to commit — and the commit is theirs, not the kernel's. Once we accepted that shape, the hedging-vs-committing tension dissolved. There's nothing left for the kernel to hedge or commit on.

"It depends" is sometimes the only honest answer. The better honest answer is showing the reader what it depends on.


Try it

Three ways in. All share the same kernel and produce the same artifacts.

Claude Code plugin — recommended for normal users:

claude plugin marketplace add beevibe-ai/architecture-deep-research
claude plugin install adr

Then in any session: /adr:doctor (one-time setup), then /adr:decide to run the full pipeline. The slash command asks if you want peer products. Recommended: yes.

MCP server for Cursor, Codex, or any MCP host:

npm install -g github:beevibe-ai/architecture-deep-research
adr-doctor setup

CLI for scripts and CI:

adr deep-research --discover-first --include-peers \
  --repo . \
  --domain "agent-native OS" \
  --decision "vector store for agent memory" \
  --out .adr-runs/vector-store

The repository is open-source under Apache-2.0 at github.com/beevibe-ai/architecture-deep-research. Issues and pull requests welcome.