All posts
Engineering · May 2026 · 9 min read

Your agent reinvents the wheel.
The fix isn't a better prompt.

Post 02 in the Beevibe series.

Last month I gave our team agent a task: record a 30-second demo of a UI flow. Standard ask. The agent picked up the task, spawned a Browser Automation specialist, and started writing a Playwright script from scratch.

There are dozens of working demo-recorder libraries on GitHub. The agent knew about Playwright. It just didn't think to look for an off-the-shelf tool. It went straight to "let me scaffold from scratch."

A human teammate would have asked "is there already a tool for this?" first. The agent didn't, because nothing in its training rewards that question. What rewards an agent is shipping code. So it ships code, even when the answer was a pip install away.

This is the most expensive AI behavior we've watched up close. Not hallucination. Not getting things wrong. Defaulting to invention when discovery would have been faster.

The shape of the problem

Once you start looking, it's everywhere:

  • Task: "extract tables from this PDF." Agent: writes a custom PDF parser. There are five mature libraries.
  • Task: "transcribe this audio." Agent: opens the Whisper API docs and re-writes the boilerplate. The repo has a one-line CLI.
  • Task: "convert this video to MP3." Agent: spawns a Python ffmpeg subprocess. There's already a wrapper.

Each individual decision is fine. None of them are wrong. But over a week, the team accumulates forty hand-rolled scripts that are each 80% of three open-source projects, badly remembered.

We've been calling this "prompt overloading." The agent has the right knowledge in its weights — it knows pdfplumber exists. It just doesn't activate that knowledge under the pressure of "do this task now."

What we tried first, and why it didn't stick

For weeks we tried to fix this with prompts. Add a directive: "before scaffolding, search for an existing tool." Add a sub-rule: "if the goal smells like extract, convert, transcribe, scrape — call find_repo first." Add a banned-phrase list to stop the agent from saying "let me write a quick wrapper."

Some of these worked once. None of them stuck.

The reason is structural. By the time the agent is deep in a task, every rule we added at the top of the system prompt has been pushed out of its attention by 30,000 tokens of recent tool calls and intermediate reasoning. A rule in the system prompt is not load-bearing.

We learned this the hard way. The agent kept ignoring our directives. Each time we tightened the language. Each tightening worked for a day, then broke. The pattern is unmistakable now — if you need an agent to do something reliably, you can't put the trigger in the prompt.

So where do you put it?

The fix is in the UI, not the prompt

You take the action out of the agent's hands.

That's the principle we ended up with. When a behavior has to happen reliably — registering a capability, attaching an artifact, surfacing a related tool — the system, not the agent, does it. The UI exposes the action; the human clicks; the system writes the row. The agent's job becomes "identify the candidates," not "execute the next step."

We applied this principle three times last week, in what we're now calling the capability network:

1. Discovery is server-side, not prompt-side. find_repo is an MCP tool the agent calls explicitly. It returns ranked candidates from four sources: the team's saved skills, a community registry of proven recipes, GitHub trending (real star velocity, not lifetime stars), and a live GitHub search. Embedding-based semantic match across the lot. The agent doesn't have to reason about "should I look here or there" — it calls one tool and gets the answer.

2. Trying a repo is a button, not an agent decision. Every time find_repo returns a candidate, the chat UI auto-renders a Try button next to it. Click it and we spin up a fresh Docker sandbox, clone the repo, run a child agent against the goal, and stream the transcript to a Playground page. The agent that did find_repo isn't involved in the trial. The user is.

3. Persistence is the system's job. When a sandbox run succeeds and produces an artifact, the Playground shows a Save panel. Click it and the artifact body becomes a SKILL.md in skills/learned/<name>/. The DB row gets written. Claude Code's skill auto-discovery picks it up on the next session. The agent never has to remember to "save this as a capability" — by the time the agent's context window is exhausted, the click has already happened.

What the loop looks like end-to-end

A real example. I asked the team agent: "find me some GEO or SEO repos."

  • The agent called find_repo. Eight candidates came back — two from trending, six from github search, ranked by embedding similarity to the goal.
  • The chat surface auto-rendered them as cards. Each card showed owner/name, star count, language, source tier, and a Try button. Even when the agent emitted markdown links instead of the structured card directive, we auto-promote bare GitHub URLs into cards — so the Try button is reachable regardless of the agent's formatting choices.
  • I clicked Try on the most interesting one. Browser navigated to /capabilities/runs/<id>. A Docker container booted in two seconds (cached python:3.12-slim), cloned the repo, the child agent read the README, listed the directory, and wrote a structured markdown overview of the repo's contents.
  • 30 seconds later the Playground showed: status = Succeeded, 1 artifact exported, the agent's wrap-up summary inline.
  • I clicked Save. A modal popped with a pre-filled name (slugged from the URL) and a goal pattern (first sentence of my original query, with the noisy "Context: …" trailer auto-stripped). Hit Save.
  • The artifact body became skills/learned/<name>/SKILL.md. Future agent sessions auto-discover it as a slash command, and find_repo scores this repo at +50 (the highest tier) for any future goal whose embedding is close.

Total UI surface I touched: two buttons, one form.

Total time the agent's prompt had to "remember" to do anything: zero.

The hard part nobody warned us about

We hit dozens of small failure modes that don't fit the clean story above. Worth being honest about them, because every one of them taught us the same lesson in a slightly different shape:

  • The agent kept hallucinating UI states. It would emit a chip prompt, then narrate in the next sentence "looks like the prompt got dismissed — let me just propose a default and let you redirect." There is no "dismiss" state in our chat UI. The agent invented one to give itself an exit. We had to put a literal phrase blacklist in the system prompt to make it stop.
  • The agent loaded its own tools, loudly. Every sandbox session opened with five ToolSearch calls — Claude Code's deferred schema lookup. They cluttered the live transcript with empty results. We added a pass that collapses consecutive ToolSearches into one Loaded N tool schemas line.
  • GitHub's search ranks by GitHub's algorithm, not ours. A 7k-star trending repo we knew was relevant didn't appear in find_repo's results because GitHub's keyword search ranked it 30th and we sliced at top-20. We added a separate "trending injection" pass: any repo in our daily/weekly trending snapshot whose embedding is within 0.35 cosine of the goal gets surfaced, regardless of whether GitHub returned it.
  • Keyword matching was the wrong tool. Our first relevance gate was STOP_WORDS + keyword overlap. It broke the second I queried with phrasing it hadn't seen — "find me some geo or seo repos" turned every trending repo whose description contained the word "find" into a positive match. We replaced the regex with cosine similarity on text-embedding-3-small. Stop words went away. The query just works.
  • Agents lie about why they succeeded. When an agent finishes a use_repo run, its terminal message is usually a verbose summary. Sometimes it says "Exported. Artifact: foo.md — covers…" and sometimes just "Exported." If we naively pulled the last sentence as a default goal pattern, the user saw "Exported" pre-filled in the Save form. We stopped pulling the agent's wrap-up and started deriving the pattern from the original goal, with a one-click swap if the user prefers the agent's narrative.

Each of these felt like a small detour at the time. Looking at them together, the pattern is the same: anytime we tried to encode a behavior in the prompt, drift broke it. Anytime we encoded the behavior in code or UI, it stuck.

What this is, and what it isn't

This isn't a marketplace. We're not curating which repos are "good." We're not building a paid extension store. There's no licensing layer between you and the open-source ecosystem.

It's a thin pipe: agent → ranked candidates → Docker sandbox → artifact → optional save. Every layer is read-only against GitHub. The trust boundary is the sandbox. The agent never runs arbitrary code on your host filesystem.

It's also opt-in. There's a toggle in /settings (default ON for now; we're watching how teams react). Flip it off and the entire surface vanishes — find_repo and use_repo come off your agents' tool list, the Capabilities nav entry disappears, the chat repo cards stop rendering. Same architectural principle: when behavior needs to be reliable, it lives in code, not in a prompt.

The deeper lesson

The thing I keep wanting to tell people who are building agent products:

The prompt is where you tell the agent what to consider. The UI is where you let the human decide. Anything that has to happen reliably belongs in code between them, not in either of those two.

We learned this writing four different versions of "before scaffolding, search for an existing tool" and watching the agent ignore each one. We finally believed it after we stopped editing the prompt and started shipping buttons.

The capability network is one application of that lesson. There will be more.


The repo is open: github.com/beevibe-ai/beevibe

Next post: deep research on system design agents — why useful architecture work needs memory, specialists, and visible disagreement.