The Infrastructure Stack for Self-Improving AI Agents

What does a self-improving AI agent actually need underneath it?

A self-improving agent is not a smarter model. It is a feedback loop. The infrastructure has to capture every conversation, understand what users were actually trying to do, diagnose where the agent failed them, propose a concrete fix, and put a human in front of that fix before it ships. Five layers. Miss one and the loop breaks.

Most teams I talk to think “self-improving” means fine-tuning, or some agent that rewrites its own prompt at runtime. It almost never does in production. The real loop is slower, more boring, and far more reliable: the system watches what happens, figures out why, and proposes changes you review. The intelligence is in the pipeline between your agent and your codebase, not in the agent gaining sentience.

Here is the reference architecture, layer by layer, and where teams usually drop the ball.

The five-layer reference stack

Layer	Job	What it produces	Common failure
1. Capture	Instrument every conversation	Structured traces of every turn, tool call, latency, error	Sampling, dropping payloads, logging text but not structure
2. Understand	Extract intents and signals from raw conversations	Tagged intents: bug report, feature request, churn risk, setup friction	Generic dashboards that count messages but not meaning
3. Diagnose	Connect signals to root cause	”Users stall at step 3 because the agent asks for an API key it already has”	Stopping at “what” and never reaching “why”
4. Act	Turn diagnoses into proposed changes	Pull requests against prompts, harness code, configs	Insights that sit in a dashboard nobody acts on
5. Review	Human in the loop approves or rejects	Merged, attributable changes with a paper trail	Auto-applying changes and hoping for the best

The stack is sequential. You cannot diagnose what you did not capture. You cannot act on a diagnosis you never made. So lets walk through each one.

Layer 1: Capture

This is the foundation and the one teams cheap out on the most. You need a trace of every conversation, not a sample. Not “1 percent for cost reasons.” Every one. The conversations you drop are disproportionately the ones where something went wrong, because errors are rare by definition and rarity is exactly what sampling kills.

Use OpenTelemetry if you can. It is the closest thing the agent world has to a standard, it is framework agnostic, and it means your capture layer is not locked to one vendor. Instrument the full turn: user message, system prompt version, tool calls and their results, model used, latency, token counts, and the final response. The system prompt version matters more than people expect. When you ship a fix in layer 4, you want to prove the regression actually moved.

One thing I learned the hard way: capture the structure, not just the transcript. A wall of text is searchable but not analyzable. If you log which tool failed and what arguments it got, layer 3 has something to work with. If you log only the human-readable chat, you are doing forensics by hand at 2am.

Layer 2: Understand

Raw traces are noise until something labels them. This is where most “observability” tools stop and where self-improvement actually starts.

The unit that matters here is the intent. Not sentiment, not a thumbs up, not message count. What was the user trying to do, and did the agent let them do it? In production you see a handful of recurring intents that predict everything: bug report, feature request, churn risk, setup friction, confusion, and a long tail of product-specific ones.

The trap is using a fixed taxonomy. Your product is not someone else’s product. A setup-friction intent for a dev tool (“the agent keeps asking me to re-auth”) looks nothing like setup friction for a consumer app. The understand layer has to generate intents that fit your product from your actual conversations, then tag every trace against them. Generic categories give you generic, useless insights.

This is the layer Agnost AI was built around. It reads every conversation and auto-generates the custom intents for your product instead of making you define them by hand, which is the part nobody has time for.

Layer 3: Diagnose

Knowing that 12 percent of conversations carry a churn-risk signal is interesting. Knowing why is the whole game.

Diagnosis means connecting an intent cluster to a root cause in your system. Example from a real pattern: users hit a churn-risk signal right after turn 3, and when you read the cluster, the agent is repeatedly asking for context the user already provided two turns earlier. The cause is not the model being dumb. It is a system prompt that does not tell the agent to carry forward earlier answers. That is a one-line fix, but you only find it by reading the cluster, not the aggregate.

Good diagnosis is specific enough to act on. “Users are frustrated” is not a diagnosis. “Users abandon during onboarding because the agent requests an API key that is already in their environment” is. The difference is whether layer 4 can write code against it.

Layer 4: Act

Here is the line between an analytics product and infrastructure for self-improvement: does the system propose the change, or does it just tell you about the problem?

A dashboard that surfaces “setup friction is up 30 percent” is leaving the actual work to you. The act layer should produce a concrete artifact: a pull request against your system prompt, your agent harness, or your config (W&B configs included if that is where your model setup lives). A real diff you can read in seconds.

This is the part that sounds scary and is actually the safest design choice. A PR is reviewable, attributable, and revertible. It is git. You already trust this workflow for the rest of your code. Extending it to agent behavior means agent changes get the same scrutiny as everything else, instead of living in some separate “AI ops” surface with no version history.

Layer 5: Review and merge

Never auto-apply. I know the demo of an agent that fixes itself end to end with no human looks magical. In production it is how you ship a regression to every user at once with nobody having read it.

The human-in-the-loop step is not a limitation of the stack, it is the point. A senior engineer reading a proposed prompt change takes 30 seconds and catches the 1 in 20 suggestion that is subtly wrong. You keep the speed of automation for the 19 good ones and the judgment of a human for the one bad one. Merge it and the change is attributable: who approved it, what data drove it, what it was supposed to fix. Next cycle, layer 1 captures whether it worked.

Build vs buy across the stack

You can build all five layers yourself. Plenty of teams start to. Here is the honest breakdown of where the effort goes:

Capture is the most buildable. OpenTelemetry plus a trace store gets you far. Budget a sprint.
Understand is where it gets hard. Auto-generating product-specific intents from raw conversations and keeping them current as your product changes is a real ML and infra problem, not a weekend project.
Diagnose and act are the parts almost nobody finishes. Most homegrown stacks die as a dashboard. The pipeline from intent cluster to a mergeable PR is the hardest and most valuable stretch.
Review you already have. Its your existing code review process.

The pattern I see: teams build layers 1 and 2, get a nice dashboard, and stall. The dashboard tells them things are bad but never closes the loop. Six months later the insights are stale and nobody opens it. The loop is only worth building if you build all of it.

FAQ

Is a self-improving agent the same as fine-tuning?

No. Fine-tuning changes model weights and is one possible action in layer 4, but most production self-improvement happens through prompt, harness, and config changes that are cheaper, faster, and easier to review. The loop is about diagnosing and fixing behavior, not retraining a model on a schedule.

Do I need OpenTelemetry specifically for the capture layer?

You do not strictly need it, but it is the most portable choice. It is framework and vendor agnostic, so your capture layer is not coupled to one tool. If you already have structured tracing that records prompts, tool calls, and outcomes per turn, that works too. The requirement is full coverage and structure, not a specific protocol.

Why keep a human in the loop if the system can propose fixes automatically?

Because auto-applying agent changes ships regressions to everyone at once with no review. A proposed change as a pull request gives you the speed of automation and the judgment of a reviewer. The human step costs seconds per change and catches the occasional subtly wrong suggestion before it reaches users.

The five layers are not optional add-ons, they are one loop, and a loop with a gap is just a dashboard. If you want the capture, understand, diagnose, and act layers running in about two minutes instead of two quarters, that is the gap Agnost AI is built to fill, and you review and merge every change yourself.