The Self-Improving Agent Playbook: From First Conversation to Merged PR

A self-improving agent is not an agent that retrains itself in the dark. It is a closed loop: every conversation gets read, real problems get named as intents, the worst ones get diagnosed, and a concrete fix lands as a pull request you review and merge. The agent gets better because the loop runs, not because a model woke up smarter.

This is the full playbook. Nine stages, from the first conversation a user has with your agent to the first merged PR that fixes what that conversation exposed. Each stage has a thing to do and a pitfall that will quietly wreck the loop if you ignore it.

Why most “self-improving” setups never actually improve

Here is the uncomfortable truth from production. The model is almost never the bottleneck. The bottleneck is that nobody reads the conversations, nobody names the patterns, and nobody turns those patterns into a change anyone can review.

Teams test agents anecdotally, ship, and then wait for a Slack message that says “the bot is being weird again.” That is not a feedback loop. That is a complaint queue.

Most published “self-improving” approaches also lean on objectively verifiable outcomes. Code compiles or it does not. A math proof checks or it does not. Great. But your support agent, your onboarding agent, your sales agent? Success there is fuzzy, multi-turn, and tangled up with whether the user churned, stalled, or refused to upgrade. You cannot grade that with a unit test. You have to read what actually happened and act on it.

So the playbook below is built for the fuzzy case, which is most real agents.

The nine stages at a glance

Stage	What you do	The pitfall to avoid
1. Connect and instrument	Wire the agent so every call is captured	Sampling traffic and missing the rare, expensive failures
2. Capture every conversation	Store full multi-turn context, not just last turn	Logging the final response only, losing the why
3. Auto-generate custom intents	Let the system name what users are actually doing	Reusing a generic intent taxonomy that fits no product
4. Track intents live	Watch intent volume and outcomes over time	Treating intents as static labels instead of trends
5. Diagnose the failure	Tie a bad outcome to a root cause	Blaming “the model” instead of the prompt or harness
6. Draft a concrete change	Write the smallest fix that addresses the cause	Rewriting the whole system prompt in one shot
7. Open it as a reviewable PR	Put the change where humans review code	Editing prompts live in a dashboard with no diff
8. Review and merge	A human approves, then it ships	Auto-merging changes nobody read
9. Measure on the next cohort	Check the fix on fresh traffic	Declaring victory on the same data you diagnosed

Now the detail.

Stage 1: Connect and instrument the agent

What to do

Get a trace of every agent interaction flowing somewhere queryable. The bar is low to start: a 3-line SDK drop-in or an OpenTelemetry exporter, and you are emitting traces in minutes. This is the part teams overthink and it should take about two minutes, not a quarter.

You want the model calls, the tool calls, the system prompt version in effect, and a stable conversation ID that ties turns together. That is the raw material for everything downstream.

The pitfall

Sampling. Someone will suggest capturing 10% of traffic to save money. The failures you most need to find are rare and expensive: the enterprise trial that stalled, the one churn conversation that represents fifty silent ones. Sample those away and your loop is blind exactly where it matters. Capture everything until you have a real reason not to.

Stage 2: Capture every conversation, not just the last turn

What to do

Store the full multi-turn thread. The reason a user gave up is rarely in the final message. It is in turn three, where the agent misread intent, and turns four through seven, where the user tried to recover and failed.

Context is dynamic and messy in production. Users carry multiple goals into one session. If you only keep the last response, you have thrown away the evidence.

The pitfall

“Context rot.” As threads get long, accuracy drops even with huge context windows, because the signal gets buried. So capture the whole thread, but do not feed the whole raw thread back into every analysis. Keep it complete in storage, summarize intelligently in use.

Stage 3: Auto-generate custom intents for your product

What to do

This is the stage that separates a real loop from a log viewer. Read the conversations and let the system generate intents that are specific to your product: not “support request” but “user could not connect their Postgres source,” not “feedback” but “churn risk: pricing objection after seat limit hit.”

Generic taxonomies fit nobody. The intents that matter are things like bug reports, feature requests, setup friction, upgrade hesitation, and churn risk, expressed in the actual language of your users and your domain.

This is exactly the job Agnost AI does: it connects to your agent, reads every conversation, and auto-generates custom intents for your product so you are not hand-labeling thousands of threads.

The pitfall

Forcing a fixed, off-the-shelf intent list because it feels tidy. Your product is not generic, so your intents should not be either. Let them emerge from the data, then prune.

Stage 4: Track those intents live

What to do

Intents are not labels you assign once. They are trends you watch. Track volume and outcome per intent over time. “Setup friction: webhook config” jumping 3x after a release tells you the release broke something. The intent that correlates with churn is your priority list, ranked for you.

This is where the loop starts paying off before you have changed a single line. You can see WHY users churn, stall, or refuse to upgrade, in aggregate, instead of guessing from the loudest complaint.

The pitfall

Treating intents as static. A churn-risk intent that was 2% of traffic last month and is 9% this month is the story. Snapshot thinking hides it.

Stage 5: Diagnose the failure to a root cause

What to do

Pick the top-ranked painful intent and trace it to a cause. Failures usually live in one of four places: the agent misidentified intent, it called the wrong tool, the tool returned bad data, or it had the right data and answered wrong. Find which one.

The diagnosis should name a specific lever. “The system prompt never tells the agent to ask for the database region before generating the connection string, so it guesses and fails” is a diagnosis. “The bot is bad at setup” is a complaint.

The pitfall

Stopping at “the model isn’t smart enough.” That is almost never the real cause, and it is the one you cannot fix this week. The fixable cause is usually in the prompt, the tool wiring, or the harness.

Stage 6: Draft a concrete change

What to do

Write the smallest change that addresses the diagnosed cause. One clarifying question added to the system prompt. One tool description tightened. One config value adjusted in your W&B setup. Small changes are reviewable, reversible, and measurable.

The pitfall

The full rewrite. When you rewrite the whole system prompt to fix one issue, you cannot tell what helped, you introduce three new regressions, and the next cohort’s data becomes uninterpretable. Resist it.

Stage 7: Open the change as a reviewable PR

What to do

Put the change where your team already reviews changes: a pull request, with a diff, a description of the intent it addresses, and the evidence behind it. Against your system prompts, your agent harness, or your W&B configs.

This is the design choice that makes the loop trustworthy. A prompt change in a PR is a diff with a blame trail. A prompt change typed into a dashboard text box is an outage waiting to happen with no way to roll back cleanly.

Agnost AI does this part for you: once it diagnoses an issue, it opens a PR against your prompts, harness, or configs with the change and the conversations that justify it. You did not write the fix from scratch, and you did not lose control of it either.

The pitfall

Editing prompts live in a UI with no version history. It feels fast. It is the reason nobody can answer “what changed right before quality dropped.”

Stage 8: Review and merge

What to do

A human reads the PR, checks the change against the evidence, and merges. This is non-negotiable. The agent does not self-improve by merging its own changes unsupervised. It self-improves because the loop produces small, reviewed, mergeable changes fast enough to keep up with reality.

The pitfall

Auto-merge. The moment changes ship without a human reading them, you have traded a slow problem (drift) for a fast one (silent regressions at scale). Keep the human on the merge button.

Stage 9: Measure on the next cohort, then repeat

What to do

After merge, watch the same intent on fresh traffic. Did “setup friction: webhook config” drop? Did the churn-risk correlation weaken? You are measuring the fix against users who were not part of the diagnosis, which is the only honest test.

Then go back to Stage 4 and pick the next intent. The loop is the product. One pass is a fix. A running loop is a self-improving agent.

The pitfall

Grading the fix on the same conversations you used to diagnose it. Of course it looks better there, you tuned to it. Always measure forward.

A full example: one issue, first conversation to merged PR

Here is the loop running end to end on one real-feeling issue.

First conversation. A trial user asks the onboarding agent to connect their Postgres database. The agent generates a connection string, the user pastes it, it fails. They try twice more, then go quiet. Session ends, no upgrade.
Captured. The full seven-turn thread is stored with the system prompt version and the tool calls, not just the cheerful final “let me know if you need anything else.”
Intent generated. The system reads it and similar threads and names a custom intent: “setup friction: Postgres connection string region mismatch.”
Tracked live. That intent is 6% of new-trial conversations this week, up from near zero, and it correlates strongly with trials that never convert.
Diagnosed. Root cause: the system prompt instructs the agent to build the connection string but never tells it to ask which region the database is in, so it defaults wrong.
Change drafted. Add one instruction: ask for the database region before generating the string, and validate it against the supported list.
PR opened. A diff against the system prompt lands, titled with the intent and linking the conversations that justify it.
Reviewed and merged. An engineer reads the three sample threads, agrees, merges.
Measured on the next cohort. Over the following week, the intent drops from 6% to under 1% on fresh trials, and trial-to-paid for that segment ticks up.

That is one turn of the loop. The agent did not get smarter on its own. The infrastructure around it turned a silent churn into a merged, measurable fix.

FAQ

Is a self-improving agent the same as an agent that retrains itself? No. Retraining is one possible action and usually the wrong first one. A self-improving agent in production improves mostly through small, reviewed changes to prompts, tool wiring, and configs, driven by what real conversations reveal. The loop is the mechanism, not the model.

Does this require objectively verifiable outcomes like code or math? That is where naive self-improvement works best, but most real agents are fuzzy and multi-turn. This playbook is built for that case: you read conversations, name intents, and measure trends like churn correlation and intent volume instead of pass/fail grades.

Do I have to give up control to make my agent self-improving? The opposite. Every change shows up as a reviewable PR with the evidence attached, and a human merges it. You keep the merge button. The system does the reading, naming, diagnosing, and drafting that you do not have time to do by hand.

Agnost AI is the infrastructure that runs this loop for you: it reads every conversation, auto-generates your custom intents, tracks why users churn or stall, and opens the PR against your prompts, harness, and configs. You review and merge, free to start in about two minutes with a 3-line SDK or OpenTelemetry.