The short version
Your agent’s conversations are the highest-signal bug tracker you own, and almost nobody mines them. To turn conversations into code, you run a five-step pipeline: capture every transcript, cluster them into recurring intents, rank those intents by impact, translate the winning pattern into a concrete prompt/code/config change, then open it as a reviewable PR. The hard part isn’t the fix. It’s connecting “47 users got stuck at the same step” to “here’s the exact diff that unsticks them.”
Most teams skip this loop entirely. They read maybe 20 transcripts on a Friday, get a vibe, and tweak a system prompt from memory. Then they wonder why the same complaints keep showing up in the next batch.
Why are conversations better than your bug tracker?
Because users dont file tickets. They just leave, or they tell your agent and move on.
When someone types “wait, where do I even put the API key” into your chat agent, that is a bug report. A real one, with timestamp, full context, and the exact phrasing of a confused human. Your Jira board will never see it. Your agent already did.
We see this constantly in production data. The signal hiding in raw transcripts is denser than any survey or support queue, for three reasons:
- Volume. A mid-size agent handles thousands of conversations a week. Your support inbox sees a fraction of that, heavily skewed toward angry power users.
- Honesty. Nobody performs for a chatbot. People type what they actually mean, typos and all.
- Context. The transcript shows you the exact step before the frustration. A ticket says “checkout is broken.” The conversation shows the user asked the same question three times because the agent kept referencing a button that no longer exists.
The problem was never a lack of data. It’s that conversations are unstructured, high-volume, and nobody has time to read 4,000 of them. So the signal rots.
What does the conversation-to-PR pipeline actually look like?
Five steps. You can run all of them by hand. It just doesn’t scale past a few hundred conversations a week.
1. Capture everything
Log the full transcript, not just the final answer. You want every turn, tool call, latency, and the metadata that lets you slice later (plan tier, account age, which feature they were using). If you’re only storing the last LLM response, you’ve already thrown away the part that matters: the back-and-forth where the user got confused.
2. Cluster into intents
This is where it gets real. A single transcript is an anecdote. A hundred transcripts saying the same thing in different words is a roadmap.
The goal is to group raw conversations into recurring intents: bug report, feature request, churn risk, setup friction, pricing confusion, and whatever else is specific to your product. Not generic sentiment scores. Actual patterns like “users on the free plan keep asking how to invite teammates and the agent tells them it’s a paid feature without saying which plan.”
Doing this manually means tagging transcripts and praying your taxonomy holds up. Doing it with embeddings plus an LLM pass gets you 80% of the way, but you still have to babysit the clusters so they don’t collapse into mush.
3. Rank by impact
You will find 30 recurring intents. You can fix maybe three this sprint. So rank them. A rough scoring model that works:
| Factor | Why it matters |
|---|---|
| Frequency | How many conversations hit this pattern |
| Stage | Friction at onboarding hurts more than friction in a power-user flow |
| Revenue exposure | Is this blocking upgrades or putting paying accounts at risk |
| Fix cost | A prompt tweak is cheap. A new tool integration is not |
Multiply frequency by revenue exposure, divide by fix cost, and you’ve got a crude but honest priority list. The point is to stop fixing whatever the loudest person in standup remembers.
4. Translate the pattern into a change
Here’s the step everyone underestimates. “Users are confused about API key setup” is not a fix. It’s a feeling. The fix is a specific diff.
That recurring intent might translate into:
- A system prompt edit so the agent always asks “are you on the dashboard or the CLI?” before giving setup instructions.
- A harness change that injects the user’s current plan into context so the agent stops recommending features they can’t access.
- A config change (retrieval weights, a W&B config, tool descriptions) so the agent stops hallucinating that deprecated button.
The translation is the craft. You’re going from a cluster of complaints to one concrete, testable change.
5. Open a reviewable PR
Don’t push straight to prod. A change to a system prompt is a code change and deserves the same scrutiny: a diff, a description of the pattern it fixes, the conversations that motivated it, and a human review before merge.
This is the discipline most teams skip, and it’s why their prompts turn into 800-line junk drawers nobody dares touch.
A worked example, start to finish
Say you run a developer-tool agent. Here’s the loop in practice.
Capture. Over a week you log 3,200 conversations.
Cluster. An intent surfaces: 61 conversations where users ask some variant of “why is my webhook not firing.” The agent’s answer is technically correct but assumes the user already enabled webhooks in settings. They hadn’t. The agent never checks.
Rank. 61 conversations, mostly from trial accounts in their first three days, several of whom churned within 48 hours. High frequency, onboarding stage, revenue exposure (these are prospects), low fix cost. This one jumps the queue.
Translate. The fix is a system prompt change. Before giving webhook debugging steps, the agent should ask whether webhooks are enabled and link the exact settings path. Concretely:
When a user reports a webhook is not firing:
- Walk them through verifying their endpoint and checking logs.
+ First confirm webhooks are enabled at Settings > Integrations > Webhooks.
+ If the user is unsure, link them there before any debugging steps.
+ Only after enablement is confirmed, walk through endpoint and log checks.
PR. Open it with a title like “Fix webhook setup friction (61 conversations, trial churn).” Body links the cluster, shows before/after, and tags whoever owns the agent. Someone reviews it, sanity-checks that it won’t break the power-user path, and merges. Next week you watch that intent’s volume drop. If it doesn’t, you reopen.
That’s the whole game. Conversations to code, with a human in the loop on the merge.
Where does Agnost AI fit?
Running this manually breaks somewhere around a few hundred conversations a week. The clustering eats your week, the ranking gets political, and the translation step quietly stops happening.
Agnost AI is the infrastructure for self-improving agents that runs this exact pipeline for you. You connect your agent in about two minutes with a three-line SDK or OpenTelemetry, works with any LLM and any framework. It reads every conversation, auto-generates the custom intents for your product, tracks them live so you can see why users churn or stall, and then opens pull requests against your system prompts, agent harness, and W&B configs. You review and merge. The loop above stops being a Friday afternoon chore and becomes the default.
To be clear about scope: this is about turning real conversation signal into shippable changes, not about scoring your agent against a static test set.
FAQ
Do I need a huge volume of conversations for this to work?
No, but the value compounds with volume. At a few hundred conversations a week you can eyeball clusters by hand. Past a couple thousand, manual reading falls apart and you need tooling to cluster and rank, otherwise the highest-impact patterns hide in the noise.
Isn’t editing system prompts based on conversations risky?
It’s risky if you push blind. That’s exactly why the last step is a PR, not a direct deploy. A diff, the motivating conversations, and a human review catch the change that fixes one cohort while breaking another. Treat prompt changes like any other code change and the risk drops to normal.
Can’t I just read transcripts myself?
You can, and you should occasionally, for taste. But reading is sampling, and samples lie. You’ll over-index on the last vivid complaint and miss the quiet pattern across 60 conversations that’s actually costing you upgrades. Clustering exists because human reading doesn’t scale and isn’t statistically honest.
If your agent is talking to thousands of users and that signal is just sitting in a logs table, you’re leaving your best roadmap on the floor. Agnost AI turns those conversations into reviewable pull requests so the fixes actually ship, and it’s free to start with no sales call.