To improve an AI agent in production, you don’t start with the model. You start with the conversations. The fastest path from “our agent feels worse than the demo” to “our agent is measurably better” is a loop: instrument real chats, read the signals, turn them into a prioritized backlog of prompt and harness changes, ship them as reviewable diffs, then check whether the live cohorts actually got better.
That’s the whole job. Everything below is how to run it without drowning.
Why production is a different game than launch
The demo passed. The launch went fine. And then real users showed up and did things your test cases never imagined.
Here’s the thing nobody tells you: a launched agent isn’t a finished product, it’s a hypothesis. Users will phrase requests in ways you didn’t anticipate, paste in formats you didn’t handle, and abandon flows at the exact step you assumed was obvious. The PM who owns the agent post-launch isn’t maintaining it. They’re running an ongoing investigation into why real conversations go sideways.
The mistake I see most often: teams treat the agent like a static feature. They ship it, move the team to the next thing, and only look again when complaints pile up or a churn number spikes. By then you’re doing forensics on a cold trail.
What should a PM instrument first?
Don’t instrument everything. You’ll bury yourself. Start with the signals that map directly to a decision you can act on this week.
The four I’d wire up before anything else:
- Completion vs. abandonment per intent. Did the user get what they came for, or did they bail? Tracked by what they were trying to do, not by raw turn count.
- Turn-depth to resolution. How many back-and-forths before the agent actually helped. Resolution at turn 2 is healthy. Resolution at turn 9 means something upstream is broken.
- Escalation and handoff rate. How often the agent punts to a human, a fallback, or a dead “I can’t help with that.”
- Negative sentiment and repair attempts. Users repeating themselves, rephrasing, or getting frustrated is your earliest churn signal, and it shows up turns before anyone fills out a survey.
Notice what’s missing: token counts, p99 latency, raw log volume. Those matter for your infra team. They tell a PM almost nothing about whether the agent is good. Latency dashboards answer “is it up.” You need to answer “is it working.”
How do you read conversation signals without going insane?
This is where most teams stall out. You can’t read every conversation. At any real volume, manual review is a part-time job that produces anecdotes, not decisions.
The unlock is grouping conversations by intent, not by keyword or topic. What was the user actually trying to do, and where did it break? When you cluster real chats by intent, patterns that were invisible in a flat log become obvious:
- A spike in “setup friction” conversations every time someone mentions API keys
- A cluster of “feature request” intents that’s really one missing capability asked 40 different ways
- A “churn risk” pattern where users on the trial plan stall at the exact same configuration step
The trap to avoid: building a fixed taxonomy upfront and forcing every conversation into it. Your users don’t care about your categories. The useful intents are the ones that emerge from the actual conversations, and they shift as your product changes. A rigid label set goes stale in a month.
This is exactly the problem Agnost AI was built to handle. It reads every conversation your agent has and auto-generates custom intents for your product (bug reports, feature requests, churn risk, setup friction, and more), then tracks them live so the why behind a stall surfaces on its own instead of you grepping logs at midnight.
Turning signals into a prioritized backlog
A signal isn’t a task. “Users are confused during onboarding” is a vibe. “When users paste a Slack-formatted message, the agent strips the formatting and asks them to retype it, abandonment 38%” is a ticket.
Score each candidate fix on three things and stop overthinking it:
| Factor | Question | Weight |
|---|---|---|
| Frequency | How many real conversations hit this? | High |
| Severity | Does it cause abandonment, churn, or just friction? | High |
| Fix cost | Prompt tweak, harness change, or net-new capability? | Medium |
A high-frequency, high-severity problem that’s a one-line system prompt edit goes to the top of the list, obviously. The interesting calls are the medium ones. A rare but brutal failure (agent confidently gives wrong billing info) usually beats a common but mild annoyance.
One rule I’d tattoo on the backlog: every item must name the mechanism. Not “improve onboarding” but “the system prompt doesn’t tell the agent to handle pasted config, add an instruction and a few-shot example.” If you can’t name the mechanism, you don’t understand the problem yet. Go read five more conversations.
How should fixes ship: prompts, harness, or config?
Most agent improvements are not model swaps. They’re small, surgical changes in three places:
- System prompts. The biggest lever and the most underused. Most “the model is dumb” complaints are actually “we never told the model how to handle this.” Specific instructions and a couple of grounded examples fix more than people expect.
- The agent harness. Tool definitions, retrieval logic, routing, retry behavior, what context gets passed where. A lot of “bad answers” are really “the agent never got the right context to begin with.”
- Configs (including your W&B setup). The scaffolding around the agent: parameters, environments, the plumbing that decides what runs and when.
Whatever you change, ship it as a reviewable diff. This is the part teams skip and regret. A prompt edited live in a dashboard with no version history is a future incident. When a change is a pull request, engineering can review it, you can roll it back, and six weeks later you can answer “why is the prompt phrased this way” instead of shrugging.
This is the workflow Agnost AI leans into: it opens pull requests against your system prompts, agent harness, and W&B configs to fix what it finds in the conversation data, and you review and merge. Same git flow your team already trusts. Nothing changes without a human approving the diff.
Working with engineering without becoming the bottleneck
The friction between PMs and engineers on agent work is almost always about specificity. Engineers don’t want “make the agent better.” They want a diff they can reason about.
Give them three things per item: the conversations that show the problem, the proposed mechanism, and the metric you expect to move. That’s it. When the proposed change arrives as a PR with the failing examples linked, the review is fast because everyone’s looking at the same evidence. No re-litigating whether the problem is real.
And resist the urge to batch fifteen changes into one heroic deploy. Small diffs, shipped often, attributed cleanly. If three things change at once and the numbers move, you’ll never know which one did it.
How do you know the agent actually improved?
Measure on live cohorts, not on your gut and not on a frozen test set. The question is never “is the prompt better in the abstract.” It’s “did the cohort of users who hit this change have better outcomes than the cohort before it.”
Concretely:
- Pick the one intent your change targets (say, “setup friction”).
- Track completion and abandonment for that intent before and after the change.
- Compare matched cohorts, not all traffic. New-user setup friction shouldn’t be diluted by returning power users who never see onboarding.
- Give it enough volume to mean something. A change that “fixed it” across 12 conversations didn’t fix anything yet.
If the targeted metric moves and nothing else regresses, keep the change. If it doesn’t move, revert it and go back to the conversations. The discipline is killing your own changes when the data says they didn’t work. That’s hard for everyone, PMs included.
A weekly cadence that actually holds up
The teams that improve their agents fastest run a tight, boring loop. Boring is the point.
- Monday: Review the week’s emergent intents and their volume trends. What’s new? What’s spiking?
- Tuesday: Pick 2 to 4 problems. Write tickets with mechanism + target metric. Pull the supporting conversations.
- Wednesday/Thursday: Engineering ships the diffs as PRs. PM reviews against the linked examples.
- Friday: Check last week’s merged changes against their target metrics on live cohorts. Keep, tune, or revert.
Four problems a week, shipped and measured, compounds into a dramatically better agent over a quarter. Trying to fix forty at once gets you zero.
The roles, who owns what
| Role | Owns |
|---|---|
| PM | Reading intents, prioritizing the backlog, defining target metrics, keep/revert calls |
| Engineering | Implementing fixes as reviewable diffs, harness and config changes |
| Data / Analyst | Cohort setup, before/after measurement, guarding against regressions |
| Support / CX | Surfacing qualitative pain that hasn’t hit the metrics yet |
If your team is small and that’s three hats on one person, fine. The point is that each responsibility is named, not that you need four headcount.
FAQ
How often should I update my agent’s system prompt in production? As often as the conversation data justifies, but always through a reviewable diff with version history. Weekly is a healthy default. The risk isn’t editing too often, it’s editing live with no record and no way to measure or roll back.
What’s the single most important metric to improve an AI agent in production? Completion rate per intent. It ties directly to whether users got what they came for, and you can attribute movement in it to specific changes. Latency and token counts matter for infra, not for judging whether the agent is good.
Do I need a separate test set to measure improvements? No. Measure on live cohorts, comparing matched groups of real users before and after a change. A frozen test set tells you about yesterday’s problems. Your production traffic is the only ground truth that keeps up with how users actually behave.
If you’d rather not stitch this loop together by hand, that’s the gap Agnost AI fills: it reads every conversation, surfaces the intents behind churn and stalls, and opens the pull requests to fix them so your agent improves on a schedule instead of by luck. You stay in control of every merge, and the agent gets better every week.