← All posts

How to Improve Your AI Agent's System Prompt From Real Conversations

A practical guide to system prompt optimization driven by real production conversations: collect failures, cluster patterns, ship reviewable diffs.

The short answer

System prompt optimization works best when it stops being a vibes exercise. Instead of staring at your prompt and adding another “be concise” line because something felt off, you mine your real production conversations for failures and friction, cluster them into recurring patterns, then translate each pattern into one specific prompt change you can test and ship as a reviewable diff. That is the whole loop, and the rest of this post is how to actually run it.

Most teams I talk to iterate on prompts the same way: someone notices a bad reply in a Slack screenshot, opens the prompt, and tweaks a sentence. Three weeks later nobody remembers why that sentence is there, and the prompt is 1,400 tokens of accumulated superstition. There is a better way, and it does not require a research team.

Why guesswork prompts go stale

Here is the thing nobody tells you when you ship your first agent: your prompt is fine for the demo and wrong for production. The demo covers the three happy paths you tested. Production covers the 200 weird things real users actually do.

When you edit prompts from gut feeling, you optimize for the last bad example you saw, not the distribution of what’s actually breaking. You fix the loud failure (the angry user who emailed you) and ignore the quiet one (the 8% of users who silently abandon during setup). The loud failure is usually rare. The quiet one is usually killing your retention.

Conversation-driven optimization flips this. You let the volume of real failures tell you what to fix first, in order of impact, instead of fixing whatever’s freshest in your memory.

Step 1: Collect real failures and friction, not just thumbs-down

Thumbs-up/thumbs-down feedback is a trap. Most users never click it, and the ones who do are a biased sample. You need signal from the conversations themselves.

What to actually capture:

  • Hard failures: the agent says it cant do something it should, hallucinates a feature, loops, or hands off to a human unnecessarily.
  • Friction: the user has to repeat themselves, rephrase, or correct the agent before it understands.
  • Abandonment: the conversation just stops mid-task. No goodbye, no resolution. This is the most underrated signal there is.
  • Escalation language: “this isn’t what I asked,” “no, I meant,” “forget it.” These phrases are gold.

You don’t need fancy tooling to start. Pull a week of transcripts, read 50 of them with coffee, and tag what went wrong. Reading raw conversations is humbling and it’s the single highest-leverage thing a founder can do for their agent. Do it before you automate anything.

Step 2: Cluster failures into patterns (intents)

One bad conversation is an anecdote. Forty bad conversations that all fail the same way are a pattern, and patterns are what you fix.

The goal here is to group your tagged failures into a handful of recurring intents. Not “user was confused” (too vague) but specific, named buckets like:

PatternWhat it looks likeRough frequency
Setup frictionUser stalls during onboarding, asks the same config question twice22% of failed convos
Refused valid requestAgent says it can’t do X even though it can11%
Over-apologizingAgent burns three sentences apologizing instead of acting9%
Wrong escalationHands off to human when it had the answer7%

Now you have a prioritized list. Setup friction at 22% is where you start, not over-apologizing at 9%, even if the apologies annoy you personally.

This clustering step is the part people skip, and it’s exactly why their prompt edits feel random. Without it, you’re fixing by recency. With it, you’re fixing by impact.

Step 3: Translate one pattern into one specific change

Now the actual prompt engineering. The rule: one pattern, one targeted change. Don’t rewrite the whole prompt because then you can’t tell what worked.

Let’s take the “refused valid request” pattern. Say your support agent keeps telling users it can’t issue refunds, but it actually can for orders under $50.

Before:

You are a helpful support assistant. Be polite and professional.
If you are unsure whether you can perform an action, tell the
user you'll connect them with a human.

That last line is the culprit. “Unsure” is doing way too much work, so the model defaults to punting.

After:

You are a support assistant for Acme. You CAN issue refunds for
orders under $50 without approval. Do it directly when asked and
confirm the amount. Only escalate to a human for refunds over $50,
disputed charges, or account deletion. When you escalate, say
exactly why.

See what changed? We replaced a vague hedge with explicit capability boundaries and a concrete threshold. The model now knows the line. Same approach works for setup friction (add the exact happy-path steps), over-apologizing (instruct it to act first, apologize at most once), and wrong escalation (define the escalation criteria precisely).

Concrete beats polite every time. “Be helpful” is noise. “Issue refunds under $50 directly” is a decision the model can follow.

Step 4: Test the change before you trust it

Do not ship a prompt change because it looks right. Prompts have side effects. The refund fix might make the agent too eager and start refunding things it shouldn’t.

You don’t need a heavy setup. A lightweight check:

  1. Pull 15 to 20 real conversations that triggered the pattern.
  2. Run the new prompt against those same inputs.
  3. Eyeball: did it fix the failure without breaking the good cases?
  4. Keep a small set of “golden” conversations that should always pass, and rerun them on every change to catch regressions.

The point is comparison against real traffic, not synthetic test cases you made up. Your made-up cases reflect your assumptions. Production conversations reflect reality, and reality is what’s churning your users.

Step 5: Ship it as a reviewable diff

Treat your system prompt like code, because it is code. It should live in version control, change through pull requests, and carry a description of what failure pattern it fixes and what data backs it.

A good prompt PR description reads like: “Fixes the refused-refund pattern (11% of failed support convos this week). Added explicit refund authority under $50 and tightened escalation criteria. Tested against 18 real transcripts, fixed 16, no regressions on golden set.”

That description means the next person (or you, in two months) knows why the line exists. No more archaeology. No more “I think Dave added this, don’t touch it.”

Where Agnost AI fits

Everything above is doable by hand, and you should do it by hand at least once so you understand your own failure modes. But reading transcripts every week and clustering them manually doesn’t scale past a certain volume.

This is the loop Agnost AI runs as infrastructure underneath your agent. It connects to your agent, reads every conversation, and auto-generates custom intents for your product (setup friction, refused requests, churn risk, and whatever else shows up in your specific traffic). It tracks those intents live so you can see why users stall or won’t upgrade, then opens pull requests against your system prompt and agent harness to fix what it found. You review the diff and merge, exactly like the manual flow, just without the weekly transcript-reading marathon. Works with any LLM and any framework.

A quick checklist

Pin this somewhere:

  • Read 50 raw transcripts yourself before automating anything
  • Tag failures by type, not just “good/bad”
  • Cluster into named patterns with frequencies
  • Fix the highest-frequency pattern first, not the freshest one
  • One pattern, one specific prompt change
  • Replace vague hedges (“if unsure”) with explicit boundaries
  • Test against real conversations that triggered the pattern
  • Keep a golden set to catch regressions
  • Ship as a PR with a description tied to data

FAQ

How often should I update my system prompt?

Tie it to data, not the calendar. If a new failure pattern crosses a meaningful share of your conversations, fix it now. If nothing’s breaking, leave the prompt alone. Constant tweaking with no signal behind it is how prompts rot. In practice, teams with real traffic find something worth fixing every week or two.

Should I rewrite the whole prompt or make small changes?

Small, targeted changes, almost always. One pattern, one diff. Big rewrites feel productive but they make it impossible to tell what actually helped, and they reintroduce bugs you already fixed. The only time a full rewrite makes sense is when the prompt has become 1,500 tokens of conflicting instructions, at which point you rebuild from your documented patterns.

How do I know a prompt change actually worked?

Compare against real production conversations, before and after, on the specific pattern you targeted. Did the failure rate for that pattern drop? Did your golden-set cases stay passing? If you can’t measure it against real traffic, you’re guessing, and guessing is the thing we’re trying to get away from.

Closing

The teams whose agents keep getting better aren’t smarter prompt writers, they just close the loop between what users actually do and what the prompt says. Read your conversations, fix by pattern, ship reviewable diffs.

If you’d rather have that loop run continuously instead of carving out transcript-reading time every week, that’s the whole reason Agnost AI exists. Either way, let your real conversations drive the prompt, not your gut.