← All posts

How to Tune Your Agent Harness and Config From Production Signals

Agent harness tuning guide: use production conversation signals to fix wrong tool calls, bad retrieval, and premature give-ups as reviewable diffs.

Agent harness tuning means changing the parts of your agent that aren’t the system prompt: tool definitions, retrieval settings, routing logic, retry policy, guardrails, and memory. Most production failures live in these knobs, not in your prompt wording. The fastest way to find which knob is broken is to read real conversations, group the failures by cause, and change one config at a time as a reviewable diff.

If you’ve been editing your system prompt over and over and the agent still calls the wrong tool, this post is for you.

Why the system prompt is the wrong place to start

Teams obsess over the system prompt because it’s the most visible part of the agent. It’s a text file. You can read it. It feels like the brain.

But the prompt is maybe 30% of observed behavior. The rest comes from the harness and config: what tools exist and how they’re described, how many chunks retrieval returns and at what similarity threshold, when the router hands off to a sub-agent, how many times a failed tool call retries before the agent gives up, and how much conversation history actually makes it into context.

A concrete example. We worked with a team whose support agent kept “forgetting” account details mid-conversation. They rewrote the prompt four times telling it to “always reference prior context.” Didn’t help. The real cause was a memory config that truncated history at 6 turns to save tokens. No amount of prompt wording fixes a turn that isn’t in the context window.

That’s the pattern. The prompt tells the agent what to do. The harness decides what the agent can do. When they disagree, the harness wins.

What counts as a harness or config knob?

Here’s the map most teams are missing. These are the knobs that actually move behavior:

KnobWhat it controlsCommon failure when it’s wrong
Tool definitionsNames, descriptions, argument schemasAgent picks the wrong tool or fills bad args
Retrieval settingstop-k, similarity threshold, rerankingHallucinated answers, “I couldn’t find that” on docs that exist
RoutingWhen to hand off to a sub-agent or modelSimple queries hit the expensive path, hard ones hit the cheap one
Retry policyRetries on tool/API failure before giving upPremature “I can’t help with that”
GuardrailsInput/output filters, refusal triggersOver-refusal on benign requests, or unsafe passes
MemoryHistory window, summarization, what persistsAgent loses context, repeats questions
Model configTemperature, max tokens, the W&B configs you versionTruncated outputs, inconsistent behavior across runs

You don’t tune these by guessing. You tune them by reading what actually went wrong in production.

How do you find which knob is causing failures?

You need to go from “the agent feels bad” to “the agent called search_orders when it should have called lookup_subscription, 41 times this week, mostly on billing questions.” That second sentence points at a specific knob. The first one points at nothing.

Three steps:

1. Group conversations by failure cause, not by topic

Topic clustering tells you people asked about billing. Cause clustering tells you the agent gave up on billing because retrieval returned nothing. Only the second one is actionable.

This is the part teams skip because it’s tedious to do by hand. You’re reading transcripts and tagging them: “wrong tool,” “bad retrieval,” “premature give-up,” “lost context.” Across a few thousand conversations, you can’t eyeball this. This is exactly the gap Agnost AI fills: it reads every conversation, auto-generates intents for your product (setup friction, churn risk, bug reports, and so on), and tracks them live so you see why users stall instead of just that they did.

2. Map each cause to a knob

Once you have failure clusters, the mapping is usually obvious:

  • Wrong tool calls point at tool definitions. Vague descriptions, overlapping tools, or argument schemas the model misreads.
  • “I couldn’t find that” on real docs points at retrieval. top-k too low, threshold too strict, or no reranking.
  • Premature give-ups point at retry policy or guardrails firing on benign input.
  • Repeated questions point at memory window or summarization.
  • Truncated or cut-off answers point at max tokens in your model config.

3. Change one knob, ship it as a diff, watch the same cluster

Don’t change five things at once. Change the retrieval threshold from 0.82 to 0.75, ship it, and watch the “couldn’t find that” cluster over the next few days. If it shrinks, keep it. If it doesn’t, you learned the threshold wasn’t the problem, which is also useful.

The “as a diff” part matters. Harness changes are code and config changes. They should go through review like any other change, not get hot-edited in a dashboard where nobody can see who changed what or roll it back.

Concrete examples per knob

Tool definitions: the wrong-tool problem

Symptom in transcripts: the agent calls get_user then get_user again then gives a generic answer, when it should have called get_invoice.

Usual cause: two tools with descriptions that overlap. get_user says “fetches user account information” and get_invoice says “fetches user billing information.” The model can’t tell them apart on a billing question.

Fix: tighten the descriptions and the trigger conditions. Make get_invoice say “use this for any question about charges, payments, refunds, or invoices.” That’s a one-line diff to the tool schema, and it’s reviewable.

Retrieval: the threshold problem

Symptom: agent says it can’t find information that’s clearly in your docs.

We’ve seen teams set a similarity threshold so high that legitimate matches get filtered out. They set it strict to avoid hallucination, then over-corrected into uselessness. The fix is usually lowering top-k filtering, adding a reranker, or loosening the threshold and letting the model decide relevance. Each is a config change you can test against the exact cluster of failed retrieval conversations.

Retry policy: the premature give-up

Symptom: agent says “I’m having trouble, please try again later” when the underlying API was just slow.

Cause: retries set to 0 or 1 with no backoff. One timeout and the agent bails. Bumping to 3 retries with exponential backoff often kills an entire cluster of false give-ups. Small diff, big behavior change.

Memory: the context-loss problem

The truncation example from earlier. If transcripts show the agent re-asking for info the user already gave, your history window is too short or your summarizer is dropping key facts. Widen the window or change what the summarizer prioritizes.

Where W&B configs fit

If you version your agent configs in Weights & Biases, your tool schemas, retrieval params, model settings, and routing rules already live as tracked artifacts. That’s good. It means a harness change is a config diff with a version history.

The missing piece is the loop back from production. A W&B config tells you what the agent’s settings are. It doesn’t tell you which of those settings is causing the “couldn’t find that” cluster you saw last Tuesday. You still need production conversation signals to know which knob to turn.

This is the workflow Agnost AI is built around: it reads production conversations, finds the failure clusters, and opens pull requests against your system prompts, your agent harness, and your W&B configs to fix what it finds. You review the diff and merge, or you don’t. Nothing changes in prod without a human approving it.

How to change harness knobs safely

A few rules that have saved us from shipping regressions:

  1. One knob per change. If you tune retrieval and retries together and behavior improves, you don’t know which one did it.
  2. Tie every change to a cluster. “We changed top-k because of these 38 failed conversations” is a reviewable rationale. “We changed top-k because it felt low” is not.
  3. Ship as a diff, not a dashboard edit. Config changes that bypass code review are how agents drift silently.
  4. Watch the same cluster after shipping. Did the failures you targeted actually go down? That’s your signal, not vibes.
  5. Keep prompt changes and harness changes separate. Mixing them makes attribution impossible.

FAQ

What’s the difference between agent harness tuning and prompt engineering? Prompt engineering changes the instructions you give the model. Harness tuning changes the machinery around the model: tools, retrieval, routing, retries, guardrails, and memory. The prompt sets intent; the harness sets capability. Most production failures are capability problems, so they’re harness problems, not prompt problems.

How do I know whether a failure is a config issue or a model issue? Read the transcript and ask what the agent could have done. If the right tool didn’t exist or retrieval returned nothing, it’s a config issue, no model can fix a tool it wasn’t given. If the right tool was available with good data and the agent still chose wrong, that’s closer to a prompt or model problem. Grouping failures by cause makes this split obvious.

Should agent config changes go through code review? Yes. Tool schemas, retrieval params, retry policy, and routing are all behavior-defining code. Hot-editing them in a dashboard with no diff, no author, and no rollback is how agents regress without anyone noticing. Treat config changes like any other production change: a reviewable diff with a clear rationale.


If you’re tired of rewriting the prompt to fix problems that live in the harness, Agnost AI reads your production conversations, finds which knob is actually broken, and opens the pull request to fix it. You stay in control: review the diff, merge what makes sense, ship the rest never.