← All posts

Agent Experience Score: A Single Number for How Well Your AI Agent Is Performing

The AX Score is a composite metric that rolls up Task Completion Rate, Path Efficiency, Trust Retention, and Recovery Rate into one number that tells you exactly how your agent is performing in production.

Agent Experience Score: A Single Number for How Well Your AI Agent Is Performing

Every week, somewhere in a product meeting, someone asks: “So how is the agent actually doing?”

And somebody pulls up a dashboard with 14 charts on it. Latency P95. Task completion rate. CSAT. Average turns per session. Error rate. Escalation rate. Token cost per session. And everyone stares at it for 30 seconds, someone says “it looks… okay?” and the conversation moves on without a real answer.

That’s decision fatigue in action. And its killing your ability to actually improve your agent in production.

You need one number. Not because agents are simple. They’re not. But because your team needs a north star that everyone can orient around without a statistics degree. Something that tells you, at a glance, whether your agent is getting better or worse, and by how much.

That’s what the Agent Experience Score is for.

Person overwhelmed staring at multiple computer screens meme

^ your PM trying to read 14 metric charts and extract a single takeaway about agent performance


Why your existing single metrics are failing you

Before we get into how to construct an AX Score, lets talk about why every existing single metric falls apart when you lean on it as a north star.

Task Completion Rate is the one most teams default to. And it makes sense intuitively. Did the agent do the thing or not? The problem: it says nothing about efficiency. An agent that completes every task through 23 back-and-forth turns and three escalations has a perfect completion rate. It also has a terrible user experience that nobody is going to pay for long-term.

Session length / turn count is the other common proxy. Longer conversations mean engagement, right? Wrong. We covered this in detail in the Frustration Index post, but the short version: a long session is equally likely to mean a user is stuck in a loop as it is to mean they’re getting deep value. Session length without quality context is noise.

CSAT scores miss most of your users. Response rates of 5-15% mean your feedback data is coming from people who had a strong reaction. The 80% who had a mediocre interaction and quietly didn’t come back? Invisible.

Latency is an infrastructure metric, not a quality metric. Fast wrong answers are worse than slow correct ones.

Each of these metrics is useful in isolation. None of them is sufficient as the one number your whole team can track. The AX Score is built to be that number by combining the dimensions that actually matter.


How to construct an AX Score

The AX Score is a weighted composite of four components. Each one captures a distinct dimension of agent performance. Together they describe the full picture of whether your agent is delivering real value in production.

Component 1: Task Completion Rate (weight: 30%)

This is still the baseline. Did the agent accomplish what the user came for?

The key is measuring this behaviorally, not by asking users to rate it. Behavioral proxies for task completion: a user action that follows the conversation (code accepted, form submitted, item purchased, ticket closed), a positive closing signal in the conversation, absence of immediate re-query on the same topic.

Target range: 70-90% depending on use case complexity. Below 70% and you have a fundamental capability problem. Above 90% on complex tasks should make you suspicious, check if users are giving up before the task gets hard.

Component 2: Path Efficiency (weight: 25%)

This is the dimension task completion rate misses entirely.

Path efficiency measures how well your agent gets users to their goal relative to the optimal path. Take the median number of turns required for successful task completion, compare it to the actual turn count for each session.

An agent with 90% task completion that takes an average of 14 turns per task, when 5 is the median for successful flows, has a path efficiency problem. Users are getting there, but it’s painful. That kind of friction is the first thing to erode when users evaluate whether the product is worth their time.

Score this as: (median optimal turns / actual turns) * 100, capped at 100. An agent consistently running at 60-70 on this dimension is slow and likely frustrating users even when tasks complete.

Component 3: Trust Retention (weight: 25%)

This one is trickier to compute but critical to include.

Trust Retention measures whether users who encounter an agent error, an unexpected response, or a limitation continue engaging with the product. It captures how gracefully your agent handles the moments where it fails.

An agent that handles failures poorly doesn’t just lose that single session. It updates the user’s mental model. “This thing is unreliable.” That cognitive update persists. Trust Retention operationalizes this. Track: when users encounter a failed resolution, an error state, or a confusing response, what percentage continue the conversation or return within 48 hours versus abandoning entirely?

High Trust Retention means your agent fails gracefully enough that users give it another shot. Low Trust Retention means each failure compounds. This single component often explains more variance in 90-day retention than task completion rate does.

Component 4: Recovery Rate (weight: 20%)

Recovery Rate asks a specific question: when the agent gets it wrong, how often does it get back on track without human escalation?

This is particularly important for multi-turn agents and workflow agents where partial failures mid-task are common. An agent with a 65% task completion rate but a 90% recovery rate on partial failures is a fundamentally different product than an agent with the same completion rate and a 30% recovery rate. One is a capable agent hitting hard problems. The other is a brittle system that breaks and stays broken.

Track this as: percentage of conversations where the agent initially failed (by proxy signal) but ultimately reached a successful resolution before session end or escalation.


Computing the composite

The formula is straightforward:

AX Score = (TCR * 0.30) + (PE * 0.25) + (TR * 0.25) + (RR * 0.20)

Normalize each component to 0-100 before weighting. Output is a 0-100 score.

The weights above are starting points, not gospel. A coding assistant should probably weight Path Efficiency higher (slow autocomplete is unusable autocomplete). A customer support agent should weight Trust Retention higher (because losing trust in support has downstream brand implications beyond just the product). Calibrate the weights to what matters most for your specific use case.

Surprised Pikachu face meme

^ founders realizing their “90% task completion rate” agent has an AX Score of 58 because Path Efficiency and Trust Retention are in the basement


What does a good AX Score actually look like?

This question always comes up. Here’s what we see across the products instrumented through Agnost AI, broken down by agent category.

Coding assistants: Median AX Score sits around 68-72 for teams in early production. Top-quartile products are running 80+. Below 60 is a red flag, usually path efficiency dragging everything down (too many iteration cycles before the code is usable).

AI companions: Lower absolute scores are normal here because the task definition is inherently fuzzy. Median around 62-67. What matters more in this category is Trust Retention specifically, because companion users form deeper relationships and a single trust-breaking interaction can cause permanent churn.

Workflow agents (automating multi-step tasks): Most volatile category. You’ll see scores ranging from 45 to 85 on similar-looking products. Recovery Rate is the key differentiator here. The products in the top quartile almost always have exceptional recovery logic, they’re designed to fail gracefully at each step rather than catastrophically all at once.

Customer support agents: Most mature benchmarks in this category. Good products run 72-80. The teams who’ve invested in intent-specific routing (routing different user needs to specialized sub-agents or prompts) tend to cluster in the top quartile. Generic one-size-fits-all support agents sit in the middle.

Use these as rough orientation points, not hard targets. Your industry, user sophistication, and task complexity all affect where your absolute score lands. The trend matters more.


Trend over time beats absolute score every time

Honestly, the single most useful thing about an AX Score isn’t the number. Its the direction and velocity of change.

An AX Score of 71 that’s been climbing from 62 over the last six weeks is a completely different signal than an AX Score of 71 that dropped from 78 last month. Same number. Completely different product story.

The teams who get the most value from this metric are the ones who track it weekly and connect the trend to specific product events. New model version deployed: what happened to AX Score in the 72 hours after? New prompt shipped: did it move the score in the right direction for the intent categories you targeted? Onboarding change: did it affect Trust Retention for new users specifically?

This is how AX Score becomes a feedback loop rather than just a reporting metric.


The early warning nobody expects: AX Score drops precede churn

Here’s the finding that consistently catches people off guard when they start tracking this.

An AX Score decline often shows up in the data 2 to 4 weeks before that cohort’s churn shows up in your retention chart.

The mechanic makes sense once you think about it. Users dont typically cancel the same week their agent experience degrades. There’s a lag. They give it another shot. Then one more. Then they mentally file it as “not really working for me.” Then they stop engaging. Then, eventually, they cancel or let the subscription lapse. The AX Score captures the degradation at step one. The retention chart captures the cancellation at the final step. That gap is your intervention window.

Across customers we work with at Agnost AI, the pattern is consistent enough that we treat a sustained AX Score decline (more than 5 points over two consecutive weeks) as a churn risk flag for that cohort, even when retention looks normal. It almost always presages a bad retention month.


The trap: gaming the AX Score by optimizing components in isolation

This is the mistake that kills the usefulness of any composite metric. You have to watch for it actively.

The failure mode: your team sees Task Completion Rate is dragging the AX Score down. So they optimize for completion rate specifically. They make the agent more aggressive about declaring tasks “complete.” They lower the bar for what counts as a successful resolution. Completion rate goes up. AX Score goes up. Actual user outcomes don’t change, and might get worse.

This is exactly how teams end up with a great-looking north star metric and a churning user base.

The safeguard is to track each component independently alongside the composite. If the composite is rising but one component is declining, that’s a problem even if the weighted output looks fine. And set floor thresholds on each component, if any single component drops below a certain floor (say, Trust Retention below 40 or Recovery Rate below 35), flag it as a product health issue regardless of the composite score.

A good AX Score should emerge from genuinely improving all four dimensions. If it’s not, the score is being gamed, intentionally or not.

Clown applying makeup meme getting ready

^ optimizing Task Completion Rate in isolation and watching the AX Score go up while user experience quietly gets worse


How to start tracking this today

The minimum viable version of an AX Score doesn’t require building a custom analytics pipeline. You need four things:

Task completion signals (behavioral proxies you probably already have), turn count per session for the path efficiency calculation, a session-level “user returned or continued after failure” flag for Trust Retention, and a conversation-level flag for whether partial failures were recovered before escalation.

If you’re already logging conversation data with turn-level metadata, you’re probably 80% of the way to the raw data you need. The work is in the analysis layer, aggregating signals, weighting them correctly, and tracking the composite over time.

Where most teams hit the wall isn’t data collection. Its the aggregation and visualization layer. Building a dashboard that shows AX Score trends by user cohort, by agent version, by intent category, and flags sudden drops is not a weekend project if you’re starting from scratch.

This is one of the core things Agnost AI is built for. We track the component signals natively across your conversation data, surface the composite AX Score in real time, and alert you when trends are moving in the wrong direction before they show up in your churn numbers. The per-cohort and per-intent views are built in because thats where the actionable insights live. Take a look at agnost.ai if you want to stop building this from scratch.


Wrapping it up

Your agent deserves a real performance metric. Not 14 charts that require a statistics degree to interpret. Not a single proxy that misses three dimensions of what makes an agent actually work.

The AX Score gives your whole team, technical and non-technical, a shared number to own. When it goes up, you’re shipping better. When it drops, you know before your users tell you (or more accurately, before they stop telling you anything at all and just leave).

Build the composite. Track the trend. Watch it against every product change you ship. And set a floor on each component so gaming the composite isn’t an option.

The teams winning on agent quality right now aren’t the ones with the best benchmarks on eval sets. They’re the ones who know exactly how their agent is performing in production, across every session, every cohort, every week.

That’s the compounding advantage. Start tracking it.

Hackerman meme coding confidently at multiple screens

^ you, six weeks after launching AX Score tracking, watching the trend line go up after each sprint


TL;DR: Task completion rate, session length, and CSAT all fail as north star metrics for agent quality. The AX Score combines Task Completion Rate (30%), Path Efficiency (25%), Trust Retention (25%), and Recovery Rate (20%) into a single actionable number. An AX Score drop of 5+ points over two weeks almost always precedes a bad retention month by 2-4 weeks. Track the trend, not just the absolute.

Reading Time: ~9 min