Intent Resolution Rate: The Metric That Ties AI Quality Directly to Revenue
Most teams building conversational AI products are optimizing for the wrong thing.
They’re watching response latency. They’re tracking CSAT scores. They’re running evals on test sets. They’re A/B testing prompts against human preference benchmarks. All useful. None of them answer the actual question: did your AI do the job the user came for?
That’s what Intent Resolution Rate measures. And after watching data across millions of agent conversations, I’m convinced its the most important number in any AI product’s analytics stack. Not because its conceptually interesting, but because its the metric most directly connected to whether users stick around and pay you.
Let me break down what it actually is, how to measure it in production (three ways, pick the one that matches where you are), and why improving it by 10 points is simultaneously a product decision, an engineering decision, and a revenue decision.

^ every AI PM who’s been tracking CSAT instead of whether the AI actually resolved anything
What Intent Resolution Rate Actually Means
Here’s the definition: Intent Resolution Rate is the percentage of conversations in which the user’s stated or implied goal was successfully addressed.
Simple. But the implementation is what trips people up, because “resolution” is not the same as “response.”
Every LLM gives a response. That’s a solved problem. The model will generate tokens until it hits a stop sequence. It will sound confident. It will be grammatically correct. It might even be beautifully formatted. And it can still completely fail to resolve what the user actually wanted.
Think about the last time you asked a customer support bot something and it came back with a three-paragraph answer that technically addressed the words you typed but had nothing to do with your actual problem. Response: given. Intent: unresolved.
This gap is the thing IRR is designed to measure.
The reason IRR is hard to track is that intent is almost always implicit. Users don’t submit a form that says “my intent was X, rating: resolved/unresolved.” They just… behave. They rephrase their question. They abandon the session. They come back and try again five minutes later. They copy the output and use it, or they close the tab. You have to read those behavioral signals and infer what they’re telling you about whether resolution happened.
That inference problem is solvable. Here’s how.
How to Measure IRR in Production (Three Methods)
Method 1: Proxy Signals (Start Here)
This is where most teams should begin. You dont need any infrastructure changes. You need to look at the behavioral data you already have and identify the signals that correlate with failure.
The core failure signals are:
Rephrasing. User sends a message, gets a response, then sends a semantically similar message. This almost always means the first response missed the mark. If you can detect when a user’s second message is a variation of their first, you’ve found a non-resolution event.
Immediate re-query. User ends a conversation and starts a new one within 3-5 minutes on a similar topic. Strong signal that the first conversation didn’t resolve anything.
Drop-off without action. The conversation ends abruptly, no closing message, no positive signal, no follow-up action in the product. Not conclusive on its own, but when combined with turn count and timing, it tells a story.
Escalation. For products with human fallback, did the user route to a human after the AI’s response? That’s a clean non-resolution signal.
Conversely, resolution signals look like: positive closing message, a follow-up question that builds on the previous answer (not restates it), a subsequent action in the product that suggests the task was completed.
Your baseline IRR from proxy signals will not be perfect. But it’ll be directionally correct within a conversation or two. And it gives you a number to move, which is what matters.
Method 2: LLM-as-Judge (The Scale Play)
Once you’re processing more than a few hundred conversations a day, the proxy approach starts missing nuance. This is where LLM-as-judge comes in, and it works better than most people expect.
The idea: take a sample of completed conversations and run them through a separate LLM pass with a structured prompt asking it to evaluate whether the user’s intent was resolved. The judge gets the full conversation transcript, a description of the product context, and a rubric.
Research on LLM-as-judge evaluation shows these pipelines can match human rater agreement at over 80%, which is roughly the same agreement rate you’d get between two humans doing the same rating task. Thats good enough for a production metric, especially when you’re running it at scale.
The critical detail: your judge prompt needs to be anchored to your specific product context, not generic quality. “Was this a good response?” is not the right question. “Given that the user appears to have been trying to [infer goal], did the AI’s response successfully address that goal?” is much better.
We’ve seen teams get this pipeline from zero to running in a weekend. The cost at scale is trivial compared to the signal value.
Method 3: Explicit Resolution Signals (The High-Fidelity Play)
If you want the cleanest signal, build it into the UX. This doesn’t have to be intrusive.
A thumbs up/down at the end of a conversation works. A simple “was this helpful?” prompt at conversation close works. For task-completion products, a checkpoint at the point of task completion works even better.
The thing people get wrong here is treating this as the ONLY signal. Explicit feedback has a response rate problem. Typically 5-15% of users actually rate anything. That’s not enough for statistical confidence on granular segments.
The right approach is to use explicit signals as a calibration layer on top of your proxy signals. When you do have explicit negative feedback, use it to validate whether your proxy detection would have caught the same conversation. This is how you tune the sensitivity of your automated detection and make the whole system more accurate over time.

^ teams building elaborate eval frameworks on test sets while IRR in production is sitting at 58%
What IRR Actually Looks Like by Product Category
One of the biggest mistakes I see is teams applying a single IRR benchmark without accounting for what “resolution” means in their specific product context. It’s not the same across categories.
Customer support bots. Resolution is relatively clean here, the user either got their answer or they didn’t. A good IRR for a well-built support bot is 70-85%. Below 60%, users are escalating to human agents or just churning. Above 85% is excellent and usually correlates with strong NPS. These numbers come from what we see across customers using Agnost for support-oriented products.
AI coding assistants. Resolution maps pretty directly to code acceptance. Did the user take the suggestion? Did they implement it? Code acceptance rate is your IRR proxy. Teams running GitHub Copilot-style products track this natively. If you’re building your own coding assistant, this should be your primary quality signal, not user satisfaction surveys.
AI companions. This is the hard one. “Resolution” for an open-ended companion conversation is genuinely ambiguous, the conversation is the product, there’s no discrete task completion moment. Here, session IRR is a useful reframe: did this conversation lead the user to return? If a user had a conversation and came back within 48 hours, that’s a reasonable proxy for a resolved/satisfying interaction. If they had a conversation and churned, something went wrong.
AI tutors. Resolution = demonstrated understanding. Did the user move on to the next concept? Did they pass the checkpoint? Did they stop asking the same type of question? These are your signals. The absence of them (user stuck on the same topic across multiple sessions, asking the same foundational question repeatedly) is your non-resolution flag.
The point is: define resolution in terms of the job your product is supposed to do. Then measure whether it’s doing that job.
Why IRR Maps Directly to Revenue
Here’s the mechanic that makes IRR the revenue metric.
Users who have high-IRR conversations develop a mental model: “this thing works for me.” That mental model drives habit formation, which drives retention, which drives upgrade decisions. They’re not upgrading because they saw a feature announcement. They’re upgrading because the product has earned trust through a track record of successful resolutions.
The reverse is also true, and its faster. Users who hit consecutive low-IRR conversations, conversations where the AI clearly missed what they were asking for, tend to churn within two weeks. Not because they had one bad experience. Because two or three bad experiences back-to-back updates their mental model: “this thing doesnt understand me.” And once that update happens, recovery is very difficult.
Across our customer base at Agnost, the pattern is consistent: cohorts with average IRR above 75% show meaningfully higher 90-day retention than cohorts below 65%. The gap compounds over time because retained users generate more high-IRR conversations, which reinforces retention further.
The business case for improving IRR is straightforward. Moving your IRR from 65% to 75% isn’t just a quality improvement, it’s also a retention improvement and, downstream, a revenue improvement. This is why IRR should be on every AI team’s weekly dashboard next to NRR, not buried in an engineering eval doc that only gets reviewed during sprint retros.

^ founders realizing their prompt engineering work is actually a revenue decision
The Mistakes Teams Make When Tracking IRR
Substituting CSAT. CSAT has two problems: response rate and bias. You’ll get responses from 5-15% of users, and those users skew toward the people who had a strong reaction in either direction. The 70% of users who had a mediocre experience and quietly didn’t come back? They’re invisible in your CSAT data. IRR, measured through behavioral proxies, captures everyone.
Using test set accuracy. The lab-to-production gap is real and it’s consistently underestimated. A model that scores 90% on your eval set might perform at 65% IRR in production because real user inputs are messier, more ambiguous, and more varied than anything your test set anticipated. Production IRR is the ground truth. Test set accuracy is at best a directional signal.
Averaging across all conversations. This one kills the utility of the metric. “Our IRR is 71%” tells you almost nothing. “Our IRR is 71% overall, 84% for how-to questions, 62% for troubleshooting workflows, and 43% for multi-step task assistance” tells you exactly where to focus.
Segment by intent category. The gap between your best-performing intent category and your worst is your product roadmap. Every low-IRR category is an argument for a new capability, a prompt change, or a UI intervention. The metric becomes actionable the moment you stop averaging it.
How to Use IRR to Drive Actual Product Decisions
Once you have IRR segmented by intent category, the roadmap writes itself. You look at which categories have the lowest resolution rates, and those become your priority. Not based on gut. Not based on whoever shouted loudest in the last sprint planning meeting. Based on where the AI is failing its users most consistently.
The second use case is prompt engineering measurement. This is where IRR becomes genuinely powerful for technical teams. When you’re testing a new prompt version, you shouldn’t be asking “did human reviewers prefer version B?” You should be asking “did version B improve IRR on the intent categories where version A was weakest?” That’s the experiment that actually matters for the product.
IRR turns prompt engineering from an art into a measurable function. Prompt change proposed, experiment run, IRR before and after compared. Ship if it moves in the right direction. This is how you build a culture of iterative AI quality improvement instead of just vibes-based prompt tweaking.
The third use is segmentation by user cohort. IRR across your entire user base is interesting. IRR broken down by user segment is where you find your beachhead. Whose problems is your AI already solving really well? That’s your most defensible early market. Who has low IRR? Either they’re the wrong users for this version of the product, or there’s a product gap to close.
What Your IRR Dashboard Should Actually Show
At minimum, a weekly IRR review should include:
Overall IRR by week (trend matters more than the absolute number), IRR by top 5-10 intent categories, IRR by user segment or cohort, and IRR comparison across any active prompt or model versions.
If you want to connect it directly to revenue, add a layer that segments IRR by subscription tier. High-IRR free users are your conversion pipeline. Low-IRR paid users are your churn risk. That view alone will change how you prioritize product work.
The tooling problem right now is that building this from scratch requires combining conversation data, behavioral signals, and a judge pipeline, and then maintaining all of it as your product evolves. Most teams either don’t have the bandwidth or they build it once and it slowly goes stale.
This is exactly the problem Agnost is built to solve. IRR tracking is native to the platform, you’re not bolting a separate eval layer onto your conversation logs after the fact. You get resolution signals, LLM-as-judge scoring, and per-intent-category breakdowns out of the box, connected directly to the user and cohort data you need to take action.
Wrapping It Up
Here’s the uncomfortable truth: most AI teams are flying blind on the metric that matters most.
You know your latency. You know your token costs. You probably have some CSAT data. But do you know what percentage of conversations your AI actually resolves successfully, segmented by what users were trying to do? Can you tell whether your last prompt update improved resolution rates for the categories where you’re weakest?
If the answer is no, you’re optimizing blindfolded. You’re making product decisions based on proxies that are two steps removed from whether the product is actually doing its job.
IRR is not a perfect metric. Measuring intent resolution in production is an inference problem and your instrumentation will have noise. But a directionally correct signal that’s connected to user outcomes is infinitely more useful than a clean metric that tells you nothing about whether your AI is working.
The teams I see moving fastest on AI quality are the ones who’ve made this shift. They’re not debating benchmark scores. They’re watching IRR by intent category, running prompt experiments against it, and shipping changes when they see it move. That’s the compounding advantage that’s very hard to catch up to once a competitor has it.
If you’re tired of flying blind on this, Agnost gives you IRR measurement built natively into your conversation analytics, so you can stop inferring quality from proxies and start actually measuring it.

^ you, after your first week of tracking IRR by intent category and finally knowing what to fix
TL;DR: Intent Resolution Rate measures whether your AI actually solved what users came for. It’s the most direct proxy for retention and revenue. Track it with behavioral signals, sharpen it with LLM-as-judge, segment it by intent category, and watch your roadmap write itself.
Reading Time: ~9 min