The 6 Metrics Every AI-Native Product Should Track (And How to Define Them)

Your D7 retention is 34%. Your DAU/MAU ratio looks healthy. Average session length is up 12% week-over-week.

And your best users are quietly leaving.

Here’s what happened to one team we worked with closely. They were building an AI coding assistant. By every traditional metric, the product looked great. High engagement, decent retention, growing usage. Then they dug into the actual conversations. Turns out, a massive chunk of their “engaged” users were stuck in loops, asking the same question three different ways, getting plausible-sounding but wrong answers, and eventually giving up. The product looked fine in the dashboard. In the conversations, it was broken.

This is the trap with AI-native products. The standard analytics playbook was designed for apps where users consume content, complete forms, click buttons, and navigate flows. You can instrument all of that. Funnels make sense. Session length means something.

But your core loop is a conversation. And conversations don’t fit neatly into pageview-era metrics.

The six metrics below are the ones we’ve found actually tell you whether your AI product is working. Not whether people are opening it, but whether its doing the job they hired it to do.

Dog sitting in burning room saying "This is fine"

^ every founder who looks at their Mixpanel dashboard and ignores what’s happening in the conversations

1. Intent Resolution Rate (IRR): The Only Metric That Actually Correlates With Revenue

IRR is simple to define and hard to measure. It’s the percentage of conversations where the user’s stated or inferred intent was successfully resolved.

If someone opens your AI tutor and asks “explain how recursion works,” did they get an explanation that satisfied them? If someone opens your customer support bot and says “I need to cancel my subscription,” did they actually get their subscription canceled?

This is the single most important metric in a conversational product. Everything else is downstream of it. High IRR means your AI is doing its job. Low IRR means users leave frustrated, dont come back, and tell their friends.

The hard part: most teams cant measure IRR directly because they dont have ground truth labels. So you proxy for it.

Good proxies:

Conversation ended without a follow-up clarification request
User sent a message that signals completion (“thanks”, “perfect”, “got it”, “that worked”)
No repetition of the same intent in the same session
Session did not end in abandonment after a long AI response

At Agnost, we track IRR signals across millions of conversations and the pattern is consistent: products with IRR above 70% see significantly better 30-day retention than products below 55%, even when their DAU numbers look similar. The gap shows up in the conversations before it shows up in your retention curve, typically by two to three weeks.

A good IRR to aim for: 65-75%+ depending on task complexity. If you’re below 50%, you have an AI quality problem, not a growth problem.

2. Conversation Depth: Not All Deep Conversations Are Good

Conversation depth is the average number of turns per conversation, where one turn equals one user message plus one AI response.

Here’s the thing nobody tells you about conversation depth: it’s directionally meaningless unless you segment by outcome.

A 12-turn conversation could mean your AI companion helped a user work through something genuinely difficult and they were deeply engaged. Or it could mean the user asked the same thing five times, got confused responses, tried rephrasing, got more confused, and eventually gave up.

Same metric. Completely opposite signal.

The way to make depth useful is to split every conversation into two buckets: resolved (IRR = 1) and abandoned (IRR = 0), then track average depth separately for each. What you’re looking for:

Resolved conversations trending longer = users are going deeper because they’re getting value. This is good. Your AI is helping people explore, not just answer.
Abandoned conversations trending longer = users are stuck. Long conversations with no resolution is the clearest signal of AI friction you have.
Resolved conversations getting shorter over time = your AI is getting more efficient. Also good, usually.

For context, across AI coding assistants we’ve seen in Agnost’s data, the average successful conversation runs 4-7 turns. Abandoned conversations that lasted more than 10 turns almost never convert to re-engagement within 7 days. The user is done.

Surprised Pikachu face meme

^ me when founders see their avg conversation depth at 9 turns and think it means high engagement

3. Frustration Index: Quantifying Friction Before It Becomes Churn

Frustration Index is a composite signal. You cant measure frustration directly, but you can build a pretty accurate proxy from behavioral signals that show up in conversations before users consciously decide to leave.

The signals to combine:

Message repetition — user sends a substantively similar message to one they sent earlier in the same conversation. Weight this heavily. Rephrasing is almost always frustration.

Short follow-ups after long responses — AI sends 400 words. User replies “no that’s not what I meant.” This is a mismatch between AI output and user intent.

Clarification requests — “what do you mean by X”, “can you explain that differently”, “I don’t understand” type messages. Some of this is normal in a learning context. A lot of it is a problem.

Session abandonment after substantive AI response — user gets a response that seems complete, then just… leaves. No resolution signal, no follow-up. The AI thought it answered. The user disagreed by walking away.

You score each conversation 0-100 using a weighted combination of these signals. Then you track Frustration Index per user over time.

The pattern we consistently see: users whose Frustration Index climbs two sessions in a row are 3x more likely to churn within 14 days than users with stable or declining scores. It’s one of the strongest leading indicators of churn we’ve found, and it’s entirely invisible in traditional analytics.

One quick operational note: dont try to build this as a single perfect formula out of the gate. Start with message repetition only. That alone will tell you more than anything else in your current stack.

4. Activation Conversation: The Single Highest-Leverage Discovery You Can Make

In traditional products, you find your activation moment by identifying the action that most correlates with long-term retention. For Twitter it was following 30 people. For Dropbox it was putting a file in a folder.

For AI products, your activation moment is a conversation type.

Every AI product we’ve looked at has a specific conversation pattern, usually identifiable by topic or intent, that predicts whether a new user will come back. Not a pageview, not a feature used, not a session length threshold. A conversation.

For an AI coding assistant, it might be: “users who have a conversation where they debug a specific error with the AI in their first week retain at 2x the rate of users who only use it for code generation.”

For an AI tutor, it might be: “users who have a back-and-forth concept exploration conversation (6+ turns, resolved) in their first three days retain at 3x the rate of users who only ask factual questions.”

For an AI companion, it might be something about emotional context or continuity of a specific conversation thread.

You find yours by segmenting your retained users (90-day active) from your churned users and working backwards through their first-week conversations. What did retained users talk about? What conversation patterns show up in retained cohorts but not churned ones?

This is not a small insight. Finding your activation conversation is one of the highest-leverage things you can do in the first year of building an AI product, because it tells you what to push new users toward in onboarding. Most teams are optimizing their onboarding around UI flows. You should be optimizing it around getting new users into their activation conversation as fast as possible.

5. Conversation Health Score: The Leading Indicator of Churn

Conversation Health Score is a per-user rolling metric. You calculate it as a weighted average of their last N conversations (we usually use N=5 or N=7), scored on IRR, conversation depth relative to outcome, and Frustration Index.

The resulting number is a 0-100 score that represents how well the product is serving that specific user right now.

Here’s why this matters: churn in AI products is rarely a sudden event. It’s a gradual decay of value. Users who are about to churn almost always have a degraded Conversation Health Score 2-3 weeks before they actually leave. The conversations get shorter, more frustrated, less resolved.

If you’re only looking at aggregate metrics, you miss all of this. You see the user is still active (technically true) while their individual experience is quietly falling apart.

The operational use case for Conversation Health Score is churn intervention. Set an alert for any user whose score drops more than 15 points over two consecutive weeks. That user is at risk. You now have a 2-3 week window to do something about it before they’re gone.

What do you do? Depends on your product. An in-app message. A push notification. A triggered email. Showing them a feature they haven’t used. Whatever your intervention playbook is, the Conversation Health Score tells you exactly when to run it.

This is actually one of the core metrics we built Agnost around. Not because it’s the flashiest thing, but because it’s the most operationally useful signal we could give teams building AI products. Knowing which users are in trouble before they churn, not after, is the difference between proactive and reactive product management.

Office Space "Yeah if you could do that, that'd be great" meme

^ your customer success team when you tell them you can now see which users are about to churn 2 weeks early

6. Silence Before Churn: Reading the Behavioral Obituary

This one is a lagging indicator, which means it’s not going to help you save a user who’s already on their way out. But it’s enormously useful for building and refining your churn model.

Silence Before Churn is the pattern of what users stop doing in conversations before they cancel.

Here’s what it typically looks like in the data:

Message length decreases — users who were writing detailed, contextual prompts start sending short, one-line queries. The effort they’re willing to put in drops.
Follow-up rate drops — they stop having multi-turn conversations. Single-message sessions increase.
Session frequency dips before it goes to zero — there’s usually a period of irregular, low-effort usage before full churn. People dont usually go from daily active to gone overnight.
Topic narrowing — users stop exploring. They stick to one or two narrow use cases, usually the most basic ones.
No resolution-positive signals — they stop sending messages like “that worked” or “perfect.” The satisfaction markers disappear from their conversations.

You build this by doing a cohort analysis of churned users and pulling their last 30 days of conversation data. What changed in days 0-7 before cancellation? What about 7-14 days before? You’ll find patterns specific to your product.

Once you have those patterns, you encode them into your Conversation Health Score (which is why these metrics form a system, not a list). The Silence Before Churn tells you what decay looks like. The Health Score operationalizes it as a real-time alert.

How These Metrics Connect to Each Other

Look, these arent six independent metrics you add to a dashboard and check in different windows. They’re a system.

IRR is the foundation. Everything starts with whether the AI is actually resolving intent.

Conversation Depth gives IRR context. High depth, high IRR = deep engagement. High depth, low IRR = stuck users.

Frustration Index is the early warning layer. It catches degradation in experience before it shows up in IRR or retention numbers.

Activation Conversation is what you optimize toward in the first-use experience. The whole funnel from acquisition to activation changes when you know which conversation pattern predicts retention.

Conversation Health Score aggregates IRR, depth, and frustration signals into a single per-user number, giving you an operationalizable churn prediction tool.

Silence Before Churn is the post-mortem that makes your Health Score smarter over time. It’s how you learn what early decay actually looks like in your specific product.

The dependency chain goes: IRR and Depth, then Frustration Index, then Activation Conversation shapes onboarding, Health Score operationalizes everything, and Silence Before Churn continuously improves the model.

If you only have bandwidth to track two of these right now, start with IRR and Frustration Index. Those two will tell you more about the health of your AI product than all of your current analytics combined.

The Practical Problem

Most teams we talk to hit the same wall: they’re using analytics tools that were built for clicks, not conversations. The event structure doesnt map. The funnel metaphor breaks down. You end up with 40 custom events that dont tell a coherent story.

This is the exact problem we built Agnost to solve. Agnost is an analytics platform designed specifically for AI-native products, with native support for conversation-level tracking, IRR proxies, Frustration Index, Health Scoring, and the churn signals discussed in this post. If you’re building a conversational AI product and you’re tired of hacking together conversation analytics on top of tools that weren’t designed for it, it’s worth a look.

Hackerman meme

^ you, after finally having actual visibility into what’s happening in your AI product’s conversations

Wrapping it up

The teams winning at AI-native products in 2026 aren’t the ones with the best models. They’re the ones who actually understand what’s happening in their conversations.

The metrics above aren’t theoretical. They’re the patterns we’ve seen separate products that retain from products that churn. Build the measurement system. Fix what it shows you. Repeat.

Your model is probably better than you think. Your visibility into how its performing almost certainly isn’t.

TL;DR: DAU and session length dont tell you if your AI product is working. Track Intent Resolution Rate, Conversation Depth (segmented by outcome), Frustration Index, Activation Conversation, Conversation Health Score, and Silence Before Churn. These six metrics form a system that tells you whether your AI is actually doing its job.

Reading Time: ~9 min