← All posts

Why Your Best Users and Your Worst Users Look Identical in Your Dashboard

A power user and a frustrated user can have the same session count, same average session length, and same return rate. Standard analytics can't tell them apart. Conversation analytics can.

Why Your Best Users and Your Worst Users Look Identical in Your Dashboard

Meet User A. Opens your AI product four times a week. Spends about eight minutes per session. Has been a customer for three months. In that time they’ve asked 50 complex questions, gotten 47 of them resolved cleanly, told three colleagues to sign up, and is actively exploring your premium tier.

Now meet User B. Opens your AI product four times a week. Spends about eight minutes per session. Has been a customer for three months. In that time they’ve asked the same five questions over and over, gotten incomplete or wrong answers each time, and is currently on a free trial of your closest competitor.

Pull up your analytics dashboard. Find them.

You can’t. Because in your dashboard, they look exactly the same.

Two Spidermen pointing at each other

^ your “power users” segment and your “about to churn” segment, probably

This isn’t a quirk of your specific tooling. It’s a structural limitation that almost every team building on a conversational AI product runs into eventually, usually after they’ve already made several expensive product decisions based on the wrong read of their user base.


Why standard metrics can’t see the difference

The session-based metrics we inherited from the web analytics world, DAU, session count, average session length, return rate, all measure the same thing: whether the user showed up and how long they stayed. They don’t measure what happened while they were there.

That worked fine when “using a product” meant clicking through screens and submitting forms. State transitions are easy to track. Did the user reach the checkout page? Did they complete the registration flow? Each step either happened or it didn’t.

Conversations dont have that structure. There’s no checkout step. There’s no “intent resolved” event in your event log. There’s just a stream of messages that could represent deep, growing engagement or increasingly desperate repetition, and from the outside they look identical.

Here’s the part that makes it worse: frustrated users don’t immediately drop their usage when things aren’t working. They do the opposite. They ramp it up. They rephrase. They come back and try again. They spend more time in the product trying to extract value they’re not getting.

Early-stage churn often looks like increased engagement in your event logs. That’s not an edge case. That’s the pattern.

Dog sitting in burning room saying "this is fine"

^ every AI PM who’s been measuring “engagement” without measuring whether anyone’s actually getting what they came for


The consequences are worse than you probably think

If you can’t distinguish your genuinely engaged power users from your pre-churn users who look like power users, the downstream effects compound fast.

Your segmentation is wrong. The “highly engaged” segment you’re analyzing to understand what drives success is contaminated with users who are engaged because they’re stuck, not because they’re thriving. Any behavioral patterns you extract from that segment, to inform onboarding, to inform feature prioritization, to inform your ICP, are reflecting both populations at once.

Your A/B tests are noisy. When you run an experiment on your “active users” cohort and see a flat result, you don’t know if you’re looking at a neutral effect or two opposite effects that cancelled each other out, the intervention helping your real power users and making things worse for the pre-churn group, or vice versa.

Your retention interventions land on the wrong people. If you’re sending “tips for power users” to your most active segment, a meaningful chunk of those emails are going to users who are about to leave. Not because you targeted wrong, but because your targeting metric can’t see what’s actually happening.

And the most painful one: you think product-market fit is stronger than it is. Your engagement numbers look healthy. Retention looks okay. Then Q2 churn hits and it’s worse than anything your dashboard suggested. You’re not surprised because the product got worse. You’re surprised because you didnt know the product was already failing a whole segment of users who were still showing up in your active cohort.


What actually separates them

The signals that distinguish a power user from a pre-churn user don’t live in event logs. They live in the conversations.

Intent Resolution Rate at the user level. User A resolves 90%+ of their intents. User B resolves maybe 30%. You can’t get this number from session data alone. You need to know whether the user actually got what they came for, which requires analyzing what happened inside the conversation, not just that a session occurred.

Conversation trajectory over time. User A’s conversations are getting deeper and more complex each week. They’re giving more context, asking more sophisticated follow-up questions, using the product for higher-stakes tasks. User B’s conversations are getting shorter and more repetitive. The complexity is shrinking, not growing. This trend is invisible in aggregate session metrics but completely legible in conversation data.

Frustration signals. Rephrasing the same intent two, three, four times in a single session. Short clipped replies after receiving an AI response. Polite exits at low turn counts. These are behavioral signatures of a user who is not getting what they need. They’re invisible to event-based tracking and obvious in conversation-level analysis.

Session resolution rate. Not just how many sessions, but how many sessions where the user got something done. User A’s sessions have resolution markers: follow-up actions in the product, positive closing messages, subsequent return for a different intent type (suggesting the first one was resolved). User B’s sessions just end.

Across millions of agent conversations we’ve tracked at Agnost, the IRR gap between what look like equally engaged users is often 40-50 percentage points. The engagement metrics told you the same story. The conversation data told you two completely different ones.

Surprised Pikachu

^ founders seeing user-level IRR data for the first time and realizing their “top users” segment is half pre-churn


How to actually separate these segments

The fix isn’t adding more events to your event log. It’s building a user quality score from conversation data and using that to split what your standard metrics treat as a single population.

The components of a useful quality score:

IRR per user, trailing 30 days. What percentage of this user’s conversations ended in a successful resolution? If you dont have LLM-as-judge scoring yet, proxy signals work: positive follow-up after a response, no immediate rephrase, subsequent product action after the conversation. Imperfect but directionally correct.

Conversation depth trend. Is this user’s average conversation depth (turn count, message complexity, topic variety) going up or flat or down over the last 4-6 weeks? Increasing depth is a power user signal. Decreasing depth is a quiet downgrade signal, the user has stopped trusting the AI with anything important.

Frustration index. Aggregate of rephrase rate, short-exit rate, and polite-exit-at-low-turn-count rate per user. A frustration index trending up over two weeks is one of the strongest pre-churn leading indicators we’ve found. It often precedes actual churn by 3-4 weeks, which is exactly the window you need to intervene.

Session resolution rate. Sessions that ended with a positive signal vs. sessions that ended abruptly or with a soft quit indicator.

Run this scoring on your active users segment and look at the distribution. It’s almost always bimodal. There’s a healthy cluster with high IRR, growing depth, low frustration. And there’s an at-risk cluster with low IRR, flattening or declining depth, elevated frustration. Both clusters have the same session counts and session lengths. Without conversation-level scoring, they’re invisible to you as separate populations.

Now you have two products to manage. One is a retention and expansion play: go deep on what’s working for the healthy cluster, understand their use cases, build the features that deepen their engagement further, figure out who else looks like them. The other is a rescue operation: identify the at-risk cluster before they churn, reach out with targeted interventions, figure out what intents they’re failing on and whether that’s a product problem or a positioning problem (are they just wrong-fit users who you shouldn’t be acquiring in the first place?).

Both decisions are better than operating on the assumption that your engaged users are a single homogeneous group, which is what every standard analytics tool implicitly tells you.


The tooling reality

Building this scoring in-house is doable. It’s also a few weeks of eng work that’ll need ongoing maintenance as your product evolves. You need semantic analysis on message content, LLM-as-judge for intent resolution scoring, a way to track per-user trends over time rather than aggregate cohort metrics, and a UI that makes the bimodal distribution visible rather than averaging it away.

A lot of teams I talk to say they’ll get to this “once things settle down.” That’s usually the same teams who hit a churn surprise six months later and can’t explain it.

This is exactly the gap Agnost was built to close. User-level IRR tracking, conversation depth trends, frustration indexing, and the per-user quality score that lets you split your engaged user segment into the two populations that are actually in there. It’s built natively into the conversation analytics layer, not bolted on as a separate eval tool you have to maintain.

If you’ve got a growing user base and you’re starting to wonder whether your engagement metrics are telling you the whole story, they’re probably not. The data you actually need is in your conversations. You just need something that knows how to read it.


Wrapping it up

Here’s what I keep coming back to: the identical-data problem isn’t a dashboarding problem you can solve by adding more charts. It’s a fundamental mismatch between the measurement paradigm you’re using and the product you’re building.

Event-based analytics were designed for products where success is a state transition. Conversational AI products dont have state transitions in that sense. Success is qualitative. It lives inside the exchange. And the only way to see it is to actually analyze the exchange, not just log that it happened.

Your best users and your worst users are in the same segment right now. They have the same session counts, the same return rates, the same engagement scores. But their relationship with your product is completely different. One of them is building a habit. The other is running out of patience.

The sooner you can tell them apart, the sooner you can do something about both.

Hackerman meme coding at multiple screens confidently

^ you, after running the bimodal user quality analysis and finally seeing your engagement data honestly


TL;DR: Your engaged users segment contains two completely different populations that look identical in standard metrics. Power users and pre-churn users have the same session counts and return rates because frustrated users ramp up usage trying to extract value they’re not getting. The only way to split them is conversation-level scoring: IRR per user, conversation depth trend, and frustration index. Once you do, you’ll find a bimodal distribution with two completely different product strategies hiding inside what looked like one healthy cohort.

Reading Time: ~8 min