← All posts

Setting Up Your First Conversation Health Dashboard

Learn how to build a Conversation Health Dashboard for your AI product: the 5 views you actually need, how to instrument for it, and the weekly review ritual that turns data into better decisions.

Setting Up Your First Conversation Health Dashboard (A Step-by-Step Guide for AI Product Teams)

You shipped. You have users. You have data. And every Monday morning you open your analytics dashboard and stare at a wall of session counts, DAU/MAU ratios, and funnel drop-off rates trying to answer one very specific question:

Is the AI actually working?

And the dashboard just… doesn’t answer that.

This is the gap nobody warns you about. In a conversation-first product, the “thing that happened” isn’t a click or a page view. It’s a conversation. And conversations have health. They succeed or fail. They frustrate or delight. None of that shows up in your standard analytics tool.

So you’re flying blind. Most teams don’t realize how blind until churn starts moving the wrong direction.

Dog sitting in a burning room saying this is fine

^ every AI product team’s Monday morning with their current analytics setup

This post is the guide I wish I’d had six months in. Here’s exactly what a real Conversation Health Dashboard should contain, how to build one, and how to use it every week.


Why hooking up standard analytics isn’t enough

Let me be clear first: I’m not dismissing event analytics. You still want it. Retention curves, activation funnels, feature adoption rates. You still need all of that. But event analytics tracks what users DO. Not what conversations ACCOMPLISH. There’s a real difference.

A user who sent 20 messages in a session looks great in your engagement metrics. But what if 15 of those messages were them rephrasing the same question because the AI kept misunderstanding them? That session is a product failure dressed up as strong engagement.

Standard analytics tools tell you the user stayed. They don’t tell you if the AI helped. And for a product where the core value proposition IS the AI’s ability to resolve what a user is trying to do, that’s a massive blind spot.

The gap: traditional analytics tracks behavior. Conversation health analytics tracks outcomes.

We’ve seen this pattern repeat across the teams we work with at Agnost. Teams with strong event analytics and zero conversation-level visibility consistently underestimate their failure rates, because frustrated users don’t always bounce immediately. They try again. They rephrase. They give the AI one more shot. And then they quietly stop coming back.

By the time churn shows up in your retention chart, you’ve already lost them.


The 5 views every Conversation Health Dashboard needs

Here’s the opinionated framework. Five views. No more, no less. Each has a specific job.

View 1: Weekly Pulse (the Monday morning check)

Three numbers. That’s it.

  • Intent Resolution Rate this week vs last week. What percent of conversations ended with the user’s intent being resolved? Up or down?
  • Frustration Index trend. Are users showing signs of frustration (repetition, rephrasing, short angry responses, abandonment mid-conversation) more or less than last week?
  • New user activation rate. Of users who signed up in the past 7 days, what percent had at least one successful conversation in their first session?

Under each number, one sentence of interpretation. Not a paragraph. Not a report. One sentence that tells you whether to dig deeper.

This view answers a single question in under 5 minutes: is the product healthier or sicker than it was last week? If the answer is clearly “healthier,” you go back to building. If something moved the wrong direction, that’s when you open the next view.

View 2: Conversation Quality Distribution

A histogram. X-axis: conversation quality score (resolved, partially resolved, frustrated, abandoned). Y-axis: count. Segmented by conversation category or intent type.

This is where you see the shape of your product’s performance, not just the average.

A flat aggregate IRR of, say, 68% sounds fine. But what if that number is actually 91% resolved for one intent category and 22% for another? The average hides a product-threatening failure in a specific use case.

Segment this view by whatever your natural intent taxonomy looks like. In a coding assistant, that might be “debugging,” “code generation,” “explanation,” “architecture advice.” In an AI support agent, it’s your top ticket categories. In an AI companion, it might be conversation type: emotional support vs. task help vs. casual chat.

The insight this view gives you: where exactly is the AI excelling and where is it failing? Not on average. By topic.

View 3: User Journey (cohort view)

This is the one most teams never build. And it’s probably the most important for understanding long-term product health.

Conversation Health Score over time, segmented by signup cohort.

Are users’ conversations getting better as they use the product more? Are they learning to prompt better? Is the AI learning to handle their preferences? Or does conversation quality plateau after week 2 and never improve?

A healthy AI product should show a positive slope here. Users who’ve been around longer should have higher-quality conversations than new users. If they dont, you have a problem. The AI relationship isn’t deepening. You’re getting breadth of usage without depth of value.

This view is your monthly gut-check on whether you’re building something sticky or something disposable.

View 4: Failure Mode Drill-down

Here’s where the dashboard turns into a prioritization tool.

Top 10 intent categories with the lowest Intent Resolution Rate this week. With sample conversations from each category right there in the view, not linked to some separate logs tool. Actually embedded.

You should be able to open this view and read real conversations without leaving the dashboard. That’s the key design principle. If it takes more than two clicks to go from “low IRR in category X” to “what those conversations actually look like,” you’ll stop using it. Everyone does.

This view is your weekly product backlog input. Fix the top 3 failure categories. Assign owners. Move on.

View 5: Churn Early Warning

Users whose Conversation Health Score has declined for two or more consecutive weeks. Segmented by plan or tier.

This is your intervention list. These are real users, with names and emails, who are drifting toward churn before they’ve made the conscious decision to leave. A declining Conversation Health Score is a leading indicator, usually by 2-3 weeks.

Segmenting by plan matters here. A paid user showing two consecutive weeks of declining conversation quality is a very different fire to put out than a free user on week 3.

This view feeds directly into your CS or success team’s weekly outreach. Or, if you’re early stage, it’s the list you personally email on Fridays.

Surprised Pikachu face

^ most teams seeing their churn early warning list for the first time after building this view


How to instrument for this (what you actually need to log)

The lazy version first, because it gets you 80% of the way there.

Log: conversation ID, user ID, turn count, timestamps per turn, and the raw text of each turn. That’s your baseline.

From there, run an LLM evaluation layer on completed conversations. After each conversation ends, send the full transcript to a judge model and ask it to score: was the user’s intent resolved? Did the user show signs of frustration? What category of intent was this? This is the “log everything, classify after the fact” approach, and it works.

The more rigorous version adds explicit UX signals: thumbs up/down after responses, a task completion checkpoint (“Did that help?”), a session-end survey. These ground your LLM-judged scores in actual user feedback, not inference.

The trap most teams fall into: trying to build the rigorous version first, getting stuck on feedback UX design, and shipping nothing. Start lazy. Get scores running. Ship the dashboard. Layer in explicit signals once you know what you’re optimizing for.

One important note on the LLM-as-judge layer: calibrate it against real conversations periodically. Have someone manually review 50 conversations a week and check whether the scores match human judgment. If they’re diverging, the judge prompt needs work.

At Agnost, we track this calibration score as a dashboard health metric. Your conversation health numbers are only as trustworthy as the evaluation layer behind them.


The weekly review ritual that actually matters

Building the dashboard is 20% of the work. The ritual is 80%.

A dashboard nobody looks at systematically is worse than no dashboard. It gives you false confidence that someone is watching.

Monday morning (5 minutes, solo). Open the Weekly Pulse. Read the three numbers. Trending right? Close the tab and go build. Something moved wrong? Add it to the team sync agenda.

Weekly team sync (15-20 minutes). Open the Failure Mode drill-down together. Read actual conversations out loud as a group. This is the part people skip, and it’s the most important part. There’s something about reading a real user’s frustrating conversation in front of your team that builds product empathy in a way no metric can replicate. Assign owners to the top 3 failure categories. Those owners report back next week.

Monthly review (30-45 minutes). Pull up the User Journey cohort view. Is conversation quality improving with tenure? If not, that’s a strategic problem, not a tactical one. The core product isn’t getting stickier and you need to understand why before going further on growth.

The discipline of this ritual is what turns conversation data into product decisions. Without it, the dashboard is decoration. With it, every metric has an owner and every trend has a response.


Common mistakes when building this

Tracking too many metrics. I’ve seen teams build dashboards with 40+ conversation metrics. They look impressive and get checked never. Pick 5. Know them deeply.

Not segmenting. A flat average Intent Resolution Rate hides everything. If your overall IRR is 70% but the entire failure is concentrated in one intent category, the average is actively misleading you. Always segment before drawing conclusions.

Building the dashboard without the review ritual. The dashboard exists. It lives at some URL. Nobody has a calendar block to look at it. Six months later someone asks “wait, dont we have a dashboard for this?” Yeah. You do. You’ve never used it.

Optimizing for the metrics instead of the users. Goodhart’s Law is real. If IRR becomes a performance target, people find ways to game it. Build in a qualitative check (actual human conversation review) that keeps the numbers honest.

Person applying clown makeup progressively

^ teams that built a beautiful dashboard and then never scheduled a review meeting


What a healthy dashboard looks like after 90 days

Here’s the benchmark to aim for. After 90 days of running this dashboard with a consistent review ritual, you should be able to say yes to all of the following:

  • Intent Resolution Rate is flat or improving week-over-week
  • Frustration Index is declining
  • The User Journey cohort view shows conversation quality improving with user tenure (positive slope)
  • The Failure Mode list has fewer categories than it did on day 1, and the ones remaining have owners and timelines
  • You can point to at least 3 product decisions you made this quarter that were directly informed by a specific dashboard trend

That last one is the real test. If you can’t name the decisions, the dashboard is decorative. If you can, it’s running the product.

The teams we see doing this well at Agnost typically hit meaningful inflection points around weeks 6-8. Not because the data suddenly gets better, but because the review ritual becomes a habit and the team starts making tighter, faster product decisions. The data was always there. The discipline is what changed.


Wrapping it up

Look, I know the instinct is to wait until things are “more stable” before investing in this infrastructure. You’re focused on shipping features, not building internal tooling. The dashboard can come later.

But here’s what I keep seeing: the teams that build Conversation Health Dashboards early don’t slow down. They ship FASTER, because they have a clear weekly answer to “what should we fix next.” They’re not debating priorities in a vacuum. The data tells them.

Your AI product’s health is a real, measurable thing. Users are either getting their problems solved or they’re not. Conversations are either improving with product tenure or they’re stagnating. You either know this or you don’t.

Build the dashboard. Run the ritual. Give yourself the unfair advantage of actually knowing what’s going on inside your product.

Hackerman typing intensely at a glowing computer

^ you, 90 days after building this dashboard and actually knowing your product better than anyone in the room


If you’d rather not build this from scratch, this is exactly what Agnost tracks out of the box. Connect your AI product and get your Conversation Health Dashboard running in under 10 minutes: Intent Resolution Rate, Frustration Index, cohort views, failure mode drill-downs, the whole thing. Check it out at agnost.ai.


TL;DR: Standard analytics tells you what users did. Conversation Health Dashboards tell you whether the AI worked. Build 5 views, run a weekly ritual, give every metric an owner. The teams that do this ship better products faster.

Reading Time: ~9 min