We Dug Into Claude Code's Source Code. Anthropic Built a Full Frustration Detection System.

Most people using Claude Code have no idea how much Anthropic is watching the experience.

Not in a creepy way. In the “we actually care if this thing is working” way. Because buried in Claude Code’s source is a surprisingly sophisticated stack of frustration detection, satisfaction labeling, and behavioral improvement loops. Its the kind of instrumentation most agent-builders would love to have. And the wild part is, none of it is available to you when you deploy Claude (or any LLM) in your own product.

Let me show you what they actually built.

Surprised Pikachu face meme

^ me opening useIssueFlagBanner.ts for the first time

The regex layer: frustration detection at the message level

The first thing that caught my eye was useIssueFlagBanner.ts. It’s a hook that scans every incoming user message against a list of correction-tone phrases. If a match fires, a banner pops up offering to file a GitHub issue.

The phrases it’s scanning for:

"no, that's wrong"
"not what I asked"
"why did you"
"try again"
"undo that"
"that's not right"
"you misunderstood"

This is a blunt instrument, but its effective. You dont need ML to catch “why did you do that” — that’s a frustrated user every single time. Regex gets you 80% of the signal for basically zero latency cost.

The smarter move here isn’t just showing a banner. Its logging that signal. How often are those phrases firing? Which intents trigger them most? What percentage of sessions contain at least one correction-tone message? That’s your product health scorecard.

The analytics event layer: `is_negative` flags on every prompt

userPromptKeywords.ts is where it gets more interesting. Anthropic fires a tengu_input_prompt analytics event for every user message — and one of the properties on that event is is_negative: true if the message contains frustrated language.

The negative keyword list includes things like:

"wtf"
"this sucks"
"fucking broken"
"terrible"
"useless"
"what is this"

So Anthropic is tracking, at scale, what percentage of Claude Code prompts are coming in with negative sentiment. Across their entire user base. Every day.

Think about what that data lets you do. You can see if a new model version caused frustrated-prompt rates to spike. You can segment by task type and see which ones generate the most negative input. You can correlate frustrated-prompt rate with 30-day retention and build a leading indicator.

If you’re building an AI product and you dont have an equivalent of this running, you’re flying blind.

The survey layer: 0.5% sampling with session upload

Claude Code runs a periodic feedback survey shown to 0.5% of sessions. Three options: bad, fine, good. Simple.

But the interesting part is what happens when you rate it “bad”. The system can upload the full conversation transcript to Anthropic’s API for human review. That’s a qualitative signal collection loop, not just a NPS score.

The 0.5% sample rate is smart. High enough to get statistical signal. Low enough not to annoy users. And the bad-rating transcript upload means your qualitative review queue is pre-filtered to the sessions that actually went wrong.

Person watching a fire extinguisher being used on a small fire meme

^ every PM who’s only collecting NPS scores and wondering why they can’t diagnose what’s wrong

The LLM layer: session-level satisfaction labeling

This is the one that made me stop and re-read it twice.

Claude Code has an internal /insights command that triggers a second LLM call. That call reads the entire session transcript and labels the session with one of five satisfaction states:

frustrated
dissatisfied
likely_satisfied
satisfied
happy

And then it goes further, categorizing the friction type if the session was negative:

misunderstood_request
wrong_approach
buggy_code
user_rejected_action
excessive_changes

That last list is gold. “Excessive changes” and “wrong approach” are completely different failure modes that require completely different fixes. One is a calibration problem. The other might be a context window problem or a system prompt problem. But if you’re not labeling sessions this way, you can’t even ask the right questions.

This is what production agent analytics actually looks like. Not just “session duration” and “message count” — but labeled session outcomes with friction category breakdowns.

We built something very close to this at Agnost, and it’s one of the features our customers use most. Looking at your top 10% frustrated sessions by friction category tells you more in 10 minutes than a week of log diving.

The behavior loops: tool rejection hints and skill improvement

Two more things worth calling out.

When a user cancels or rejects a Claude action, the system injects a hint into the context: “The user’s next message may contain a correction… consider saving that to memory.” Its a subtle behavior nudge that turns a rejection event into a learning moment.

And every 5 turns, a side-channel LLM call scans the conversation for explicit user corrections (“no, do X instead”, “always use Y”) and proposes updates to Claude’s behavior. This is an automated skill improvement loop running silently in the background.

The ESC key press is also tracked. How many times did the user interrupt Claude mid-response? High interruption counts are a really clean signal that something’s off — either the model is going in the wrong direction or responses are too long.

The gap that affects everyone building on top of LLMs

Here’s the thing though.

All of this instrumentation lives inside Claude Code. If you’re building your own agent — a customer support bot, a coding assistant, a research tool, whatever — you get none of it. You get the raw API. You get tokens in, tokens out. You get latency if you instrument it yourself.

Anthropic built this feedback stack for their own product. They can see when Claude Code is frustrating users at scale. They can see which friction categories are trending up after a model change. They can route bad sessions for human review automatically.

You can’t. Unless you build it yourself.

That’s the gap Agnost closes. Frustration signal detection, satisfaction scoring, session-level friction labeling, correction pattern tracking — across all your agents, without building any of this infrastructure yourself. We process millions of agent calls through Agnost, and the patterns we see across customers are remarkably consistent: the teams that instrument for frustration catch problems 2 to 3 weeks before they show up in churn data.

The teams that don’t are reverse-engineering lost users with nothing but message logs.

What to actually do with this

If you’re building an AI agent product today, here’s the minimum viable version of what Anthropic shipped in Claude Code:

Regex layer: Scan every user message for correction-tone phrases. Log the match rate per session. Trend it over time.

Negative sentiment flags: Add an is_negative property to your prompt analytics events. Track what percentage of your prompts are coming in frustrated. This number should go down as your product matures. If it’s not, something is wrong.

Session satisfaction labeling: After each session ends, run a cheap LLM call to label the session outcome. You dont need a complex taxonomy to start — just good, neutral, frustrated gets you 80% of the value.

Interruption tracking: If your agent does multi-step work, track how often users cancel mid-execution. High rates here mean your agent is making poor planning decisions.

None of this is rocket science. Anthropic just did it, and now you can see exactly what they built.

Hackerman meme with glowing keyboard

^ you, after instrumenting your agent with actual frustration detection instead of just session counts

Wrapping up

The thing that stuck with me after going through this source code isn’t any single feature. Its the philosophy. Anthropic treated Claude Code as a product that needed to earn user trust on every session, not just the first one. They built a whole measurement stack around that belief.

If you’re building AI products and you’re only measuring whether the API call succeeded, you’re measuring the wrong thing. The question isn’t “did the agent respond?” It’s “did the user get what they actually needed, and how did they feel about it?”

Anthropic knows the answer for Claude Code users. Do you know it for yours?

If you want the same visibility for your agents without building it from scratch, that’s exactly what Agnost is for. Check it out.

TL;DR: Anthropic built regex frustration detection, negative sentiment analytics, LLM-powered session satisfaction labeling, and automated skill improvement loops into Claude Code. Agent builders deploying LLMs in their own products have none of this by default. Now you know what to build — or just use Agnost.

Reading Time: ~6 min