Your Voice AI Agent Thinks Every Call Went Well. It’s Wrong.

A caller threatens legal action over suspected fraud. Their voice is steady, controlled. Your QA platform scores the call “positive quality”: agent followed the script, latency was low, all tools fired correctly. Meanwhile, the LLM reading the transcript flags deep frustration. The acoustic model hears a calm voice. Which one is right?

All of them. And that’s the problem.

This is the gap nobody talks about when shipping voice AI agents. You instrument the technical layer, set up QA, check the sentiment score, see “positive,” and ship the next feature. But you have no idea how your callers actually felt. At scale, that matters a lot more than whether your TTS latency was under 400ms.

What a customer’s four-pipeline experiment revealed

One of our customers, a team running a voice AI agent across multilingual call center workflows, wanted to pressure-test their analytics setup. Were they actually measuring caller experience, or just call quality?

They ran four independent analysis pipelines across 11 real call recordings: an LLM (Claude) doing text analysis, Deepgram’s acoustic model, Hume AI’s hybrid text-plus-prosody pipeline, and the QA platform they’d been using in production.

The result was uncomfortable.

The QA platform rated every single call as “positive quality.” Every one. Across all 11 calls, across 7 languages, across wildly different caller situations: billing disputes, investment fraud victims, aggressive sales calls, someone threatening legal action. Positive. Positive. Positive.

The other three pipelines told a completely different story. They flagged conflict, frustration, or behavioral anomalies in all 11 calls. The average agreement score across the four pipelines was 0.38 out of 1.0.

Same calls. Completely different conclusions. That’s not measurement noise.

The QA score is measuring the wrong thing

QA platforms measure operational quality. Did the agent follow the script? Were there hallucinations? Did the tool calls resolve correctly? Low latency? That’s the full picture of what a QA score is telling you.

It’s not telling you whether your caller was satisfied, or calm, or quietly furious and already planning to churn. A call can be technically flawless while the caller is three days away from disputing the charge. The QA platform won’t flag that. It wasn’t built to.

Why different signals catch completely different things

Text analysis catches explicit frustration: “this is unacceptable,” “I want to speak to a manager,” “I’m considering legal action.” An LLM reading the transcript will surface these. But text misses everything that isn’t said directly. A caller staying polite while seething? Text often won’t catch that.

Acoustic analysis catches what voice reveals that words don’t: pitch, tempo, energy, pace. But it has real blind spots. In this customer’s experiment, the acoustic model returned “neutral” for almost every non-English call. That’s a known limitation of most speech analytics tools; their sentiment models are trained heavily on English and degrade significantly on other languages. If you’re running voice AI at any international scale, leaning on acoustic signals alone will give you a systematically distorted picture.

The behavioral layer is where things get more interesting.

Venting, masking, and suppressed frustration

When you look at text and acoustic signals together, patterns show up that no single metric can surface. Three showed up repeatedly across this customer’s calls.

Venting is when a caller uses frustrated language but sounds calm. The person who’s been dealing with a billing issue for two weeks, has called before, and is now stating their frustration plainly but professionally. Not raising their voice. The acoustic model registers neutral. But the transcript is unambiguous about where this caller stands. Four out of 11 calls showed this pattern, including a French caller who had lost money to an investment scam and was describing the situation in controlled, detailed terms while being clearly distressed in what they were actually saying.

Suppressed frustration is a milder mismatch, but persistent. One call in the set involved a customer being pitched a pharmaceutical product they repeatedly declined. Voice stayed neutral throughout. Text was clearly resistant. The QA platform scored it positive quality. The reality was a caller who’d been worn down, refusing with decreasing patience, and was a real escalation risk that looked completely fine by every surface metric.

Openly negative is the one case where text and voice both agreed on frustration. A caller disputing what they believed was investment fraud, discussing potential legal action. Both the linguistic analysis and the acoustic model flagged genuine distress. This call had the lowest agreement score in the dataset (0.20 out of 1.0), meaning the QA platform and reality were as far apart as they could be. The platform still scored it positive.

None of these patterns show up if you’re only looking at one signal, and all three represent callers who are either already gone or on their way out.

Four signals, four different questions

The experiment makes the separation between signal types pretty clear. Text tells you what the caller said: explicit frustration, stated complaints, things like “I want to escalate this” or “this has happened three times now.” Acoustic tells you how they sounded: tone, pace, pitch, the stuff words don’t carry. The delta between the two is where the behavioral patterns live, a caller who sounds fine but is saying very frustrated things, or one whose voice is tense while the words stay polite. And QA tells you whether the agent technically worked.

All four are real. None of them are the same. Most teams in production are running one, maybe two of these, and treating that as the full picture.

Running only QA on your voice AI is like measuring a restaurant by whether orders came out on time. True, trackable, and almost entirely beside the point if the food wasn’t good.

What this means for your voice AI setup

What this customer found isn’t a special case. Teams running voice AI across any volume of calls are almost certainly looking at a QA score that says “fine” while a portion of their callers are frustrated, suppressing it, or venting to an agent that has no idea. The only way to know is to measure across multiple signals.

That’s what Agnost does. Instead of picking one signal and hoping it generalizes, Agnost gives you the full picture at the conversation level: text sentiment, acoustic patterns, behavioral flags, and technical QA in one place, across every language your agents operate in. You can see which calls produced venting patterns. You can see where your agent gave a technically correct response that still landed badly. You can catch suppressed frustration before that customer churns, not two weeks later when they’ve already left.

The teams pulling ahead in voice AI aren’t the ones with the most sophisticated agents. They’re the ones with enough visibility into their calls to actually iterate on what’s going wrong. Without that, you’re optimizing on QA scores and hoping for the best.

See how Agnost works

The data is already there

Every call your voice agent handles is generating text, audio, behavioral signals, and operational metrics. The question isn’t whether the information exists. The question is whether you’re actually reading it, or whether you’ve set up a QA tool, watched it return positive across the board, and moved on.

This customer’s experiment was a forcing function. They ran the same calls through four different lenses, got four different answers, and that gap told them more about their callers than months of QA scores had. You don’t necessarily need four pipelines. But you do need more than one.

Want to see what this looks like across your own voice agent calls? Talk to us.

Related reading: