← All posts

How to Instrument Your AI Agent With OpenTelemetry in 2 Minutes

A practical guide to instrument your OpenTelemetry AI agent: which spans, attributes, and conventions to emit for prompts, tool calls, and outcomes.

You instrument an AI agent with OpenTelemetry by wrapping each user turn, model call, and tool call in spans, then attaching the GenAI semantic-convention attributes (model name, token counts, prompts, tool inputs and outputs) to those spans. Point the SDK at an OTLP endpoint and your traces start flowing. The whole thing takes about two minutes once the SDK is in your dependency tree.

The harder part isn’t wiring it up. It’s deciding what to emit so the traces are actually useful later, when you’re trying to figure out why an agent stalled on turn 7 or quietly stopped calling a tool it should have. This guide covers the spans and attributes that matter, the GenAI semantic conventions, and a few things production teams learn the expensive way.

Why OpenTelemetry instead of a vendor SDK?

Because your agent will outlive whatever observability vendor you pick this quarter.

OpenTelemetry (OTel) gives you a vendor-neutral wire format (OTLP) and a stable set of semantic conventions. You instrument once. After that you can fan the same traces out to Jaeger for local debugging, a managed backend for production, and a data layer for understanding behavior, all without touching your agent code again.

The OTel project also ships an official set of GenAI semantic conventions specifically for LLM and agent workloads. That means the span names and attribute keys for “an LLM was called” or “a tool ran” are standardized. Tools downstream know how to read them without you writing a custom parser per backend.

What spans should an LLM agent emit?

Think in terms of the unit of work, not the unit of code. A single user message often kicks off a chain: the agent reasons, calls a tool, gets a result, reasons again, then responds. Each of those is a span, and they nest.

A reasonable span hierarchy for one agent turn:

  • agent.turn (root span for the turn): the full request/response cycle for one user message
    • gen_ai.chat: a single LLM call (one round-trip to the model)
    • tool.execute: one tool/function invocation
    • gen_ai.chat: the follow-up model call after the tool returned
  • agent.session (optional parent): groups every turn in a conversation under one trace or links them via a shared session ID

The session-level grouping is the one people skip, and then regret. Per-turn traces tell you a single call was slow. Session-level grouping tells you the user asked the same question three different ways before giving up. One of those is an observability problem. The other is a product problem, and it’s usually the more expensive one.

Which attributes actually matter?

Here’s a minimal but useful set, aligned with the GenAI conventions. Names follow the gen_ai.* namespace.

AttributeGoes onWhy you want it
gen_ai.systemLLM spanWhich provider (openai, anthropic, etc.)
gen_ai.request.modelLLM spanModel + version, for regression hunting
gen_ai.usage.input_tokensLLM spanCost and context-bloat tracking
gen_ai.usage.output_tokensLLM spanCost, verbosity drift
gen_ai.promptLLM span (event)The actual input. The thing you’ll reread at 2am
gen_ai.completionLLM span (event)What the model actually said
gen_ai.tool.nametool spanWhich tool, for failure-rate-by-tool
session.idall spansStitch a conversation together
user.idall spansPer-user behavior, churn analysis

A note on prompts and completions: in the current conventions these are usually emitted as span events or as opt-in body content, not always as plain attributes, partly for size and partly for privacy reasons. Whatever your SDK version does, capture them. An agent trace without the actual prompt and completion is a stack trace with the error message redacted. Technically present, practically useless.

Show me the code

Python, using the OTel SDK. This is the manual version so you can see what’s happening. Auto-instrumentation libraries do most of this for you, but you should know what they’re emitting.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="https://your-collector/v1/traces"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-agent")

That’s the three-ish lines of setup. Now the part that makes the data worth having: wrapping an actual turn.

def handle_turn(user_msg, session_id, user_id):
    with tracer.start_as_current_span("agent.turn") as turn:
        turn.set_attribute("session.id", session_id)
        turn.set_attribute("user.id", user_id)
        turn.add_event("user.message", {"text": user_msg})

        with tracer.start_as_current_span("gen_ai.chat") as llm:
            llm.set_attribute("gen_ai.system", "openai")
            llm.set_attribute("gen_ai.request.model", "gpt-4.1")
            resp = call_model(user_msg)  # your model call
            llm.set_attribute("gen_ai.usage.input_tokens", resp.usage.input)
            llm.set_attribute("gen_ai.usage.output_tokens", resp.usage.output)
            llm.add_event("gen_ai.completion", {"content": resp.text})

        if resp.tool_call:
            with tracer.start_as_current_span("tool.execute") as tool:
                tool.set_attribute("gen_ai.tool.name", resp.tool_call.name)
                result = run_tool(resp.tool_call)
                tool.set_attribute("tool.success", result.ok)

        turn.set_attribute("agent.outcome", classify_outcome(resp))
        return resp

The one line that punches above its weight is agent.outcome. Latency and tokens tell you what the machine did. Outcome tells you whether the user got what they came for. Did the turn resolve the request, escalate to a human, hit a tool error, or just trail off? You can compute it cheaply (a small classifier, a heuristic, or a deferred label) and it turns a pile of spans into something you can actually reason about.

If you’d rather not hand-roll any of this, the OpenTelemetry ecosystem has auto-instrumentation for OpenAI, Anthropic, LangChain, and friends. Add the package, and the gen_ai.chat spans show up automatically. You still own the agent.turn and agent.outcome semantics, because only you know what “success” means for your product.

Common mistakes that make traces useless

A few patterns I see over and over:

  1. Capturing latency but not content. You’ll know the call took 4 seconds. You won’t know it returned garbage. Capture prompts and completions.
  2. No session ID. Every turn is an island. You can’t see the user fighting your agent across five messages.
  3. No outcome label. Without it, you can debug performance but not behavior. Most agent failures are behavior failures.
  4. Sampling out the interesting traces. Tail-based sampling that drops “normal” traffic also drops the slow, failing, weird sessions you most want. Sample to keep errors and outliers, not to throw them away.
  5. Logging PII into spans by accident. Prompts contain user data. Scrub or redact at the SDK boundary, set retention deliberately, and decide this before legal asks.

From traces to a self-improving agent

Once conversations and traces flow over OTLP, you have the substrate for something bigger than dashboards. The same spans that help you debug latency also describe, in structured form, every place a user got stuck, asked for a feature, or hit a setup snag.

That’s where this stops being plumbing and starts being a feedback loop. Agnost AI reads those conversations over the OpenTelemetry data you’re already emitting, auto-generates custom intents for your product (bug reports, feature requests, churn risk, setup friction), and tracks them live to surface why users stall or won’t upgrade. Then it opens pull requests against your system prompts, agent harness, and W&B configs to fix what it found. You review and merge. Because it’s standard OTel, there’s no separate instrumentation to maintain: the traces you built for debugging become the input to actually improving the agent.

The point of instrumenting well isn’t pretty waterfall charts. It’s making the agent’s behavior legible enough that you, and your tooling, can change it on purpose.

FAQ

Do I need OpenTelemetry if my LLM provider already gives me a dashboard? Provider dashboards show you that provider’s calls. They don’t see your tool executions, your retries, your routing across multiple models, or your business outcome. OTel captures the whole turn end to end, in a format you own.

Will instrumenting add latency to my agent? Negligible, if you use the batch span processor (which exports asynchronously off the hot path). The expensive part is network export, and batching makes that a background concern. Don’t use the simple/synchronous processor in production.

What’s the difference between the GenAI semantic conventions and just making up my own attributes? You can make up your own, and for product-specific things like agent.outcome you should. But for standard concepts (model name, token usage, tool calls), the gen_ai.* conventions mean downstream tools parse your data without custom adapters. Standard where you can, custom where you must.

If you’re already emitting OpenTelemetry traces, you’re most of the way to closing the loop between what users do and what your agent ships next. Agnost AI plugs into that same data to turn it into reviewed pull requests, and it’s free to start (no credit card, no sales call) whenever you want to see it on your own agent.