Apr 11, 2026
LLM Observability: Logs, Traces, and Metrics That Actually Matter
Engineering

LLM Observability: Logs, Traces, and Metrics That Actually Matter

LLM systems are not just one model behind an endpoint. They are workflows: retrieval, tools, business logic, safety layers, caches, and multiple models glued together. You do not find failures with a couple of dashboards. You find it with traces, structured logs, and metrics designed for this stack.
Daniel BrooksNovember 2, 202516 min read381 views

Most teams building with LLMs know they need "observability."
Very few can answer a basic question: Show me, for yesterday, three real on the open web examples where observability changed a decision. If the only thing you can show is token counts and latency percentiles, you do not have observability. You have billing telemetry. LLM systems)-cooling physical limits ai scaling reliability engineering are not just one model behind an endpoint. They are workflows: retrieval, tools, business logic, safety policy why governments care about your gpu cluster loss functions layers, caches, and multiple models glued together. Failures are usually not "the model is down." They are "the system is confidently wrong, quietly unsafe, or surprisingly expensive." You do not find that with a couple of dashboards. You find it with traces, structured logs, and metrics designed for this stack. ## WHAT MAKES LLM OBSERVABILITY DIFFERENT Classical web apps care about: * Request rate

  • Latency
  • Error codes
  • Resource usage LLM systems inherit all of that and add new axes. Same endpoint, wildly different cost Two requests that look similar at the HTTP level can differ by orders of magnitude in tokens consumed, retrieval load ai tools that help people think, or number of tool calls. One user's "simple question" might trigger a full RAG pipeline and three API calls; another's might hit a cache. Same model, wildly different risk An LLM answering internal dev questions on staging is not the same as an LLM drafting messages to regulators. The infrastructure is identical; the acceptable failure modes are not. Same output, different quality Two answers can both be fluent and plausible; one is correct and grounded, the other is a hallucinated blend. Existing infra metrics do not see that. Observability has to surface: * What the system did for each request (the path taken)
  • What it cost (tokens, tool calls, retrieval fan out
  • How risky it was (safety filters, policy hits, external actions)
  • Whether the result was any good (quality and correctness signals) You cannot bolt that on at the end. You design for it. ## THE CORE UNIT: A TRACE, NOT A MODEL CALL The basic unit of observability for LLM systems is not "one OpenAI call" or "one model.generate." It is a trace: One end-to-end user request, with every step along the way, labeled and linked. For a typical LLM-backed workflow, a trace might include: * User request and metadata (tenant, feature, device, region)
  • Pre-processing steps (classification, routing decisions)
  • Retrieval queries and the documents returned
  • Model calls (prompt, parameters, response summary)
  • Tool calls (arguments, results, errors)
  • Safety filters and their decisions
  • Post-processing and final response If you cannot reconstruct this path for an arbitrary request ID after the fact, you are debugging blind. A good trace has a few properties: * Correlation ID that flows through every component
  • Start and end timestamps per step
  • Structured fields, not free-form logs (you want to query, not grep)
  • Enough context to reason about behavior without violating privacy Do not log entire prompts and outputs for everything forever. You do not need to hoard raw text to see that a specific workflow is leaking tokens or hammering a tool. Capture summaries and sample raw content selectively for deep dives. ## FOUR CLASSES OF SIGNALS YOU NEED Once you have traces, you can layer metrics and logs. The signals that matter cluster into four groups. 1. Health and performance You still care about the basics: * Latency per endpoint and per key step (retrieval, model, tools)
  • Error rates: timeouts, 4xx/5xx from downstream APIs
  • Resource usage: tokens, CPU, memory, queue depths But you want them broken down by: * Tenant / customer segment
  • Feature (chat vs summarization vs extraction)
  • Model and configuration (provider, version, temperature, context length) One flat latency histogram for "/chat" is useless when half your traffic is lightweight autocomplete and the other half is 100k-token RAG. 2. Cost and efficiency If you do not monitor cost per unit of value, your margin is guesswork. Raw "spend per day" is not enough. You need: * Input tokens, output tokens, and total tokens per request
  • Distribution of context lengths (p50, p95, max)
  • Cache hit rates if you use response or embedding caches
  • Tool call counts and latency per tool type
  • Retrieval fan-out (how many documents / chunks pulled per query) Then you aggregate by something meaningful: * Cost per resolved ticket
  • Cost per document processed
  • Cost per code change merged
  • Cost per user session You want to see, for example: * This tenant uses 3x more tokens per ticket than others. Why?
  • This feature's average model cost is stable, but p99 tokens per request doubled after we changed RAG.
  • This agent workflow makes 8 tool calls on average and 40 in the worst 1% of traces. 3. Safety and policy Safety is not just a content filter bolted on. It is an observable dimension. You should log: * How often safety filters fire (input and output)
  • Which categories they flag (hate, self-harm, sexual content, violence, etc.)
  • How often you override or bypass safety layers (for admins, for tests)
  • How many requests hit "refusal" paths vs "answer" paths
  • Tool calls blocked by policy (forbidden actions, out-of-scope requests) Over time, this tells you: * Whether a model or configuration change made safety filters busier or lazier
  • Whether certain tenants, features, or regions see more safety triggers
  • Whether your "refusal rate" spikes for legitimate tasks in some languages or domains These are signals for both security and user experience. A sudden increase in blocked content in one region may indicate abuse attempts, or it may indicate a misfire in your filters after a change. 4. Quality and correctness The hardest signals are about quality. You cannot afford to have humans rate everything. But you need some structured view. Depending on your product, quality signals might include: * User edits: how heavily users edit generated drafts before sending
  • Task outcomes: ticket resolved, code merged, document accepted
  • Explicit feedback: thumbs up/down, ratings, flags
  • Automatic checks: schema conformance, presence of required fields, match against known answers for a subset of requests You do not need a perfect scalar "quality" metric. You need proxies that correlate with what you care about. For additional context, see our analysis in How Artificial Intelligence Is Changing Everyday Life in America—8 Surprising Examples. Examples: * In support automation, measure "human rlhf constitutional methods alignment tricks takeover rate" and "post-edit distance" between AI suggestions and final messages.
  • In code assistance, measure "compiles without errors" and "tests pass" for suggestions that were actually used.
  • In document QA, measure "exact or partial match" against known correct spans for a sampled dataset. Quality is domain-specific. Generic LM benchmarks tell you almost nothing about your deployment. ## DESIGNING YOUR LOGGING SCHEMA LLM observability fails when everyone logs whatever they feel like. You get a junk drawer of JSON blobs and no ability to group or compare. Define a basic schema for every model call and trace step. Include: * trace_id
  • span_id and parent_span_id (for hierarchy)
  • timestamp
  • service / component name
  • model provider, model name, model version
  • prompt_type (system, user, tool, etc.)
  • token_in, token_out
  • latency_ms
  • outcome (success, error, partial)
  • error_type if any (timeout, validation, provider_error, policy_block) Augment with domain fields: * tenant_id / customer_id
  • feature_name
  • task_type (classification, summarization, chat, planning, etc.)
  • risk_level (low, medium, high, derived from your governance stack) For retrieval: * index_name
  • k (requested) and k_effective (after filtering)
  • doc_ids returned (hashes or IDs)
  • per-doc score For tools: * tool_name
  • arguments_summary (truncated)
  • result_summary (truncated)
  • external_error_type if call failed The point is standardization. When you add a new feature or model, you plug into existing dashboards and queries. You do not reinvent logging shapes every time. ## SAMPLING: YOU CAN'T KEEP EVERYTHING Full logging of all prompts, contexts, and outputs for every request is a privacy and cost nightmare. You need sampling strategies. Good patterns: * Always log structured metadata for every request (tokens, timings, ids).
  • Sample raw text and full traces at a low rate globally (for example, 1 in 1000).
  • Oversample high-risk categories (regulated workflows, tool-using agents, admin actions).
  • Oversample new features and newly deployed models heavily for a while, then reduce if stable.
  • Provide per-tenant log controls: some customers will forbid content logging entirely; others will allow it for better debugging. Your goal: enough exemplars to investigate problems and train evals, without creating an unnecessary liability and storage bill. Dashboards should be built primarily on structured metadata, not full text. You rarely need the exact prompt to see that one tenant is hitting your 128k context limit on every second request. ## CONNECTING OBSERVABILITY TO EVALS Observability and evaluation should not live in different universes. From evals to monitoring The failure cases you identify in offline evals (jailbreaks, grounding errors, formatting failures) should become queries in your logs. For example: * If evals showed the model often forgets required disclaimers, monitor for responses missing those in production.
  • If evals showed prompt-injection success for certain patterns, log and search for those patterns in retrieved docs. From monitoring to evals When you see weird patterns in logs or traces, turn them into evals: * A specific prompt that produces a bad answer becomes a test case in your regression suite.
  • A class of RAG failures (wrong policy citations) becomes a targeted eval set for retrieval and answer grounding. The loop looks like: 1. Evals reveal fragile behavior.
  1. You deploy mitigations and guardrails.
  2. Observability tells you whether those mitigations hold under real traffic.
  3. New real-world failures feed back into eval sets. Without that loop, "observability" becomes passive monitoring. You watch graphs drift and occasionally tweak prompts. ## TOOLS VS HOMEGROWN You can roll your own tracing and logging infrastructure with OpenTelemetry, a time-series database, and some log storage. You can also use dedicated LLM observability products that understand traces, prompts, and model metadata out of the box. The decision is not religious. It hinges on: * How many different LLM workflows you have
  • How deeply you want to customize metrics and dashboards
  • Your internal expertise in observability vs your desire to offload it What you cannot outsource is deciding what you care about. No tool will magically know that "cost per KYC file processed under Policy X in Region Y" is the unit that matters for your business. You have to surface that and wire it into whatever system you use. ## COMMON FAILURE MODES IN OBSERVABILITY A few patterns show up repeatedly. Only infra metrics You track CPU, memory, and p95 latency and call it a day. Meanwhile, cost per task doubles and the model starts hallucinating more often after a prompt change. None of your dashboards tell you. No per-tenant or per-feature splits You aggregate everything. One big graph hides that a single large customer is seeing 10x error rates because their data is weird, or that one feature is driving most of your cost while adding little value. Unstructured logging You dump prompts and outputs into logs as free text without consistent fields. Queries become painful. Correlations are guesswork. Adding a new provider breaks half your tooling. No link to business outcomes You stare at token charts and safety counts, but you cannot say whether any of it affects conversions, resolution time, churn, or anything else the business cares about. Observability sits in a silo. No ownership Everyone can see the dashboards, no one is on the hook. Incidents get noticed in screenshots, not alerts. Metrics are "interesting" but do not drive roadmaps. ## THE MINIMUM DISCIPLINE THAT WORKS You do not need a massive platform team to get value. You need a small set of habits. * Every user-facing LLM feature emits traces with a common schema and correlation IDs.
  • You maintain a small set of canonical dashboards: health, cost, safety, quality proxies, each sliced by tenant and feature.
  • You define a handful of SLO-like targets: max acceptable latency, max acceptable cost per unit, max refusal rate, max safety hit rate before investigation.
  • You set alerts on these, with clear owners.
  • Every serious incident produces at least one new query or dashboard tile. The rest can grow over time. If you do nothing else, at least connect token-level metrics to business units: "tokens per resolved ticket," "tokens per page summarized," "tokens per assisted sale." That one step turns an abstract infra bill into something product and finance can reason about. ## WHY THIS MATTERS LLM systems fail differently than traditional software. They do not crash; they mislead. They do not always throw errors; they quietly drift. They can be "up" from an SRE perspective and completely broken from a user or regulator perspective. Without good observability, you will find out about that drift from screenshots on social media media pipelines from text prompt to production asset, not from your own dashboards. Observability is not a luxury. It is how you keep control of systems that are, by design, probabilistic, data-dependent, and constantly evolving. If you want to run those systems in production, you either see them clearly or you fly blind.

Master AI with Top-Rated Courses

Compare the best AI courses and accelerate your learning journey

Explore Courses

Keywords

LLMObservabilityMonitoringMetricsTracesDevOps

This should also interest you

AI Incidents and Postmortems: Learning Faster Than the Failures
Engineering

AI Incidents and Postmortems: Learning Faster Than the Failures

If AI is now embedded in products that touch money, health, safety, and rights, you don't just need better prompts and evals. You need an incident culture: a way to notice failures, contain them, understand them, and change the system so the same failure doesn't blindside you again.

Maya RodriguezNov 12, 202514 min read