Observability for AI products is observability for everything else, plus two extra dimensions: how much each request cost, and whether the output was actually good. Without both, an AI feature can quietly degrade in production and you can't tell why. This post covers the observability stack we use on AI-product engagements and why each piece matters.
The stack
Request lifecycle
What each stream captures
Three streams. Application traces flow to your APM (Datadog, Honeycomb). LLM-specific events flow to an LLM observability tool (LangSmith, Helicone, Langfuse). Output samples flow to a quality review process.
Application-level traces
These are the same traces you'd have for any web service. OpenTelemetry, instrumented through your framework, sent to Datadog or Honeycomb or Grafana Tempo.
The trace shows the HTTP request in, database queries, external API calls (including the LLM provider), cache hits and misses, background jobs spawned, and the HTTP response out with status and latency.
For AI features specifically, the LLM call shows up as a child span. The parent trace correlates it to the request.
LLM-specific observability
The application trace knows that an LLM call happened. It doesn't know what prompt was sent, what the response was, which model was used, what the cost was, or what the eval score is.
LLM observability tools (LangSmith, Helicone, Langfuse) capture all of this. You wrap your LLM client in the tool's SDK. Every call now logs prompt, response, model, tokens in/out, cost, latency, and custom metadata. A dashboard shows trends, error rates, slow calls, expensive calls.
Pick one. Don't run two LLM observability tools, the overhead isn't worth it.
Cost tracking
The category that catches teams off-guard most often. Plan it carefully.
What to track:
- Dollars per request. Surface this in the LLM tool.
- Dollars per feature. Tag every call with a feature name. The "summarise inbox" feature should have its own cost line.
- Dollars per customer. For per-seat pricing or enterprise plans, you want cost-per-customer to detect outliers.
- Daily total. With an alert if it doubles in 24 hours.
- Token-cache hit rate. If you're using prompt caching (Anthropic, OpenAI), track how often it's working.
Cost overruns in AI products usually look like a prompt change accidentally triggered chain-of-thought reasoning on every call, doubling the token count. Or a customer is using a feature in an unexpected way that fires 100 expensive calls per session. Both are detectable in minutes if you have cost-per-feature and cost-per-customer dashboards.
Output quality sampling
This is the dimension that distinguishes AI observability from regular observability.
You can have great latency, low cost, and silently degrading output quality. The user complains. You can't tell why.
The pattern:
- Sample 1 to 5 percent of production AI outputs.
- Score each one with an LLM-as-judge or send to a human reviewer.
- Track the score over time.
- Alert if the rolling 7-day score drops by a meaningful amount.
Two ways to do the scoring:
- LLM-as-judge is cheap and scales but introduces its own noise. Useful for trending.
- Human review is the ground truth. 30 to 100 samples per day reviewed by a senior team member.
Most teams do both. LLM-as-judge for the live dashboard, human review for the rolling weekly check.
Connecting the streams
The three streams (application traces, LLM observability, quality sampling) connect via a shared request_id or trace_id. Everything you log includes it.
When an incident happens, the application trace shows what failed, the LLM observability shows the prompt and response, and the quality sample (if this request was sampled) shows the score.
Cross-tool linking is the unlock. Often the AI tool has a "link to APM trace" feature for this.
Alerts that matter
Resist the temptation to alert on everything. Page on:
- Error rate spike. Standard APM alert.
- Cost spike. Daily total 2x the rolling 7-day average.
- Quality drop. Sampled quality score down meaningfully.
- Model provider outage. Errors specifically from your LLM provider.
- P95 latency spike. AI features get slower as prompts grow. Catch it before users do.
Five alerts. Routable. Not 50.
What to watch in the first 90 days of a new AI feature
- Cost per request. Often higher than your model.
- Cost per feature. Often dominated by one feature with a long prompt.
- Token usage distribution. Long tail of expensive calls is normal but watch the tail.
- Failure rate. Some prompts trigger model refusals or content-policy violations.
- Latency P95 vs P50. AI calls have a heavy tail.
- User feedback signals (thumbs up/down, clicks on regenerate).
Common mistakes
- No cost-per-feature tagging. You can't tell which feature is expensive.
- Sampling at 100 percent. Expensive and pointless. 1 to 5 percent is enough for trend.
- No quality sampling. You'll find out about degradation from customers.
- LLM observability without app observability. You see slow AI calls but not why the surrounding request was slow.
- Choosing alerts based on what's easy to alert on. Pick alerts based on what actually predicts user pain.
How Hashorn helps with AI observability
Hashorn provides MLOps and DevOps and CI/CD services that include the observability stack above. We wire LLM observability into the application traces, set up cost monitoring, build the quality sampling pipeline, and tune the alerts. For AI-first product teams, this is part of every engagement, not a separate workstream.
Conclusion
AI product observability in 2026 is three streams (traces, cost, quality) connected by a shared trace ID and watched through five well-tuned alerts. Skip any of the three and you're flying partially blind. Build them once. The next two years of AI feature shipping are smoother because of it.
Frequently asked questions
Need help building AI-powered software, QA automation, or secure cloud systems?
Talk to Hashorn's engineering team. Dedicated senior engineers, QA, and security with same-week ramp.