LLM Observability: Tracing, Logging, and Debugging AI Applications

How to monitor, debug, and optimize LLM applications in production with proper observability tooling

Posted by Perivitta on March 12, 2026 · 19 mins read
Understanding : A Step-by-Step Guide

LLM Observability: Tracing, Logging, and Debugging AI Applications


Introduction

When you first deploy an LLM application, everything seems to work. Users get responses, the system runs, and the logs show successful API calls.

Then production happens. Users report inconsistent answers. Costs spike unexpectedly. Latency degrades during peak hours. Debugging becomes a nightmare because you cannot see what the model actually received or why it generated a particular response.

Traditional logging and monitoring tools are not built for LLM applications. You need observability specifically designed for tracking prompts, model behavior, retrieval quality, and token usage.

This is where LLM observability comes in.

Observability for LLMs means instrumenting your application to capture traces, logs, and metrics at every step of the execution pipeline. This allows you to debug issues, monitor performance, and optimize costs.

This post explains what LLM observability is, why it matters, and how to implement it using tools like LangSmith, LangFuse, and custom instrumentation.


Why Traditional Monitoring Is Not Enough

Traditional application monitoring focuses on metrics like latency, throughput, error rates, and resource utilization.

These metrics are still important for LLM applications, but they do not capture the behavior that matters most:

  • What prompt was sent to the model?
  • What context was retrieved from the database?
  • What was the model's response?
  • How many tokens were used?
  • Did the retrieval step fail?
  • Why did the model hallucinate?

Without visibility into these aspects, debugging is guesswork.

A user might report that the chatbot gave a wrong answer, but without seeing the exact prompt and retrieved context, you cannot reproduce or fix the issue.


What Is LLM Observability?

LLM observability is the practice of instrumenting your application to capture detailed execution traces that show:

  • The full sequence of steps in a request (retrieval, prompt construction, model inference, post-processing).
  • The inputs and outputs at each step.
  • Metadata like token counts, latency, and costs.
  • Errors and exceptions at each layer.

This goes beyond simple logging. It creates a structured trace that shows exactly how your application processed a request from start to finish.


Core Components of LLM Observability

1. Tracing

Tracing captures the execution flow of a request across multiple components.

For example, a RAG application might have the following trace:

  • User query received.
  • Query embedding generated.
  • Vector search executed.
  • Top 5 documents retrieved.
  • Prompt constructed with retrieved context.
  • LLM inference called.
  • Response returned to user.

Each step is logged with timestamps, inputs, outputs, and metadata.

2. Logging

Logging stores the raw data at each step: prompts, responses, retrieved documents, errors.

Unlike traditional logs, LLM logs must capture high-dimensional data like embeddings, long-form text, and nested structures.

3. Metrics

Key metrics for LLM applications include:

  • Token usage: Input tokens, output tokens, total tokens per request.
  • Cost: API cost per request, total daily cost.
  • Latency: Time to first token (TTFT), total response time.
  • Retrieval quality: Number of documents retrieved, relevance scores.
  • Error rates: Failed API calls, timeouts, rate limit hits.

4. User Feedback

Capturing thumbs up/down feedback or explicit user corrections helps identify where the system fails.

This feedback can be linked to specific traces for debugging.


The Observability Stack for LLMs

A production LLM observability stack typically includes:

  • Trace collector: Captures execution traces (LangSmith, LangFuse, OpenTelemetry).
  • Log storage: Stores prompts, responses, and metadata (PostgreSQL, S3, specialized trace databases).
  • Metrics dashboard: Visualizes cost, latency, and usage (Grafana, Datadog, custom dashboards).
  • Alerting: Notifies on anomalies like cost spikes or high error rates (PagerDuty, Slack).

LangSmith: End-to-End Tracing for LangChain

LangSmith is a dedicated observability platform for LangChain applications.

What LangSmith Provides

  • Automatic tracing of LangChain chains and agents.
  • Visualization of execution steps with inputs and outputs.
  • Cost and token tracking per trace.
  • Debugging tools to replay and inspect failed requests.
  • Dataset management for evaluation and testing.

Setting Up LangSmith

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
 
# Set LangSmith API key
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-production-app"
 
# Define a simple chain
llm = ChatOpenAI(model="gpt-4")
prompt = PromptTemplate.from_template("Translate {text} to French.")
chain = LLMChain(llm=llm, prompt=prompt)
 
# Run the chain (automatically traced)
result = chain.run(text="Hello, how are you?")
print(result)

Every execution is automatically sent to LangSmith with full trace details.

Viewing Traces in LangSmith

The LangSmith dashboard shows:

  • A waterfall view of execution steps.
  • Inputs and outputs at each step.
  • Token counts and API costs.
  • Latency breakdown by component.

You can filter traces by user, session, or time range, and drill down into specific failures.


LangFuse: Open-Source LLM Observability

LangFuse is an open-source alternative to LangSmith.

It supports LangChain, LlamaIndex, and custom LLM applications.

Key Features

  • Self-hosted or cloud-hosted options.
  • Tracing for any Python LLM application.
  • Cost tracking and token usage analytics.
  • Prompt versioning and experimentation.
  • Integration with evaluation frameworks.

Example: Using LangFuse

from langfuse import Langfuse
 
langfuse = Langfuse(
    public_key="your-public-key",
    secret_key="your-secret-key"
)
 
# Start a trace
trace = langfuse.trace(name="user-query")
 
# Log a generation
generation = trace.generation(
    name="gpt4-response",
    model="gpt-4",
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    usage={"input_tokens": 10, "output_tokens": 8}
)
 
# End trace
trace.update(status="success")

LangFuse provides a UI similar to LangSmith for exploring traces and debugging issues.


Custom Instrumentation with OpenTelemetry

For more control, you can build custom tracing using OpenTelemetry.

Why OpenTelemetry?

  • Vendor-neutral standard for distributed tracing.
  • Integrates with existing observability stacks (Jaeger, Zipkin, Datadog).
  • Flexible and extensible.

Example: Tracing an LLM Call

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
 
# Set up tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
 
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(ConsoleSpanExporter())
)
 
# Trace an LLM call
with tracer.start_as_current_span("llm-inference") as span:
    span.set_attribute("model", "gpt-4")
    span.set_attribute("input_tokens", 120)
    span.set_attribute("output_tokens", 80)
    
    # Simulate LLM call
    response = "The answer is 42."
    
    span.set_attribute("response", response)

This creates a trace that can be exported to any OpenTelemetry-compatible backend.


What to Monitor in Production

Cost Tracking

Track API costs per user, session, and endpoint. Set alerts for unexpected cost spikes.

Example: If daily costs exceed $500, trigger an alert.

Latency Monitoring

Monitor latency at each pipeline stage: retrieval, embedding, inference, post-processing.

Identify bottlenecks and optimize accordingly.

Token Usage

Track average input and output tokens per request. Detect if prompts are growing unexpectedly due to bugs.

Retrieval Quality

Log retrieval scores and document counts. If relevance scores drop, it may indicate embedding drift or index issues.

Error Rates

Monitor API errors, timeouts, and rate limits. Set up retries and fallback mechanisms.


Debugging with Traces

When a user reports an issue, traces allow you to:

  • Replay the exact request with the same inputs.
  • Inspect what context was retrieved.
  • See the final prompt sent to the model.
  • Identify where in the pipeline the failure occurred.

This turns debugging from guesswork into a systematic process.


Privacy and Security Considerations

Logging full prompts and responses can expose sensitive user data.

Best Practices

  • Sanitize or redact PII (personally identifiable information) before logging.
  • Encrypt logs at rest and in transit.
  • Limit access to trace data to authorized personnel.
  • Implement data retention policies to delete old traces.

Some observability platforms offer automatic PII detection and redaction.


Evaluation and Testing

Observability platforms often integrate with evaluation frameworks.

You can:

  • Create test datasets from production traces.
  • Run regression tests when updating prompts or models.
  • Compare model performance across versions.

This ensures that changes do not degrade quality.


Comparison: Observability Tools

Tool Type Best For Pricing Key Features
LangSmith Managed LangChain apps Paid Auto-tracing, datasets, evals
LangFuse Open-source Any LLM app Free (self-hosted) Flexible, prompt versioning
Helicone Managed OpenAI apps Freemium Proxy-based, cost tracking
OpenTelemetry Framework Custom integrations Free Vendor-neutral, extensible
Datadog Managed Enterprise monitoring Paid Full-stack, APM integration

Building a Custom Observability Layer

If you want full control, you can build a custom observability layer.

Components

  • Trace storage: PostgreSQL or a time-series database.
  • Logging middleware: Intercepts LLM calls and logs details.
  • Dashboard: Streamlit, Grafana, or a custom React app.
  • Alerting: Slack webhooks or email notifications.

Example: Simple Logging Middleware

import time
import json
 
class LLMLogger:
    def __init__(self, log_file="llm_logs.jsonl"):
        self.log_file = log_file
    
    def log_request(self, prompt, response, metadata):
        log_entry = {
            "timestamp": time.time(),
            "prompt": prompt,
            "response": response,
            "metadata": metadata
        }
        
        with open(self.log_file, "a") as f:
            f.write(json.dumps(log_entry) + "\n")
 
# Usage
logger = LLMLogger()
 
prompt = "What is the capital of France?"
response = "The capital of France is Paris."
metadata = {"model": "gpt-4", "tokens": 18, "cost": 0.0003}
 
logger.log_request(prompt, response, metadata)

This simple logger writes each request to a JSONL file for analysis.


Common Pitfalls

Over-Logging

Logging every intermediate step can generate massive amounts of data. Log selectively and implement sampling for high-traffic systems.

Ignoring Privacy

Always sanitize logs to avoid exposing user data. Treat logs as sensitive.

Not Setting Alerts

Logging is useless if no one monitors it. Set up automated alerts for critical issues.


Future of LLM Observability

Emerging trends include:

  • Real-time anomaly detection: Automatically flag unusual behavior in production.
  • AI-powered root cause analysis: Using LLMs to analyze traces and suggest fixes.
  • Integrated evaluation: Running evals in production to catch regressions early.

Conclusion

Observability is not optional for production LLM applications. Without it, you are flying blind.

Whether you use LangSmith, LangFuse, OpenTelemetry, or build a custom solution, the key is to instrument your application to capture every step of execution.

This visibility allows you to debug faster, optimize costs, and build confidence in your system's reliability.

If your LLM application is already in production and you do not have observability, this should be your next priority.


Key Takeaways

  • Traditional monitoring does not capture LLM-specific behavior like prompts and token usage.
  • LLM observability requires tracing execution flows with inputs, outputs, and metadata.
  • LangSmith and LangFuse are popular observability platforms for LLM apps.
  • Track key metrics: cost, latency, token usage, retrieval quality, and error rates.
  • Traces enable systematic debugging by replaying exact requests.
  • Always sanitize logs to protect user privacy.
  • Set up alerts for cost spikes, high error rates, and latency degradation.

Related Articles