LLM Observability: Tracing, Logging, and Debugging AI Applications

Introduction

When you build a traditional web application and something breaks, you check the logs. You see the exact request that came in, the SQL query that ran, the error that was thrown. You can reproduce the bug and fix it.

LLM applications do not work this way. When a user complains that the chatbot gave a wrong answer, "checking the logs" tells you almost nothing — just that an API call happened and a response was returned. You cannot see what the exact prompt was, what documents were retrieved from the database, whether the model received corrupted context, or why it chose to hallucinate a particular fact.

This is the problem that LLM observability solves. Observability means instrumenting your application so you can see exactly what happened at every step: what prompt was sent, what context was retrieved, how many tokens were used, how long each part took, and what the model responded. Without this visibility, debugging is pure guesswork.

This article explains what LLM observability is, which components you need, and how to set it up using practical tools like LangSmith, LangFuse, and OpenTelemetry.

Why Traditional Monitoring Is Not Enough

Traditional application monitoring tracks metrics like latency, throughput, error rates, and server resource usage. These are still important, but they only tell you that something is slow or broken — not why.

For LLM applications, the questions that matter most are:

What exact prompt was sent to the model?
What documents were retrieved from the vector database?
What was the model's complete response?
How many tokens were used, and what did that cost?
Did the retrieval step return relevant results?
Where in the pipeline did a failure originate?

A user reports the chatbot gave a wrong answer. Without seeing the exact prompt and retrieved context, you cannot reproduce the issue, let alone fix it. You are flying blind.

What Is LLM Observability?

LLM observability is the practice of instrumenting your application to capture traces — structured records of everything that happened during a request, from start to finish.

A trace is like a detailed receipt for a single request. It breaks the request down into individual spans, each representing one step in the pipeline (e.g., "embed the query", "search the vector database", "call the LLM"). Each span records its inputs, outputs, timing, and any metadata.

For a RAG application, a typical trace might look like:

User query received at 10:00:00.000
Query embedding generated (12ms)
Vector search executed, returned 5 documents (45ms)
Prompt constructed with retrieved context
LLM inference called (1,240ms)
Response returned to user at 10:00:01.297

Directed acyclic graph with labeled nodes and directed edges showing dependencies between processing stages — **Figure:** A trace through an LLM pipeline is a Directed Acyclic Graph — retrieval feeds into prompt construction, which feeds into inference, which feeds into post-processing, with no cycles. Instrumenting each node with timestamps, inputs, outputs, and token counts is what transforms a sequence of API calls into a debuggable observability record. Source: David W. / Wikimedia Commons (Public Domain)

Core Components of LLM Observability

1. Tracing

Tracing captures the execution flow of a request across all components. Each step in your pipeline becomes a span with a start time, end time, inputs, outputs, and associated metadata. Traces let you see the full picture of any single request.

2. Logging

Logging stores the raw data at each step: the complete prompts, model responses, retrieved documents, and any errors. Unlike traditional logs that just record events, LLM logs need to capture high-dimensional text data — full prompt strings can be thousands of characters long.

3. Metrics

Metrics are aggregated numbers that give you a high-level view of system health. Key LLM metrics include:

Token usage: Input tokens, output tokens, and total tokens per request.
Cost: API cost per request and total daily cost (critical for budget control).
Latency: Time to first token (TTFT) and total response time.
Retrieval quality: Number of documents retrieved and their relevance scores.
Error rates: Failed API calls, timeouts, and rate limit hits.

4. User feedback

Capturing thumbs up/down feedback from users helps identify where the system fails in ways that are hard to detect automatically. This feedback can be linked to specific traces, letting you find the exact prompt and context that caused a bad response.

The Observability Stack for LLMs

A production LLM observability stack typically includes:

Trace collector: Captures execution traces and sends them to storage (LangSmith, LangFuse, OpenTelemetry).
Log storage: Stores prompts, responses, and metadata (PostgreSQL, S3, or specialized trace databases).
Metrics dashboard: Visualizes cost, latency, and usage trends (Grafana, Datadog, or custom dashboards).
Alerting: Notifies you when something goes wrong — cost spike, high error rate, or unusual latency (PagerDuty, Slack webhooks).

LangSmith: End-to-End Tracing for LangChain

LangSmith is a managed observability platform built specifically for LangChain applications. It is the easiest way to get full visibility into a LangChain-based app.

What LangSmith provides

Automatic tracing of all LangChain chains and agents — no manual instrumentation needed.
A waterfall view of execution steps with inputs and outputs at each step.
Cost and token tracking per trace.
Debugging tools to replay and inspect failed requests.
Dataset management for evaluation and regression testing.

Setting up LangSmith

As of LangChain version 0.2 and later, tracing is configured via environment variables. You no longer need to add callback handlers to your code — just set the variables, and every LangChain call in your application is automatically traced:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# As of LangChain >=0.2, tracing is configured via environment variables
# rather than callback handlers — no LangSmithCallbackHandler needed.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "your-project-name"
# All LangChain calls are now automatically traced — no callback needed

# Define a simple chain
llm = ChatOpenAI(model="gpt-4")
prompt = PromptTemplate.from_template("Translate {text} to French.")
chain = LLMChain(llm=llm, prompt=prompt)

# Run the chain (automatically traced)
result = chain.run(text="Hello, how are you?")
print(result)

After running your application, open the LangSmith dashboard to see a waterfall view of execution steps, token counts, costs, and latency broken down by component. You can filter by user, session, or time range, and drill into any trace to see the exact prompt that was sent.

LangFuse: Open-Source LLM Observability

LangFuse is an open-source alternative to LangSmith. It works with LangChain, LlamaIndex, and any custom LLM application. Because it is open source, you can self-host it if you need to keep data on your own infrastructure.

Key features

Self-hosted or cloud-hosted options — useful when data privacy is a concern.
Tracing for any Python LLM application, not just LangChain.
Cost tracking and token usage analytics.
Prompt versioning: track how prompt changes affect quality over time.
Integration with evaluation frameworks for automated quality testing.

Example: Using LangFuse

The snippet below shows manual trace creation, useful when you are not using LangChain and need to instrument a custom LLM pipeline directly:

from langfuse import Langfuse

langfuse = Langfuse(
    public_key="your-public-key",
    secret_key="your-secret-key"
)

# Start a trace
trace = langfuse.trace(name="user-query")

# Log a generation
generation = trace.generation(
    name="gpt4-response",
    model="gpt-4",
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    usage={"input_tokens": 10, "output_tokens": 8}
)

# End trace
trace.update(status="success")

Custom Instrumentation with OpenTelemetry

OpenTelemetry is a vendor-neutral, open standard for distributed tracing. If your team already uses Jaeger, Zipkin, or Datadog for tracing traditional services, you can extend that same infrastructure to cover your LLM application.

Why OpenTelemetry?

Vendor-neutral: works with any tracing backend.
Integrates with existing observability stacks.
Fully extensible — you define exactly what each span captures.

Example: Tracing an LLM call

The example below creates a single span covering one LLM call and exports it to the console. In production, replace ConsoleSpanExporter with your preferred backend (Jaeger, Datadog, etc.):

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor

# Set up tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(ConsoleSpanExporter())
)

# Trace an LLM call
with tracer.start_as_current_span("llm-inference") as span:
    span.set_attribute("model", "gpt-4")
    span.set_attribute("input_tokens", 120)
    span.set_attribute("output_tokens", 80)

    # Simulate LLM call
    response = "The answer is 42."

    span.set_attribute("response", response)

What to Monitor in Production

Cost tracking

Track API costs broken down by user, session, and endpoint. Set automated alerts for unexpected spikes. A common trigger: if daily cost exceeds a threshold (e.g., $500), alert the on-call engineer immediately. LLM API costs can balloon unexpectedly due to prompt bugs that inflate token counts.

Latency monitoring

Monitor latency at each stage of your pipeline separately — retrieval, embedding, inference, and post-processing. This lets you pinpoint bottlenecks precisely. If total latency spikes but LLM inference time is unchanged, the issue is in retrieval or embedding.

Token usage

Track average input and output tokens per request. If you see a sudden increase in average prompt length, it likely indicates a bug — perhaps a loop is appending context incorrectly, or a retrieval step is returning too many documents.

Retrieval quality

Log the relevance scores of retrieved documents. If scores drop significantly, it may indicate embedding drift (your embedding model behavior changed) or an index issue (the vector database was not updated after a data change).

Error rates

Monitor API errors, timeouts, and rate limit hits. Set up automatic retries with exponential backoff for transient failures, and alert on sustained high error rates.

Debugging with Traces

When a user reports a bad response, traces let you systematically investigate:

Find the exact trace for that user's request.
Inspect what context was retrieved from the database.
See the complete final prompt that was sent to the model.
Identify which stage of the pipeline produced the bad output.

This turns debugging from guesswork into a reproducible, systematic process — the same way a good stack trace makes debugging a crashed program tractable.

Privacy and Security Considerations

Full prompt logging can capture sensitive user data. Treat logs as sensitive data from day one.

Best practices

Sanitize or redact PII (personally identifiable information — names, emails, phone numbers) before logging, using regex rules or a dedicated PII detection library.
Encrypt logs at rest and in transit.
Limit access to trace data to authorized engineers.
Implement data retention policies: delete traces older than 90 days unless they are needed for compliance.

Some observability platforms (LangSmith, LangFuse) offer automatic PII detection and redaction.

Evaluation and Testing

Observability platforms often integrate with evaluation frameworks, enabling you to create test datasets from real production traces, run automated regression tests when you change prompts or models, and compare quality metrics across model versions. This is how you ensure that a prompt change or model upgrade does not silently degrade quality for users.

Comparison: Observability Tools

Tool	Type	Best For	Pricing	Key Features
LangSmith	Managed	LangChain apps	Paid	Auto-tracing, datasets, evals
LangFuse	Open-source	Any LLM app	Free (self-hosted)	Flexible, prompt versioning
Helicone	Managed	OpenAI apps	Freemium	Proxy-based, cost tracking
OpenTelemetry	Framework	Custom integrations	Free	Vendor-neutral, extensible
Datadog	Managed	Enterprise monitoring	Paid	Full-stack, APM integration

Building a Custom Observability Layer

If you want full control without a third-party platform, you can build a lightweight logging layer. This is a good starting point for teams that are just getting started or have strict data residency requirements.

Components you need

Trace storage: PostgreSQL or a time-series database.
Logging middleware: A Python class that wraps every LLM call and logs details.
Dashboard: Streamlit, Grafana, or a custom React app for visualization.
Alerting: Slack webhooks or email notifications for anomalies.

Example: Simple logging middleware

The following class writes each LLM call to a JSONL file (one JSON object per line), which is easy to parse for analysis later:

import time
import json

class LLMLogger:
    def __init__(self, log_file="llm_logs.jsonl"):
        self.log_file = log_file

    def log_request(self, prompt, response, metadata):
        log_entry = {
            "timestamp": time.time(),
            "prompt": prompt,
            "response": response,
            "metadata": metadata
        }

        with open(self.log_file, "a") as f:
            f.write(json.dumps(log_entry) + "\n")

# Usage
logger = LLMLogger()

prompt = "What is the capital of France?"
response = "The capital of France is Paris."
metadata = {"model": "gpt-4", "tokens": 18, "cost": 0.0003}

logger.log_request(prompt, response, metadata)

Common Pitfalls

Over-logging

Logging every intermediate state can generate enormous amounts of data, increasing storage costs and making it harder to find the logs that matter. Log selectively, and implement sampling for high-traffic systems (e.g., log 10% of successful requests but 100% of failures).

Ignoring privacy

Always sanitize logs before storing them. User queries often contain personal information that you are legally required to protect. Build PII redaction into your logging layer from day one — it is much harder to add retroactively.

Logging without alerting

Logs are useless if no one looks at them. Set up automated alerts for critical signals — cost spikes, elevated error rates, unusual latency — so issues surface before users notice them.

Conclusion

Observability is not optional for production LLM applications. Without it, every bug is a mystery and every cost spike is a surprise.

Start with the simplest tool that works for your stack: LangSmith if you are using LangChain, LangFuse if you need open-source flexibility or self-hosting. Capture traces, monitor costs and latency, and set alerts for anomalies. Once you have this foundation, add evaluation workflows to catch quality regressions before they reach users.

Key Takeaways

Traditional monitoring does not capture LLM-specific behavior; you need traces that record the exact prompt, retrieved context, token counts, and model output at every step.
LangSmith and LangFuse are the most practical starting points — LangSmith for LangChain apps (auto-traces with two environment variables), LangFuse for open-source flexibility or self-hosting.
Monitor cost, latency, token usage, retrieval quality, and error rates; set automated alerts so anomalies surface before users notice them.
Always sanitize logs to protect user privacy, and treat trace data with the same care as any other sensitive production data.

References

LangSmith Documentation — LLM Observability
Langfuse Documentation — Open-Source LLM Engineering Platform
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML Test Score. IEEE Big Data 2017.
OpenTelemetry Documentation
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media.

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Tree-of-Thoughts Explained

Chain-of-thought improves multi-step reasoning. ReAct adds tool use. Tree-of-thoughts explores multiple solution...

Structured Outputs in LLMs: JSON Mode, Function Calling, and Schema Validation

Free-form LLM output breaks parsing pipelines. JSON mode, function calling, grammar-constrained decoding,...

Found this useful?