RAG Observability with Langfuse, vLLM, and FAISS

In this lesson, you will learn how to instrument every step of your Retrieval-Augmented Generation (RAG) pipeline using Langfuse, capture traces across ingestion, retrieval, and generation, and understand exactly how your system behaves under the hood.

You will wire tracing into your retriever and generator, monitor latency and token usage, evaluate quality scores, and run the entire stack with vLLM and FAISS locally so you can experiment freely without any cloud dependencies.

By the end, you will have a fully transparent RAG workflow that you can debug, optimize, and scale with confidence.

This lesson is the last in a 3-part series on LLM observability with Langfuse:

LLM Observability with Self-Hosted Langfuse and vLLM
Manual Tracing, Scores, and Evaluation with Langfuse (Self-Hosted)
RAG Observability with Langfuse, vLLM, and FAISS (this tutorial)

To learn how to make your RAG pipeline fully observable with Langfuse, vLLM, and FAISS, just keep reading.

Looking for the source code to this post?

Introduction to Production-Grade RAG and LLM Observability

What Makes a RAG Pipeline Production-Grade

A RAG pipeline becomes “production-grade’’ only when it consistently delivers correct, stable, and explainable outputs under real-world constraints. In development, it is easy to get an LLM to answer questions using retrieved context. In production, the challenges multiply: retrieval quality varies, embeddings may shift over time, documents evolve, and latency budgets tighten. A production RAG pipeline must remain robust even when the input data is noisy, queries are unpredictable, and traffic is high.

A production-ready RAG system must treat retrieval as a first-class subsystem, not a background detail. That means surfacing similarity scores, exposing ranking decisions, understanding how vector search behaves at scale, and ensuring retriever recall stays high across diverse query types. It also requires that the prompt construction step is deterministic, inspectable, and traceable, because subtle variations in formatting often change model behavior dramatically.

Beyond these retrieval and prompt concerns, the LLM is also a production component. That means retry logic, token accounting, consistent latency, predictable throughput, and graceful failure modes. Production pipelines need clear boundaries between retrieval failures, prompt-generation bugs, and LLM invocation issues. If these concerns remain invisible, debugging becomes guesswork and reliability collapses under load. Production-grade RAG means engineered behavior, not accidental correctness.

Why Observability Is Essential for Retrieval-Augmented Systems

RAG pipelines fail silently. Retrieval may return irrelevant documents, prompting may omit essential context, and the LLM may hallucinate confidently even when grounded context exists. Without observability, it is impossible to diagnose why a particular answer was wrong. Was the embedding model inconsistent? Did FAISS, the vector search library used to retrieve similar documents, return poor matches? Did the prompt formatting break a system instruction? Did the LLM drift or degrade?

Observability solves this by turning the RAG pipeline into a transparent execution graph. Tools like Langfuse give you hierarchical traces: one trace for the whole request, and nested spans for retrieval, LLM calls, evaluation, and supporting steps. Each span captures inputs, outputs, metadata, latencies, token usage, and even scoring metrics. With this information, problems become diagnosable:

Retrieval returned low-relevance documents
Prompt formatting changed unexpectedly
LLM call degraded or hit retry logic
Evaluation metrics began trending downward

In other words, observability provides ground truth for system behavior. Production RAG must be accountable: decisions should be explainable, errors traceable, and failures measurable. Without observability, shipping RAG to production is equivalent to flying an airplane without instruments; the system might work, but you will not know when or why it stops working.

What We Will Build: Traced Retriever, Traced LLM, Full RAG Pipeline, and Evaluation

In this lesson, you will construct a fully observable, component-wise traced RAG system using Langfuse, FAISS, SentenceTransformers, and vLLM. Each part of the pipeline is instrumented for visibility: you will build a traced retriever that logs embeddings, index sizes, similarity scores, and ranking. You will build a traced LLM wrapper that records prompts, responses, retry attempts, and token usage. These components power a fully traced RAG pipeline that captures retrieval, prompt construction, generation, and final evaluation as a single hierarchical execution tree.

You will also implement automatic RAG output evaluation, computing relevancy, hallucination risk, and an overall quality score, with each metric logged back to Langfuse dashboards with scoring nodes. This gives you a complete introspection loop: every answer is measured, every metric is recorded, and every decision is traceable through structured spans.

By the end, you will have a production-grade RAG observability stack, running locally with:

A traced retriever
A traced LLM client
A fully instrumented RAG pipeline
Automatic scoring and diagnostics
Local dashboards for analyzing behavior

This foundation prepares you for upcoming lessons, where we extend these ideas into multi-step agents and more complex reasoning workflows.

RAG Observability Architecture with Langfuse, vLLM, and FAISS

A production-grade RAG pipeline is not a single model call. It is an orchestrated system composed of independent but cooperating components: retrieval, prompt assembly, LLM inference, and evaluation. In this section, we break down each subsystem and explain how they interact, why they are separated, and how Langfuse stitches everything together into a fully observable execution graph.

Retrieval → Prompt Construction → LLM → Scoring (The Core RAG Loop)

A well-designed RAG architecture follows a clean, linear flow where each stage has a single responsibility:

Retrieval

The system begins by embedding the user query and searching for relevant documents in a vector index. The retriever returns ranked, scored context items that will guide the LLM. In production, retrieval quality is often the primary bottleneck; if retrieval fails, generation cannot be correct. Therefore, retrieval spans log:

embeddings used
search distances and converted relevance scores
number of documents returned
FAISS query latencies

Your TracedRetriever does exactly this in the codebase, generating embeddings, searching the FAISS index, and tracing each step.

Prompt Construction

Next, the system converts retrieved documents into a structured context block and assembles a prompt that the LLM can reliably parse. Prompt construction must be deterministic to avoid instability across runs. The code in rag_pipeline.py builds a system message, a user message, and contextual references ([1] doc1, [2] doc2, etc.). This ensures:

deterministic ordering
visible context structure
stable interface for downstream evaluation

LLM Generation

The prompt is sent to the LLM via an OpenAI-compatible Completion API, served by vLLM locally. The TracedLLMClient wraps this call with:

retry logic
token usage reporting
error logging
prompt and response capture
metadata annotations

This is critical for production reliability because LLM latency, token usage, and intermittent failures must all be observable.

Scoring and Evaluation

Finally, the answer is passed through a lightweight evaluation module (evaluation.py). It computes:

a relevancy score
a hallucination risk score
an overall quality score

These metrics are reported back into Langfuse as scoring nodes. Production RAG systems need this because correctness is subjective; evaluation makes correctness measurable.

This 4-step pipeline forms the backbone of every modern retrieval-augmented system.

Local Vector Store Using FAISS and SentenceTransformers

RAG pipelines must remain fast, private, and cost-efficient. This system uses FAISS as the vector index and SentenceTransformers for embedding models, giving you:

Zero API cost (everything is local)
GPU acceleration optional (FAISS works on CPU just fine)
Deterministic embeddings (critical for reproducibility)
Config-driven control over the embedding model and dimensionality

The retrieval pipeline is built around the following 3 core mechanisms:

Document Embeddings

Each document is encoded using the local model defined in config.yaml:

embeddings:
  model: "sentence-transformers/all-MiniLM-L6-v2"
  dimension: 384

The TracedRetriever loads this model and produces normalized embeddings for better retrieval precision.

FAISS Index

FAISS stores all document embeddings in a vector index created via:

self.index = faiss.IndexFlatL2(self.dimension)

IndexFlatL2 is simple, fast, and perfect for local development, while still appropriate for many production environments.

Similarity Search

Retrieval happens by computing L2 distance and converting those distances into relevance scores, ensuring consistency and interpretability.

You end up with a fully local, high-performance vector store without touching external cloud APIs.

vLLM as an OpenAI-Compatible Inference Server

Instead of relying on OpenAI or Anthropic APIs, your lesson uses vLLM, a high-throughput inference engine built for serving LLMs at scale.

Your Docker Compose file runs vLLM either on GPU (recommended) or CPU fallback, exposing it at:

http://localhost:8000/v1

This allows you to call vLLM with the exact same interface as OpenAI:

response = client.chat.completions.create(
    model=self.model,
    messages=messages,
    temperature=temperature,
    max_tokens=max_tokens
)

Benefits for production-grade RAG:

Predictable latency
Full control over model versioning
No external dependencies
High-throughput serving (paged attention)
OpenAI API compatibility (no code rewrite needed)

Your TracedLLMClient wraps all of this with Langfuse observability, giving you:

latency metrics
retry attempts
token usage breakdown
full input/output transparency
error-level spans when inference fails

This is how modern enterprises run private LLMs with production reliability.

Langfuse for Tracing, Metrics, Evaluation, and Span Hierarchies

Langfuse is the backbone of observability in this system. Every major component (i.e., embedding, retrieval, generation, and evaluation) becomes a span inside a single root trace.

A typical trace hierarchy looks like:

rag_pipeline (root)
│
├── retrieve_documents
│   ├── embed_text
│   ├── index_documents (only once)
│   └── retrieve_documents
│
├── llm_completion
│
└── evaluate_rag_output
    ├── evaluate_relevancy
    └── evaluate_hallucination

This structure gives you:

Full-System Visibility

Every question generates a complete execution tree revealing:

what happened
where it happened
how long it took
what went wrong

End-to-End Metrics

token usage
retrieval scores
latency per component
evaluation metrics

Rich Debugging Context

Each span stores:

input messages
embeddings preview
retrieved context
generated outputs
error details

Continuous Quality Monitoring

Your evaluation step logs:

a relevancy score
a hallucination risk
a final pass-or-fail quality metric

Langfuse becomes the single pane of glass for understanding your RAG pipeline’s behavior, serving as the missing observability layer that transforms a working prototype into a production-ready system.

Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.

Project Setup

Before we write a single line of RAG logic, the foundation must be solid: a clean folder structure, repeatable environment setup, deterministic configuration, and a reliable inference and observability stack. This section walks you through the project layout, how to launch vLLM and Langfuse via Docker Compose, how to install retrieval dependencies (FAISS and SentenceTransformers), and how to configure all components using a single config.yaml file.

Folder Structure Walkthrough

Your project is organized for production clarity, where each subsystem (RAG, LLM, agent, evaluation, and infrastructure) is isolated in its own module.

project-root/
│
├── configs/
│   └── config.yaml                # Central config: LLM, embeddings, RAG, agent, eval, Langfuse
│
├── data/
│   └── sample_docs.txt            # Example inputs for retrieval
│
├── src/
│   ├── config.py                  # Config loader utilities
│   ├── llm_utils.py               # OpenAI-compatible client initialization
│   ├── llm_client.py              # Traced LLM wrapper (retry + token usage + spans)
│   ├── retriever.py               # FAISS retriever with traced indexing + search
│   ├── evaluation.py              # RAG quality scoring (relevancy + hallucination)
│   ├── rag_pipeline.py            # Full retrieval → prompt → generation → evaluation pipeline
│   ├── agent_orchestration.py     # 3-step traced agent workflow
│   ├── langfuse_instrumentation.py# Bootstraps Langfuse + flush utilities
│
├── docker-compose.yml             # vLLM + Langfuse + Postgres (self-hosted observability)
│
├── requirements.txt               # Python dependencies
│
└── check_rag_health.py            # Full system health check (env, docker, dependencies, files)

This layout ensures:

Decoupled components: easy for testing and future replacement
Reproducible environment: config-driven behavior
Portable observability: one command launches everything
Scalable structure: supports RAG, agents, and future tools

Every file in the src/ directory corresponds to a runnable pipeline step, and each is instrumented with Langfuse decorators so all activity becomes visible in the dashboard.

Starting vLLM and Langfuse Using Docker Compose

For production-like observability, the system relies on 2 running services:

Langfuse: tracing, metrics, and span visualization
vLLM: inference engine serving the LLM

Both are provided in your docker-compose.yml, and both run locally, meaning:

zero cloud dependency
zero per-token cost
repeatable development environment

Start the entire stack (GPU version)

docker-compose --profile gpu up -d

Or start CPU mode

docker-compose --profile cpu up -d

Confirm services are running

docker-compose ps

You should see something like:

**Table 1:** Core services with port mappings and health status

UI access

Langfuse dashboard: http://localhost:3000
vLLM API: http://localhost:8000/v1

What these services do internally

Langfuse server: stores traces, spans, scoring, and metadata
Langfuse worker: processes asynchronous scoring and analytics
PostgreSQL: stores trace data
vLLM: serves the Llama 2 model loaded at runtime

This cluster forms your local, production-grade observability and inference backbone.

Installing FAISS and SentenceTransformers

The retrieval layer requires:

FAISS: similarity search
SentenceTransformers: embedding model

These are already declared in your requirements.txt:

sentence-transformers>=2.2.0
faiss-cpu>=1.7.4
numpy>=1.24.0

Install dependencies:

pip install -r requirements.txt

After installation, verify FAISS is working:

python -c "import faiss; print(f'FAISS version: {faiss.__version__}')"

Verify embedding model loads:

python -c "from sentence_transformers import SentenceTransformer; print(SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2'))"

These 2 libraries form the core retrieval engine:

Embeddings: produced locally (MiniLM)
Retrieval: performed locally (FAISS IndexFlatL2)

No external API latency.

No cost.

No vendor lock-in.

Configuring config.yaml (LLM, Embeddings, RAG, and Evaluation)

The entire RAG and agent system is configurable from a single YAML file:

langfuse:
  host: "http://localhost:3000"
  project_name: "rag-selfhosted"

llm:
  base_url: "http://localhost:8000/v1"
  model: "meta-llama/Llama-2-7b-chat-hf"
  temperature: 0.7
  max_tokens: 300
  max_retries: 2

embeddings:
  model: "sentence-transformers/all-MiniLM-L6-v2"
  dimension: 384

rag:
  top_k: 3

agent:
  max_steps: 3

evaluation:
  enable_scoring: true
  min_quality_score: 0.6

Key configuration sections

LLM Configuration

Controls inference behavior:

model name
sampling temperature
max tokens
retry count
endpoint (vLLM server URL)

Your TracedLLMClient loads these automatically via:

from config import get_llm_config

Embeddings Configuration

Controls vector dimension and embedding model, and is consumed by:

from config import get_embeddings_config

RAG Settings

Controls retrieval behavior:

top_k: results returned from FAISS
used inside: TracedRetriever.retrieve()

Agent Settings

Agent workflows build on top of RAG, controlling:

max agent steps
model used for intent detection

Evaluation Configuration

Defines quality control thresholds:

relevancy
hallucination risk
minimum acceptable quality

Used in:

from config import get_evaluation_config

This config-driven system makes your pipeline:

reproducible
tunable
production-friendly
environment-agnostic

Change models or thresholds, and no code changes are required.

Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Building a Langfuse-Traced Retriever with FAISS

The retriever is the beating heart of any RAG pipeline. If retrieval is weak, every downstream component (i.e., prompting, LLM generation, and evaluation) will degrade. In this section, we construct a production-grade retriever built on three pillars: local embeddings, FAISS vector search, and Langfuse instrumentation. The result is a component that is fast, reproducible, fully observable, and cheap to run because it never leaves your machine or calls a cloud API.

Your TracedRetriever class in src/retriever.py handles 4 responsibilities:

Load an embedding model
Embed and index documents
Perform similarity-based search
Emit Langfuse spans for every step (embedding, indexing, and retrieval)

Loading and Embedding Documents

The retriever begins by loading a local SentenceTransformers model, which provides dense vector embeddings without any external API calls.

embeddings_config = get_embeddings_config()
model_name = embeddings_config.get("model", "sentence-transformers/all-MiniLM-L6-v2")
self.model = SentenceTransformer(model_name)

Why local embeddings?

No rate limits or API costs after the local environment is configured
Fast inference through optimized ONNX or Torch acceleration
Privacy-safe since no data leaves the environment
Predictable latency, which is critical in production

Embedding a document (with tracing)

Your embed() method is wrapped with the Langfuse @observe decorator:

@observe(name="embed_text")
def embed(self, text: str) -> np.ndarray:

This automatically creates a top-level span called embed_text in Langfuse.

Inside the span, you record:

first 100 characters of the text
embedding dimension

langfuse_context.update_current_observation(
    input={"text_preview": text[:100]}
)

The embedding call itself:

embedding = self.model.encode([text], normalize_embeddings=True)[0]

This normalization step ensures the embeddings have unit length, which stabilizes similarity scoring and produces better retrieval in FAISS L2 spaces.

The span finishes by storing metadata:

langfuse_context.update_current_observation(
    output={"embedding_dim": len(embedding)}
)

This is extremely useful later for debugging:

Did documents produce embeddings with inconsistent lengths?
Are embeddings accidentally empty?
Are overly long texts being passed in?

Langfuse gives you full visibility.

Creating and Populating a FAISS Index

After loading the embedding model, the retriever constructs a FAISS IndexFlatL2 index:

self.index = faiss.IndexFlatL2(self.dimension)

This index:

Stores vectors in RAM
Uses Euclidean distance (L2) for similarity
Has no training step, making it ideal for small and medium-sized datasets

Your retriever keeps an in-memory list of source documents:

self.documents = []

Indexing documents (with tracing)

@observe(name="index_documents")
def index_documents(self, documents: List[str]):

This span tracks:

how many documents are being indexed
how many embeddings were added
previews of content for debugging

Under the hood:

Store the raw documents
Embed them in a batch
Add the vectors to FAISS

embeddings = self.model.encode(documents, normalize_embeddings=True)
self.index.add(embeddings.astype(np.float32))

Because FAISS expects float32, the cast is mandatory.

Why IndexFlatL2?

Simple
Deterministic
Fast for small–medium corpora (< 200k docs)
Plays well with normalized embeddings (MiniLM, BERT, etc.)

Your pipeline achieves high throughput without additional libraries or GPUs.

Retrieving with Similarity Ranking

The retrieval process begins with the retrieve() method:

@observe(name="retrieve_documents")
def retrieve(self, query: str, top_k: int = None):

Langfuse creates a tracing span named retrieve_documents for every search operation.

Step 1. Embed the query

query_embedding = self.embed(query).reshape(1, -1)

Notice that calling self.embed() creates a nested span under the retrieval span in Langfuse.

This nesting hierarchy:

retrieve_documents
    ├── embed_text

gives you a complete view of:

how long embedding took
token count (if embedding model changes)
exact query text

Step 2. Search the FAISS index

distances, indices = self.index.search(query_embedding, top_k)

FAISS returns:

indices: the closest documents
distances: L2 distances to each doc

You convert distances into similarity scores:

relevance_score = 1.0 / (1.0 + float(distance))

This transforms smaller distances into higher scores.

Step 3. Format ranked results

results.append({
    "content": self.documents[idx],
    "score": relevance_score,
    "rank": rank + 1,
    "distance": float(distance)
})

Step 4. Log retrieval metadata to Langfuse

langfuse_context.update_current_observation(
    output={
        "result_count": len(results),
        "scores": [r["score"] for r in results],
        "results": [...]
    }
)

You even send content previews (200 characters), which appear in the Langfuse UI and make debugging dramatically easier.

Adding Langfuse Spans to Indexing and Retrieval Steps

Langfuse observability is woven into every retrieval path using the @observe decorator and metadata updates.

Spans you automatically get from your retriever

**Table 2:** Key instrumented methods and their corresponding Langfuse tracing spans

These spans appear under your RAG pipeline trace like:

rag_pipeline
    ├── retrieve_documents
    │       ├── embed_text
    │       └── result metadata
    ├── llm_completion
    ├── evaluate_rag_output
    └── final scoring

Why this matters in production

You can identify whether latency is coming from embedding, FAISS search, or LLM inference.
You can detect mismatches like:
- wrong embedding dimension
- missing documents
- unnormalized vectors
- misconfigured top-k
You get complete end-to-end lineage for every query.
You can monitor retriever performance across time.

This is the observability layer that most open-source RAG tutorials never include, but you now have it baked into the core of your retriever.

Building a Traced LLM Wrapper for vLLM and Langfuse

OpenAI-Compatible Chat Completions via vLLM

Your LLM wrapper is split into 2 layers:

a low-level OpenAI-compatible client in llm_utils.py, and
a high-level, Langfuse-traced wrapper in llm_client.py.

The low-level client is created in get_llm_client():

from openai import OpenAI

def get_llm_client(timeout: int = 60, load_model_from_config: bool = False):
    if os.getenv("OPENAI_BASE_URL") is None:
        print("⚠️  OPENAI_BASE_URL not found in environment. Using default http://localhost:8000/v1")
   
    if os.getenv("OPENAI_API_KEY") is None:
        print("⚠️  OPENAI_API_KEY not set. Using dummy key.")
   
    client = OpenAI(
        base_url=os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1"),
        api_key=os.getenv("OPENAI_API_KEY", "dummy"),
        timeout=timeout,
    )
    ...
    return client

This means that as long as vLLM is running behind an OpenAI-compatible server (from docker-compose.yml on http://localhost:8000/v1), the rest of your code simply calls client.chat.completions.create(...) exactly like it would against OpenAI, without any vendor-specific changes.

At the higher level, TracedLLMClient in src/llm_client.py wraps this client and pulls model configuration from configs/config.yaml via get_llm_config():

llm:
  base_url: "http://localhost:8000/v1"
  model: "meta-llama/Llama-2-7b-chat-hf"
  temperature: 0.7
  max_tokens: 300
  max_retries: 2

class TracedLLMClient:
    def __init__(self, model: str = None, max_retries: int = 2, timeout: int = 60):
        self.client = get_llm_client(timeout=timeout)
        if model is None:
            llm_config = get_llm_config()
            model = llm_config.get("model", "meta-llama/Llama-2-7b-chat-hf")
        self.model = model
        self.max_retries = max_retries

The end result: your RAG and agent code never talks to vLLM directly; it always goes through TracedLLMClient, which is OpenAI-compatible, config-driven, and ready for tracing.

Retry Logic and Error Handling

The core of the wrapper is the complete() method:

from langfuse.decorators import observe, langfuse_context

class TracedLLMClient:
    @observe(name="llm_completion")
    def complete(self, messages: List[Dict[str, str]], **kwargs) -> Dict:
        llm_config = get_llm_config()
        temperature = kwargs.get("temperature", llm_config.get("temperature", 0.7))
        max_tokens = kwargs.get("max_tokens", llm_config.get("max_tokens", 300))

        langfuse_context.update_current_observation(
            input={"messages": messages, "model": self.model}
        )

        last_error = None
        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    temperature=temperature,
                    max_tokens=max_tokens
                )
                ...
                return {..., "success": True}
            except Exception as e:
                last_error = e
                if attempt < self.max_retries - 1:
                    time.sleep(1)
                    continue

        error_msg = f"LLM call failed after {self.max_retries} attempts: {last_error}"
        langfuse_context.update_current_observation(
            level="ERROR",
            output={"error": error_msg}
        )
        return {"content": None, "error": error_msg, "success": False}

A few production-grade details are baked in here:

Config-driven defaults: temperature and max_tokens come from config.yaml but can be overridden per-call via kwargs.
Retry loop: the method tries up to self.max_retries times (default 2), with a short time.sleep(1) backoff between attempts.
Graceful failure: if all attempts fail, you get a structured response {content: None, error: "...", success: False} instead of a hard crash, and the Langfuse span is explicitly marked as "ERROR".

When you call this from rag_pipeline.py or agent_orchestration.py, you can safely check result["success"] and decide whether to return a fallback answer, propagate the error, or short-circuit the pipeline.

Logging Request and Response Payloads

Because TracedLLMClient is decorated with @observe(name="llm_completion"), every call automatically becomes a Langfuse span, and you manually enrich that span with inputs and outputs via langfuse_context.

At the start of the call, you log the request payload:

langfuse_context.update_current_observation(
    input={"messages": messages, "model": self.model}
)

This means that in the Langfuse UI you will see:

the full chat history (messages) you sent to the model, and
which model was used (e.g., "meta-llama/Llama-2-7b-chat-hf").

After a successful LLM call, you log the response content:

content = response.choices[0].message.content

langfuse_context.update_current_observation(
    output={"content": content},
    usage={
        "input": response.usage.prompt_tokens,
        "output": response.usage.completion_tokens,
        "total": response.usage.total_tokens
    },
    metadata={"attempt": attempt + 1}
)

So every Langfuse span for llm_completion will show:

the raw answer text the model generated
which attempt succeeded (first try or retry)
the token usage for that call

On failure, the wrapper logs the error message instead of content:

langfuse_context.update_current_observation(
    level="ERROR",
    output={"error": error_msg}
)

This gives you debuggable traces when vLLM is down, you hit timeouts, or your model name is misconfigured.

Capturing Token Usage and Metadata in Langfuse

vLLM exposes response.usage in an OpenAI-like shape, and you forward that directly into Langfuse as part of the span:

langfuse_context.update_current_observation(
    output={"content": content},
    usage={
        "input": response.usage.prompt_tokens,
        "output": response.usage.completion_tokens,
        "total": response.usage.total_tokens
    },
    metadata={"attempt": attempt + 1}
)

return {
    "content": content,
    "usage": response.usage.model_dump(),
    "success": True
}

This gives you 2 layers of observability:

Inside Langfuse
- You can filter and inspect spans by usage.total, see which prompts are expensive, and spot unusually long generations.
- You can correlate token usage with overall RAG or agent traces because llm_completion spans sit inside higher-level pipeline spans such as rag_pipeline or agent_workflow.
Inside your Python code
- Callers receive result["usage"] and can log or aggregate it themselves (e.g., cost dashboards, quotas, or alerting in future lessons).
- Because usage is returned as response.usage.model_dump(), it is just a normal Python dict you can serialize or send elsewhere.

The metadata={"attempt": attempt + 1} block gives you a clean way to see how often retries are needed; if you start seeing a lot of second or third attempts in Langfuse, you know vLLM or your infra is becoming unreliable and needs attention.

Example: Using the Traced LLM Client

Your __main__ block in llm_client.py shows a minimal end-to-end example:

if __name__ == "__main__":
    client = TracedLLMClient()
   
    result = client.complete(
        messages=[
            {"role": "user", "content": "What is RAG in AI?"}
        ]
    )
   
    print(f"Response: {result['content']}")
    print(f"Tokens: {result['usage']['total_tokens']}")
   
    trace_id = langfuse_context.get_current_trace_id()
    langfuse_host = os.getenv("LANGFUSE_HOST", "http://localhost:3000")
    print(f"🔍 View trace: {langfuse_host}/trace/{trace_id}")

This script:

verifies that vLLM is responding correctly
verifies that Langfuse keys and host are properly configured
gives you a direct URL to the exact trace for this LLM call in the Langfuse UI

In the next sections, you will see this same TracedLLMClient reused inside the RAG pipeline and RAG evaluation, where it becomes just one span in a larger, nested trace tree.

Building a Fully Traced RAG Pipeline with Langfuse

The run_rag_pipeline Orchestrator

Your full RAG flow is implemented in src/rag_pipeline.py as a single orchestrator function:

@observe(name="rag_pipeline")
def run_rag_pipeline(
    question: str,
    retriever: TracedRetriever,
    llm_client: TracedLLMClient,
    top_k: int = 3
) -> Dict:
    ...

This one function wires together everything you have built so far: it takes a user question, uses the traced retriever to find context, calls the traced LLM client to generate an answer, and then runs RAG evaluation to compute relevancy and hallucination scores. Because it is decorated with @observe(name="rag_pipeline"), the entire run shows up in Langfuse as a top-level trace, with all retrieval, LLM, and evaluation spans nested underneath.

Step 1: Retrieve

The first step is retrieving documents with your TracedRetriever:

langfuse_context.update_current_observation(
    input={"question": question, "top_k": top_k}
)

print("Step 1: Retrieving documents...")
docs = retriever.retrieve(query=question, top_k=top_k)

if not docs:
    print("❌ No documents found")
    return {"answer": "No relevant information found.", "success": False}

Here is what happens in this step:

The pipeline span is enriched with the incoming question and the top_k parameter via langfuse_context.update_current_observation.
retriever.retrieve(...) is itself decorated with @observe(name="retrieve_documents"), so Langfuse automatically creates a child span under rag_pipeline. Inside that span, you log the query, scores, and content previews.
If the index is empty or nothing is returned, you fail fast with a friendly message and success=False instead of trying to prompt the LLM with no context.

By the end of Step 1, you have a ranked list of documents such as:

[
    {"content": "...", "score": 0.93, "rank": 1, "distance": 0.12},
    {"content": "...", "score": 0.88, "rank": 2, "distance": 0.18},
    ...
]

and their retrieval details are already captured in Langfuse.

Step 2: Build Prompt from Retrieved Docs

Next, you turn those retrieved documents into a single, structured prompt:

print("Step 2: Building prompt...")
context = "\n\n".join([f"[{i+1}] {d['content']}" for i, d in enumerate(docs)])
messages = [
    {"role": "system", "content": "Answer based on the provided context."},
    {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"}
]

A few important details:

Each document is tagged with an index ([1], [2], [3]) so it is easy to map parts of the final answer back to specific sources, both as a human and when you are debugging traces.
The system message explicitly constrains the model: “Answer based on the provided context.” This is a simple but effective guardrail against hallucinations.
The user message includes both the stitched context and the original question, finishing with “Answer:” to bias the model toward a direct response.

Because messages are later passed into the traced LLM client, the entire prompt (including context) is visible inside the llm_completion span in Langfuse.

Step 3: Generate with vLLM

You then hand off the prompt to your TracedLLMClient:

print("Step 3: Generating answer...")
result = llm_client.complete(messages)

if not result["success"]:
    print(f"❌ Generation failed: {result.get('error')}")
    return {"answer": None, "error": result.get("error"), "success": False}

answer = result["content"]
print(f"✅ Answer generated\n")

Under the hood:

TracedLLMClient.complete() calls client.chat.completions.create(...) against the vLLM OpenAI-compatible server (configured via OPENAI_BASE_URL and OPENAI_API_KEY, with model and temperature from config.yaml).
The method is decorated with @observe(name="llm_completion"), so a child span is created inside the rag_pipeline trace.
Inside that span, you log:
- the messages and model as input
- the generated content as output
- detailed token usage (prompt_tokens, completion_tokens, total_tokens) as usage, plus metadata={"attempt": ...} indicating which retry succeeded

If vLLM is down, misconfigured, or times out, the wrapper returns {"success": False, "error": ...} and updates the Langfuse span with level="ERROR", so you get a clear red node in the trace instead of a mysterious failure.

Step 4: Evaluate Response Quality

Once you have an answer, you pass everything into the evaluation layer in src/evaluation.py:

print("Step 4: Evaluating quality...")
evaluation_results = evaluate_rag_output(question, docs, answer)

evaluate_rag_output is itself annotated with @observe(name="evaluate_rag_output"), and it calls 2 more traced helpers under the hood:

evaluate_relevancy(query, retrieved_docs, answer)
evaluate_hallucination_risk(retrieved_docs, answer)

The process inside evaluate_rag_output looks like this:

langfuse_context.update_current_observation(
    input={
        "query": query,
        "doc_count": len(retrieved_docs),
        "answer_length": len(answer)
    }
)

relevancy_score = evaluate_relevancy(query, retrieved_docs, answer)
hallucination_risk = evaluate_hallucination_risk(retrieved_docs, answer)
overall_quality = (relevancy_score + (1.0 - hallucination_risk)) / 2.0

eval_config = get_evaluation_config()
min_quality = eval_config.get("min_quality_score", 0.6)

results = {
    "relevancy_score": relevancy_score,
    "hallucination_risk": hallucination_risk,
    "overall_quality": overall_quality,
    "passed": overall_quality >= min_quality
}

In more detail:

Relevancy scoring (evaluate_relevancy): computes how well the answer overlaps with both the query and the retrieved documents using simple word-level heuristics.
Hallucination risk (evaluate_hallucination_risk): estimates how many of the answer’s content words are grounded in the retrieved documents; low grounding means higher risk.
Overall quality: is a simple average of relevancy and 1 − hallucination_risk, giving a single number between 0 and 1.
A minimum quality threshold (min_quality_score) comes from the evaluation section of config.yaml and is used to set a passed boolean.

The function then:

langfuse_context.score_current_observation(
    name="relevancy",
    value=relevancy_score,
    comment="Keyword and document relevance"
)

langfuse_context.score_current_observation(
    name="hallucination_risk",
    value=hallucination_risk,
    comment="Risk of ungrounded claims"
)

langfuse_context.score_current_observation(
    name="overall_quality",
    value=overall_quality,
    comment=f"Combined quality score (threshold: {min_quality})"
)

langfuse_context.update_current_observation(output=results)

So you get 3 named scores on the evaluation span inside Langfuse: relevancy, hallucination_risk, and overall_quality, each with a numeric value and a human-readable comment.

Tracing the Entire RAG Pipeline with Nested Spans

Back in run_rag_pipeline, you finalize the top-level observation and return a structured result:

langfuse_context.update_current_observation(
    output={
        "answer": answer,
        "sources_count": len(docs),
        "evaluation": evaluation_results
    }
)

print(f"✅ Evaluation complete")
print(f"  Relevancy: {evaluation_results['relevancy_score']:.2f}")
print(f"  Hallucination Risk: {evaluation_results['hallucination_risk']:.2f}")
print(f"  Overall Quality: {evaluation_results['overall_quality']:.2f}")
print(f"  Passed: {'✅' if evaluation_results['passed'] else '❌'}\n")

Then you expose the trace URL:

trace_id = langfuse_context.get_current_trace_id()
langfuse_host = os.getenv("LANGFUSE_HOST", "http://localhost:3000")

print(f"{'='*50}")
print(f"✅ Pipeline Complete")
print(f"🔍 View trace: {langfuse_host}/trace/{trace_id}")
print(f"{'='*50}\n")

At this point, a single pipeline run creates a hierarchy of spans roughly like this in Langfuse:

rag_pipeline (top-level)
- retrieve_documents (from TracedRetriever.retrieve)
  - embed_text (from TracedRetriever.embed)
- llm_completion (from TracedLLMClient.complete)
- evaluate_rag_output
  - evaluate_relevancy
  - evaluate_hallucination

Each node contains its own inputs, outputs, usage, and scores, giving you a complete picture of where time is spent, how the model behaved, and whether the final answer passed your quality threshold.

Returned Structure and Downstream Use

Finally, the function returns a rich Python dictionary :

return {
    "answer": answer,
    "sources": docs,
    "evaluation": evaluation_results,
    "success": True
}

This shape is deliberate:

answer: can be rendered in a UI, CLI, or logged for later inspection.
sources: lets you show which documents backed the answer (e.g., for “source citations” in a frontend).
evaluation: gives your downstream systems a simple way to gate responses (e.g., only show answers where overall_quality >= 0.7).
success: makes it easy to distinguish between “no documents”, “LLM error”, and “normal completion”.

Together, this section gives you not just a RAG pipeline, but a fully traced, quality-scored RAG system that is ready to plug into dashboards, UIs, or further production hardening.

Implementing LLM Evaluation Metrics for RAG: Relevancy and Hallucination Risk

Relevancy Scoring

Relevancy is implemented in evaluate_relevancy() and answers a simple but crucial question: “How well does the model’s answer align with the retrieved documents and the user’s query?”

Your scoring function uses a lightweight, keyword-overlap heuristic, which is ideal for debugging and observability without introducing another model dependency. The implementation:

@observe(name="evaluate_relevancy")
def evaluate_relevancy(query: str, retrieved_docs: List[Dict], answer: str) -> float:
    langfuse_context.update_current_observation(
        input={"query": query, "doc_count": len(retrieved_docs), "answer_length": len(answer)}
    )

    query_words = set(query.lower().split())
    answer_words = set(answer.lower().split())

    overlap_with_query = len(answer_words & query_words) / max(len(query_words), 1)

    doc_words = set()
    for doc in retrieved_docs:
        doc_words |= set(doc["content"].lower().split())

    overlap_with_docs = len(answer_words & doc_words) / max(len(answer_words), 1)

    relevancy_score = (overlap_with_query + overlap_with_docs) / 2.0

    langfuse_context.score_current_observation(
        name="relevancy",
        value=relevancy_score,
        comment="Keyword and doc overlap relevance"
    )

    langfuse_context.update_current_observation(output={"relevancy": relevancy_score})
    return relevancy_score

What the algorithm evaluates:

Query–Answer overlap: Ensures the model is addressing the question.
Document–Answer overlap: Checks that the model grounds its answer in retrieved context.
The final score is the average of both signals.

While simple, this gives you an interpretable, production-friendly metric that appears directly in Langfuse traces.

Hallucination Risk Estimation

Hallucination risk is implemented in evaluate_hallucination_risk() and estimates how much of the answer is unsupported by the retrieved documents.

@observe(name="evaluate_hallucination")
def evaluate_hallucination_risk(retrieved_docs: List[Dict], answer: str) -> float:
    all_doc_words = set()
    for doc in retrieved_docs:
        all_doc_words |= set(doc["content"].lower().split())

    answer_words = set(answer.lower().split())

    grounding_ratio = len(answer_words & all_doc_words) / max(len(answer_words), 1)
    hallucination_risk = 1.0 - grounding_ratio

Interpretation:

If every important token in the answer appears in the retrieved context, the hallucination risk is low.
If the answer relies heavily on tokens not present in any source document, the hallucination risk is high.

Langfuse logs this as:

langfuse_context.score_current_observation(
    name="hallucination_risk",
    value=hallucination_risk,
    comment="Ungrounded token ratio"
)

This trace node helps you immediately visualize how close an answer is to going “off the rails.”

Overall Quality Metric

Your master scoring function evaluate_rag_output() combines the 2 metrics:

overall_quality = (relevancy_score + (1.0 - hallucination_risk)) / 2.0

This means:

high relevancy and low hallucination risk indicate high quality
low relevancy and high hallucination risk indicate low quality

config.yaml defines the minimum acceptable score:

evaluation:
  min_quality_score: 0.6

Then:

passed = overall_quality >= min_quality

This allows your downstream systems to treat RAG evaluation like a gatekeeper:

passed=True: show the answer to the user, store it, or send it downstream
passed=False: trigger fallback mode, self-reflection, or agentic repair workflows

All 3 metrics (relevancy, hallucination_risk, and overall_quality) are scored and attached to the current Langfuse span.

How Langfuse Displays Evaluation and Scoring Nodes

The evaluation subsystem produces one of the most informative trace segments in Langfuse. A typical structure:

rag_pipeline
 ├── retrieve_documents
 ├── llm_completion
 └── evaluate_rag_output
       ├── relevancy (score)
       ├── hallucination_risk (score)
       ├── overall_quality (score)

Each node includes:

Inputs

user query
document count
answer length

Outputs

numeric scores
pass-or-fail status
evaluation metadata

Visual Benefits Inside Langfuse

Color-coded score nodes help you spot failing RAG runs instantly.
Timeline alignment shows you evaluation overhead and where bottlenecks appear.
Nested spans reveal exactly which part of the pipeline caused a failure.
JSON detail view allows exporting evaluation metrics for dashboards or analytics.

With these evaluation spans, your Langfuse trace evolves from a simple log viewer into a quality monitoring dashboard for your RAG system.

Running and Inspecting the RAG Pipeline End-to-End

Running rag_pipeline.py End-to-End

With all components in place (the retriever, the traced LLM wrapper, and the evaluation module), you can now run the complete production-grade RAG pipeline. The script rag_pipeline.py orchestrates the entire flow:

python src/rag_pipeline.py

This script loads documents, indexes them, retrieves the top_k matches, builds a contextual prompt, generates an answer using vLLM, evaluates the output quality, and logs every step into Langfuse. If all services are running (Langfuse UI on port 3000 and vLLM on port 8000), the run completes with a final console message showing the trace URL:

🔍 View trace: http://localhost:3000/trace/<trace_id>

This makes it trivial to jump directly into the corresponding trace in your observability dashboard and inspect the entire RAG execution, including nested spans and evaluation scores.

Example Trace Outputs

A successful run produces a hierarchical trace structure in Langfuse that mirrors your pipeline architecture. A typical RAG trace looks like this:

rag_pipeline
 ├── retrieve_documents
 │     ├── embed_text
 │     └── FAISS search metadata
 ├── llm_completion
 │     ├── request payload
 │     ├── response payload
 │     └── token usage
 └── evaluate_rag_output
        ├── relevancy (score)
        ├── hallucination_risk (score)
        └── overall_quality (score)

What you will see in the trace:

Retrieval metadata

top_k value
query text
relevance scores
FAISS distances
document preview snippets

LLM generation metadata

system and user messages used for prompting
token usage breakdown
retry attempts
vLLM latency and response time

Evaluation metrics

numeric relevancy score
hallucination risk estimation
overall quality score
pass-or-fail decision using the threshold in config.yaml

Together, these give you a full audit trail for each RAG run, which is perfect for debugging, monitoring, or offline analysis.

Debugging with the Langfuse UI (Span Trees, Scores, and Metadata)

Langfuse is not just a logger; it acts as a visual debugger for your entire RAG system. When you open the trace URL, you will see several powerful debugging tools:

Span Tree View

This hierarchical tree shows the exact execution order and timing of:

retrieval
embedding
indexing
LLM generation
evaluation steps

It helps you detect:

slow spans (bottlenecks)
failed or retried LLM calls
missing or empty retrieval results

Scoring Nodes

Evaluation scores appear as structured nodes:

relevancy
hallucination_risk
overall_quality

Langfuse color-codes these (green, yellow, and red), making it instantly clear when a RAG answer is degrading in quality.

Metadata Panels

Each span contains:

input and output payloads
token counts
FAISS distances
processed document counts
retry counts
trace-level summaries

This makes debugging extremely fast:

Wrong documents retrieved? Inspect retrieval span input and output.
Unexpected LLM answer? Check the exact prompt in the generation span.
Poor evaluation scores? Expand the scoring spans to see the raw metrics.

Because traces are stored locally in your self-hosted Langfuse instance, you get complete transparency without relying on cloud telemetry.

Viewing RAG Traces, Spans, and Scores in Langfuse

Once your RAG pipeline is running end-to-end, the real magic happens inside Langfuse. This is where retrieval steps, LLM calls, evaluation metrics, token usage, and pipeline-level metadata condense into a single, navigable trace. In this section, you will learn how to interpret that trace, span by span, so you can debug, understand, and improve RAG behavior with production-grade visibility.

Understanding Hierarchical Spans (Retrieve → Prompt → Generate → Evaluate)

Langfuse automatically groups each step of your rag_pipeline into a nested hierarchy of spans. A typical RAG trace looks like this:

rag_pipeline (root trace)
- retrieve_documents
  - embed_text
- llm_completion
- evaluate_rag_output
  - evaluate_relevancy
  - evaluate_hallucination
- scoring nodes (overall_quality, relevancy, hallucination_risk)

This hierarchy corresponds directly to your source code:

TracedRetriever.retrieve(): retrieval span
TracedLLMClient.complete(): generation span
evaluate_rag_output(): evaluation span

**Figure 1:** Hierarchical spans created automatically by the RAG pipeline. Notice Retrieve → Generate → Evaluate structure.

How to Navigate the Span Tree

Each span reveals:

execution time (critical for latency bottlenecks)
inputs and outputs captured via langfuse_context.update_current_observation()
whether nested operations (e.g., embedding calls) executed successfully
metadata from FAISS search, document previews, and query text

Langfuse becomes a timeline and debugger for your RAG system.

Inspecting Retrieval: Document Scores and Previews

The retrieval stage is your first major insight point. The retrieve_documents span logs:

the query that was embedded
the top_k used for FAISS search
distance scores returned
converted relevancy scores (your 1/(1+d) heuristic)
ranked documents with text previews

**Figure 2:** Retrieval span showing FAISS scores, document previews, and ranked results.

What to Look For

High distances and low scores: embedding mismatch or poor docs
Same document repeatedly ranking #1: indexing error
Empty results: index not built or FAISS dimension mismatch

Embedded Text Span

The embed_text span reveals the preview of text used for embeddings:

inspect embeddings length
detect empty or malformed documents
verify embeddings model configuration

**Figure 3:** Embedding span showing text preview and output vector dimension.

Inspecting Prompt Construction (Optional View)

Prompt creation happens between retrieval and generation. Although you do not create a separate Langfuse span for this step, the constructed prompt appears inside the LLM span input.

**Figure 4:** Prompt passed to vLLM including context numbered `[1]`, `[2]`, and `[3]`.

What you verify here:

context formatting
numbering
whitespace
hallucination-reducing systems instructions

This becomes essential when debugging wrong answers.

Token Usage and Generation Metadata

Inside the llm_completion span, Langfuse records:

input tokens
output tokens
total tokens
retry count
model name
latency breakdown
response content

**Figure 5:** LLM span showing the request payload, response payload, token usage, retry-attempt metadata, and timing.

What to Look For

Unusually high input tokens: prompt too large
High output tokens: model drifting or verbose
Repeated retries: vLLM throughput issue
Very long latency: GPU under-provisioned or CPU fallback

Evaluation Scoring Nodes (Relevancy, Hallucination, and Overall Quality)

Your evaluation functions (evaluate_relevancy, evaluate_hallucination_risk, evaluate_rag_output) create 3 scoring nodes inside Langfuse:

relevancy
hallucination_risk
overall_quality

These appear alongside the evaluation span.

**Figure 6:** Langfuse scoring nodes: relevancy, hallucination risk, and overall quality.

How to Interpret Them

High relevancy and low hallucination risk: high overall_quality
Low relevancy and high hallucination risk: RAG failure
passed=True means the response met the min_quality_score threshold in config.yaml

Debugging Failures

Relevancy low: retrieval needs improvement
Hallucination high: prompt needs grounding
Both low: LLM ignoring context, bad retrieval, or noisy docs

Visual Timeline and Performance Profiling

The timeline view shows exact timings:

embedding
retrieval
prompt construction
LLM generation
evaluation

This allows profiling end-to-end latency.

**Figure 7:** Timeline visualization showing latency distribution across RAG stages, including embedding, retrieval, LLM generation, and evaluation.

How Langfuse Helps Production Debugging

Langfuse tracing helps answer real production questions:

“Why was this answer wrong?”

Open the evaluation spans, review the hallucination score, inspect the prompt, and then inspect the retrieved documents.

“Which part is slowing down?”

Open the timeline and locate the bottleneck, which is often embeddings or the LLM.

“Did the LLM actually use the retrieved documents?”

Compare:

retrieval previews
answer keywords
relevancy score

“Why did this query fail?”

The trace will show:

empty index
retries
exceptions
malformed inputs
missing environment variables

In production, this becomes indispensable.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: June 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, you built a fully instrumented, production-grade RAG pipeline and learned how observability transforms retrieval-augmented systems from “black boxes” into transparent, debuggable, measurable workflows. You started by setting up the core infrastructure (self-hosted Langfuse, vLLM for fast local inference, and FAISS and SentenceTransformers for efficient retrieval) and then wired all these components together using a clean, traceable architecture.

With tracing enabled end-to-end, every stage of your RAG pipeline became inspectable: document embedding, FAISS indexing, retrieval scoring, prompt construction, LLM generation, and quality evaluation. You saw how Langfuse automatically visualizes these steps as nested spans, how it captures token usage and metadata for LLM calls, and how your evaluation functions produce relevancy, hallucination risk, and overall-quality scores directly inside the trace.

By running the pipeline and examining the traces, you learned how to debug retrieval quality, diagnose prompt-related issues, inspect model behavior, and identify performance bottlenecks using Langfuse’s hierarchical tree view and timeline profiler. The final result is an observability-first RAG stack: fully local, fast, and transparent, designed exactly the way production systems must operate.

This foundation prepares you for upcoming lessons, where we extend the same tracing principles to multi-step agents, adding reasoning chains, intent analysis, and multi-span agent workflows on top of the RAG engine you constructed here.

Citation Information

Singh, V. “RAG Observability with Langfuse, vLLM, and FAISS,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/g20yk

@incollection{Singh_2026_rag-observability-langfuse-vllm-faiss,
  author = {Vikram Singh},
  title = {{RAG Observability with Langfuse, vLLM, and FAISS}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/g20yk},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!