Semantic Caching for LLMs: FastAPI, Redis, and Embeddings

In this lesson, you will learn how to build a semantic cache for LLM applications using FastAPI, Redis, and embedding-based similarity search, and how requests flow from exact matches to semantic matches before falling back to the LLM.

This lesson is the 1st in a 2-part series on Semantic Caching for LLMs:

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings (this tutorial)
Lesson 2

To learn how to build a semantic cache for LLM applications using embeddings and Redis, just keep reading.

Looking for the source code to this post?

Introduction: Why Semantic Caching Matters for LLM Systems

Cost, Latency, and Redundant LLM Calls

Large language models are powerful, but they are not cheap. Every request to an LLM involves tokenization, inference, decoding, and network overhead. Even when models are hosted locally, response times are measured in hundreds of milliseconds or seconds rather than microseconds.

In real applications, this cost compounds quickly. Users often ask similar questions repeatedly, either across sessions or within the same workflow. Each request is treated as a fresh LLM invocation, even when the underlying intent has already been handled before.

This leads to 3 systemic problems:

High latency: Users wait for responses that could have been reused instantly
Increased cost: Identical reasoning is paid for multiple times
Wasted capacity: LLM throughput is consumed by redundant requests

These issues become especially visible under load, where repeated paraphrased queries can overwhelm an otherwise well-sized system.

Why Exact-Match Caching Breaks Down for Natural Language

Traditional caching assumes that identical inputs produce identical outputs. This works well for APIs, database queries, and deterministic functions. It fails for natural language.

From a string-matching perspective, the following queries are completely unrelated:

“What is semantic caching?”
“Can you explain how semantic caching works?”
“How does caching based on embeddings work for LLMs?”

A traditional cache keyed on raw strings will miss all three. As a result, the system calls the LLM three times, even though a human would expect the same answer.

This brittleness causes exact-match caches to have extremely low hit rates in LLM-backed systems. Worse, it gives a false sense of optimization. The cache exists, but it almost never helps in practice.

Where Semantic Caching Fits in Real Systems

Semantic caching addresses this mismatch by caching meaning instead of exact text.

Rather than asking “have I seen this string before?”, a semantic cache asks “have I answered something semantically similar before?”. It does this by converting queries into embeddings and comparing them using a similarity metric such as cosine similarity.

In a real system, semantic caching sits between the application layer and the LLM:

The application sends a query
The cache evaluates whether a prior response is reusable
Only true cache misses reach the LLM

When designed correctly, this layer is invisible to the user. Responses feel faster, costs drop, and the system scales more gracefully without changing the frontend or prompt logic.

This lesson focuses on building that layer explicitly and transparently, using FastAPI, Redis, and embeddings, without hiding the mechanics behind heavy abstractions.

Figure 1: Why semantic caching matters for LLM systems. Exact-match caching treats paraphrased queries as unique requests, resulting in repeated LLM calls. Semantic caching groups queries by meaning, reducing latency and redundant inference. — **Figure 1:** Why semantic caching matters for LLM systems. Exact-match caching treats paraphrased queries as unique requests, resulting in repeated LLM calls. Semantic caching groups queries by meaning, reducing latency and redundant inference (source: image by the author).

Exact-match caching treats paraphrased queries as unique requests, resulting in repeated LLM calls. Semantic caching groups similar queries by meaning, allowing responses to be reused and reducing both latency and cost.

How Semantic Caching Works for LLMs: Embeddings and Similarity Search Explained

Section 1 explained why semantic caching exists.

This section explains how it works, conceptually, before we touch any FastAPI, Redis, or code.

The goal here is to give the reader a mental execution model they can keep in their head while reading the implementation.

From Text to Meaning: Embeddings as the Cache Key

Semantic caching replaces raw text comparison with vector similarity.

Instead of caching responses under the literal query string, the system converts each query into an embedding: a high-dimensional numeric vector that captures semantic meaning. Queries that are worded differently but mean the same thing produce embeddings that are close together in vector space.

This is what allows the cache to recognize paraphrases as equivalent:

“How do I reset my password?”
“I forgot my password, what should I do?”
“Guide me through password recovery”

Exact strings differ. Embeddings do not.

At a high level, semantic caching works by:

Generating an embedding for the incoming query
Comparing it against embeddings stored in the cache
Reusing a cached response if similarity is high enough

The similarity metric used in this lesson is cosine similarity, which measures the angle between two vectors rather than their raw magnitude.

Why a Layered Cache Beats Semantic-Only Caching

While semantic matching is powerful, it is also computationally expensive.

Embedding generation requires a model call. Similarity search requires vector math. Doing this for every request, even when the exact same query has already been seen, would be wasteful.

That is why this lesson uses a layered caching strategy.

Layer 1: Exact Match (Fast Path)

The query is normalized and hashed.

If the same query has already been answered, the response is returned immediately.

No embedding generation
No similarity computation
Minimal latency

This handles repeated identical queries efficiently.

Layer 2: Semantic Match (Flexible Path)

If no exact match exists, the query is embedded and compared against cached embeddings.

This layer catches:

paraphrases
minor wording differences
reordered phrases

Semantic matches trade compute cost for much higher cache hit rates.

Layer 3: LLM Fallback (Slow Path)

If neither exact nor semantic matches succeed, the request is forwarded to the LLM.

The response is then stored in the cache so future requests can reuse it.

This layered approach ensures:

the cheapest checks happen first
expensive operations are only used when necessary

Confidence, Freshness, and Cache Safety

Semantic similarity alone is not enough to decide whether a cached response should be reused.

This lesson introduces the idea of confidence scoring, which combines:

Similarity: how close the embeddings are
Freshness: how old the cached entry is

A highly similar but stale response should not necessarily be trusted. Likewise, a fresh response with low similarity should be rejected.

In addition, cached entries are validated to prevent:

expired responses
poisoned entries (errors, empty outputs)

These checks ensure the cache improves correctness and performance rather than degrading them.

**Figure 2:** Layered semantic caching request flow (source: image by the author).

Incoming queries first attempt an exact-match lookup, then fall back to semantic similarity search using embeddings, and finally call the LLM only on cache miss. This ordering minimizes latency and unnecessary model calls.

Note: In this lesson, we implement this flow using Redis as a simple embedding store with linear similarity scans, rather than a dedicated vector database.

Semantic Caching Architecture and Request Flow

In Section 2, you learned how semantic caching works conceptually.

In this section, we map that mental model to a real request flow in an LLM-backed service.

The goal is to answer one question clearly:

What happens, step by step, when a user sends a request to this system?

We will stay implementation-aware, but not code-specific yet. That comes next.

High-Level System Components

At a high level, the system consists of 5 logical components:

API layer: Receives user requests and orchestrates the caching pipeline.
Exact-match cache: Performs fast hash-based lookups for identical queries.
Embedding model: Converts text queries into semantic vectors when needed.
Semantic cache: Stores embeddings and responses and performs similarity matching.
LLM: Acts as the final fallback when no cache entry is suitable.

Each component has a narrowly defined responsibility. This separation is intentional and keeps the system easy to reason about and extend.

In this implementation:

The API layer is built using FastAPI and acts as the orchestration point.
Redis is used as the backing store for both exact-match and semantic cache layers.
Ollama provides both embedding generation and LLM inference locally.

These choices keep the system lightweight, self-contained, and easy to reason about while still reflecting real production patterns.

End-to-End Request Flow

When a user sends a query, the system processes it in the following order.

Step 1: Request enters the API

The API receives a text query along with optional flags, such as whether to use the bypass_cache. Input validation happens immediately to prevent meaningless or malformed queries from entering the pipeline.

This ensures the cache is not polluted with empty or invalid entries.

Step 2: Exact-match cache lookup

The query is normalized and hashed.

The system checks whether an identical query has already been answered.

If an exact match exists and is valid, the response is returned immediately.
No embeddings are generated.
The LLM is not touched.

This is the fastest possible path through the system.

Step 3: Embedding generation

If the exact-match lookup fails, the query is passed to the embedding model.

The model converts the text into a numeric vector that captures semantic meaning. This vector becomes the key for semantic comparison.

This step is intentionally skipped when an exact match succeeds.

Step 4: Semantic cache lookup

The embedding is compared against cached embeddings using a similarity metric.

A cached response is reused only if:

similarity exceeds a defined threshold
the entry has not expired
the entry is not poisoned
the computed confidence is high enough

If a suitable match is found, the response is returned to the user without calling the LLM.

Step 5: LLM fallback and cache population

If both cache layers miss, the request is forwarded to the LLM.

Once a response is generated:

it is returned to the user
it is stored in the cache with metadata, timestamps, and TTL (Time To Live)

This ensures future requests can reuse the result.

Why This Architecture Works Well

This architecture is intentionally conservative and explicit.

Cheap operations happen first.
Expensive operations are deferred.
Every step is observable and debuggable.
No component hides complexity behind opaque abstractions.

Most importantly, the system degrades gracefully. Even when the cache provides no benefit, the request still succeeds via the LLM.

Figure 3: Architecture and request flow for a layered semantic caching system. — **Figure 3:** Architecture and request flow for a layered semantic caching system (source: image by the author).

User queries enter the API, attempt an exact-match lookup, fall back to semantic similarity search using embeddings, and call the LLM only when both cache layers miss. Successful LLM responses are stored for future reuse.

Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.

Configuring Your Environment for Semantic Caching: FastAPI, Redis, and Ollama Setup

To follow this guide, you need a small set of Python libraries and system services that support API orchestration, vector similarity, and LLM interaction. The goal is to keep the environment lightweight, reproducible, and easy to reason about.

At a minimum, you will need:

Python 3.10 or newer
Redis (used as the cache backing store)
An LLM + embedding provider (Ollama in this tutorial)

All required Python dependencies are pip-installable.

Installing Python Dependencies

Create and activate a virtual environment (recommended), then install the required packages:

$ pip install fastapi uvicorn redis httpx python-dotenv numpy

These libraries provide the following functionality:

FastAPI: API layer and request orchestration
Uvicorn: ASGI server for running the service
Redis client: Communication with the cache store
httpx: HTTP client for embedding and LLM calls
NumPy: Vector math for cosine similarity
python-dotenv: Environment-based configuration

Verifying Redis

This lesson assumes Redis is running locally on the default port.

You can verify Redis is available with:

$ redis-cli ping
PONG

If Redis is not installed, you can start it quickly using Docker (but you also can spin it up using the docker-compose.yml we provide in the code zip):

$ docker run -p 6379:6379 redis:7

Setting Up Ollama

This system uses Ollama for both embedding generation and LLM inference. Make sure Ollama is installed and running, and that the required models are available.

For example:

$ ollama pull nomic-embed-text
$ ollama pull llama3.2

Once running, Ollama exposes local HTTP endpoints that the application will call directly for embeddings and text generation.

Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

Before diving into individual components, let’s take a moment to understand how the project is organized.

A clear directory structure is especially important in LLM-backed systems, where responsibilities span API orchestration, caching, embeddings, model calls, and observability. In this project, each concern is isolated into its own module so the request flow remains easy to trace and reason about.

After downloading the source code from the “Downloads” section, your directory structure should look like this:

.
├── app
│   ├── api
│   │   ├── __init__.py
│   │   └── ask.py
│   ├── cache
│   │   ├── __init__.py
│   │   ├── poisoning.py
│   │   ├── schemas.py
│   │   ├── semantic_cache.py
│   │   └── ttl.py
│   ├── config
│   │   ├── __init__.py
│   │   └── settings.py
│   ├── embeddings
│   │   ├── __init__.py
│   │   └── embedder.py
│   ├── llm
│   │   ├── __init__.py
│   │   └── ollama_client.py
│   ├── main.py
│   └── observability
│       └── metrics.py
├── complete-codebase.txt
├── docker-compose.yml
├── Dockerfile
├── README.md
└── requirements.txt

Let’s break this down at a high level.

The app/ Package

The app/ directory contains all runtime application code. Nothing outside this folder is imported at execution time.

This keeps the service self-contained and makes it easy to reason about deployment and dependencies.

app/main.py: Application Entry Point

This file defines the FastAPI application and registers all routers.

It contains no business logic — only service wiring. Every request into the system enters through this file.

app/api/: API Layer

The api/ package defines HTTP-facing endpoints.

ask.py: Implements the /ask endpoint and acts as the orchestration layer for the entire semantic caching pipeline.

The API layer is responsible for:

input validation
enforcing cache ordering
coordinating cache, embeddings, and LLM calls
returning structured debug information

It does not implement caching or similarity logic directly.

app/cache/: Caching Logic

This package contains all cache-related functionality.

semantic_cache.py: Core semantic cache implementation (exact match, semantic match, Redis storage, similarity search).
schemas.py: Defines the cache entry schema used for Redis storage.
ttl.py: Application-level TTL configuration and expiration checks.
poisoning.py: Safety checks to prevent invalid or error responses from being reused.

By isolating caching logic here, the API layer stays clean and reusable.

app/embeddings/: Embedding Generation

embedder.py: Handles embedding generation via Ollama’s embedding endpoint.

This module has a single responsibility: convert text into semantic vectors.

It does not cache, rank, or validate embeddings.

app/llm/: LLM Client

ollama_client.py: Wraps calls to the Ollama text-generation endpoint.

Keeping LLM interaction isolated allows the rest of the system to remain model-agnostic.

app/observability/: Metrics

metrics.py: Implements simple in-memory counters for cache hits, misses, and LLM calls.

These metrics are intentionally lightweight and meant for learning and debugging, not production monitoring.

Configuration and Infrastructure

Outside the app/ directory:

config/settings.py: Centralizes environment-based configuration (Redis host, TTLs, model names).
Dockerfile and docker-compose.yml: Define a reproducible runtime environment for the API and Redis.
requirements.txt: Lists all Python dependencies required to run the service.

FastAPI Entry Point for Semantic Caching: Wiring the API Service

Before we look at caching logic, embeddings, or Redis, it’s important to understand how the service itself is wired together. Every request to the semantic cache enters the system through a single FastAPI application, defined in app/main.py.

This file acts as the entry point of the service. Its responsibility is not to implement business logic, but to connect the application components and expose HTTP routes.

Application Entry Point (app/main.py)

from fastapi import FastAPI
from api.ask import router as ask_router

app = FastAPI(title="Semantic Cache Basics")
app.include_router(ask_router)

Let’s break this down.

The FastAPI() call creates the application object. This object represents the entire web service and is what the ASGI (Asynchronous Server Gateway Interface) server (uvicorn) runs when the container starts.

The application itself contains no knowledge of caching, embeddings, or LLMs. It simply defines a runtime container that will host those capabilities.

Router Registration

Instead of defining endpoints directly in main.py, the application imports a router from api/ask.py and registers it using include_router().

This pattern serves several purposes:

Separation of concerns: Routing and request handling live outside the application entry point.
Scalability: As the system grows, additional routers (for health checks, metrics, or admin endpoints) can be added without modifying core application wiring.
Readability: main.py remains easy to understand at a glance, even as the codebase expands.

At runtime, FastAPI merges the routes defined in ask_router into the main application. When a request arrives at the /ask endpoint, FastAPI resolves it through the registered router and forwards it to the appropriate handler function.

Why This Matters

Keeping the entry point minimal is intentional. It ensures that:

The application startup process is predictable
Routing logic is easy to trace
Core functionality can evolve independently of service wiring

With the application structure in place, we can now focus on what actually happens when a request reaches the system.

In the next section, we will walk through the /ask endpoint and see how it orchestrates exact-match caching, semantic search, and LLM fallback step by step.

FastAPI Ask Endpoint: End-to-End Semantic Caching Request Flow

This section makes the architecture concrete. We now walk through the /ask endpoint, which orchestrates the entire semantic caching pipeline from request arrival to response delivery.

The goal here is not to memorize code, but to understand why each step exists, where it lives, and how it protects performance, cost, and correctness.

The Role of the Ask Endpoint

The Ask endpoint is the control plane of the system.

It does not:

Compute similarity
Store embeddings
Talk directly to Redis internals

Instead, it:

Validates input
Decides which cache layers to consult
Enforces ordering between cheap and expensive operations
Collects observability signals
Guarantees a response even on cache failure

This separation is intentional. Cache logic remains reusable and testable, while orchestration logic stays explicit at the API boundary.

Defining the API Contract

We begin by defining the request and response models.

class AskRequest(BaseModel):
    query: str
    bypass_cache: bool = False

The request consists of a user query and an optional bypass_cache flag. This flag allows us to force a cache miss during debugging or testing, ensuring that the LLM and embedding pipeline still function correctly.

Before the request ever reaches the cache, the query field is validated.

@field_validator('query')
@classmethod
def validate_query(cls, v: str) -> str:
    if not v or not v.strip():
        raise ValueError("Query cannot be empty or whitespace-only")
    return v.strip()

This validation step protects the system at the boundary. Rejecting empty or whitespace-only queries prevents:

wasted embedding computation
cache pollution with meaningless entries
unnecessary LLM calls

This is a recurring pattern in production systems: fail fast, before expensive operations are triggered.

class AskResponse(BaseModel):
    response: str
    from_cache: bool
    similarity: float
    debug: dict

The response model intentionally exposes diagnostic information through fields such as from_cache, similarity, and debug. During development, this makes cache behavior transparent rather than opaque.

Initializing the Cache

Before handling requests, we create a SemanticCache instance:

cache = SemanticCache()

The endpoint itself remains stateless. All persistence and reuse live inside the cache layer.

Step 1: Entering the Endpoint

The endpoint is registered using FastAPI’s routing mechanism:

@router.post("/ask", response_model=AskResponse)
def ask_endpoint(request: AskRequest):

FastAPI automatically validates incoming requests and outgoing responses using the schemas defined earlier. If invalid data enters or exits the system, FastAPI raises an error instead of silently failing.

Inside the handler, we extract the query and initialize tracking state:

query = request.query
miss_reason = None

The miss_reason variable exists purely for observability. Rather than treating cache misses as a black box, we explicitly track why a miss occurred.

Step 2: Exact-Match Cache Lookup (Fast Path)

The first decision point is the exact-match cache lookup:

if not request.bypass_cache:
    cached = cache.search(None, exact_query=query)

This is the cheapest path through the system.

If the same query has already been answered, the response can be returned immediately:

no embeddings are generated
no similarity computation occurs
the LLM is not touched

If a cached entry is found, it is validated:

if is_expired(cached):
    miss_reason = "expired"
elif is_poisoned(cached):
    miss_reason = "poisoned"
elif cached.get("confidence", 0.0) < 0.7:
    miss_reason = "low_confidence"

Only entries that are fresh, valid, and confident are allowed to short-circuit the pipeline.

When all checks pass, the endpoint returns immediately:

metrics.cache_hit()
return AskResponse(...)

This path typically completes in milliseconds and handles repeated identical queries efficiently.

Step 3: Embedding Generation (Escalation Point)

If the exact-match lookup fails or is bypassed, the endpoint escalates:

embedding = embed_text(query)

Embedding generation is expensive, even when running locally. For this reason, it is intentionally delayed until all cheaper options have been exhausted.

This single design choice has a significant impact on system efficiency.

Step 4: Semantic Cache Lookup

With the embedding available, the endpoint attempts a semantic search:

cached = cache.search(embedding)

This path catches paraphrased and reworded queries. As before, cached entries are validated to ensure they are safe to reuse.

If a suitable match is found, the response is returned without calling the LLM.

Step 5: Explicit Cache Bypass

The bypass_cache flag is handled explicitly:

if request.bypass_cache:
    miss_reason = "bypass"

This allows controlled testing and debugging without modifying code or disabling cache logic globally.

Step 6: LLM Fallback and Cache Population

If both cache layers miss, the request is forwarded to the LLM:

metrics.cache_miss()
response = generate_llm_response(query)
metrics.llm_call()

This is the slowest path through the system, but it guarantees correctness.

Successful responses are stored in the cache:

if not response.startswith("[LLM Error]"):
    cache.store(query, embedding, response, metadata=metadata)

Responses beginning with [LLM Error] are intentionally not cached, preventing cache poisoning and ensuring failures do not propagate to future requests.

Control Flow Summary

The endpoint follows a simple, explicit sequence:

**Figure 4:** LLM API Control Flow with Layered Semantic Caching (source: image by the author).

Every expensive operation is deferred until absolutely necessary.

Embeddings: Turning Text into Semantic Vectors

Up to this point, we have treated embeddings as a black box: something expensive that we try to avoid unless absolutely necessary.

In this section, we will open that box just enough to understand what embeddings are, when they are generated, and why they enable semantic caching without diving into vector math or model internals.

Why Embeddings Exist in This System

Exact-match caching works only when queries are identical at the string level. As soon as wording changes, exact matching breaks down.

Embeddings solve this problem by converting text into a numeric representation that captures meaning rather than surface form.

Queries that mean the same thing tend to produce vectors that are close together in vector space, even if their wording differs significantly.

This is the foundation that makes semantic caching possible.

Embedding Generation Happens on Demand

In our implementation, embeddings are generated only after the exact-match cache fails.

This decision is intentional.

Embedding generation involves:

a model invocation
network overhead
serialization and deserialization
non-trivial latency

Because of this cost, embeddings are treated as an escalation step, not a default operation.

This is why the /ask endpoint first attempts an exact-match lookup before calling embed_text().

The embed_text Function

def embed_text(text: str):

This function has one responsibility: Convert input text into a semantic vector representation.

It does not perform caching, similarity search, or validation. Those concerns live elsewhere.

Calling the Embedding Model

url = f"http://{settings.OLLAMA_HOST}:{settings.OLLAMA_PORT}/api/embeddings"

Here, we construct the Ollama embedding endpoint using configuration values (e.g., settings.OLLAMA_HOST, settings.OLLAMA_PORT, etc.).

This allows the embedding service to run locally, inside Docker, or on a remote host without changing code.

resp = httpx.post(
    url,
    json={"model": settings.EMBEDDING_MODEL, "prompt": text},
    timeout=10.0
)

This request sends 2 key pieces of information to the embedding service:

the embedding model name (e.g., nomic-embed-text)
the input text to embed

The timeout ensures the request does not hang indefinitely. Embedding generation is expensive, but it should still fail fast if something goes wrong.

Handling the Response

resp.raise_for_status()
return resp.json().get("embedding", [])

If the request succeeds, the embedding model returns a numeric vector — typically a list of floating-point values.

This vector represents the semantic meaning of the input text and becomes the key used for similarity comparison in the cache.

At this stage, we treat the vector as an opaque object. We do not inspect its dimensionality or normalize it here.

Error Handling Strategy

except Exception as e:
    raise RuntimeError(f"Failed to generate embedding: {e}")

If embedding generation fails for any reason (network issues, model errors, timeouts), the function raises an exception.

This is intentional.

If embeddings cannot be generated, the system cannot safely perform semantic matching. Silently continuing would lead to unpredictable behavior, so we fail loudly instead.

Why the Embedder Is Intentionally Simple

Notice what this function does not do:

it does not store embeddings
it does not perform similarity search
it does not retry failed requests
it does not fall back to alternative models

Those decisions are deliberate.

For Lesson 1, the embedder exists purely to convert text into vectors. Keeping it small and focused makes the system easier to understand and test.

How the Embedder Is Used in the Pipeline

At runtime, the embedder is called only when necessary:

Exact-match cache fails
The query is passed to embed_text()
The returned vector is sent to the semantic cache
Similarity is computed against stored embeddings

This ensures embeddings are generated only when cheaper paths have already failed.

Key Takeaways

Embeddings are generated via a simple HTTP call to a local model
The embedder has a single responsibility
Errors are surfaced immediately
Embeddings act as semantic keys for cache lookup

With embedding generation understood, we are now ready to look at the semantic cache itself, how embeddings and responses are stored, scanned, and matched.

In the next section, we will walk through the semantic cache implementation, starting with a deliberately naive but correct linear scan approach.

The Semantic Cache: Cosine Similarity, Redis Storage, and Reusing Meaning

At this point, we understand how queries enter the system and how text is converted into embeddings. What remains is the component that ties everything together: the semantic cache itself.

The semantic cache is responsible for 2 things:

Storing past queries, embeddings, and responses
Retrieving the best reusable response for a new query

In Lesson 1, we intentionally implement the cache in the simplest correct way possible: a linear scan over cached entries. This keeps the implementation easy to reason about and makes the request flow fully transparent.

The Semantic Cache Module

The cache logic lives in semantic_cache.py:

class SemanticCache:

This class encapsulates all Redis interaction and similarity logic. The API layer never talks to Redis directly.

Initializing the Cache

def __init__(self):
    self.r = redis.Redis(
        host=settings.REDIS_HOST,
        port=settings.REDIS_PORT,
        decode_responses=True
    )
    self.similarity_threshold = 0.85
    self.namespace = "semantic_cache:v1"

Here we establish a Redis connection and configure two important parameters:

Similarity threshold: Only responses with sufficiently high semantic similarity are eligible for reuse.
Namespace prefix: All Redis keys are namespaced to avoid collisions and allow future versioning.

For Lesson 1, the exact threshold value is not important. What matters is that a threshold exists and is applied consistently.

Storing Cache Entries

The first core operation is storing new entries.

def store(self, query, embedding, response, metadata=None):

This method is called only after a successful LLM response.

Creating a Cache Entry

entry = CacheEntry(
    id=entry_uuid,
    query=query,
    query_hash=query_hash,
    embedding=json.dumps(embedding),
    response=response,
    created_at=int(time.time()),
    ttl=default_ttl(),
    metadata=metadata or {}
)

Each cache entry stores:

the original query
a normalized query hash (used for exact matching)
the embedding (serialized for Redis storage)
the LLM response
timestamps and TTL
optional metadata for observability

This structure allows the cache to support both exact-match and semantic lookups.

Writing to Redis

self.r.hset(redis_key, mapping=entry.dict())
self.r.sadd(f"{self.namespace}:keys", redis_key)

Each cache entry is stored as a Redis hash, and all entry keys are tracked in a Redis set.

This allows the cache to iterate over all entries during search operations.

For Lesson 1, this approach is intentionally simple and explicit.

Searching the Cache

The second core operation is lookup.

def search(self, embedding, exact_query=None):

This method supports 2 search modes, which map directly to the layered cache strategy used in the API.

Exact-Match Lookup (Fast Path)

if exact_query:
    query_hash = self._hash_query(exact_query)

When an exact query is provided, the cache first attempts a hash-based lookup.

Each cached entry is scanned until a matching hash is found. If found, the entry is returned immediately with a similarity score of 1.0.

No embeddings are involved in this path.

Semantic Lookup (Flexible Path)

If no exact match is found and an embedding is provided, the cache performs a semantic search:

sim = self.cosine_similarity(query_embedding, cached_embedding)

Each cached embedding is compared against the query embedding using cosine similarity.

Only entries that exceed the configured similarity threshold are considered candidates.

Selecting the Best Match

During the scan, the cache tracks the highest similarity score and returns the best matching entry.

This ensures that even when multiple entries are similar, the most relevant response is reused.

Why This Implementation Is O(N)

Every search scans all cached entries.

This is not an accident.

For Lesson 1, a linear scan has three advantages:

the behavior is easy to understand
the logic is fully visible
debugging is straightforward

More advanced indexing strategies belong in later lessons.

Why Expired Entries Are Cleaned During Search

While scanning entries, expired items are removed opportunistically.

This prevents stale data from accumulating indefinitely without introducing background workers or schedulers.

Key Takeaways

The semantic cache owns all Redis interactions
Exact-match lookup is attempted before semantic matching
Semantic similarity is computed using embeddings
A linear scan trades performance for clarity
The cache returns the best reusable response, not just the first match

At this stage, the system is fully functional: queries can be answered, cached, and reused.

Cache Entries: What Exactly Gets Stored?

So far, we’ve treated the cache as a logical concept: something that stores queries, embeddings, and responses.

In this section, we’ll make that concrete by looking at the structure of a cache entry. Understanding this structure is important because it explains why the cache can support both exact-match and semantic lookup — without duplicating data or logic.

The Cache Entry Schema

Cache entries are defined using a Pydantic model:

class CacheEntry(BaseModel):
    id: str
    query: str
    query_hash: str
    embedding: str
    response: str
    created_at: int
    ttl: int
    metadata: Optional[Dict] = Field(default_factory=dict)

Each field exists for a specific reason. Let’s walk through them one by one.

Identity and Query Fields

id: str
query: str
query_hash: str

id: uniquely identifies the cache entry and is used to construct the Redis key.
query: stores the original user input. This is useful for debugging and inspection.
query_hash: stores a normalized hash of the query and enables exact-match lookup.

At this stage, it’s enough to know that the hash ensures identical queries can be matched quickly. We’ll revisit how and why this normalization matters in a later lesson.

Embedding Storage

embedding: str

Embeddings are stored as a JSON-serialized string, not as a raw Python list.

This choice is deliberate:

Redis stores strings efficiently
Serialization keeps the schema simple
Deserialization happens only when similarity needs to be computed

For Lesson 1, the important takeaway is that embeddings are stored once, alongside the response they produced.

Response and Timing Information

response: str
created_at: int
ttl: int

response: is the text returned by the LLM.
created_at: records when the entry was generated.
ttl: defines how long the entry is considered valid.

The cache does not rely on Redis expiration here. Instead, validity is checked at read time. This gives the application full control over when an entry should be reused or rejected.

We intentionally avoid deeper TTL semantics in this lesson.

Metadata and Safety

metadata: Optional[Dict] = Field(default_factory=dict)

Metadata allows the cache to store contextual information such as:

pipeline name
model identifier
request origin

The use of default_factory=dict avoids shared mutable state across cache entries — a subtle but important correctness detail.

At this stage, metadata is informational rather than functional.

Why This Schema Works Well

This schema supports the layered caching strategy naturally:

Exact match uses query_hash
Semantic match uses embedding
Freshness checks use created_at and ttl
Safety checks use response and metadata

All required information is co-located in a single cache entry, making lookup and validation straightforward.

End-to-End Demo: Verifying Core Cache Behavior

In this section, we will verify that the semantic cache behaves as expected under a small set of controlled scenarios.

These examples are meant to be run locally by the reader. The responses shown below are representative and may vary slightly depending on the model and configuration.

Demo Case 1: Cold Request (LLM Fallback)

We begin with a query that has not been seen before.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?"}'

Expected behavior

Exact-match cache miss
Semantic cache miss
LLM call
Cache population

Response

The key signal here is "from_cache": false, confirming the request fell back to the LLM.

Demo Case 2: Exact-Match Cache Hit

Now we send the same query again.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?"}'

Expected behavior

Exact-match cache hit
No embedding generation
No LLM call

Example response

**Figure 6:** Exact-match cache behavior. The repeated query is served directly from the cache via an exact string match, bypassing embedding generation and the LLM entirely (source: image by the author).

Here, the cache reused the response immediately using an exact-match lookup.

Optional Demo: Whitespace Normalization

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "   What   is   semantic   caching?   "}'

This will hit the exact-match cache due to query normalization.

Demo Case 3: Semantic Cache Hit (Paraphrased Query)

Next, we send a paraphrased version of the original query.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "Can you explain how semantic caching works?"}'

Expected behavior

Exact-match cache miss
Embedding generation
Semantic cache hit
No LLM call

Example response

**Figure 7:** Semantic cache hit for a paraphrased query. Although the input text differs, the cached response is reused based on embedding similarity, avoiding a new LLM call (source: image by the author).

Even though the query text is different, the cache successfully reused the response based on semantic similarity.

Demo Case 4: Forcing a Cache Miss with bypass_cache

The bypass_cache flag allows us to force the system to skip both cache layers.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?", "bypass_cache": true}'

Expected behavior

Exact-match cache skipped
Semantic cache skipped
LLM called unconditionally

Example response

**Figure 8:** Cache bypass behavior. The request explicitly skips all cache layers via `bypass_cache`, ensuring the LLM pipeline executes independently of cached responses (source: image by the author).

This is useful for debugging and validating that the LLM pipeline still works independently of the cache.

Observing Cache Metrics (Optional)

You can inspect basic cache statistics using the /internal/metrics endpoint:

curl http://localhost:8000/internal/metrics

Example response

**Figure 9:** Internal cache metrics showing hit, miss, and bypass counters, enabling lightweight observability of cache behavior during development and debugging (source: image by the author).

These metrics make cache behavior observable without requiring external tooling.

If you can reproduce these behaviors locally, you’ve successfully implemented a working semantic cache.

In the next lesson, we will take this system and begin hardening it for real-world use.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, we built a complete semantic caching system for LLM applications from the ground up. We started by wiring a FastAPI service and defining a clean request–response contract, then implemented a layered caching strategy that prioritizes cheap exact-match lookups before escalating to semantic similarity and, finally, LLM inference.

We walked through how text queries are converted into embeddings on demand, how cached responses and embeddings are stored in Redis, and how the cache decides whether a prior response can be safely reused. By keeping the implementation intentionally simple and explicit, every step in the request flow remains observable and easy to reason about.

Finally, we verified the system end-to-end by running controlled demos: a cold request falling back to the LLM, an exact-match cache hit, a semantic cache hit for a paraphrased query, and an explicit cache bypass. At this point, you have a working semantic cache that behaves correctly, makes its decisions visible, and serves as a solid foundation for further hardening and optimization.

Citation Information

Singh, V. “Semantic Caching for LLMs: FastAPI, Redis, and Embeddings,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/yso6f

@incollection{Singh_2026_semantic-caching-for-llms-fastapi-redis-and-embeddings,
  author = {Vikram Singh},
  title = {{Semantic Caching for LLMs: FastAPI, Redis, and Embeddings}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/yso6f},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Looking for the source code to this post?

Cost, Latency, and Redundant LLM Calls

Why Exact-Match Caching Breaks Down for Natural Language

Where Semantic Caching Fits in Real Systems

From Text to Meaning: Embeddings as the Cache Key

Why a Layered Cache Beats Semantic-Only Caching

Layer 1: Exact Match (Fast Path)

Layer 2: Semantic Match (Flexible Path)

Layer 3: LLM Fallback (Slow Path)

Confidence, Freshness, and Cache Safety

High-Level System Components

End-to-End Request Flow

Step 1: Request enters the API

Step 2: Exact-match cache lookup

Step 3: Embedding generation

Step 4: Semantic cache lookup

Step 5: LLM fallback and cache population

Why This Architecture Works Well

Installing Python Dependencies

Verifying Redis

Setting Up Ollama

Need Help Configuring Your Development Environment?

The app/ Package

app/main.py: Application Entry Point

app/api/: API Layer

app/cache/: Caching Logic

app/embeddings/: Embedding Generation

app/llm/: LLM Client

app/observability/: Metrics

Configuration and Infrastructure

Application Entry Point (app/main.py)

Router Registration

Why This Matters

The Role of the Ask Endpoint

Defining the API Contract

Initializing the Cache

Step 1: Entering the Endpoint

Step 2: Exact-Match Cache Lookup (Fast Path)

Step 3: Embedding Generation (Escalation Point)

Step 4: Semantic Cache Lookup

Step 5: Explicit Cache Bypass

Step 6: LLM Fallback and Cache Population

Control Flow Summary

Why Embeddings Exist in This System

Embedding Generation Happens on Demand

The embed_text Function

Calling the Embedding Model

Handling the Response

Error Handling Strategy

Why the Embedder Is Intentionally Simple

How the Embedder Is Used in the Pipeline

Key Takeaways

The Semantic Cache Module

Initializing the Cache

Storing Cache Entries

Creating a Cache Entry

Writing to Redis

Searching the Cache

Exact-Match Lookup (Fast Path)

Semantic Lookup (Flexible Path)

Selecting the Best Match

Why This Implementation Is O(N)

Why Expired Entries Are Cleaned During Search

Key Takeaways

The Cache Entry Schema

Identity and Query Fields

Embedding Storage

Response and Timing Information

Metadata and Safety

Why This Schema Works Well

Demo Case 1: Cold Request (LLM Fallback)

Demo Case 2: Exact-Match Cache Hit

Optional Demo: Whitespace Normalization

Demo Case 3: Semantic Cache Hit (Paraphrased Query)

Demo Case 4: Forcing a Cache Miss with bypass_cache

Observing Cache Metrics (Optional)

What's next? We recommend PyImageSearch University.

Citation Information

Download the Source Code and FREE 17-page Resource Guide

About the Author