Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions blog/valkey_semantic_caching/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
slug: valkey_semantic_caching
title: "Semantic Caching on Valkey and AWS ElastiCache"
date: 2026-06-17T10:00:00
authors:
- yassin
description: "LiteLLM now supports semantic prompt caching on Valkey clusters running the valkey-search module, including AWS ElastiCache for Valkey, with no RediSearch, Redis Stack, or Qdrant required."
tags: [caching, valkey, elasticache, semantic cache]
hide_table_of_contents: false
---

LiteLLM now supports semantic prompt caching on Valkey. If you run a Valkey cluster with the [valkey-search](https://github.com/valkey-io/valkey-search) module, including AWS ElastiCache for Valkey, you can point LiteLLM at it with `type: valkey-semantic` and get embedding-based cache hits without standing up Redis Stack or a separate vector database.

{/* truncate */}

## Why this matters

Semantic caching stores responses by the meaning of a prompt rather than an exact string match, so a reworded request can still hit the cache and skip a paid model call. Until now LiteLLM's semantic cache was built on RedisVL, which depends on RediSearch's `FT.*` vector API. RediSearch is not available on Redis OSS or on ElastiCache for Redis OSS, which left teams standing up Redis Stack or Qdrant just to get semantic caching. With Redis moving to a source-available license, more teams are standing up Valkey instead, and ElastiCache for Valkey is a common managed target.

Valkey ships vector search through the valkey-search module, and ElastiCache for Valkey exposes it. LiteLLM's new backend talks to valkey-search directly over the Redis protocol, so semantic caching on ElastiCache for Valkey works without RediSearch, Redis Stack, or Qdrant in the path.

## How it works

The `valkey-semantic` backend builds its own vector index from the field types valkey-search supports, a tag field that isolates each cache key's scope and an HNSW vector field for the prompt embedding, then runs a KNN query at lookup time and returns the cached response when the cosine similarity clears your threshold. Prompt extraction, embedding generation, and response handling are shared with the existing Redis semantic cache, so behavior matches the Redis path including per-request scope isolation. Connections resolve from `VALKEY_HOST`, `VALKEY_PORT`, and `VALKEY_PASSWORD`, falling back to the `REDIS_*` equivalents, and passwordless clusters are supported for IAM or no-auth setups.

## Get started

Add the cache to your `config.yaml`:

```yaml
litellm_settings:
cache: True
cache_params:
type: valkey-semantic
host: os.environ/VALKEY_HOST
port: os.environ/VALKEY_PORT
valkey_semantic_cache_embedding_model: openai-embedding
similarity_threshold: 0.8
```

For ElastiCache with encryption in transit, pass a `rediss://` URL through `cache_params.redis_url` instead of host and port. To try valkey-search locally, the bundled image has the module ready:

```shell
docker run -d -p 6379:6379 valkey/valkey-bundle:8.1
```

See the [caching docs](https://docs.litellm.ai/docs/proxy/caching) for the full setup, including the SDK usage and the parameter reference.
75 changes: 74 additions & 1 deletion docs/caching/all_caches.md
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,75 @@ assert response1.id == response2.id

</TabItem>

<TabItem value="valkey-sem" label="valkey-semantic cache">

Use this when your vector store is a Valkey instance running the [valkey-search](https://github.com/valkey-io/valkey-search) module, for example [AWS ElastiCache for Valkey](https://aws.amazon.com/elasticache/). RediSearch and RedisVL are not required; LiteLLM drives valkey-search directly over the Redis protocol.

:::info Requirements

The `valkey-search` module must be loaded on the server (run `MODULE LIST` and look for `search`, or `FT._LIST`). On AWS ElastiCache, vector search is available on **node-based Valkey 8.0+ clusters**; a single-node / cluster-mode-disabled node group is supported and is the recommended target. ElastiCache **Serverless does not support vector search**, so a serverless endpoint will not work here. Multi-shard (cluster-mode-enabled) endpoints are not supported by this backend, since the async client cannot route the `FT.*` search commands across shards; use a single-shard endpoint, and scale vertically.

:::

To run a Valkey instance with valkey-search locally, the `valkey/valkey-bundle` image ships the module:

```shell
docker run -d -p 6379:6379 valkey/valkey-bundle:8.1
```

```python
import litellm
from litellm import completion
from litellm.caching.caching import Cache

random_number = random.randint(
1, 100000
) # add a random number to ensure it's always adding / reading from cache

print("testing semantic caching")
litellm.cache = Cache(
type="valkey-semantic",
host=os.environ["VALKEY_HOST"],
port=os.environ["VALKEY_PORT"],
password=os.environ.get("VALKEY_PASSWORD"), # omit for passwordless / IAM-auth clusters
similarity_threshold=0.8, # similarity threshold for cache hits, 0 == no similarity, 1 = exact matches, 0.5 == 50% similarity
ttl=120,
valkey_semantic_cache_embedding_model="text-embedding-ada-002", # this model is passed to litellm.embedding(), any litellm.embedding() model is supported here
valkey_semantic_cache_index_name="litellm_semantic_cache_index", # optional, defaults to litellm_semantic_cache_index
)
response1 = completion(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": f"write a one sentence poem about: {random_number}",
}
],
max_tokens=20,
)
print(f"response1: {response1}")

random_number = random.randint(1, 100000)

response2 = completion(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": f"write a one sentence poem about: {random_number}",
}
],
max_tokens=20,
)
print(f"response2: {response2}")
assert response1.id == response2.id
# response1 == response2, response 1 is cached
```

`VALKEY_HOST`, `VALKEY_PORT`, and `VALKEY_PASSWORD` fall back to `REDIS_HOST`, `REDIS_PORT`, and `REDIS_PASSWORD` if they are not set. For ElastiCache with encryption in transit (TLS), either pass `ssl=True` alongside host and port, or pass a full `redis_url="rediss://..."`.

</TabItem>

<TabItem value="in-mem" label="in memory cache">

### Quick Start
Expand Down Expand Up @@ -586,7 +655,7 @@ cache.get_cache = get_cache
```python
def __init__(
self,
type: Optional[Literal["local", "redis", "redis-semantic", "s3", "gcs", "disk"]] = "local",
type: Optional[Literal["local", "redis", "redis-semantic", "valkey-semantic", "s3", "gcs", "disk"]] = "local",
supported_call_types: Optional[
List[Literal["completion", "acompletion", "embedding", "aembedding", "atranscription", "transcription"]]
] = ["completion", "acompletion", "embedding", "aembedding", "atranscription", "transcription"],
Expand All @@ -613,6 +682,10 @@ def __init__(
redis_semantic_cache_embedding_model: str = "text-embedding-ada-002",
redis_semantic_cache_index_name: Optional[str] = None,

# valkey semantic cache params (valkey-search module, e.g. ElastiCache for Valkey)
valkey_semantic_cache_embedding_model: str = "text-embedding-ada-002",
valkey_semantic_cache_index_name: Optional[str] = None,

# s3 Bucket, boto3 configuration
s3_bucket_name: Optional[str] = None,
s3_region_name: Optional[str] = None,
Expand Down
72 changes: 72 additions & 0 deletions docs/proxy/caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ calling the LLM API again.
- Redis Cache
- Qdrant Semantic Cache
- Redis Semantic Cache
- Valkey Semantic Cache
- S3 Bucket Cache
- GCS Bucket Cache

Expand Down Expand Up @@ -409,6 +410,77 @@ one**

</TabItem>

<TabItem value="valkey-semantic" label="Valkey Semantic cache">

Semantic caching on a Valkey instance running the [valkey-search](https://github.com/valkey-io/valkey-search) module, such as AWS ElastiCache for Valkey. RediSearch and RedisVL are not required.

:::info Requirements

The `valkey-search` module must be loaded (check with `MODULE LIST` / `FT._LIST`). On AWS ElastiCache, vector search needs a **node-based Valkey 8.0+ cluster**; a single-node / cluster-mode-disabled node group is supported and recommended. ElastiCache **Serverless does not support vector search**. Multi-shard (cluster-mode-enabled) endpoints are not supported here, so use a single-shard endpoint.

:::

#### Step 1: Add `cache` to the config.yaml

```yaml
model_list:
- model_name: fake-openai-endpoint
litellm_params:
model: openai/fake
api_key: fake-key
api_base: https://exampleopenaiendpoint-production.up.railway.app/
- model_name: openai-embedding
litellm_params:
model: openai/text-embedding-3-small
api_key: os.environ/OPENAI_API_KEY

litellm_settings:
set_verbose: True
cache: True
cache_params:
type: valkey-semantic
host: os.environ/VALKEY_HOST
port: os.environ/VALKEY_PORT
valkey_semantic_cache_embedding_model: openai-embedding # the model should be defined on the model_list
valkey_semantic_cache_index_name: litellm_semantic_cache_index # optional
similarity_threshold: 0.8 # similarity threshold for semantic cache
```

#### Step 2: Add Valkey Credentials to your .env

```shell
VALKEY_HOST = "your-valkey-host"
VALKEY_PORT = "6379"
VALKEY_PASSWORD = "your-password" # omit for passwordless / IAM-auth clusters
```

For ElastiCache with encryption in transit (TLS), add `ssl: true` under `cache_params`, or set `cache_params.redis_url` to a `rediss://` URL instead of host and port. To run valkey-search locally, `docker run -d -p 6379:6379 valkey/valkey-bundle:8.1`.

#### Step 3: Run proxy with config

```shell
$ litellm --config /path/to/config.yaml
```

#### Step 4. Test it

```shell
curl -i http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "fake-openai-endpoint",
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```

**Expect to see `x-litellm-semantic-similarity` in the response headers when semantic caching is
one**

</TabItem>

<TabItem value="s3" label="s3 cache">

#### Step 1: Add `cache` to the config.yaml
Expand Down