diff --git a/blog/valkey_semantic_caching/index.md b/blog/valkey_semantic_caching/index.md new file mode 100644 index 000000000..4478f35e1 --- /dev/null +++ b/blog/valkey_semantic_caching/index.md @@ -0,0 +1,47 @@ +--- +slug: valkey_semantic_caching +title: "Semantic Caching on Valkey and AWS ElastiCache" +date: 2026-06-17T10:00:00 +authors: + - yassin +description: "LiteLLM now supports semantic prompt caching on Valkey clusters running the valkey-search module, including AWS ElastiCache for Valkey, with no RediSearch, Redis Stack, or Qdrant required." +tags: [caching, valkey, elasticache, semantic cache] +hide_table_of_contents: false +--- + +LiteLLM now supports semantic prompt caching on Valkey. If you run a Valkey cluster with the [valkey-search](https://github.com/valkey-io/valkey-search) module, including AWS ElastiCache for Valkey, you can point LiteLLM at it with `type: valkey-semantic` and get embedding-based cache hits without standing up Redis Stack or a separate vector database. + +{/* truncate */} + +## Why this matters + +Semantic caching stores responses by the meaning of a prompt rather than an exact string match, so a reworded request can still hit the cache and skip a paid model call. Until now LiteLLM's semantic cache was built on RedisVL, which depends on RediSearch's `FT.*` vector API. RediSearch is not available on Redis OSS or on ElastiCache for Redis OSS, which left teams standing up Redis Stack or Qdrant just to get semantic caching. With Redis moving to a source-available license, more teams are standing up Valkey instead, and ElastiCache for Valkey is a common managed target. + +Valkey ships vector search through the valkey-search module, and ElastiCache for Valkey exposes it. LiteLLM's new backend talks to valkey-search directly over the Redis protocol, so semantic caching on ElastiCache for Valkey works without RediSearch, Redis Stack, or Qdrant in the path. + +## How it works + +The `valkey-semantic` backend builds its own vector index from the field types valkey-search supports, a tag field that isolates each cache key's scope and an HNSW vector field for the prompt embedding, then runs a KNN query at lookup time and returns the cached response when the cosine similarity clears your threshold. Prompt extraction, embedding generation, and response handling are shared with the existing Redis semantic cache, so behavior matches the Redis path including per-request scope isolation. Connections resolve from `VALKEY_HOST`, `VALKEY_PORT`, and `VALKEY_PASSWORD`, falling back to the `REDIS_*` equivalents, and passwordless clusters are supported for IAM or no-auth setups. + +## Get started + +Add the cache to your `config.yaml`: + +```yaml +litellm_settings: + cache: True + cache_params: + type: valkey-semantic + host: os.environ/VALKEY_HOST + port: os.environ/VALKEY_PORT + valkey_semantic_cache_embedding_model: openai-embedding + similarity_threshold: 0.8 +``` + +For ElastiCache with encryption in transit, pass a `rediss://` URL through `cache_params.redis_url` instead of host and port. To try valkey-search locally, the bundled image has the module ready: + +```shell +docker run -d -p 6379:6379 valkey/valkey-bundle:8.1 +``` + +See the [caching docs](https://docs.litellm.ai/docs/proxy/caching) for the full setup, including the SDK usage and the parameter reference. diff --git a/docs/caching/all_caches.md b/docs/caching/all_caches.md index 7cc329c93..3835347eb 100644 --- a/docs/caching/all_caches.md +++ b/docs/caching/all_caches.md @@ -331,6 +331,75 @@ assert response1.id == response2.id + + +Use this when your vector store is a Valkey instance running the [valkey-search](https://github.com/valkey-io/valkey-search) module, for example [AWS ElastiCache for Valkey](https://aws.amazon.com/elasticache/). RediSearch and RedisVL are not required; LiteLLM drives valkey-search directly over the Redis protocol. + +:::info Requirements + +The `valkey-search` module must be loaded on the server (run `MODULE LIST` and look for `search`, or `FT._LIST`). On AWS ElastiCache, vector search is available on **node-based Valkey 8.0+ clusters**; a single-node / cluster-mode-disabled node group is supported and is the recommended target. ElastiCache **Serverless does not support vector search**, so a serverless endpoint will not work here. Multi-shard (cluster-mode-enabled) endpoints are not supported by this backend, since the async client cannot route the `FT.*` search commands across shards; use a single-shard endpoint, and scale vertically. + +::: + +To run a Valkey instance with valkey-search locally, the `valkey/valkey-bundle` image ships the module: + +```shell +docker run -d -p 6379:6379 valkey/valkey-bundle:8.1 +``` + +```python +import litellm +from litellm import completion +from litellm.caching.caching import Cache + +random_number = random.randint( + 1, 100000 +) # add a random number to ensure it's always adding / reading from cache + +print("testing semantic caching") +litellm.cache = Cache( + type="valkey-semantic", + host=os.environ["VALKEY_HOST"], + port=os.environ["VALKEY_PORT"], + password=os.environ.get("VALKEY_PASSWORD"), # omit for passwordless / IAM-auth clusters + similarity_threshold=0.8, # similarity threshold for cache hits, 0 == no similarity, 1 = exact matches, 0.5 == 50% similarity + ttl=120, + valkey_semantic_cache_embedding_model="text-embedding-ada-002", # this model is passed to litellm.embedding(), any litellm.embedding() model is supported here + valkey_semantic_cache_index_name="litellm_semantic_cache_index", # optional, defaults to litellm_semantic_cache_index +) +response1 = completion( + model="gpt-3.5-turbo", + messages=[ + { + "role": "user", + "content": f"write a one sentence poem about: {random_number}", + } + ], + max_tokens=20, +) +print(f"response1: {response1}") + +random_number = random.randint(1, 100000) + +response2 = completion( + model="gpt-3.5-turbo", + messages=[ + { + "role": "user", + "content": f"write a one sentence poem about: {random_number}", + } + ], + max_tokens=20, +) +print(f"response2: {response2}") +assert response1.id == response2.id +# response1 == response2, response 1 is cached +``` + +`VALKEY_HOST`, `VALKEY_PORT`, and `VALKEY_PASSWORD` fall back to `REDIS_HOST`, `REDIS_PORT`, and `REDIS_PASSWORD` if they are not set. For ElastiCache with encryption in transit (TLS), either pass `ssl=True` alongside host and port, or pass a full `redis_url="rediss://..."`. + + + ### Quick Start @@ -586,7 +655,7 @@ cache.get_cache = get_cache ```python def __init__( self, - type: Optional[Literal["local", "redis", "redis-semantic", "s3", "gcs", "disk"]] = "local", + type: Optional[Literal["local", "redis", "redis-semantic", "valkey-semantic", "s3", "gcs", "disk"]] = "local", supported_call_types: Optional[ List[Literal["completion", "acompletion", "embedding", "aembedding", "atranscription", "transcription"]] ] = ["completion", "acompletion", "embedding", "aembedding", "atranscription", "transcription"], @@ -613,6 +682,10 @@ def __init__( redis_semantic_cache_embedding_model: str = "text-embedding-ada-002", redis_semantic_cache_index_name: Optional[str] = None, + # valkey semantic cache params (valkey-search module, e.g. ElastiCache for Valkey) + valkey_semantic_cache_embedding_model: str = "text-embedding-ada-002", + valkey_semantic_cache_index_name: Optional[str] = None, + # s3 Bucket, boto3 configuration s3_bucket_name: Optional[str] = None, s3_region_name: Optional[str] = None, diff --git a/docs/proxy/caching.md b/docs/proxy/caching.md index 85686c99f..8656ea6d3 100644 --- a/docs/proxy/caching.md +++ b/docs/proxy/caching.md @@ -19,6 +19,7 @@ calling the LLM API again. - Redis Cache - Qdrant Semantic Cache - Redis Semantic Cache +- Valkey Semantic Cache - S3 Bucket Cache - GCS Bucket Cache @@ -409,6 +410,77 @@ one** + + +Semantic caching on a Valkey instance running the [valkey-search](https://github.com/valkey-io/valkey-search) module, such as AWS ElastiCache for Valkey. RediSearch and RedisVL are not required. + +:::info Requirements + +The `valkey-search` module must be loaded (check with `MODULE LIST` / `FT._LIST`). On AWS ElastiCache, vector search needs a **node-based Valkey 8.0+ cluster**; a single-node / cluster-mode-disabled node group is supported and recommended. ElastiCache **Serverless does not support vector search**. Multi-shard (cluster-mode-enabled) endpoints are not supported here, so use a single-shard endpoint. + +::: + +#### Step 1: Add `cache` to the config.yaml + +```yaml +model_list: + - model_name: fake-openai-endpoint + litellm_params: + model: openai/fake + api_key: fake-key + api_base: https://exampleopenaiendpoint-production.up.railway.app/ + - model_name: openai-embedding + litellm_params: + model: openai/text-embedding-3-small + api_key: os.environ/OPENAI_API_KEY + +litellm_settings: + set_verbose: True + cache: True + cache_params: + type: valkey-semantic + host: os.environ/VALKEY_HOST + port: os.environ/VALKEY_PORT + valkey_semantic_cache_embedding_model: openai-embedding # the model should be defined on the model_list + valkey_semantic_cache_index_name: litellm_semantic_cache_index # optional + similarity_threshold: 0.8 # similarity threshold for semantic cache +``` + +#### Step 2: Add Valkey Credentials to your .env + +```shell +VALKEY_HOST = "your-valkey-host" +VALKEY_PORT = "6379" +VALKEY_PASSWORD = "your-password" # omit for passwordless / IAM-auth clusters +``` + +For ElastiCache with encryption in transit (TLS), add `ssl: true` under `cache_params`, or set `cache_params.redis_url` to a `rediss://` URL instead of host and port. To run valkey-search locally, `docker run -d -p 6379:6379 valkey/valkey-bundle:8.1`. + +#### Step 3: Run proxy with config + +```shell +$ litellm --config /path/to/config.yaml +``` + +#### Step 4. Test it + +```shell +curl -i http://localhost:4000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer sk-1234" \ + -d '{ + "model": "fake-openai-endpoint", + "messages": [ + {"role": "user", "content": "Hello"} + ] + }' +``` + +**Expect to see `x-litellm-semantic-similarity` in the response headers when semantic caching is +one** + + + #### Step 1: Add `cache` to the config.yaml