Valkey for AI developers
Vector search, semantic cache, and agent memory in one Valkey process — the unified AI data layer pitch.
The pitch in one sentence
Modern AI applications need three data services that traditional stacks split across three systems: vector search (Pinecone / Weaviate / Qdrant), semantic cache (custom Redis + embedding model), and agent memory (Mem0 / LangGraph store). Valkey 9.x puts all three behind one connection, one ACL, one operations story.
Vector search
HNSW indexes via valkey-search. Hybrid filter + similarity. Same connection as your cache.
Semantic cache
Hash a prompt → embed → ANN lookup → return cached LLM response if cosine similarity > threshold.
Agent memory
Mem0 and LangGraph have native Valkey backends. Episodic and semantic memory in one store.
valkey-search basics
valkey-search is the official vector + full-text module, GA since Valkey 9.0. Index creation:
valkey-cli FT.CREATE idx:docs ON HASH PREFIX 1 doc: \
SCHEMA \
title TEXT \
body TEXT \
embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINEIndex a document:
valkey-cli HSET doc:1 \
title "Valkey 9.1 release notes" \
body "..." \
embedding "$(python embed.py 'release notes')"Hybrid query — top 5 most similar where body contains "RDMA":
FT.SEARCH idx:docs "@body:RDMA =>[KNN 5 @embedding $vec AS score]" \
PARAMS 2 vec "$BLOB" \
RETURN 2 title score \
DIALECT 2The "hybrid" piece — combining BM25 filters with KNN — is the differentiator versus pure vector databases. You do not need to round-trip to a second system to filter by tenant id, language, or recency.
Semantic cache
The pattern that AWS reported ~86 percent cost reduction in internal customer support workloads:
- Embed the incoming prompt with a cheap embedding model (text-embedding-3-small or open-source).
- ANN search in
valkey-searchover historical (prompt, response) pairs. - If best match has cosine similarity > 0.92 (tune per domain), return the cached response.
- Otherwise call the expensive LLM, then
HSETthe new pair with its embedding for future hits.
A reference implementation in Python is ~80 lines. The cache hit rate is the whole story — high-similarity domains (customer support, FAQ, docs Q&A) hit 60-90%; creative or coding workloads hit < 20%.
Tune the threshold per use case. 0.92 is conservative for English text-embedding-3 vectors; lower thresholds save more cost but risk semantically-close-but-factually-different answers. Always log a cache_hit boolean and sample evals.
Agent memory: Mem0 and LangGraph
Both frameworks ship Valkey backends as a first-class option:
- Mem0 —
from mem0 import Memory; m = Memory.from_config({"vector_store": {"provider": "valkey", "config": {"url": "valkey://..."}}}). Stores episodic memories (interactions) and semantic memories (facts) in Valkey hashes + a vector index. - LangGraph — the
langgraph-checkpoint-valkeypackage replaces the default in-memory checkpointer with Valkey for durable graph state, enabling resume-after-restart for long-running agents.
The shared advantage: your cache, your retrieval, your memory, and your rate-limit counters all live in one place. One Helm chart, one IAM role, one set of CloudWatch dashboards.
Native vector vs compatibility-layer table
A common question is "can I just use valkey-search like Pinecone." Short answer: similar primitives, different operational shape.
| Aspect | valkey-search (native) | Dedicated vector DB |
|---|---|---|
| Latency to ANN | P99 < 5 ms in-region | 10-30 ms cross-service |
| Throughput | 100k+ QPS / shard | varies, often lower per node |
| Hybrid filter | first class | first class |
| Connection model | same as cache | separate service |
| Operational surface | one process | one more system |
| Updates / TTL | every Valkey command available | varies |
| Max vectors / node | ~100M depending on dim | varies |
| ACL / multi-tenant | full Valkey ACL | varies |
Valkey wins when you already have Valkey and your vector count fits one or a few shards. Dedicated vector DBs win at billion-vector scale with heavy analytical query patterns.