Valkey Community

Valkey for AI developers

Vector search, semantic cache, and agent memory in one Valkey process — the unified AI data layer pitch.

The pitch in one sentence

Modern AI applications need three data services that traditional stacks split across three systems: vector search (Pinecone / Weaviate / Qdrant), semantic cache (custom Redis + embedding model), and agent memory (Mem0 / LangGraph store). Valkey 9.x puts all three behind one connection, one ACL, one operations story.

Vector search

HNSW indexes via valkey-search. Hybrid filter + similarity. Same connection as your cache.

Semantic cache

Hash a prompt → embed → ANN lookup → return cached LLM response if cosine similarity > threshold.

Agent memory

Mem0 and LangGraph have native Valkey backends. Episodic and semantic memory in one store.

valkey-search basics

valkey-search is the official vector + full-text module, GA since Valkey 9.0. Index creation:

valkey-cli FT.CREATE idx:docs ON HASH PREFIX 1 doc: \
  SCHEMA \
    title TEXT \
    body TEXT \
    embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE

Index a document:

valkey-cli HSET doc:1 \
  title "Valkey 9.1 release notes" \
  body "..." \
  embedding "$(python embed.py 'release notes')"

Hybrid query — top 5 most similar where body contains "RDMA":

FT.SEARCH idx:docs "@body:RDMA =>[KNN 5 @embedding $vec AS score]" \
  PARAMS 2 vec "$BLOB" \
  RETURN 2 title score \
  DIALECT 2

The "hybrid" piece — combining BM25 filters with KNN — is the differentiator versus pure vector databases. You do not need to round-trip to a second system to filter by tenant id, language, or recency.

Semantic cache

The pattern that AWS reported ~86 percent cost reduction in internal customer support workloads:

  1. Embed the incoming prompt with a cheap embedding model (text-embedding-3-small or open-source).
  2. ANN search in valkey-search over historical (prompt, response) pairs.
  3. If best match has cosine similarity > 0.92 (tune per domain), return the cached response.
  4. Otherwise call the expensive LLM, then HSET the new pair with its embedding for future hits.

A reference implementation in Python is ~80 lines. The cache hit rate is the whole story — high-similarity domains (customer support, FAQ, docs Q&A) hit 60-90%; creative or coding workloads hit < 20%.

Tune the threshold per use case. 0.92 is conservative for English text-embedding-3 vectors; lower thresholds save more cost but risk semantically-close-but-factually-different answers. Always log a cache_hit boolean and sample evals.

Agent memory: Mem0 and LangGraph

Both frameworks ship Valkey backends as a first-class option:

  • Mem0from mem0 import Memory; m = Memory.from_config({"vector_store": {"provider": "valkey", "config": {"url": "valkey://..."}}}). Stores episodic memories (interactions) and semantic memories (facts) in Valkey hashes + a vector index.
  • LangGraph — the langgraph-checkpoint-valkey package replaces the default in-memory checkpointer with Valkey for durable graph state, enabling resume-after-restart for long-running agents.

The shared advantage: your cache, your retrieval, your memory, and your rate-limit counters all live in one place. One Helm chart, one IAM role, one set of CloudWatch dashboards.

Native vector vs compatibility-layer table

A common question is "can I just use valkey-search like Pinecone." Short answer: similar primitives, different operational shape.

Aspectvalkey-search (native)Dedicated vector DB
Latency to ANNP99 < 5 ms in-region10-30 ms cross-service
Throughput100k+ QPS / shardvaries, often lower per node
Hybrid filterfirst classfirst class
Connection modelsame as cacheseparate service
Operational surfaceone processone more system
Updates / TTLevery Valkey command availablevaries
Max vectors / node~100M depending on dimvaries
ACL / multi-tenantfull Valkey ACLvaries

Valkey wins when you already have Valkey and your vector count fits one or a few shards. Dedicated vector DBs win at billion-vector scale with heavy analytical query patterns.

Read on

On this page