Two pipelines, or it doesn't scale
Under real load RAG fails at the architecture — separate ingestion from queries, then cache, gate, and isolate.
Most RAG guides obsess over components — which vector DB, which embedding model. That's fine for demos. Under real load the failure is architectural, and the fix isn't a config tweak — it's separating ingestion from queries.
The #1 failure
Ingestion and queries share one path. One user uploads a big PDF → everyone else slows down.The two pipelines
Pipeline A · Ingestion — background, never touches the user
Pipeline B · Query — live, the user is waiting
Ingestion tip — skip what hasn't changed
Store embeddings keyed by a content-addressable hash of the model ID plus the text. On re-index, skip anything whose hash already exists — saving both money and time.key = sha256(f"{model_id}:{text}").hexdigest()
if key in embedding_cache: # unchanged since last run
continue # skip re-embedding
embedding_cache[key] = embed(text)Section 03 · the #1 cost saver
Semantic cache — stop at the first hit
Most users ask the same questions in slightly different words. Instead of running the full pipeline every time, climb a ladder and stop the moment you get a hit.
- 1
Exact match — Redis
Instant and free. The same question, asked again. - 2
Similar question — FAISS
Fast and cheap. A paraphrase of something already answered. - 3
Keyword match — BM25
Medium cost. Overlapping terms, not yet a vector hit. - 4
Full pipeline — cache miss
Slow and expensive. Only the genuinely new questions reach here.
~50% cost avoided
Check the tiers in order and stop at the first hit. Most repeat questions never reach the full pipeline — roughly half the cost disappears.Section 04 · throughput
vLLM — not optional
Standard Hugging Face serving chokes when many users arrive at once; requests pile up. vLLM uses continuous batching — while generating for user A it's already working on user B, so the GPU never idles.
A100 80GB · 14B model, BF16.
H100 · 14B model, BF16.
Based on ~2,000 tokens per request. Continuous batching means no idle GPU.
Section 05 · the hidden trap
Hybrid retrieval has a trap
Combining BM25 (keyword) with vector search (meaning) is good practice — but only if you gate it.
✕ Wrong
Let BM25 freely boost any chunk it matches → keyword-heavy but irrelevant chunks pollute the results.✓ Right
BM25 may only boost a chunk that already passed a minimum vector-similarity score → clean, relevant results.Then rerank
Add a cross-encoder reranker after retrieval — it resolves ~80% of simple lookup queries on its own.Section 06 · real deployments
Real numbers, real systems
What actually moved the needle in production.
Latency after moving from MongoDB to Qdrant on 200K chunks — the vector-native move.
of lookup queries resolved by the cross-encoder reranker alone.
of dev effort went into the metadata schema — and it had the highest ROI of anything they did.
Section 07 · what breaks
The six anti-patterns
None of these show up in a demo. They all show up when real users hit the system. Expand each for the fix.
Section 08 · the forgotten layer
Multi-tenant isolation
If customers share your RAG system, shared namespaces mean a data-leakage risk. Isolate every tenant from day one.
Tenant A
Own vector DB namespace · own embedding cache · own metadata.
Tenant B
Own vector DB namespace · own embedding cache · own metadata.
High-privacy domains
Healthcare, legal, and finance may need separate model instances per tenant. Expensive, but sometimes required.Section 09 · what actually wins
First-class priorities
The winning teams aren't swapping embedding models every month. They obsess over these.
Observability
Track latency, cache hit rate, queue depth, and GPU usage.
Cache hit rate
Your main cost lever — monitor it always.
Queue design
Back-pressure, retries, and dead-letter queues.
Resource isolation
Services and tenants must not affect each other.
Section 10 · before you ship
Production-readiness checklist
Tick what's true of your system today. The gaps are your launch blockers.
- Ingestion is async via message queue (not inline)
- Ingestion and query paths are fully separated
- Semantic cache ladder is in place
- vLLM is used for generation
- BM25 is gated by vector-similarity score
- Cross-encoder reranker is in the pipeline
- Embeddings stored with content-addressable hash
- Metadata schema designed before dev starts
- Each tenant has its own namespace in the vector DB
- HPA rules defined per service in Kubernetes
- Observability live: latency, cache, queue, GPU