Capability

Two pipelines, or it doesn't scale

Under real load RAG fails at the architecture — separate ingestion from queries, then cache, gate, and isolate.

#rag#production#cost#scaling

Most RAG guides obsess over components — which vector DB, which embedding model. That's fine for demos. Under real load the failure is architectural, and the fix isn't a config tweak — it's separating ingestion from queries.

The #1 failure

Ingestion and queries share one path. One user uploads a big PDF → everyone else slows down.

The two pipelines

Pipeline A · Ingestion — background, never touches the user

Uploaduser document

Message queueCelery + Redis / Kafka

Workersparse → embed → index

Vector DBready for queries

Uploads run on a background queue, so they never affect query speed. Ever.

Pipeline B · Query — live, the user is waiting

Questionuser

API gateway

Semantic cachechecked first · ~50% cost

Query rewriter

Hybrid retrieverBM25 + vector

Reranker

vLLMgeneration

Answer

Each step scales independently — the slow path is isolated from ingestion entirely.

Ingestion tip — skip what hasn't changed

Store embeddings keyed by a content-addressable hash of the model ID plus the text. On re-index, skip anything whose hash already exists — saving both money and time.

python

key = sha256(f"{model_id}:{text}").hexdigest()
if key in embedding_cache:     # unchanged since last run
    continue                    # skip re-embedding
embedding_cache[key] = embed(text)

Section 03 · the #1 cost saver

Semantic cache — stop at the first hit

Most users ask the same questions in slightly different words. Instead of running the full pipeline every time, climb a ladder and stop the moment you get a hit.

1
Exact match — Redis
Instant and free. The same question, asked again.
2
Similar question — FAISS
Fast and cheap. A paraphrase of something already answered.
3
Keyword match — BM25
Medium cost. Overlapping terms, not yet a vector hit.
4
Full pipeline — cache miss
Slow and expensive. Only the genuinely new questions reach here.

~50% cost avoided

Check the tiers in order and stop at the first hit. Most repeat questions never reach the full pipeline — roughly half the cost disappears.

Section 04 · throughput

vLLM — not optional

Standard Hugging Face serving chokes when many users arrive at once; requests pile up. vLLM uses continuous batching — while generating for user A it's already working on user B, so the GPU never idles.

~10,800 req/hr

A100 80GB · 14B model, BF16.

~14,400 req/hr

H100 · 14B model, BF16.

Based on ~2,000 tokens per request. Continuous batching means no idle GPU.

Section 05 · the hidden trap

Hybrid retrieval has a trap

Combining BM25 (keyword) with vector search (meaning) is good practice — but only if you gate it.

✕ Wrong

Let BM25 freely boost any chunk it matches → keyword-heavy but irrelevant chunks pollute the results.

✓ Right

BM25 may only boost a chunk that already passed a minimum vector-similarity score → clean, relevant results.

Then rerank

Add a cross-encoder reranker after retrieval — it resolves ~80% of simple lookup queries on its own.

Section 06 · real deployments

Real numbers, real systems

What actually moved the needle in production.

400ms → 70ms

Latency after moving from MongoDB to Qdrant on 200K chunks — the vector-native move.

~80%

of lookup queries resolved by the cross-encoder reranker alone.

~40%

of dev effort went into the metadata schema — and it had the highest ROI of anything they did.

Section 07 · what breaks

The six anti-patterns

None of these show up in a demo. They all show up when real users hit the system. Expand each for the fix.

Section 08 · the forgotten layer

Multi-tenant isolation

If customers share your RAG system, shared namespaces mean a data-leakage risk. Isolate every tenant from day one.

Tenant A

Own vector DB namespace · own embedding cache · own metadata.

Tenant B

Own vector DB namespace · own embedding cache · own metadata.

High-privacy domains

Healthcare, legal, and finance may need separate model instances per tenant. Expensive, but sometimes required.

Section 09 · what actually wins

First-class priorities

The winning teams aren't swapping embedding models every month. They obsess over these.

Observability

Track latency, cache hit rate, queue depth, and GPU usage.

Cache hit rate

Your main cost lever — monitor it always.

Queue design

Back-pressure, retries, and dead-letter queues.

Resource isolation

Services and tenants must not affect each other.

Section 10 · before you ship

Production-readiness checklist

Tick what's true of your system today. The gaps are your launch blockers.

Ingestion is async via message queue (not inline)
Ingestion and query paths are fully separated
Semantic cache ladder is in place
vLLM is used for generation
BM25 is gated by vector-similarity score
Cross-encoder reranker is in the pipeline
Embeddings stored with content-addressable hash
Metadata schema designed before dev starts
Each tenant has its own namespace in the vector DB
HPA rules defined per service in Kubernetes
Observability live: latency, cache, queue, GPU