The 7 layers of a production RAG system
Most RAG projects fail at retrieval, not the model — the layers a demo never has to get right.
Most RAG projects don't fail at the model, and they don't fail at the prompt. They fail at retrieval — quietly, in the layers a demo never has to get right. Each layer below can take weeks to tune, but getting the fundamentals right at every stage makes every iteration after it faster and more predictable.
Document parsing
garbage in, garbage out
PDFs and complex documents are not plain text. A basic extractor silently drops the structure that carries the meaning — and you only find out when retrieval misses.
OCR & VLM vs. layout analysis
OCR pulls characters and ignores structure. Layout analysis understands where text sits on the page and what it means in context. For legal, financial, and technical docs, that gap decides whether retrieval works at all.
What the parser must keep
Table rows, columns and merged cells · column reading order · headers and footers (often the metadata you filter on) · charts and images that hold data the text never states.
Chunking strategy
size is a trade-off, not a constant
There is no single right chunk size — but there are research-backed starting points, and the structure of your documents should drive the splits.
Starting points
200–400 tokens suits most cases. Smaller chunks win on fact lookup, larger on summarization. Keep 10–15% overlap so context survives the boundaries.
Document-aware splitting
For Markdown and sectioned reports, follow headers and section breaks instead of blindly counting tokens. A cut mid-paragraph loses coherence; a cut on a section boundary preserves it.
Parent–child chunking
Index small chunks for precise retrieval, then return the parent section to the model. Small finds the spot; large gives enough to answer well.
Embedding model
the real bottleneck
The wrong embedding model makes every other layer work harder. Choose on three axes, and test on your own queries before committing.
1 · Retrieval benchmarks
Score on retrieval tasks specifically — not classification or clustering. General leaderboards mislead.
2 · Dimensions vs. performance
Test at 512 dims before assuming you need 1,536. MRL models hold quality at lower dimensions.
3 · Language support
For Arabic + English, use one true multilingual model — not two models with language detection wedged in between.
Vector database
match the tool to your scale
Production means scalability and reliability. The right choice depends on your situation — corpus size and team capacity — not on what is popular.
Under 100K documents
A vector extension on your existing relational DB is enough. pgvector on PostgreSQL is a legitimate production choice with no new infrastructure to run.
100K – 500M vectors
Qdrant — lower operational overhead, excellent metadata filtering, strong for multi-tenant systems.
Above 500M vectors
Milvus / Zilliz Cloud — separates query from ingestion at scale, but needs infrastructure expertise.
Also check
Latency targets (P50 vs P99) · managed vs self-hosted (money vs engineering time) · built-in hybrid search before you build your own fusion layer.
Hybrid search
works vs. works well
This is the layer that separates systems that work from systems that work well. Dense vector search alone consistently misses the exact strings users actually type.
What dense search alone keeps missing
Exact contract numbers, product codes, rare identifiers and domain-specific codes — the literal strings semantic similarity glides right past:
NDA-2024-0891SKU-44821-Bcase numbersmedical codeslegal citationsThe gate rule
BM25 may only boost chunks that already passed a minimum vector-similarity threshold. Otherwise keyword-heavy junk contaminates the top results.Reranking
good results → excellent results
Vector and BM25 use fast, approximate matching: they find candidates well but don't rank them perfectly. A cross-encoder compares the query and each chunk together — slower, far more accurate.
candidates = hybrid_search(query, top_k=100) # fast, approximate
ranked = cross_encoder.rank(query, candidates) # slow, precise
context = ranked[:8] # top 5–10
answer = llm.generate(query, context)Why it kills hallucinations
Hallucinations often come from the model straining to reconcile irrelevant chunks. Better ranking means cleaner context, and cleaner context means fewer gaps to invent into.Evaluation & observability
or you are flying blind
Without this layer you can't improve what you can't measure — and retrieval quality degrades silently until users complain. Run it from day one, not bolted on later.
Trace every request
Which chunks were retrieved, with what scores, and what was actually sent to the model.
Quality metrics, continuously
Context recall — did retrieval find the right chunks? · Faithfulness — is the answer grounded in context? · Answer relevance — does it address the question?
Humans + alerts
A human-evaluation workflow for edge cases and calibration · alerts on gradual quality degradation, not just on errors.
Tools
RagasDeepEvalLangSmithPrometheus + Grafana
The seven layers together
Clean input in, grounded answer out.
- 1
Document parsing
Clean, structured input before anything else. - 2
Chunking strategy
Right size and overlap for your document types. - 3
Embedding model
Chosen for your data, tested on your queries. - 4
Vector database
Matched to your scale and operational capacity. - 5
Hybrid search
BM25 + vector, fused and gated. - 6
Reranking
Top 50–100 in, top 5–10 out. - 7
Evaluation & observability
Running from day one, not added later.
The production-readiness checklist
Tick what's true of your system today. The gaps are your roadmap.