Capability

The 7 layers of a production RAG system

Most RAG projects fail at retrieval, not the model — the layers a demo never has to get right.

#rag#retrieval#architecture

Most RAG projects don't fail at the model, and they don't fail at the prompt. They fail at retrieval — quietly, in the layers a demo never has to get right. Each layer below can take weeks to tune, but getting the fundamentals right at every stage makes every iteration after it faster and more predictable.

ParseL1

ChunkL2

EmbedL3

StoreL4

HybridL5

RerankL6

EvaluateL7

Clean input in, grounded answer out — seven stages, each a place retrieval can quietly break.

Document parsing

garbage in, garbage out

PDFs and complex documents are not plain text. A basic extractor silently drops the structure that carries the meaning — and you only find out when retrieval misses.

OCR & VLM vs. layout analysis

OCR pulls characters and ignores structure. Layout analysis understands where text sits on the page and what it means in context. For legal, financial, and technical docs, that gap decides whether retrieval works at all.

What the parser must keep

Table rows, columns and merged cells · column reading order · headers and footers (often the metadata you filter on) · charts and images that hold data the text never states.

Chunking strategy

size is a trade-off, not a constant

There is no single right chunk size — but there are research-backed starting points, and the structure of your documents should drive the splits.

Starting points

200–400 tokens suits most cases. Smaller chunks win on fact lookup, larger on summarization. Keep 10–15% overlap so context survives the boundaries.

Document-aware splitting

For Markdown and sectioned reports, follow headers and section breaks instead of blindly counting tokens. A cut mid-paragraph loses coherence; a cut on a section boundary preserves it.

Parent–child chunking

Index small chunks for precise retrieval, then return the parent section to the model. Small finds the spot; large gives enough to answer well.

Embedding model

the real bottleneck

The wrong embedding model makes every other layer work harder. Choose on three axes, and test on your own queries before committing.

1 · Retrieval benchmarks

Score on retrieval tasks specifically — not classification or clustering. General leaderboards mislead.

2 · Dimensions vs. performance

Test at 512 dims before assuming you need 1,536. MRL models hold quality at lower dimensions.

3 · Language support

For Arabic + English, use one true multilingual model — not two models with language detection wedged in between.

+10–15 ptstop-20 retrieval accuracy from fine-tuning on 500–2,000 domain examples, when your docs use unfamiliar terminology.

Vector database

match the tool to your scale

Production means scalability and reliability. The right choice depends on your situation — corpus size and team capacity — not on what is popular.

Under 100K documents

A vector extension on your existing relational DB is enough. pgvector on PostgreSQL is a legitimate production choice with no new infrastructure to run.

100K – 500M vectors

Qdrant — lower operational overhead, excellent metadata filtering, strong for multi-tenant systems.

Above 500M vectors

Milvus / Zilliz Cloud — separates query from ingestion at scale, but needs infrastructure expertise.

Also check

Latency targets (P50 vs P99) · managed vs self-hosted (money vs engineering time) · built-in hybrid search before you build your own fusion layer.

Hybrid search

works vs. works well

This is the layer that separates systems that work from systems that work well. Dense vector search alone consistently misses the exact strings users actually type.

What dense search alone keeps missing

Exact contract numbers, product codes, rare identifiers and domain-specific codes — the literal strings semantic similarity glides right past:

NDA-2024-0891SKU-44821-Bcase numbersmedical codeslegal citations

BM25 · keywordexact IDs & codes

Vectormeaning & intent

Gatemin similarity

Fusionone ranked set

Two recall paths, gated and fused into a single candidate set.

The gate rule

BM25 may only boost chunks that already passed a minimum vector-similarity threshold. Otherwise keyword-heavy junk contaminates the top results.

+15–25%recall over vector search alone, measured in production.

Reranking

good results → excellent results

Vector and BM25 use fast, approximate matching: they find candidates well but don't rank them perfectly. A cross-encoder compares the query and each chunk together — slower, far more accurate.

Top 50–100from hybrid search

Cross-encoderrerank all

Top 5–10sent to the LLM

The production rerank flow: cast a wide net, then narrow with a precise ranker.

python

candidates = hybrid_search(query, top_k=100)     # fast, approximate
ranked     = cross_encoder.rank(query, candidates)  # slow, precise
context    = ranked[:8]                              # top 5–10
answer     = llm.generate(query, context)

Why it kills hallucinations

Hallucinations often come from the model straining to reconcile irrelevant chunks. Better ranking means cleaner context, and cleaner context means fewer gaps to invent into.

−30–35%hallucinations, from cleaner context reaching the model.

Evaluation & observability

or you are flying blind

Without this layer you can't improve what you can't measure — and retrieval quality degrades silently until users complain. Run it from day one, not bolted on later.

Trace every request

Which chunks were retrieved, with what scores, and what was actually sent to the model.

Quality metrics, continuously

Context recall — did retrieval find the right chunks? · Faithfulness — is the answer grounded in context? · Answer relevance — does it address the question?

Humans + alerts

A human-evaluation workflow for edge cases and calibration · alerts on gradual quality degradation, not just on errors.

Tools

RagasDeepEvalLangSmithPrometheus + Grafana

The seven layers together

Clean input in, grounded answer out.

1
Document parsing
Clean, structured input before anything else.
2
Chunking strategy
Right size and overlap for your document types.
3
Embedding model
Chosen for your data, tested on your queries.
4
Vector database
Matched to your scale and operational capacity.
5
Hybrid search
BM25 + vector, fused and gated.
6
Reranking
Top 50–100 in, top 5–10 out.
7
Evaluation & observability
Running from day one, not added later.

The production-readiness checklist

Tick what's true of your system today. The gaps are your roadmap.