Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM applications that need to reference specific documents, knowledge bases, or proprietary data. The idea is elegant: instead of relying on the LLM's parametric knowledge (which can be outdated or wrong), retrieve relevant context from a vector database and include it in the prompt. The LLM then generates answers grounded in real data.
In practice, most RAG implementations hallucinate. Not because the concept is flawed, but because the engineering is wrong. A naive RAG pipeline — chunk documents, embed them, retrieve top-k by cosine similarity, stuff them into a prompt — introduces failure modes at every stage that compound into unreliable outputs.
This post is a deep technical guide to the engineering decisions that separate production-grade RAG systems from demo-quality ones. We'll cover chunking, embedding, retrieval, reranking, and generation — and the tradeoffs at each stage.
Where Naive RAG Fails
Before diving into solutions, let's enumerate the specific failure modes of a basic RAG pipeline:
- Wrong chunks retrieved: The retrieval step returns chunks that are semantically similar to the query but don't contain the answer. This is the most common failure — and it's a retrieval problem, not a generation problem
- Answer spread across chunks: The information needed to answer the query is split across multiple chunks that weren't designed to be self-contained. The LLM receives fragments that individually don't make sense
- Lost in the middle: Research from Stanford shows LLMs disproportionately attend to information at the beginning and end of the context window, ignoring the middle. Important context in the middle of a long retrieval set gets lost
- Hallucination despite correct retrieval: Even with the right context, the LLM may generate plausible-sounding text that isn't supported by the retrieved documents. This happens more often with ambiguous or complex queries
- Retrieval of outdated or contradictory chunks: When the knowledge base contains multiple versions of the same information, the retrieval step may pull in outdated or conflicting chunks without any disambiguation
Chunking: The Foundation Nobody Gets Right
Chunking strategy is the single highest-leverage decision in a RAG pipeline. Get it wrong, and no amount of retrieval sophistication will save you. Yet most implementations use the default: split text into fixed-size windows of 500-1000 tokens with some overlap.
Fixed-Size Chunking
The simplest approach: split text every N tokens (or characters) with M tokens of overlap. It's fast and deterministic, but it's also the worst strategy for most use cases. Fixed-size chunking routinely splits sentences mid-thought, separates questions from their answers, and creates chunks that are semantically incoherent.
When to use it: Only as a baseline or for truly unstructured text where no better signal exists. For structured documents (docs, articles, code), there are always better options.
Recursive Character Splitting
LangChain's RecursiveCharacterTextSplitter improves on fixed-size chunking by trying to split on natural boundaries: first paragraphs, then sentences, then words. This preserves more semantic coherence than arbitrary token boundaries. It's a good default, but it still doesn't understand document structure.
Semantic Chunking
Semantic chunking uses embedding similarity between consecutive sentences to find natural breakpoints. When the embedding similarity between sentence N and sentence N+1 drops below a threshold, a chunk boundary is placed. This creates chunks that are topically coherent — each chunk discusses one concept or topic.
The downside is computational cost (you're embedding every sentence) and sensitivity to the threshold parameter. Too aggressive and you get single-sentence chunks; too lenient and you get entire sections as single chunks.
Parent-Document / Hierarchical Chunking
This is the strategy that unlocks the biggest quality improvements. The idea: create small chunks for retrieval (high precision) but return the parent document or a larger context window for generation (high recall). You embed small, focused chunks so retrieval is precise. But when a small chunk matches, you return the surrounding larger context to the LLM so it has enough information to generate a complete answer.
- Small chunks (200-400 tokens) for precise semantic retrieval
- Parent documents or expanded windows (1000-2000 tokens) for generation context
- This decouples the retrieval granularity from the generation context — the key insight most pipelines miss
- Implementation: store both small and large chunks with parent-child relationships, retrieve on small, return large
Embedding Model Selection
Your embedding model determines the quality of your retrieval. A bad embedding model means the vector space doesn't capture the semantic relationships your queries need. The choice matters more than most engineers realize.
- OpenAI text-embedding-3-large: Strong general-purpose performance, 3072 dimensions, good for most use cases. But expensive at scale and requires API calls (latency)
- BGE-large-en-v1.5: Open-source, competitive performance, can be self-hosted. Great for organizations that need data privacy or want to avoid API dependencies
- E5-large-v2: Strong instruction-following embeddings. Particularly good when you can prefix queries with 'query:' and documents with 'passage:' to help the model understand the retrieval task
- Cohere embed-v3: Excellent multilingual performance and built-in compression (int8/binary embeddings for cost reduction with minimal quality loss)
- Nomic embed-text-v1.5: Open-source with Matryoshka representations — you can truncate embeddings to smaller dimensions with minimal quality loss, enabling cost/quality tradeoffs
Key considerations: dimensionality (higher = better quality but more storage/compute), whether the model supports instruction prefixes (significantly improves retrieval quality), and whether you can fine-tune on your domain data (domain-specific fine-tuning typically gives 5-15% retrieval improvement).
Hybrid Search: BM25 + Vector Retrieval
Pure vector search has a critical blind spot: it struggles with exact keyword matching. If a user asks about 'EIP-7683', vector search might return chunks about cross-chain intents (semantically similar) but miss the chunk that specifically mentions EIP-7683 by name. This is because embedding models capture semantic meaning, not lexical matching.
The solution is hybrid search: combine vector similarity with BM25 (a traditional keyword-based scoring algorithm). Most production RAG systems use a weighted combination, typically 0.7 * vector_score + 0.3 * bm25_score, though the optimal weights depend on your use case.
- Vector search excels at: conceptual queries, paraphrased questions, natural language questions
- BM25 excels at: exact term matching, technical identifiers, acronyms, proper nouns, code references
- Hybrid search gives you both — the semantic understanding of embeddings and the precision of keyword matching
- Databases like Weaviate, Qdrant, and Elasticsearch support hybrid search natively
Reranking: The 10x Quality Multiplier
If there's one technique that delivers the highest ROI for RAG quality improvement, it's reranking. The idea is simple: retrieve a larger initial set of candidates (top-20 or top-30) using fast vector search, then rerank them using a more expensive but more accurate cross-encoder model to select the final top-5.
Why does this work so well? Bi-encoder embeddings (what vector databases use) compress an entire document into a single vector. This is fast but lossy — subtle relevance signals are lost. Cross-encoders process the query and document together, allowing for much finer-grained relevance judgments. They're too slow to run against an entire corpus but perfect for reranking a small candidate set.
Reranking Models
- Cohere Rerank v3: Best-in-class commercial reranker. Supports multilingual, long context, and structured document inputs
- BGE-reranker-v2-m3: Open-source, multilingual, competitive with commercial options. Can be self-hosted for data privacy
- Cross-encoder/ms-marco-MiniLM-L-12-v2: Lightweight open-source option, fast inference, good for resource-constrained environments
- Jina Reranker v2: Good balance of speed and accuracy, supports code and technical content
The reranking step typically improves retrieval quality by 10-25% in terms of nDCG@5. In practice, this means the difference between 'the answer is in the retrieved context 60% of the time' and 'the answer is in the retrieved context 80% of the time.' This directly translates to fewer hallucinations and more accurate responses.
Query Transformation: Fixing the Input
Often the problem isn't retrieval quality — it's that the query itself is poorly suited for retrieval. User questions are often vague, multi-part, or use different terminology than the documents.
- Query rewriting: Use an LLM to rewrite the user's question into a better retrieval query. 'What does the company's leave policy say about taking time off in December?' becomes 'employee leave policy December holiday vacation'
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query, then use that hypothetical answer as the retrieval query. The embedding of a hypothetical answer is often closer to the actual answer in vector space than the original question
- Query decomposition: Break multi-part questions into sub-queries, retrieve for each, then synthesize. 'Compare our Q1 and Q2 revenue and explain the difference' becomes two retrieval queries: 'Q1 revenue' and 'Q2 revenue'
- Step-back prompting: For specific questions, generate a more general question first, retrieve for the general question, then use that context to answer the specific one
Evaluation: Measuring RAG Quality
You can't improve what you don't measure. RAG evaluation requires metrics at both the retrieval and generation stages:
Retrieval Metrics
- Context Precision: What fraction of retrieved chunks are actually relevant? High precision = less noise for the LLM to filter
- Context Recall: What fraction of the relevant information was retrieved? High recall = the answer is in the context
- nDCG (Normalized Discounted Cumulative Gain): Are the most relevant chunks ranked first? Order matters because of the 'lost in the middle' problem
- MRR (Mean Reciprocal Rank): How early does the first relevant result appear?
Generation Metrics
- Faithfulness: Does the generated answer only contain information from the retrieved context? This is the hallucination metric — low faithfulness means the LLM is making things up
- Answer Relevance: Does the generated answer actually address the user's question?
- Answer Correctness: Is the answer factually correct? (Requires ground truth labels)
Frameworks like RAGAS, DeepEval, and TruLens provide automated evaluation pipelines for these metrics. Build an evaluation set of 50-100 question-answer pairs from your actual user queries and run evaluations after every pipeline change.
Production Architecture Patterns
Putting it all together, here's what a production-grade RAG architecture looks like:
- Document ingestion pipeline: Parse → clean → chunk (hierarchical) → embed → store in vector DB with metadata
- Query pipeline: Query transformation → hybrid search (vector + BM25) → rerank → context assembly → LLM generation
- Feedback loop: User feedback → evaluation pipeline → automatic detection of retrieval failures → chunk/embedding retraining triggers
- Guardrails: Input validation, output validation (check for hallucination markers), citation extraction, confidence scoring
- Caching layer: Cache frequent queries and their retrievals to reduce latency and cost
- Monitoring: Track retrieval quality, generation faithfulness, latency, cost per query, and user satisfaction metrics
The difference between a demo RAG system and a production RAG system isn't the LLM — it's the retrieval engineering. Most teams spend 80% of their time optimizing prompts and 20% on retrieval. Flip that ratio and your results will improve dramatically.
Build Production-Grade AI Systems with Accelar
Accelar engineers production AI systems that work — from RAG pipelines and ML infrastructure to custom AI agents. We focus on the hard engineering problems that turn AI demos into reliable, scalable products. If your RAG pipeline isn't performing, let's fix it together.
