Back to blog
AI Engineering11 min read

Why Your RAG Pipeline Is Hallucinating: Chunking Strategies, Reranking, and the Retrieval-Generation Tradeoff

Retrieval-Augmented Generation promised to ground LLMs in facts. But naive implementations hallucinate almost as much as vanilla models. Here's the engineering that makes RAG actually work in production.

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM applications that need to reference specific documents, knowledge bases, or proprietary data. The idea is elegant: instead of relying on the LLM's parametric knowledge (which can be outdated or wrong), retrieve relevant context from a vector database and include it in the prompt. The LLM then generates answers grounded in real data.

In practice, most RAG implementations hallucinate. Not because the concept is flawed, but because the engineering is wrong. A naive RAG pipeline — chunk documents, embed them, retrieve top-k by cosine similarity, stuff them into a prompt — introduces failure modes at every stage that compound into unreliable outputs.

This post is a deep technical guide to the engineering decisions that separate production-grade RAG systems from demo-quality ones. We'll cover chunking, embedding, retrieval, reranking, and generation — and the tradeoffs at each stage.

Where Naive RAG Fails

Before diving into solutions, let's enumerate the specific failure modes of a basic RAG pipeline:

Chunking: The Foundation Nobody Gets Right

Chunking strategy is the single highest-leverage decision in a RAG pipeline. Get it wrong, and no amount of retrieval sophistication will save you. Yet most implementations use the default: split text into fixed-size windows of 500-1000 tokens with some overlap.

Fixed-Size Chunking

The simplest approach: split text every N tokens (or characters) with M tokens of overlap. It's fast and deterministic, but it's also the worst strategy for most use cases. Fixed-size chunking routinely splits sentences mid-thought, separates questions from their answers, and creates chunks that are semantically incoherent.

When to use it: Only as a baseline or for truly unstructured text where no better signal exists. For structured documents (docs, articles, code), there are always better options.

Recursive Character Splitting

LangChain's RecursiveCharacterTextSplitter improves on fixed-size chunking by trying to split on natural boundaries: first paragraphs, then sentences, then words. This preserves more semantic coherence than arbitrary token boundaries. It's a good default, but it still doesn't understand document structure.

Semantic Chunking

Semantic chunking uses embedding similarity between consecutive sentences to find natural breakpoints. When the embedding similarity between sentence N and sentence N+1 drops below a threshold, a chunk boundary is placed. This creates chunks that are topically coherent — each chunk discusses one concept or topic.

The downside is computational cost (you're embedding every sentence) and sensitivity to the threshold parameter. Too aggressive and you get single-sentence chunks; too lenient and you get entire sections as single chunks.

Parent-Document / Hierarchical Chunking

This is the strategy that unlocks the biggest quality improvements. The idea: create small chunks for retrieval (high precision) but return the parent document or a larger context window for generation (high recall). You embed small, focused chunks so retrieval is precise. But when a small chunk matches, you return the surrounding larger context to the LLM so it has enough information to generate a complete answer.

Embedding Model Selection

Your embedding model determines the quality of your retrieval. A bad embedding model means the vector space doesn't capture the semantic relationships your queries need. The choice matters more than most engineers realize.

Key considerations: dimensionality (higher = better quality but more storage/compute), whether the model supports instruction prefixes (significantly improves retrieval quality), and whether you can fine-tune on your domain data (domain-specific fine-tuning typically gives 5-15% retrieval improvement).

Hybrid Search: BM25 + Vector Retrieval

Pure vector search has a critical blind spot: it struggles with exact keyword matching. If a user asks about 'EIP-7683', vector search might return chunks about cross-chain intents (semantically similar) but miss the chunk that specifically mentions EIP-7683 by name. This is because embedding models capture semantic meaning, not lexical matching.

The solution is hybrid search: combine vector similarity with BM25 (a traditional keyword-based scoring algorithm). Most production RAG systems use a weighted combination, typically 0.7 * vector_score + 0.3 * bm25_score, though the optimal weights depend on your use case.

Reranking: The 10x Quality Multiplier

If there's one technique that delivers the highest ROI for RAG quality improvement, it's reranking. The idea is simple: retrieve a larger initial set of candidates (top-20 or top-30) using fast vector search, then rerank them using a more expensive but more accurate cross-encoder model to select the final top-5.

Why does this work so well? Bi-encoder embeddings (what vector databases use) compress an entire document into a single vector. This is fast but lossy — subtle relevance signals are lost. Cross-encoders process the query and document together, allowing for much finer-grained relevance judgments. They're too slow to run against an entire corpus but perfect for reranking a small candidate set.

Reranking Models

The reranking step typically improves retrieval quality by 10-25% in terms of nDCG@5. In practice, this means the difference between 'the answer is in the retrieved context 60% of the time' and 'the answer is in the retrieved context 80% of the time.' This directly translates to fewer hallucinations and more accurate responses.

Query Transformation: Fixing the Input

Often the problem isn't retrieval quality — it's that the query itself is poorly suited for retrieval. User questions are often vague, multi-part, or use different terminology than the documents.

Evaluation: Measuring RAG Quality

You can't improve what you don't measure. RAG evaluation requires metrics at both the retrieval and generation stages:

Retrieval Metrics

Generation Metrics

Frameworks like RAGAS, DeepEval, and TruLens provide automated evaluation pipelines for these metrics. Build an evaluation set of 50-100 question-answer pairs from your actual user queries and run evaluations after every pipeline change.

Production Architecture Patterns

Putting it all together, here's what a production-grade RAG architecture looks like:

The difference between a demo RAG system and a production RAG system isn't the LLM — it's the retrieval engineering. Most teams spend 80% of their time optimizing prompts and 20% on retrieval. Flip that ratio and your results will improve dramatically.

Build Production-Grade AI Systems with Accelar

Accelar engineers production AI systems that work — from RAG pipelines and ML infrastructure to custom AI agents. We focus on the hard engineering problems that turn AI demos into reliable, scalable products. If your RAG pipeline isn't performing, let's fix it together.