Accelar | Enterprise Software Infrastructure & AI Solutions

Retrieval-Augmented Generation (RAG) has become the default architecture for building LLM applications that need to reference specific documents, knowledge bases, or proprietary data. The idea is elegant: instead of relying on the LLM's parametric knowledge (which can be outdated or wrong), retrieve relevant context from a vector database and include it in the prompt. The LLM then generates answers grounded in real data.

In practice, most RAG implementations hallucinate. Not because the concept is flawed, but because the engineering is wrong. A naive RAG pipeline — chunk documents, embed them, retrieve top-k by cosine similarity, stuff them into a prompt — introduces failure modes at every stage that compound into unreliable outputs.

This post is a deep technical guide to the engineering decisions that separate production-grade RAG systems from demo-quality ones. We'll cover chunking, embedding, retrieval, reranking, and generation — and the tradeoffs at each stage.

Where Naive RAG Fails

Before diving into solutions, let's enumerate the specific failure modes of a basic RAG pipeline:

Wrong chunks retrieved: The retrieval step returns chunks that are semantically similar to the query but don't contain the answer. This is the most common failure — and it's a retrieval problem, not a generation problem
Answer spread across chunks: The information needed to answer the query is split across multiple chunks that weren't designed to be self-contained. The LLM receives fragments that individually don't make sense
Lost in the middle: Research from Stanford shows LLMs disproportionately attend to information at the beginning and end of the context window, ignoring the middle. Important context in the middle of a long retrieval set gets lost
Hallucination despite correct retrieval: Even with the right context, the LLM may generate plausible-sounding text that isn't supported by the retrieved documents. This happens more often with ambiguous or complex queries
Retrieval of outdated or contradictory chunks: When the knowledge base contains multiple versions of the same information, the retrieval step may pull in outdated or conflicting chunks without any disambiguation

Chunking: The Foundation Nobody Gets Right

Chunking strategy is the single highest-leverage decision in a RAG pipeline. Get it wrong, and no amount of retrieval sophistication will save you. Yet most implementations use the default: split text into fixed-size windows of 500-1000 tokens with some overlap.

Fixed-Size Chunking

The simplest approach: split text every N tokens (or characters) with M tokens of overlap. It's fast and deterministic, but it's also the worst strategy for most use cases. Fixed-size chunking routinely splits sentences mid-thought, separates questions from their answers, and creates chunks that are semantically incoherent.

When to use it: Only as a baseline or for truly unstructured text where no better signal exists. For structured documents (docs, articles, code), there are always better options.

Recursive Character Splitting

LangChain's RecursiveCharacterTextSplitter improves on fixed-size chunking by trying to split on natural boundaries: first paragraphs, then sentences, then words. This preserves more semantic coherence than arbitrary token boundaries. It's a good default, but it still doesn't understand document structure.

Semantic Chunking

Semantic chunking uses embedding similarity between consecutive sentences to find natural breakpoints. When the embedding similarity between sentence N and sentence N+1 drops below a threshold, a chunk boundary is placed. This creates chunks that are topically coherent — each chunk discusses one concept or topic.

The downside is computational cost (you're embedding every sentence) and sensitivity to the threshold parameter. Too aggressive and you get single-sentence chunks; too lenient and you get entire sections as single chunks.

Parent-Document / Hierarchical Chunking

This is the strategy that unlocks the biggest quality improvements. The idea: create small chunks for retrieval (high precision) but return the parent document or a larger context window for generation (high recall). You embed small, focused chunks so retrieval is precise. But when a small chunk matches, you return the surrounding larger context to the LLM so it has enough information to generate a complete answer.

Small chunks (200-400 tokens) for precise semantic retrieval
Parent documents or expanded windows (1000-2000 tokens) for generation context
This decouples the retrieval granularity from the generation context — the key insight most pipelines miss
Implementation: store both small and large chunks with parent-child relationships, retrieve on small, return large

Embedding Model Selection

Your embedding model determines the quality of your retrieval. A bad embedding model means the vector space doesn't capture the semantic relationships your queries need. The choice matters more than most engineers realize.

OpenAI text-embedding-3-large: Strong general-purpose performance, 3072 dimensions, good for most use cases. But expensive at scale and requires API calls (latency)
BGE-large-en-v1.5: Open-source, competitive performance, can be self-hosted. Great for organizations that need data privacy or want to avoid API dependencies
E5-large-v2: Strong instruction-following embeddings. Particularly good when you can prefix queries with 'query:' and documents with 'passage:' to help the model understand the retrieval task
Cohere embed-v3: Excellent multilingual performance and built-in compression (int8/binary embeddings for cost reduction with minimal quality loss)
Nomic embed-text-v1.5: Open-source with Matryoshka representations — you can truncate embeddings to smaller dimensions with minimal quality loss, enabling cost/quality tradeoffs

Key considerations: dimensionality (higher = better quality but more storage/compute), whether the model supports instruction prefixes (significantly improves retrieval quality), and whether you can fine-tune on your domain data (domain-specific fine-tuning typically gives 5-15% retrieval improvement).

Hybrid Search: BM25 + Vector Retrieval

Pure vector search has a critical blind spot: it struggles with exact keyword matching. If a user asks about 'EIP-7683', vector search might return chunks about cross-chain intents (semantically similar) but miss the chunk that specifically mentions EIP-7683 by name. This is because embedding models capture semantic meaning, not lexical matching.

The solution is hybrid search: combine vector similarity with BM25 (a traditional keyword-based scoring algorithm). Most production RAG systems use a weighted combination, typically 0.7 * vector_score + 0.3 * bm25_score, though the optimal weights depend on your use case.

Vector search excels at: conceptual queries, paraphrased questions, natural language questions
BM25 excels at: exact term matching, technical identifiers, acronyms, proper nouns, code references
Hybrid search gives you both — the semantic understanding of embeddings and the precision of keyword matching
Databases like Weaviate, Qdrant, and Elasticsearch support hybrid search natively

Reranking: The 10x Quality Multiplier

If there's one technique that delivers the highest ROI for RAG quality improvement, it's reranking. The idea is simple: retrieve a larger initial set of candidates (top-20 or top-30) using fast vector search, then rerank them using a more expensive but more accurate cross-encoder model to select the final top-5.

Why does this work so well? Bi-encoder embeddings (what vector databases use) compress an entire document into a single vector. This is fast but lossy — subtle relevance signals are lost. Cross-encoders process the query and document together, allowing for much finer-grained relevance judgments. They're too slow to run against an entire corpus but perfect for reranking a small candidate set.

Reranking Models

Cohere Rerank v3: Best-in-class commercial reranker. Supports multilingual, long context, and structured document inputs
BGE-reranker-v2-m3: Open-source, multilingual, competitive with commercial options. Can be self-hosted for data privacy
Cross-encoder/ms-marco-MiniLM-L-12-v2: Lightweight open-source option, fast inference, good for resource-constrained environments
Jina Reranker v2: Good balance of speed and accuracy, supports code and technical content

The reranking step typically improves retrieval quality by 10-25% in terms of nDCG@5. In practice, this means the difference between 'the answer is in the retrieved context 60% of the time' and 'the answer is in the retrieved context 80% of the time.' This directly translates to fewer hallucinations and more accurate responses.

Query Transformation: Fixing the Input

Often the problem isn't retrieval quality — it's that the query itself is poorly suited for retrieval. User questions are often vague, multi-part, or use different terminology than the documents.

Query rewriting: Use an LLM to rewrite the user's question into a better retrieval query. 'What does the company's leave policy say about taking time off in December?' becomes 'employee leave policy December holiday vacation'
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the query, then use that hypothetical answer as the retrieval query. The embedding of a hypothetical answer is often closer to the actual answer in vector space than the original question
Query decomposition: Break multi-part questions into sub-queries, retrieve for each, then synthesize. 'Compare our Q1 and Q2 revenue and explain the difference' becomes two retrieval queries: 'Q1 revenue' and 'Q2 revenue'
Step-back prompting: For specific questions, generate a more general question first, retrieve for the general question, then use that context to answer the specific one

Evaluation: Measuring RAG Quality

You can't improve what you don't measure. RAG evaluation requires metrics at both the retrieval and generation stages:

Retrieval Metrics

Context Precision: What fraction of retrieved chunks are actually relevant? High precision = less noise for the LLM to filter
Context Recall: What fraction of the relevant information was retrieved? High recall = the answer is in the context
nDCG (Normalized Discounted Cumulative Gain): Are the most relevant chunks ranked first? Order matters because of the 'lost in the middle' problem
MRR (Mean Reciprocal Rank): How early does the first relevant result appear?

Generation Metrics

Faithfulness: Does the generated answer only contain information from the retrieved context? This is the hallucination metric — low faithfulness means the LLM is making things up
Answer Relevance: Does the generated answer actually address the user's question?
Answer Correctness: Is the answer factually correct? (Requires ground truth labels)

Frameworks like RAGAS, DeepEval, and TruLens provide automated evaluation pipelines for these metrics. Build an evaluation set of 50-100 question-answer pairs from your actual user queries and run evaluations after every pipeline change.

Production Architecture Patterns

Putting it all together, here's what a production-grade RAG architecture looks like:

Document ingestion pipeline: Parse → clean → chunk (hierarchical) → embed → store in vector DB with metadata
Query pipeline: Query transformation → hybrid search (vector + BM25) → rerank → context assembly → LLM generation
Feedback loop: User feedback → evaluation pipeline → automatic detection of retrieval failures → chunk/embedding retraining triggers
Guardrails: Input validation, output validation (check for hallucination markers), citation extraction, confidence scoring
Caching layer: Cache frequent queries and their retrievals to reduce latency and cost
Monitoring: Track retrieval quality, generation faithfulness, latency, cost per query, and user satisfaction metrics

The difference between a demo RAG system and a production RAG system isn't the LLM — it's the retrieval engineering. Most teams spend 80% of their time optimizing prompts and 20% on retrieval. Flip that ratio and your results will improve dramatically.

Build Production-Grade AI Systems with Accelar

Accelar engineers production AI systems that work — from RAG pipelines and ML infrastructure to custom AI agents. We focus on the hard engineering problems that turn AI demos into reliable, scalable products. If your RAG pipeline isn't performing, let's fix it together.

Why Your RAG Pipeline Is Hallucinating: Chunking Strategies, Reranking, and the Retrieval-Generation Tradeoff