Augmentation

RAG

Key innovation

Combining a pre-trained generative model with a differentiably trainable document retrieval mechanism over external non-parametric memory, enabling dynamic augmentation of the model's parametric knowledge without retraining on new data.

Components

RetrieverSelects a subset of k documents from a large collection for further conditioning of generation.

Component responsible for retrieving the most relevant documents or passages from an external knowledge base based on the user query. In Lewis et al., this is Dense Passage Retrieval (DPR) — a dual-encoder model (one for queries, one for documents). In modern RAG pipelines, retrievers use precomputed dense embeddings with ANN search (e.g., FAISS) or sparse methods (BM25), or both (hybrid search).

Dense retrieval (DPR, embedding-based)Queries and documents are mapped to dense vectors by encoders; similarity measured by dot product or cosine. Used in original RAG (DPR) and modern pipelines with FAISS.

Sparse retrieval (BM25, TF-IDF)Classical sparse information retrieval methods based on keyword matching. Frequently used as a component in hybrid search.

Hybrid retrievalCombination of dense and sparse retriever, typically with result re-ranking. Provides better precision than dense or sparse alone.

Official

Document Index (Vector Store)Stores external, non-parametric knowledge as searchable vector representations.

Database storing documents or document chunks along with precomputed vector embeddings, enabling efficient approximate nearest neighbor (ANN) search. In original RAG the index is FAISS with DPR embeddings. Modern implementations use specialized vector databases (e.g., Pinecone, Weaviate, ChromaDB, Qdrant, pgvector).

Official

Generator (language model)Generates responses grounded in retrieved documents.

Generative language model conditioning its output on the input query and retrieved documents. In original RAG this is BART (seq2seq Transformer). In modern RAG pipelines, any LLM (GPT-4, Llama, Gemini, etc.) with appended context in the prompt. In original RAG, the generator is trained end-to-end with the retriever; in modern applications it is typically frozen.

Official

Chunking and Indexing ModulePrepares external documents for efficient retrieval by the retriever.

Offline component responsible for splitting source documents into smaller chunks, computing their embeddings, and indexing them in the vector store. Chunking quality (chunk size, overlap) directly impacts retrieval quality.

Official

Implementation

Reference implementations

LangChain – RAG pipeline framework (Python)

Python · LangChain AI

LlamaIndex – framework for building RAG pipelines

Python · LlamaIndex

Hugging Face RAG – implementation of models from the Lewis et al. paper

Python · Hugging Face

Official

Implementation pitfalls

Query-Document Embedding Mismatch (Encoder Asymmetry)Critical

If the embedding model uses different encoders or tokenization for queries vs. documents (asymmetric models like DPR), generating document embeddings with the query encoder (or vice versa) produces poor retrieval results.

Fix:Use embedding models according to their documentation: asymmetric models (DPR, E5) require separate encoders for queries and documents. Validate the embedding pipeline on test pairs.

Oversized Chunks — Context Noise and LLM Quality DegradationHigh

Overly large document chunks contain much irrelevant text, diluting the context signal and potentially causing hallucinations or the LLM ignoring relevant parts. Known as the 'lost in the middle' problem — LLMs perform worse on information in the middle of long contexts.

Fix:Use moderate chunk sizes (256–512 tokens with overlap). Add re-ranking of retrieved documents (e.g., a cross-encoder re-ranker). Apply context compression (e.g., LLMLingua) before inserting into the prompt.

Embedding inconsistency during index updatesHigh

When the embedding model is updated, existing embeddings in the vector database become incompatible with the new model. Searching an index created by the old model using new-model query embeddings produces incorrect results.

Fix:Full re-embedding of the index after each embedding model change. Index versioning. Using stable, infrequently updated embedding models in production.

Retrieval of Factually Contradictory Passages (Context Poisoning)High

RAG assumes retrieved documents are trustworthy and factually relevant. If the knowledge base contains false, outdated, or malicious documents, retrieval may provide the LLM with context leading to incorrect responses, even if the LLM has correct parametric knowledge.

Fix:Quality control and curation of knowledge base sources. Apply Corrective RAG or fact-verification mechanisms. Add a re-ranker to detect low-quality or contradictory passages.

Ignoring retrieved context by the LLM (context faithfulness failure)Medium

The LLM may ignore or misinterpret retrieved context, relying instead on its parametric knowledge. This is particularly problematic when retrieved documents contain information contradicting the model's training knowledge.

Fix:Explicit prompt instructions directing the model to prioritize the provided context. Faithfulness evaluation on a test set (e.g., RAGAs framework). Using models specialized for RAG (e.g., Command R+ from Cohere).

Evolution

Original paper · 2020 · NeurIPS 2020 (Advances in Neural Information Processing Systems 33) · Patrick Lewis

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela

2020

Lewis et al. define RAG as an end-to-end trainable architecture (Meta AI / NeurIPS 2020)

Inflection point

Paper introduced the term RAG and the formal architecture combining DPR (retriever) with BART (generator) in an end-to-end trainable system. Two variants proposed: RAG-Sequence and RAG-Token. Achieved state-of-the-art on three open-domain QA tasks.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (paper)

2022

Fusion-in-Decoder and other RAG variants for multi-document reasoning

Izacard & Grave published Fusion-in-Decoder (FiD), where each retrieved document is encoded separately by a T5 encoder and the decoder cross-attends over all encoded documents simultaneously, improving scaling to large k.

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (paper)

2022

ChatGPT launch and rise of RAG as a production technique for LLMs

Inflection point

After ChatGPT launch and growing LLM interest, RAG became the dominant technique for augmenting LLMs with external knowledge without retraining. The term 'RAG' began being broadly applied to frozen-LLM retrieve-then-read pipelines with vector databases.

2023

SELF-RAG, Adaptive RAG, Corrective RAG — variants with adaptive retrieval

RAG variants with adaptive retrieval decisions published: SELF-RAG trains the model to generate reflection tokens ('Should I retrieve?', 'Is this relevant?'), Adaptive RAG conditions retrieval on query complexity, Corrective RAG verifies retrieved document quality before generation.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (paper)

2024

GraphRAG (Microsoft) – RAG on knowledge graphs

Microsoft Research published GraphRAG (Edge et al.), extending RAG with knowledge graph construction from documents and entity/community-level retrieval, improving responses for complex global knowledge queries.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization (paper)

Technical details

Hyperparameters (configurable axes)

Number of Retrieved Documents (top-k)Critical

Number of documents or passages retrieved by the retriever for each query. Higher k increases retrieval recall but extends the prompt and increases LLM cost. In original RAG, k=5 was a typical value.

3Minimum value for simple QA tasks. Low latency.

5Typical value from the original Lewis et al. 2020 paper.

10–20For complex tasks requiring broad topic coverage. Increases LLM cost.

Chunk Size (Tokens)High

Length of document chunks in the index. Affects retrieval granularity and appended context length. Too small chunks lose context; too large chunks slow the LLM and may contain much irrelevant text.

256 tokenówFine-grained chunks — high retrieval precision, but risk of losing sentence-level context.

512 tokenówPopular default value in many RAG frameworks (e.g., LangChain, LlamaIndex).

1024 tokenówLarger chunks provide more context but increase LLM cost and reduce retrieval precision.

Model embeddinguCritical

Model used to convert queries and documents to vector representations. Embedding model quality directly impacts retrieval quality. Original RAG used DPR; modern pipelines use models like text-embedding-3 (OpenAI), all-MiniLM (SBERT), e5-large, or BGE.

Strategia retrievalHigh

Document search method: dense (embedding similarity), sparse (BM25/TF-IDF), or hybrid. Affects retrieval quality and latency.

dense (cosine similarity)Semantic search — good performance on paraphrases and multilingual queries.

BM25 (sparse)Effective for queries containing exact keywords and proper nouns.

hybrid (dense + BM25)Best overall retrieval performance on benchmarks (e.g., BEIR). Requires re-ranking or score normalization.

Compute bottleneck

Retrieval latency and prompt length context

During inference, RAG introduces two additional costs compared to a standard LLM: (1) ANN search latency in the vector database (typically tens of milliseconds, dependent on index size and infrastructure); (2) prompt length extension by appended retrieved document context, increasing LLM processing cost proportional to the number and size of chunks (quadratic self-attention complexity for long contexts).

Depends on

Liczba pobranych dokumentów (k)Rozmiar chunkuRozmiar indeksu wektorowego

Execution paradigm

Primary mode

conditional

RAG is a conditional paradigm: retrieval is triggered by the user query, and only a subset of documents from the full index is fetched and used to condition generation. Not all documents in the corpus are activated for every query — activation is conditional on the input. In Adaptive RAG and Self-RAG variants, the retrieval step can be optionally skipped, making activation even more input-dependent.

Activation pattern

input_dependent

Parallelism

Parallelism level

partially_parallel

Many independent queries can be processed in parallel. The indexing stage (offline chunking and embedding computation) is fully parallelizable. Retrieval with FAISS and similar ANN libraries can leverage multiple CPU threads or GPUs.

Scope

inference

Constraints

!Generation is conditioned on retrieval results, so both stages must be executed sequentially: first retrieval, then generation. They cannot be parallelized for a single query.

Hardware requirements

Primary

Both the embedding model (for generating document and query embeddings) and the LLM generator rely on Transformer architectures accelerated by GPU Tensor Cores. For production-scale deployments (millions of documents, LLM models of 7B+ parameters), a GPU is required to maintain acceptable inference latency.

Good fit

ANN search in libraries such as FAISS (Flat index, HNSW) can be executed efficiently on CPU with AVX/AVX-512 extensions for vectorized floating-point operations. For indexes up to tens of millions of documents under moderate load, CPU is sufficient for the retrieval stage.

Sources

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks