Robots AtlasRobots Atlas

Retrieval-Augmented Generation

Combining a pre-trained generative model with a differentiably trainable document retrieval mechanism over external non-parametric memory, enabling dynamic augmentation of the model's parametric knowledge without retraining on new data.

Category
Abstraction level
Operation level
01

Retriever

Selects a subset of k documents from a large collection for further conditioning of generation.

Modular

Component responsible for retrieving the most relevant documents or passages from an external knowledge base based on the user query. In Lewis et al., this is Dense Passage Retrieval (DPR) — a dual-encoder model (one for queries, one for documents). In modern RAG pipelines, retrievers use precomputed dense embeddings with ANN search (e.g., FAISS) or sparse methods (BM25), or both (hybrid search).

Dense retrieval (DPR, embedding-based)Sparse retrieval (BM25, TF-IDF)Hybrid retrieval
02

Document Index / Vector Store

Stores external, non-parametric knowledge as searchable vector representations.

Modular

Database storing documents or document chunks along with precomputed vector embeddings, enabling efficient approximate nearest neighbor (ANN) search. In original RAG the index is FAISS with DPR embeddings. Modern implementations use specialized vector databases (e.g., Pinecone, Weaviate, ChromaDB, Qdrant, pgvector).

03

Generator / LLM

Generates responses grounded in retrieved documents.

Modular

Generative language model conditioning its output on the input query and retrieved documents. In original RAG this is BART (seq2seq Transformer). In modern RAG pipelines, any LLM (GPT-4, Llama, Gemini, etc.) with appended context in the prompt. In original RAG, the generator is trained end-to-end with the retriever; in modern applications it is typically frozen.

04

Chunking & Indexing Pipeline

Prepares external documents for efficient retrieval by the retriever.

Modular

Offline component responsible for splitting source documents into smaller chunks, computing their embeddings, and indexing them in the vector store. Chunking quality (chunk size, overlap) directly impacts retrieval quality.

Wąskie gardło: Retrieval latency and prompt length context

During inference, RAG introduces two additional costs compared to a standard LLM: (1) ANN search latency in the vector database (typically tens of milliseconds, dependent on index size and infrastructure); (2) prompt length extension by appended retrieved document context, increasing LLM processing cost proportional to the number and size of chunks (quadratic self-attention complexity for long contexts).

Parallelism

Partially parallel

Many independent queries can be processed in parallel. The indexing stage (offline chunking and embedding computation) is fully parallelizable. Retrieval with FAISS and similar ANN libraries can leverage multiple CPU threads or GPUs.

Paradigm

Conditional

Input dependent

RAG is a conditional paradigm: retrieval is triggered by the user query, and only a subset of documents from the full index is fetched and used to condition generation. Not all documents in the corpus are activated for every query — activation is conditional on the input. In Adaptive RAG and Self-RAG variants, the retrieval step can be optionally skipped, making activation even more input-dependent.

Number of Retrieved Documents (top-k)

Critical
  • 3Minimum value for simple QA tasks. Low latency.
  • 5Typical value from the original Lewis et al. 2020 paper.
  • 10–20For complex tasks requiring broad topic coverage. Increases LLM cost.

Number of documents or passages retrieved by the retriever for each query. Higher k increases retrieval recall but extends the prompt and increases LLM cost. In original RAG, k=5 was a typical value.

Chunk Size (Tokens)

Standard
  • 256 tokenówFine-grained chunks — high retrieval precision, but risk of losing sentence-level context.
  • 512 tokenówPopular default value in many RAG frameworks (e.g., LangChain, LlamaIndex).
  • 1024 tokenówLarger chunks provide more context but increase LLM cost and reduce retrieval precision.

Length of document chunks in the index. Affects retrieval granularity and appended context length. Too small chunks lose context; too large chunks slow the LLM and may contain much irrelevant text.

Model embeddingu

Critical

Model used to convert queries and documents to vector representations. Embedding model quality directly impacts retrieval quality. Original RAG used DPR; modern pipelines use models like text-embedding-3 (OpenAI), all-MiniLM (SBERT), e5-large, or BGE.

Strategia retrieval

Standard
  • dense (cosine similarity)Semantic search — good performance on paraphrases and multilingual queries.
  • BM25 (sparse)Effective for queries containing exact keywords and proper nouns.
  • hybrid (dense + BM25)Best overall retrieval performance on benchmarks (e.g., BEIR). Requires re-ranking or score normalization.

Document search method: dense (embedding similarity), sparse (BM25/TF-IDF), or hybrid. Affects retrieval quality and latency.

Common pitfalls

Query-Document Embedding Mismatch (Encoder Asymmetry)
CRITICAL

If the embedding model uses different encoders or tokenization for queries vs. documents (asymmetric models like DPR), generating document embeddings with the query encoder (or vice versa) produces poor retrieval results.

Use embedding models according to their documentation: asymmetric models (DPR, E5) require separate encoders for queries and documents. Validate the embedding pipeline on test pairs.

Oversized Chunks — Context Noise and LLM Quality Degradation
HIGH

Overly large document chunks contain much irrelevant text, diluting the context signal and potentially causing hallucinations or the LLM ignoring relevant parts. Known as the 'lost in the middle' problem — LLMs perform worse on information in the middle of long contexts.

Use moderate chunk sizes (256–512 tokens with overlap). Add re-ranking of retrieved documents (e.g., a cross-encoder re-ranker). Apply context compression (e.g., LLMLingua) before inserting into the prompt.

Embedding inconsistency during index updates
HIGH

When the embedding model is updated, existing embeddings in the vector database become incompatible with the new model. Searching an index created by the old model using new-model query embeddings produces incorrect results.

Full re-embedding of the index after each embedding model change. Index versioning. Using stable, infrequently updated embedding models in production.

Retrieval of Factually Contradictory Passages (Context Poisoning)
HIGH

RAG assumes retrieved documents are trustworthy and factually relevant. If the knowledge base contains false, outdated, or malicious documents, retrieval may provide the LLM with context leading to incorrect responses, even if the LLM has correct parametric knowledge.

Quality control and curation of knowledge base sources. Apply Corrective RAG or fact-verification mechanisms. Add a re-ranker to detect low-quality or contradictory passages.

Ignoring retrieved context by the LLM (context faithfulness failure)
MEDIUM

The LLM may ignore or misinterpret retrieved context, relying instead on its parametric knowledge. This is particularly problematic when retrieved documents contain information contradicting the model's training knowledge.

Explicit prompt instructions directing the model to prioritize the provided context. Faithfulness evaluation on a test set (e.g., RAGAs framework). Using models specialized for RAG (e.g., Command R+ from Cohere).

GENESIS · Source paper

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
2020NeurIPS 2020 (Advances in Neural Information Processing Systems 33)Patrick Lewis, Ethan Perez, Aleksandra Piktus et al.
2020

Lewis et al. define RAG as an end-to-end trainable architecture (Meta AI / NeurIPS 2020)

breakthrough

Paper introduced the term RAG and the formal architecture combining DPR (retriever) with BART (generator) in an end-to-end trainable system. Two variants proposed: RAG-Sequence and RAG-Token. Achieved state-of-the-art on three open-domain QA tasks.

2022

Fusion-in-Decoder and other RAG variants for multi-document reasoning

Izacard & Grave published Fusion-in-Decoder (FiD), where each retrieved document is encoded separately by a T5 encoder and the decoder cross-attends over all encoded documents simultaneously, improving scaling to large k.

2022

ChatGPT launch and rise of RAG as a production technique for LLMs

breakthrough

After ChatGPT launch and growing LLM interest, RAG became the dominant technique for augmenting LLMs with external knowledge without retraining. The term 'RAG' began being broadly applied to frozen-LLM retrieve-then-read pipelines with vector databases.

2023

SELF-RAG, Adaptive RAG, Corrective RAG — variants with adaptive retrieval

RAG variants with adaptive retrieval decisions published: SELF-RAG trains the model to generate reflection tokens ('Should I retrieve?', 'Is this relevant?'), Adaptive RAG conditions retrieval on query complexity, Corrective RAG verifies retrieved document quality before generation.

2024

GraphRAG (Microsoft) – RAG on knowledge graphs

Microsoft Research published GraphRAG (Edge et al.), extending RAG with knowledge graph construction from documents and entity/community-level retrieval, improving responses for complex global knowledge queries.

GPU Tensor CoresPRIMARY

Both the embedding model (for generating document and query embeddings) and the LLM generator rely on Transformer architectures accelerated by GPU Tensor Cores. For production-scale deployments (millions of documents, LLM models of 7B+ parameters), a GPU is required to maintain acceptable inference latency.

ANN search (FAISS) can run on CPU for indexes up to several hundred million vectors with acceptable latency, or on GPU (FAISS-GPU) for larger indexes or lower latency.

CPU AVXGOOD

ANN search in libraries such as FAISS (Flat index, HNSW) can be executed efficiently on CPU with AVX/AVX-512 extensions for vectorized floating-point operations. For indexes up to tens of millions of documents under moderate load, CPU is sufficient for the retrieval stage.

The generation stage (LLM) still requires a GPU for acceptable latency with large models. CPU is sufficient for retrieval in small and medium RAG deployments.

Commonly used with

Prompt Engineering

Prompt Engineering is a set of techniques for precisely formulating text inputs (prompts) provided to language models, to guide their responses toward a desired format, style, level of detail, or correctness. Techniques include few-shot prompting (providing examples in context), zero-shot prompting (task without examples), role prompting (assigning a system role), chain-of-thought (requesting reasoning steps), format prompting (specifying output format), and many others. Prompt Engineering is particularly important when model fine-tuning is infeasible or uneconomical. Codification of techniques occurred mainly after GPT-3 (Brown et al., 2020), which demonstrated high sensitivity of performance to prompt formulation.

GO TO CONCEPT