RAG
Components
Component responsible for retrieving the most relevant documents or passages from an external knowledge base based on the user query. In Lewis et al., this is Dense Passage Retrieval (DPR) — a dual-encoder model (one for queries, one for documents). In modern RAG pipelines, retrievers use precomputed dense embeddings with ANN search (e.g., FAISS) or sparse methods (BM25), or both (hybrid search).
Official
Database storing documents or document chunks along with precomputed vector embeddings, enabling efficient approximate nearest neighbor (ANN) search. In original RAG the index is FAISS with DPR embeddings. Modern implementations use specialized vector databases (e.g., Pinecone, Weaviate, ChromaDB, Qdrant, pgvector).
Official
Generative language model conditioning its output on the input query and retrieved documents. In original RAG this is BART (seq2seq Transformer). In modern RAG pipelines, any LLM (GPT-4, Llama, Gemini, etc.) with appended context in the prompt. In original RAG, the generator is trained end-to-end with the retriever; in modern applications it is typically frozen.
Official
Offline component responsible for splitting source documents into smaller chunks, computing their embeddings, and indexing them in the vector store. Chunking quality (chunk size, overlap) directly impacts retrieval quality.
Official
Implementation
If the embedding model uses different encoders or tokenization for queries vs. documents (asymmetric models like DPR), generating document embeddings with the query encoder (or vice versa) produces poor retrieval results.
Overly large document chunks contain much irrelevant text, diluting the context signal and potentially causing hallucinations or the LLM ignoring relevant parts. Known as the 'lost in the middle' problem — LLMs perform worse on information in the middle of long contexts.
When the embedding model is updated, existing embeddings in the vector database become incompatible with the new model. Searching an index created by the old model using new-model query embeddings produces incorrect results.
RAG assumes retrieved documents are trustworthy and factually relevant. If the knowledge base contains false, outdated, or malicious documents, retrieval may provide the LLM with context leading to incorrect responses, even if the LLM has correct parametric knowledge.
The LLM may ignore or misinterpret retrieved context, relying instead on its parametric knowledge. This is particularly problematic when retrieved documents contain information contradicting the model's training knowledge.
Evolution
Paper introduced the term RAG and the formal architecture combining DPR (retriever) with BART (generator) in an end-to-end trainable system. Two variants proposed: RAG-Sequence and RAG-Token. Achieved state-of-the-art on three open-domain QA tasks.
Izacard & Grave published Fusion-in-Decoder (FiD), where each retrieved document is encoded separately by a T5 encoder and the decoder cross-attends over all encoded documents simultaneously, improving scaling to large k.
After ChatGPT launch and growing LLM interest, RAG became the dominant technique for augmenting LLMs with external knowledge without retraining. The term 'RAG' began being broadly applied to frozen-LLM retrieve-then-read pipelines with vector databases.
RAG variants with adaptive retrieval decisions published: SELF-RAG trains the model to generate reflection tokens ('Should I retrieve?', 'Is this relevant?'), Adaptive RAG conditions retrieval on query complexity, Corrective RAG verifies retrieved document quality before generation.
Microsoft Research published GraphRAG (Edge et al.), extending RAG with knowledge graph construction from documents and entity/community-level retrieval, improving responses for complex global knowledge queries.
Technical details
Hyperparameters (configurable axes)
Number of documents or passages retrieved by the retriever for each query. Higher k increases retrieval recall but extends the prompt and increases LLM cost. In original RAG, k=5 was a typical value.
Length of document chunks in the index. Affects retrieval granularity and appended context length. Too small chunks lose context; too large chunks slow the LLM and may contain much irrelevant text.
Model used to convert queries and documents to vector representations. Embedding model quality directly impacts retrieval quality. Original RAG used DPR; modern pipelines use models like text-embedding-3 (OpenAI), all-MiniLM (SBERT), e5-large, or BGE.
Document search method: dense (embedding similarity), sparse (BM25/TF-IDF), or hybrid. Affects retrieval quality and latency.
Compute bottleneck
During inference, RAG introduces two additional costs compared to a standard LLM: (1) ANN search latency in the vector database (typically tens of milliseconds, dependent on index size and infrastructure); (2) prompt length extension by appended retrieved document context, increasing LLM processing cost proportional to the number and size of chunks (quadratic self-attention complexity for long contexts).
Execution paradigm
RAG is a conditional paradigm: retrieval is triggered by the user query, and only a subset of documents from the full index is fetched and used to condition generation. Not all documents in the corpus are activated for every query — activation is conditional on the input. In Adaptive RAG and Self-RAG variants, the retrieval step can be optionally skipped, making activation even more input-dependent.
Parallelism
Many independent queries can be processed in parallel. The indexing stage (offline chunking and embedding computation) is fully parallelizable. Retrieval with FAISS and similar ANN libraries can leverage multiple CPU threads or GPUs.
Hardware requirements
Both the embedding model (for generating document and query embeddings) and the LLM generator rely on Transformer architectures accelerated by GPU Tensor Cores. For production-scale deployments (millions of documents, LLM models of 7B+ parameters), a GPU is required to maintain acceptable inference latency.
ANN search in libraries such as FAISS (Flat index, HNSW) can be executed efficiently on CPU with AVX/AVX-512 extensions for vectorized floating-point operations. For indexes up to tens of millions of documents under moderate load, CPU is sufficient for the retrieval stage.