Large Language Model

Scaling autoregressive language modeling to billions of parameters enabled emergent reasoning, instruction following, and general-purpose text generation capabilities.

BUILT ON

Transformer

The Transformer is a neural network architecture introduced by Vaswani et al. (Google Brain, 2017) for sequence transduction tasks. It replaces the recurrent layers of previous sequence models (RNN, LSTM, GRU) entirely with multi-head self-attention mechanisms and position-wise feed-forward networks. The core insight is that attention allows direct computation of relationships between any two positions in a sequence in a single step, regardless of their distance, while enabling full parallelization across sequence positions during training. The original architecture consists of an encoder-decoder stack. The encoder maps an input sequence of token embeddings (augmented with positional encodings) through N identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer with residual connections and layer normalization. The decoder adds a third sublayer of cross-attention over the encoder output. For language modeling tasks, decoder-only variants without cross-attention are used. Key design parameters from the original paper (base model): d_model=512, d_ff=2048, h=8 attention heads, N=6 encoder/decoder layers, d_k=d_v=64 per head. The self-attention computation is O(n²·d) in time and O(n²) in memory relative to sequence length n, making long contexts computationally expensive. The architecture has since become the dominant building block for language models (GPT series, BERT, T5, LLaMA), vision models (ViT), and multimodal models.

GO TO CONCEPT

EXTENDS

Instruction Tuning

Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.

GO TO CONCEPT

Commonly used with

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT

RAG

Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. (2020) as a general-purpose fine-tuning recipe combining pre-trained parametric memory (a seq2seq language model, specifically BART in the original paper) with non-parametric memory (a dense vector index of Wikipedia, accessed via Dense Passage Retrieval, DPR). In the original formulation, both the retriever and the generator are fine-tuned end-to-end: given an input query x, the retriever retrieves top-k documents z from the corpus, and the generator produces an output y conditioned on x and z. Two formulations were proposed: RAG-Sequence (the same retrieved documents condition the full output sequence) and RAG-Token (different documents may be used per generated token, marginalized during generation). In widespread contemporary usage (post-2022, with the growth of LLM applications), 'RAG' has expanded to describe a broader class of retrieve-then-generate pipelines, typically with a frozen LLM, a vector store containing pre-computed dense embeddings of document chunks, and a retrieval step that fetches top-k relevant chunks based on embedding similarity to the query. The retrieved chunks are appended to the prompt as context before the LLM generates a response. This non-trainable pipeline variant is technically distinct from the original Lewis et al. formulation but is the dominant practical interpretation of RAG as of 2023–2025. The canonical modern RAG pipeline consists of an offline indexing phase (document chunking, embedding computation, storage in a vector database) and an online query phase (query embedding, approximate nearest neighbor search, context-augmented generation). Key design decisions include: chunk size and overlap, embedding model choice, retrieval strategy (dense, sparse/BM25, or hybrid), number of retrieved documents k, and context integration method (prepend to prompt, cross-attention injection, or fusion-in-decoder). RAG addresses two fundamental limitations of parametric-only LLMs: the knowledge cutoff problem (inability to access post-training information) and hallucination (generation of factually incorrect content). However, RAG introduces its own failure modes, including retrieval of irrelevant or misleading context and the LLM's susceptibility to being distracted by retrieved content that contradicts its parametric knowledge.

GO TO CONCEPT