Large Language Model
Scaling autoregressive language modeling to billions of parameters enabled emergent reasoning, instruction following, and general-purpose text generation capabilities.
The Transformer model is trained on tokens from a text corpus, learning to predict the next token (autoregression). At sufficient scale (parameters, data, compute), emergent capabilities arise: reasoning, in-context learning, and instruction following.
Previous NLP models were narrowly specialized (separate models for translation, classification, QA). LLMs unify multiple language tasks within a single generic model.
GENESIS · Source paper
Language Models are Few-Shot Learners (GPT-3)GPT-3 – the first widely recognized LLM era
breakthroughOpenAI publishes GPT-3 (175B), demonstrating few-shot learning and emergent language capabilities.
ChatGPT – RLHF and mass adoption
breakthroughOpenAI releases ChatGPT (InstructGPT/GPT-3.5), combining LLM with RLHF. Mass adoption of conversational interface.
LLaMA — the open-weights LLM era
breakthroughMeta releases LLaMA, initiating the era of open-weights large language models.
LLM training and inference relies on Transformer matrix operations natively accelerated by CUDA Tensor Cores (A100, H100, GB200).
Google uses TPUs to train Gemini and PaLM models.
BUILT ON
Transformer
The Transformer is a neural network architecture introduced by Vaswani et al. (Google Brain, 2017) for sequence transduction tasks. It replaces the recurrent layers of previous sequence models (RNN, LSTM, GRU) entirely with multi-head self-attention mechanisms and position-wise feed-forward networks. The core insight is that attention allows direct computation of relationships between any two positions in a sequence in a single step, regardless of their distance, while enabling full parallelization across sequence positions during training. The original architecture consists of an encoder-decoder stack. The encoder maps an input sequence of token embeddings (augmented with positional encodings) through N identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer with residual connections and layer normalization. The decoder adds a third sublayer of cross-attention over the encoder output. For language modeling tasks, decoder-only variants without cross-attention are used. Key design parameters from the original paper (base model): d_model=512, d_ff=2048, h=8 attention heads, N=6 encoder/decoder layers, d_k=d_v=64 per head. The self-attention computation is O(n²·d) in time and O(n²) in memory relative to sequence length n, making long contexts computationally expensive. The architecture has since become the dominant building block for language models (GPT series, BERT, T5, LLaMA), vision models (ViT), and multimodal models.
GO TO CONCEPTEXTENDS
Instruction Tuning
Instruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.
GO TO CONCEPTCommonly used with
RLHF
Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.
GO TO CONCEPTRAG
Retrieval-Augmented Generation (RAG) was introduced by Lewis et al. (2020) as a general-purpose fine-tuning recipe combining pre-trained parametric memory (a seq2seq language model, specifically BART in the original paper) with non-parametric memory (a dense vector index of Wikipedia, accessed via Dense Passage Retrieval, DPR). In the original formulation, both the retriever and the generator are fine-tuned end-to-end: given an input query x, the retriever retrieves top-k documents z from the corpus, and the generator produces an output y conditioned on x and z. Two formulations were proposed: RAG-Sequence (the same retrieved documents condition the full output sequence) and RAG-Token (different documents may be used per generated token, marginalized during generation). In widespread contemporary usage (post-2022, with the growth of LLM applications), 'RAG' has expanded to describe a broader class of retrieve-then-generate pipelines, typically with a frozen LLM, a vector store containing pre-computed dense embeddings of document chunks, and a retrieval step that fetches top-k relevant chunks based on embedding similarity to the query. The retrieved chunks are appended to the prompt as context before the LLM generates a response. This non-trainable pipeline variant is technically distinct from the original Lewis et al. formulation but is the dominant practical interpretation of RAG as of 2023–2025. The canonical modern RAG pipeline consists of an offline indexing phase (document chunking, embedding computation, storage in a vector database) and an online query phase (query embedding, approximate nearest neighbor search, context-augmented generation). Key design decisions include: chunk size and overlap, embedding model choice, retrieval strategy (dense, sparse/BM25, or hybrid), number of retrieved documents k, and context integration method (prepend to prompt, cross-attention injection, or fusion-in-decoder). RAG addresses two fundamental limitations of parametric-only LLMs: the knowledge cutoff problem (inability to access post-training information) and hallucination (generation of factually incorrect content). However, RAG introduces its own failure modes, including retrieval of irrelevant or misleading context and the LLM's susceptibility to being distracted by retrieved content that contradicts its parametric knowledge.
GO TO CONCEPTRelated AI models
Grok
1| Title | Publisher | Type |
|---|---|---|
| Language Models are Few-Shot Learners (GPT-3) | OpenAI / arXiv | scientific article |