Architecture

N-gram

1948HistoricalPublished: 19 May 2026Updated: 19 May 2026Published

Key innovation

Approximating sequence probability via an (n−1)-th order Markov assumption: the next token depends only on the previous n−1 tokens, making large-corpus statistical language modeling tractable.

How it works

1. Tokenize the corpus into units (words, characters, subwords). 2. Add sentence boundary markers (<s>, </s>). 3. Count all n-grams and (n−1)-grams in the corpus. 4. Estimate conditional probabilities by maximum likelihood: P(w_i | w_{i−n+1}...w_{i−1}) = count(w_{i−n+1}...w_i) / count(w_{i−n+1}...w_{i−1}). 5. Apply smoothing (Laplace, Good-Turing, Katz back-off, Kneser-Ney) to assign non-zero mass to unseen n-grams. 6. At inference, the sentence probability is the product of conditional probabilities of successive n-grams, usually computed in log-space to avoid underflow. Evaluation metric: perplexity (lower is better).

Problem solved

Modeling the full joint probability distribution over language sequences is intractable — the number of possible sequences grows exponentially with length. N-grams solve this via a Markov assumption: only the last n−1 tokens matter, reducing the model to a finite, estimable parameter set.

Components

N-gram count tableStatistical memory of the model

Data structure storing count(w_{i−n+1}...w_i) for every n-gram observed in the training corpus; typically implemented as a trie, hash table, or key-value store.

Probability estimatorConditional probability inference

Component computing P(w_i | w_{i−n+1}...w_{i−1}) from raw counts, typically MLE: count(n-gram) / count(prefix).

SmoothingGeneralization to unseen n-grams

Algorithm that redistributes probability mass to unseen n-grams. Standard methods: Laplace (add-one), Good-Turing, Katz back-off, interpolated Kneser-Ney, modified Kneser-Ney.

Back-off / interpolationCombining estimators of different orders

Mechanism that falls back to lower-order n-grams (e.g. trigram → bigram → unigram) when the higher-order n-gram is unseen or has low count.

Implementation

Reference implementations

KenLM

C++ / Python · Kenneth Heafield

Official

SRILM

C++ · SRI International

Official

NLTK n-gram module

Python · NLTK Project

scikit-learn CountVectorizer (n-gram features)

Python · scikit-learn

Implementation pitfalls

No smoothing → zero probabilitiesCritical

Any unseen n-gram gets P=0, making the whole sentence have P=0 and log P=−∞. Smoothing is mandatory.

Fix:Use add-one (as baseline), Katz back-off, or modified Kneser-Ney (best).

Numerical underflowHigh

Multiplying many small probabilities quickly underflows; results lose precision or become 0.

Fix:Always work in log-space: log P(sentence) = Σ log P(w_i | context).

Missing <s> and </s> markersMedium

Without sentence boundary markers, P(first word) and P(end of sentence) cannot be computed.

Fix:Add n−1 <s> markers at the start and one </s> at the end of each sentence.

Model size explosion for large nHigh

A full 5-gram table over a web-scale corpus can be hundreds of GB. Without pruning and compression the model is unusable.

Fix:Use pruning (Stolcke), trie compression (KenLM), or Bloom filter approximations (Talbot & Osborne).

No long-range dependenciesHigh

By design, an n-gram ignores anything beyond n−1 tokens back. Syntax, coreference, and discourse context are out of reach.

Fix:Where long context matters, use neural models (RNN, LSTM, Transformer).

Evolution

Original paper · 1948 · Bell System Technical Journal · Claude E. Shannon

A Mathematical Theory of Communication

Claude E. Shannon

1948

Shannon introduces n-gram models in information theory

Inflection point

In "A Mathematical Theory of Communication" Shannon analyzes the statistics of English using character and word n-grams, treating language as an (n−1)-th order stochastic (Markov) process.

1951

Shannon "Prediction and Entropy of Printed English"

Classical experiment estimating the entropy of English using n-grams; shows that humans predict letters better than low-order n-gram models.

1980

Jelinek and IBM apply trigrams to speech recognition

Inflection point

Frederick Jelinek's group at IBM Research introduces trigram language models into large-vocabulary speech recognition, establishing the noisy channel paradigm.

1987

Katz back-off

Slava Katz publishes the back-off scheme for estimating probabilities of rare n-grams — industry standard for two decades.

Estimation of probabilities from sparse data for the language model component of a speech recognizer (paper)

1995

Kneser-Ney smoothing

Inflection point

Reinhard Kneser and Hermann Ney introduce smoothing based on the number of unique contexts. Modified Kneser-Ney (Chen & Goodman 1998) remains the best smoothing method for word n-grams.

2007

Google releases "Web 1T 5-gram"

Google releases a corpus of 5-grams counted over 1 trillion words of web text. Brants et al. show in "Large Language Models in Machine Translation" that simple stupid back-off on massive data matches sophisticated methods.

Large Language Models in Machine Translation (paper)

2010

KenLM — fast n-gram implementation

Kenneth Heafield releases KenLM, an open-source n-gram library with modified Kneser-Ney, the de facto standard for SMT (Moses) and baselines.

2013

Beginning of n-gram displacement by neural language models

Inflection point

Mikolov et al. (RNN LM) and word2vec show that dense vector representations solve the n-gram data sparsity problem; the statistical LM era starts ending in production.