1. Tokenize the corpus into units (words, characters, subwords). 2. Add sentence boundary markers (<s>, </s>). 3. Count all n-grams and (nโ1)-grams in the corpus. 4. Estimate conditional probabilities by maximum likelihood: P(w_i | w_{iโn+1}...w_{iโ1}) = count(w_{iโn+1}...w_i) / count(w_{iโn+1}...w_{iโ1}). 5. Apply smoothing (Laplace, Good-Turing, Katz back-off, Kneser-Ney) to assign non-zero mass to unseen n-grams. 6. At inference, the sentence probability is the product of conditional probabilities of successive n-grams, usually computed in log-space to avoid underflow. Evaluation metric: perplexity (lower is better).
Modeling the full joint probability distribution over language sequences is intractable โ the number of possible sequences grows exponentially with length. N-grams solve this via a Markov assumption: only the last nโ1 tokens matter, reducing the model to a finite, estimable parameter set.
Data structure storing count(w_{iโn+1}...w_i) for every n-gram observed in the training corpus; typically implemented as a trie, hash table, or key-value store.
Component computing P(w_i | w_{iโn+1}...w_{iโ1}) from raw counts, typically MLE: count(n-gram) / count(prefix).
Algorithm that redistributes probability mass to unseen n-grams. Standard methods: Laplace (add-one), Good-Turing, Katz back-off, interpolated Kneser-Ney, modified Kneser-Ney.
Mechanism that falls back to lower-order n-grams (e.g. trigram โ bigram โ unigram) when the higher-order n-gram is unseen or has low count.
Any unseen n-gram gets P=0, making the whole sentence have P=0 and log P=โโ. Smoothing is mandatory.
Multiplying many small probabilities quickly underflows; results lose precision or become 0.
Without sentence boundary markers, P(first word) and P(end of sentence) cannot be computed.
A full 5-gram table over a web-scale corpus can be hundreds of GB. Without pruning and compression the model is unusable.
By design, an n-gram ignores anything beyond nโ1 tokens back. Syntax, coreference, and discourse context are out of reach.
In "A Mathematical Theory of Communication" Shannon analyzes the statistics of English using character and word n-grams, treating language as an (nโ1)-th order stochastic (Markov) process.
Classical experiment estimating the entropy of English using n-grams; shows that humans predict letters better than low-order n-gram models.
Frederick Jelinek's group at IBM Research introduces trigram language models into large-vocabulary speech recognition, establishing the noisy channel paradigm.
Slava Katz publishes the back-off scheme for estimating probabilities of rare n-grams โ industry standard for two decades.
Reinhard Kneser and Hermann Ney introduce smoothing based on the number of unique contexts. Modified Kneser-Ney (Chen & Goodman 1998) remains the best smoothing method for word n-grams.
Google releases a corpus of 5-grams counted over 1 trillion words of web text. Brants et al. show in "Large Language Models in Machine Translation" that simple stupid back-off on massive data matches sophisticated methods.
Kenneth Heafield releases KenLM, an open-source n-gram library with modified Kneser-Ney, the de facto standard for SMT (Moses) and baselines.
Mikolov et al. (RNN LM) and word2vec show that dense vector representations solve the n-gram data sparsity problem; the statistical LM era starts ending in production.
Time complexity: O(N) trening; O(1) lookup (amortyzowany hash) na n-gram podczas inferencji. Space complexity: O(|V|^n) w pesymistycznym przypadku; O(min(N, |V|^n)) w praktyce.
The number of possible n-grams grows as |V|^n. For |V|=50,000 and n=5 this gives 3.1ร10^23 possible 5-grams; the vast majority never occur, but storing observed counts still requires multi-gigabyte structures (Google 5-gram: 1 TB uncompressed).
Only a single path in the count table is touched per query โ highly sparse parameter activation.
N-gram counting is embarrassingly parallel (MapReduce); each n-gram lookup at inference is independent.
Order of the model โ the length of the n-gram. Typical values: 1 (unigram), 2 (bigram), 3 (trigram), 4โ5 for large corpora.
Smoothing method for unseen n-grams. The choice often matters more than the choice of n.
N-gram unit: character, word, syllable, subword, phoneme, base pair.
Vocabulary size |V|. Determines the maximum number of distinct unigrams and drives parameter explosion.
N-gram models are primarily hash/trie lookups โ memory-bound operations well-suited to CPUs with large cache and fast memory.
No dependence on matrix multiplication โ runs anywhere, including embedded devices.
GPUs provide essentially no speedup for n-grams โ the workload is not matrix-based.