Architecture

Long Short-Term Memory

mature

How it works

1. Forget gate: f_t = σ(W_f·[h_{t-1}, x_t] + b_f). Decides what to erase from the previous cell state. 2. Input gate + candidate: i_t = σ(W_i·[h_{t-1}, x_t] + b_i), g_t = tanh(W_g·[h_{t-1}, x_t] + b_g). Decides what new information to add. 3. Cell-state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t. The additive sum prevents gradient vanishing (Constant Error Carousel). 4. Output gate: o_t = σ(W_o·[h_{t-1}, x_t] + b_o), h_t = o_t ⊙ tanh(c_t). Filters what to pass from the cell to the next layer. 5. In BiLSTM: two LSTM networks process the sequence in both directions and their states are concatenated. 6. In seq2seq: the LSTM encoder compresses the full sequence into a context vector; the LSTM decoder generates output step by step.

Problem solved

Vanilla RNNs suffer from the vanishing-gradient problem: over long sequences the error signal decays exponentially, preventing the network from learning dependencies hundreds of steps apart. LSTM solves this by introducing an additive cell state — a separate information channel running through the sequence with minimal transformations — and gating mechanisms that selectively retain and forget information.

Key mechanisms

Cell state — information channel with minimal linear transformations

Forget gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f)

Input gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i) and candidate: g_t = tanh(W_g · [h_{t-1}, x_t] + b_g)

State update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t

Output gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o); output: h_t = o_t ⊙ tanh(c_t)

Constant Error Carousel (CEC) — additive state path that prevents gradient vanishing

Optional peephole connections — gates can observe c_{t-1}

Bidirectional variant (BiLSTM) — concatenation of states from two directions

Strengths & limitations

Strengths

✓Models long-term dependencies effectively (hundreds of steps)

✓Mitigates the vanishing-gradient problem via an additive cell state

✓Streaming inference: O(1) memory per token

✓Linear complexity in sequence length (vs O(T²) in Transformers)

✓Strong results on small and medium-sized datasets

✓Mature implementations on CPU, GPU and edge accelerators

Limitations

✗Still sequential — poor training parallelism along sequence length

✗Higher parameter count per cell than vanilla RNN (4× weight matrices)

✗Harder to optimize than Transformers for large models

✗Weaker than Transformers on very long text contexts

✗Effective scale is limited (rarely above hundreds of millions of parameters in production NLP)

✗Hyperparameter tuning (layers, dropout, gradient clipping) is unforgiving

Implementation

Implementation pitfalls

Sequential processing blocks training parallelismMedium

LSTM processes time sequentially — no possibility of full parallelization as in Transformer. Training on long sequences is a throughput bottleneck.

Gradient clipping mandatory — exploding gradientsMedium

LSTM is prone to exploding gradients without clipping. The standard is clip_grad_norm=1.0 or clip_grad_value — omitting this leads to NaN losses within a few iterations.