1. Forget gate: f_t = σ(W_f·[h_{t-1}, x_t] + b_f). Decides what to erase from the previous cell state. 2. Input gate + candidate: i_t = σ(W_i·[h_{t-1}, x_t] + b_i), g_t = tanh(W_g·[h_{t-1}, x_t] + b_g). Decides what new information to add. 3. Cell-state update: c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t. The additive sum prevents gradient vanishing (Constant Error Carousel). 4. Output gate: o_t = σ(W_o·[h_{t-1}, x_t] + b_o), h_t = o_t ⊙ tanh(c_t). Filters what to pass from the cell to the next layer. 5. In BiLSTM: two LSTM networks process the sequence in both directions and their states are concatenated. 6. In seq2seq: the LSTM encoder compresses the full sequence into a context vector; the LSTM decoder generates output step by step.
Vanilla RNNs suffer from the vanishing-gradient problem: over long sequences the error signal decays exponentially, preventing the network from learning dependencies hundreds of steps apart. LSTM solves this by introducing an additive cell state — a separate information channel running through the sequence with minimal transformations — and gating mechanisms that selectively retain and forget information.
LSTM processes time sequentially — no possibility of full parallelization as in Transformer. Training on long sequences is a throughput bottleneck.
LSTM is prone to exploding gradients without clipping. The standard is clip_grad_norm=1.0 or clip_grad_value — omitting this leads to NaN losses within a few iterations.
On Penn Treebank, LSTM reached ~78 perplexity (vs ~120 for vanilla RNN). Google Neural Machine Translation (8-layer LSTM seq2seq with attention) reduced translation error by ~60% over PBMT in 2016. On ASR (Switchboard), deep LSTMs achieved WER around 5.5% — SOTA for years. On classic time-series benchmarks (M4) LSTM remains one of the strongest neural approaches.
cuDNN provides optimized fused kernels for LSTM — 5-10× faster than naive PyTorch implementation. Especially important with large batch sizes.