At every decoder step t the mechanism performs three operations: (1) for each encoder hidden state h_j and the previous decoder state s_{t-1} it computes a scalar alignment score e_{t,j} = v^T · tanh(W_a · s_{t-1} + U_a · h_j) — a small single-hidden-layer MLP; (2) scores are normalised via softmax into alignment weights α_{t,j}; (3) the context vector c_t = Σ_j α_{t,j} · h_j is fed to the decoder together with the previous token and state to produce the next token. All parameters (W_a, U_a, v) are learned end-to-end together with the encoder and decoder.
In the standard RNN-based encoder–decoder architecture the entire source sentence is compressed into a single fixed-length vector, creating an information bottleneck — especially for long sentences — and causing translation quality to degrade sharply as input length grows.
Small feed-forward network with a single tanh hidden layer that produces a scalar alignment score for every (decoder state, encoder state) pair.
Official
Normalises the scores into a probability distribution over all source positions — the alignment weights α_{t,j}.
Weighted sum of encoder hidden states, fed to the decoder as an additional input when generating the next token.
In the original paper the encoder is a bidirectional GRU; its hidden states h_j feed into the attention mechanism.
Official
The tanh MLP for each (decoder, encoder) pair is more expensive than the plain dot product used in Luong/Transformer attention.
The mechanism is embedded in a recurrent decoder — steps cannot be parallelised in time, limiting GPU scaling.
First version of the paper introducing attention in NMT.
Paper accepted as oral at ICLR 2015 — rapid adoption of the idea by the community.
Luong, Pham and Manning propose multiplicative attention variants (dot, general, concat) as a simplification and extension of Bahdanau Attention.
Vaswani et al. drop RNNs entirely and build the architecture purely on scaled dot-product self-attention — a direct continuation of the line started by Bahdanau Attention.
Time complexity: O(T_x · T_y · d). Space complexity: O(T_x · T_y).
Hidden size of the scoring MLP (typically equal to the encoder hidden size).
Hidden size of the encoder (bidirectional RNN/GRU) hidden states.
Each decoder step uses all source positions (soft attention).
Because Bahdanau Attention is embedded in a recurrent decoder (RNN/GRU), token generation is sequential; the attention operation at a given step t can be vectorised over source positions, but decoder steps must run one after another.
Operations are matrix-based, but the sequential RNN decoder limits tensor-core utilisation compared to a pure Transformer.
The mechanism itself is a small MLP plus softmax — runs on essentially any accelerator that supports standard neural network ops.