Multi-Head Self-Attention (MHSA)
Modeling contextual dependencies between sequence positions
Mechanism allowing every position in the sequence to attend to all other positions. The input is projected into three matrices: Query (Q), Key (K), Value (V). Attention weights are computed as softmax(QK^T/sqrt(d_k)) multiplied by V. Multiple 'heads' perform this operation in parallel in low-dimensional subspaces, and the results are concatenated and projected through W_O.
[B, T, d_model]B = batch, T = sequence length, d_model = representation dimension (e.g., 512, 4096).[B, T, d_model]Sequence of contextual representations with the same shape as the input.