Understanding Transformers

Encoder: Processes input with self-attention and feed-forward layers
Decoder: Generates output with masked self-attention and cross-attention

The transformer architecture revolutionized natural language processing and has become the backbone of models like BERT, GPT, and beyond.

Self-Attention Mechanism

The core innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Instead of one attention function, transformers use multiple "heads" in parallel, each learning different aspects of relationships:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$

Since transformers process all tokens simultaneously, positional information is added via sinusoidal encodings or learned embeddings.

A transformer consists of:

Transformers enabled: