A decoder-only architecture is a specific type of transformer model design used in large language models like GPT.
Unlike the original transformer architecture, which contained both encoder and decoder components, decoder-only models simplify the design by using only the decoder portion.
The key components of a decoder-only architecture are:
- Input processing: The model takes text input, converts it to tokens, and generates embeddings that include both token and position information.
- Self-attention blocks: Multiple layers of masked self-attention (also called causal attention) that prevent the model from looking ahead in the sequence when making predictions.
- Feed-forward networks: After each attention layer, a feed-forward neural network processes the token representations individually.
- Output layer: A final layer that converts the processed representations into probabilities for next-token prediction.
The architecture is “decoder-only” because it operates like the decoder part of the original transformer, using masked attention to ensure each token can only attend to previous tokens in the sequence. This matches the natural left-to-right way we generate text and makes these models particularly effective at text generation tasks.
The most well-known implementation is GPT (Generative Pre-trained Transformer), but many other models like LLaMA, Kokoro and Claude also use this architecture. Its effectiveness and relative simplicity have made it the dominant architecture for large language models.
The key advantage of decoder-only models is their straightforward design optimized for next-token prediction, making them especially good at text generation while being simpler to train and implement than full encoder-decoder models.