A Vector-Level View of GPT-2

Computational infographic of vector-wise inference in a decoder-only transformer

With annotations based on Anthropic’s A Mathematical Framework for Transformer Circuits
and Neel Nanda’s Comprehensive Mechanistic Interpretability Explainer & Glossary

Return to graphic || Source code || Text only version

This graphic situates some of the core insights from the mechanistic interpretability research program in a visual walkthrough of a transformer model. The goal is to blend...

A vector-level computational graph

Neural networks are typically implemented as a series of tensor operations because modern tooling is highly optimized for the parallelized matrix multiplications that tensors facilitate. But the most computationally efficient way to code a neural network isn’t necessarily the best way to understand how it works. Vectors (aka embeddings) are the fundamental information-bearing units in transformers, and are—with few exceptions—operated on completely independently. Discussions framed in terms of [batch × head × position × d_head] tensors, where thousands of high-dimensional vectors are packed together, can lose focus on how information actually flows through the model.

Implementations sometimes even permute the computational structure of the architecture for efficiency. For example, the original transformers paper describes multi-headed attention as involving a concatenation of the result vectors from each head, which is then projected back to the residual stream. Implementations and discussion since has largely adhered to this structuring. But concatenation is an unprincipled operation that obscures the natural way information flows through attention heads: result vectors are independently meaningful, and they can be directly and independently projected back to the residual stream without any concatenation operation.

with,

A mechanistic interpretability infographic

Existing work (such as Anthropic’s excellent Transformer Circuits thread) is weighty, and our understanding is rapidly evolving. A good primer might help people bootstrap into this important research program.

The intended audience already has a rough understanding of the transformer architecture. If a refresher is needed, I recommend Jay Alamar’s The Illustrated Transformer. Note that my diagram depicts a decoder-only model (GPT-2 124M, a common reference model for interpretability work) rather than the original encoder-decoder architecture depicted in Alamar’s piece.

Created for BlueDot Impact’s AI Safety Fundamentals’ AI Alignment Course. Thanks to Hannes Whittingham for feedback and encouragement.

Sources

Anthropic interpretability team:
- The Transformer Circuits series of papers, particularly: A Mathematical Framework for Transformer Circuits
- Transformer Circuits: rough early thoughts (series of research talks)
DeepMind mechanistic interpretability team:
- Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Neel Nanda:
- What is a Transformer? video
- Comprehensive Mechanistic Interpretability Explainer & Glossary

Recommended introductory resources

Jay Alamar:
- The Illustrated Transformer
3blue1brown:
- The deep learning video series, particularly: Transformers, Attention, and How might LLMs store facts