Computational infographic of vector-wise inference in a decoder-only transformer
With annotations based on Anthropic’s A
Mathematical Framework for Transformer Circuits
and Neel Nanda’s Comprehensive Mechanistic Interpretability Explainer & Glossary
This graphic situates some of the core insights from the mechanistic interpretability research program in a visual walkthrough of a transformer model. The goal is to blend...
Neural networks are typically implemented as a series of tensor operations because modern tooling is highly optimized for the parallelized matrix multiplications that tensors facilitate. But the most computationally efficient way to code a neural network isn’t necessarily the best way to understand how it works. Vectors (aka embeddings) are the fundamental information-bearing units in transformers, and are—with few exceptions—operated on completely independently. Discussions framed in terms of [batch × head × position × d_head] tensors, where thousands of high-dimensional vectors are packed together, can lose focus on how information actually flows through the model.
Implementations sometimes even permute the computational structure of the architecture for efficiency. For example, the original transformers paper describes multi-headed attention as involving a concatenation of the result vectors from each head, which is then projected back to the residual stream. Implementations and discussion since has largely adhered to this structuring. But concatenation is an unprincipled operation that obscures the natural way information flows through attention heads: result vectors are independently meaningful, and they can be directly and independently projected back to the residual stream without any concatenation operation.
with,
Existing work (such as Anthropic’s excellent Transformer Circuits thread) is weighty, and our understanding is rapidly evolving. A good primer might help people bootstrap into this important research program.
The intended audience already has a rough understanding of the transformer architecture. If a refresher is needed, I recommend Jay Alamar’s The Illustrated Transformer. Note that my diagram depicts a decoder-only model (GPT-2 124M, a common reference model for interpretability work) rather than the original encoder-decoder architecture depicted in Alamar’s piece.
Created for BlueDot Impact’s AI Safety Fundamentals’ AI Alignment Course. Thanks to Hannes Whittingham for feedback and encouragement; check out his final project on Reinforcement Learning from LLM Feedback!
Sources
Recommended introductory resources