×
How self-attention works in LLMs: A mathematical breakdown for beginners
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Self-attention mechanisms represent a fundamental building block of modern large language models, serving as the computational engine that allows these systems to understand context and relationships within text. Giles Thomas’s latest installment in his series on building LLMs from scratch dissects the mathematics and intuition behind trainable self-attention, making this complex topic accessible by emphasizing the geometric transformations and matrix operations that enable contextual understanding in neural networks.

The big picture: Self-attention works by projecting input word embeddings into three different spaces—query, key, and value—allowing the model to determine which parts of a sequence to focus on when processing each word.

  • This projection process transforms the input text into mathematical representations that can capture meaningful relationships between words, regardless of their positions in a sequence.
  • The approach differs fundamentally from traditional neural networks by allowing dynamic, content-dependent information flow rather than fixed patterns of connections.

Behind the mathematics: Matrices serve as transformation tools that project vectors from one dimensional space to another, essentially teaching the model which features to emphasize in different contexts.

  • When the author describes matrices as “projections between spaces,” he’s explaining how these mathematical objects redirect information, similar to how a spotlight can highlight different areas of a stage.
  • These projections create specialized representations of each word embedding that serve different functions in the attention mechanism.

In plain English: Self-attention allows a language model to look at all the words in a sentence simultaneously and decide which ones are most relevant to understanding each specific word, similar to how humans read by constantly connecting related concepts.

How it actually works: The attention mechanism calculates similarity scores between words through dot products of their query and key projections, then uses these scores to create weighted combinations of value projections.

  • The model computes how relevant each word is to every other word by multiplying their respective query and key vectors, essentially measuring their compatibility.
  • These relevance scores are normalized through the softmax function and scaled to prevent numerical instability in high-dimensional spaces.

Why scaling matters: The author explains that dividing attention scores by the square root of the dimension size helps prevent the softmax function from producing extremely sharp probability distributions.

  • Without this scaling factor, gradients could vanish during training as embeddings grow larger, making learning inefficient or impossible.
  • This mathematical insight addresses one of the key challenges in training deeper neural networks with attention mechanisms.

The final transformation: Context vectors are created by weighting value projections according to the normalized attention scores, producing representations that incorporate information from the entire sequence.

  • These context vectors capture which words in the sequence are most relevant to understanding each position, allowing the model to emphasize important relationships.
  • The end result is a more contextually aware representation of each word that can be used for subsequent processing in the language model.

What’s next: Future posts in the series will expand on this foundation by explaining causal self-attention (which prevents looking ahead in text generation), multi-head attention, and the theoretical underpinnings of why this mechanism works so effectively.

Writing an LLM from scratch, part 8 -- trainable self-attention

Recent News

Tines proposes identity-based definition to distinguish true AI agents from assistants

Tines shifts AI agent debate from capability to identity, arguing true agents maintain their own digital fingerprint in systems while assistants merely extend human actions.

Report: Government’s AI adoption gap threatens US national security

Federal agencies, hampered by scarce talent and outdated infrastructure, remain far behind private industry in AI adoption, creating vulnerabilities that could compromise critical government functions and regulation of increasingly sophisticated systems.

Anthropic’s new AI tutor guides students through thinking instead of giving answers

Anthropic's AI tutor prompts student reasoning with guiding questions rather than answers, addressing educators' concerns about shortcut thinking.