A Step-by-Step Guide to Transformers: Understanding How Neural Networks Process Texts and How to Program Them#

Introduction#

This guide provides a pathway to understanding how the most widely used neural network in natural language processing (transformer) actually works. It follows the theoretical explanations of selected chapters from a well-regarded book on the subject. It also proposes learning Python programming along with the basics of a library called PyTorch, which enables neural networks to be programmed, trained, and run on GPUs. As a culmination, an existing implementation of the transformer programmed with PyTorch is studied. The ultimate goal is to modify this code to experiment with a simple problem involving human language. The idea is to gain solid knowledge for tackling more complex tasks later, rather than creating something flashy to showcase immediately.

Note

This page outlines a self-learning roadmap for understanding transformers. It links to documents hosted on other pages of this website. Thus, the collection can be considered a complete guide to assist you on your journey. However, you may have come to these pages from another source (e.g., a specific course) that suggests a different way of using the various contents. In that case, follow the recommendations and plan provided by that source instead of those proposed here.

Some content can be studied in parallel. While learning about neural models, you can start exploring Python, NumPy, and later PyTorch after the first two. You can also review the elements of algebra, calculus, and probability that you might have forgotten. Studying the transformer's code should only be undertaken after thoroughly assimilating all these prior concepts.

Study Manual#

To understand neural networks mathematically and conceptually, we will rely on the third edition (still unfinished) of the book "Speech and Language Processing" by Dan Jurafsky and James H. Martin. Sections of this guide indicate which chapters and sections are relevant for our purposes. Important: Since the online version of the book is unfinished and periodically updated, not only with new content but also with restructurings and section relocations, this guide includes links and references to an archived version of the book that may not correspond to the latest version (available here).

Why a Deep Dive Approach?#

At first glance, writing a program that uses machine learning models seems straightforward. For example, the following lines of code use a language model based on a transformer to complete a given text:

from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2')
generator("Hello, I'm a language model and", max_length = 30, num_return_sequences=3)

While high-level libraries are immensely important in certain contexts, if you only use the code above:

You won't understand how the model actually works.
You won't be able to create other models to experiment with different problems.
You won't know how to train your own model or what factors influence training quality or duration.
You won't understand other neural models used in natural language processing.
Ultimately, you’ll view your program as a black box performing magical tasks.

This guide aims to help you open that black box and understand its workings thoroughly.

Content Sequencing#

The following table shows a sequence of the guide's content along with indicative time estimates for each part.

Step	Content	Estimated Time	Notes
1	Introduction	10 minutes	This page!
2	Mathematical Concepts	5 hours	Refer to the links in this section only if you need to refresh your knowledge of mathematical concepts.
3	Regressors	4 hours	This document introduces a machine learning model that is not usually categorized as a neural network but helps introduce most of the key ideas relevant for discussing neural networks.
4	Learning PyTorch	5 hours	Going beyond equations and learning to implement the different models we will study is fundamental to fully understanding them. This page links to resources for learning (or reviewing) Python and PyTorch. Invest time here before advancing to theoretical content to better understand the implementations.
5	Regressor Implementation	4 hours	Examine PyTorch code for implementing logistic and softmax regressors. Use debugging tools as explained here to step through the code: analyze variable values and types, tensor shapes, and ensure you understand what each dimension represents.
6	Non-contextual Embeddings	4 hours	Obtaining non-contextual embeddings is an immediate application of logistic regressors, showcasing the potential of self-supervised learning.
7	Skip-gram Implementation	2 hours	Analyze PyTorch code for skip-gram implementation. Use debugging tools as explained here to step through the code. By this point, it is advisable to start familiarizing yourself with PyTorch before proceeding further.
8	Feedforward Networks	3 hours	This section introduces the neural network concept and creates a very basic language model with feedforward networks.
9	Feedforward Network Implementation	1 hour	Explore the code for implementing a simple feedforward-based language model.
10	Transformers and Attention Models	6 hours	All previous concepts prepare you to delve into the transformer architecture. This page focuses on the transformer’s decoder part, used in language models.
11	Transformer Code Implementation	6 hours	Analyze the general transformer implementation and a decoder-based language model. This code is more complex than those you studied before.
12	Additional Aspects of Transformers	4 hours	This page introduces the transformer’s encoder part and its potential uses, both standalone and paired with a decoder.
13	Named Entity Recognizer Implementation	1 hour	Analyze the code for implementing a named entity recognizer based on an encoder.
14	GPT-2 Model Implementation	4 hours	Optionally analyze the code for implementing a language model capable of loading and using GPT-2.
15	Speech	4 hours	This content is optional as it shifts to the domain of speech processing.

In each section, the icon highlights essential links, whether to book chapters for reading or code for exploration.

Prerequisite Mathematical Concepts#

Basic algebra, calculus, and probability concepts needed for natural language processing can be found in the "Linear Algebra," "Calculus" (including "Automatic differentiation"), and "Probability and Statistics" sections of Chapter 2 in the book "Dive into Deep Learning." Other topics like information theory or the maximum likelihood principle are covered in the "Information Theory" and "Maximum Likelihood" sections of an appendix in the same book.