A Step-by-Step Guide to Transformers: Understanding How Neural Networks Process Texts and How to Program Them#
Introduction#
This guide provides a pathway to understanding how the most widely used neural network in natural language processing (transformer) actually works. It follows the theoretical explanations of selected chapters from a well-regarded book on the subject. It also proposes learning Python programming along with the basics of a library called PyTorch, which enables neural networks to be programmed, trained, and run on GPUs. As a culmination, an existing implementation of the transformer programmed with PyTorch is studied. The ultimate goal is to modify this code to experiment with a simple problem involving human language. The idea is to gain solid knowledge for tackling more complex tasks later, rather than creating something flashy to showcase immediately.
Note
This page outlines a self-learning roadmap for understanding transformers. It links to documents hosted on other pages of this website. Thus, the collection can be considered a complete guide to assist you on your journey. However, you may have come to these pages from another source (e.g., a specific course) that suggests a different way of using the various contents. In that case, follow the recommendations and plan provided by that source instead of those proposed here.
Some content can be studied in parallel. While learning about neural models, you can start exploring Python, NumPy, and later PyTorch after the first two. You can also review the elements of algebra, calculus, and probability that you might have forgotten. Studying the transformer's code should only be undertaken after thoroughly assimilating all these prior concepts.
Study Manual#
To understand neural networks mathematically and conceptually, we will rely on the third edition (still unfinished) of the book "Speech and Language Processing" by Dan Jurafsky and James H. Martin. Sections of this guide indicate which chapters and sections are relevant for our purposes. Important: Since the online version of the book is unfinished and periodically updated, not only with new content but also with restructurings and section relocations, this guide includes links and references to an archived version of the book that may not correspond to the latest version (available here).
Why a Deep Dive Approach?#
At first glance, writing a program that uses machine learning models seems straightforward. For example, the following lines of code use a language model based on a transformer to complete a given text:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2')
generator("Hello, I'm a language model and", max_length = 30, num_return_sequences=3)
While high-level libraries are immensely important in certain contexts, if you only use the code above:
- You won't understand how the model actually works.
- You won't be able to create other models to experiment with different problems.
- You won't know how to train your own model or what factors influence training quality or duration.
- You won't understand other neural models used in natural language processing.
- Ultimately, you’ll view your program as a black box performing magical tasks.
This guide aims to help you open that black box and understand its workings thoroughly.
Content Sequencing#
The following table shows a sequence of the guide's content along with indicative time estimates for each part.
Step | Content | Estimated Time | Notes |
---|---|---|---|
1 | Introduction | 10 minutes | This page! |
2 | Mathematical Concepts | 5 hours | Refer to the links in this section only if you need to refresh your knowledge of mathematical concepts. |
3 | Regressors | 4 hours | This document introduces a machine learning model that is not usually categorized as a neural network but helps introduce most of the key ideas relevant for discussing neural networks. |
4 | Learning PyTorch | 5 hours | Going beyond equations and learning to implement the different models we will study is fundamental to fully understanding them. This page links to resources for learning (or reviewing) Python and PyTorch. Invest time here before advancing to theoretical content to better understand the implementations. |
5 | Regressor Implementation | 4 hours | Examine PyTorch code for implementing logistic and softmax regressors. Use debugging tools as explained here to step through the code: analyze variable values and types, tensor shapes, and ensure you understand what each dimension represents. |
6 | Non-contextual Embeddings | 4 hours | Obtaining non-contextual embeddings is an immediate application of logistic regressors, showcasing the potential of self-supervised learning. |
7 | Skip-gram Implementation | 2 hours | Analyze PyTorch code for skip-gram implementation. Use debugging tools as explained here to step through the code. By this point, it is advisable to start familiarizing yourself with PyTorch before proceeding further. |
8 | Feedforward Networks | 3 hours | This section introduces the neural network concept and creates a very basic language model with feedforward networks. |
9 | Feedforward Network Implementation | 1 hour | Explore the code for implementing a simple feedforward-based language model. |
10 | Transformers and Attention Models | 6 hours | All previous concepts prepare you to delve into the transformer architecture. This page focuses on the transformer’s decoder part, used in language models. |
11 | Transformer Code Implementation | 6 hours | Analyze the general transformer implementation and a decoder-based language model. This code is more complex than those you studied before. |
12 | Additional Aspects of Transformers | 4 hours | This page introduces the transformer’s encoder part and its potential uses, both standalone and paired with a decoder. |
13 | Named Entity Recognizer Implementation | 1 hour | Analyze the code for implementing a named entity recognizer based on an encoder. |
14 | GPT-2 Model Implementation | 4 hours | Optionally analyze the code for implementing a language model capable of loading and using GPT-2. |
15 | Speech | 4 hours | This content is optional as it shifts to the domain of speech processing. |
In each section, the icon highlights essential links, whether to book chapters for reading or code for exploration.
Prerequisite Mathematical Concepts#
Basic algebra, calculus, and probability concepts needed for natural language processing can be found in the "Linear Algebra," "Calculus" (including "Automatic differentiation"), and "Probability and Statistics" sections of Chapter 2 in the book "Dive into Deep Learning." Other topics like information theory or the maximum likelihood principle are covered in the "Information Theory" and "Maximum Likelihood" sections of an appendix in the same book.
Further Reading#
Expand your knowledge with the following books, most of which are available online:
- "Speech and Language Processing" by Dan Jurafsky and James H. Martin. Third edition unpublished as of 2024 but with an advanced draft online. Details key NLP concepts and models without delving into implementation details. This guide is based on this book.
- "Dive into Deep Learning" by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. Explores deep learning models in detail, with a paper version published in 2023.
- "Deep Learning: Foundations and Concepts" by Chris Bishop and Hugh Bishop. Also available in print since 2024.
- "Understanding Deep Learning" (2023) by Simon J.D. Prince. Filled with illustrations and figures to clarify concepts.
- The series "Probabilistic Machine Learning: An Introduction" (2022) and "Probabilistic Machine Learning: Advanced Topics" (2023) by Kevin Murphy covers various machine learning elements in depth.
- "Deep Learning for Natural Language Processing: A Gentle Introduction" by Mihai Surdeanu and Marco A. Valenzuela-Escárcega. Still under development. Contains code in some chapters.
- "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide" (2022) by Daniel Voigt Godoy. Paid, with digital and print versions (in three volumes). A Spanish version of the first chapters exists. Written in a clear, example-rich style.
- "The Mathematical Engineering of Deep Learning" (2024) by Benoit Liquet, Sarat Moka, and Yoni Nazarathy.
- "The Little Book of Deep Learning" (2023) by François Fleuret.
The following list includes links to video courses by renowned researchers or universities:
- Stanford CS224n ― Natural Language Processing with Deep Learning; course website.
- Stanford CS324 ― Large Language Models; 2022 edition.
- Stanford CS324 2023 ― Advances in Foundation Models; 2023 edition.
- Stanford CS25 ― Transformers United; course website.
- Stanford CS229 ― Machine Learning; course website.
- Stanford CS230 ― Deep Learning; course website.
- MIT 6.S191 ― Introduction to Deep Learning; course website.
- "Neural Networks: Zero to Hero" by Andrew Karpathy.
- Machine Learning Specialization.