Implementation of Models in PyTorch#

The implementation of different models in code is a complementary approach to studying them from a mathematical perspective. This page presents PyTorch implementations of each of the studied models. The idea is to approach these implementations after conceptually studying the respective model.

Note

This page is part of the series "A step-by-step guide to transformers," which presents a guide to understanding how neural networks process text and how to program them. It is also possible that you arrived here from another source (e.g., a specific course) that suggests a different way to use this content. In that case, follow the recommendations and planning provided by that source.

Code for a Logistic and a Multinomial Regressor#

Here are two PyTorch implementations of the regressors studied on this page in just a few dozen lines of code. Ensure you understand the code well enough to feel confident about modifying it to suit other needs.

Review how to debug Python programs before tackling the code. Also, review how the broadcasting mechanism works in PyTorch.

The two programs in this section are:

A logistic regressor that classifies two-dimensional synthetic samples into two classes. Only the most basic elements of PyTorch are used to keep the implementation as detailed as possible. As an exercise, trace and analyze the sizes of the tensors. Experiment with the number of training steps and the learning rate to observe how training evolves. Explore various positions of class centers and data dispersion to see how the decision boundary changes. Remove the bias from the equations and observe how forcing the decision boundary to pass through the origin restricts its shape.
A softmax regressor for classifying texts by topic. As an exercise, try training with a single embedding per training step instead of a batch of all embeddings and observe how the error behaves. You can also adapt the previous logistic regressor code to use the PyTorch functions seen in this program.

If you haven't already, you can start learning Python and PyTorch by following the chapter corresponding to this series.

Code for Skip-Grams#

This is an implementation of the skip-gram algorithm for obtaining static embeddings as studied on this page. It follows the guidelines in the book by Jurafsky and Martin.

Code for a Language Model with Feedforward Networks#

This is the implementation of a language model using feedforward networks as studied on this page. It adheres to the equations in the book by Jurafsky and Martin.

Code for the Transformer#

The transformer (studied in this section of the guide) is presented in three separate notebooks:

One that contains the base architecture and implementations for both an encoder-based model and a decoder-based model.
Another that applies the decoder to a language model predicting the next token in a sequence.
And one based on the encoder to build a named entity recognition system.

Code for a Transformer from the minGPT Project#

A good PyTorch implementation of a transformer-based language model is Andrej Karpathy's minGPT. The code allows for training and using language models and loading the weights of the GPT-2 model. The transformer in our guide is based on minGPT, so the model itself should not be difficult to understand.

This guide's repository has a copy of the minGPT code with minor modifications. Below is a summary of relevant files. You do not need to examine files that are not mentioned. To use and modify the code, you can install it with:

pip install --editable .

Due to changes in external dependencies, the current code may not work as-is. To fix this, modify line 200 of the file mingpt/model.py from:

assert len(keys) == len(sd)

to:

assert len(keys) == len([k for k in sd if not k.endswith(".attn.bias")])

File mingpt/bpe.py#

This file contains the necessary implementation to use the BPE subword model used by GPT-2. Its functionality is discussed later. The main code in the file demonstrates a step-by-step tokenization example of an input string, which you can see by running python bpe.py. The first time the encode or decode methods are called, the files encoder.json and vocab.bpe—containing the vocabulary and subword merge rules used by GPT-2, respectively—are downloaded. These files are stored in the ~/.cache/mingpt directory.

It is not necessary to study the code in this file. Simply know that it allows you to obtain a list of token indices from an input text and retrieve the text associated with a list of token indices output by the model:

bpe = BPETokenizer()
tokens = bpe("A relaxing cup of café con leche in Plaza Mayor") # encode
# tokens is a tensor of shape (1, 9)
print(bpe.decode(tokens[0]))  
# "A relaxing cup of café con leche in Plaza Mayor"
print(tokens[0].tolist()) 
# [32, 28175, 6508, 286, 40304, 369, 443, 2395, 287, 23280, 10106]
for token in tokens[0]:
   print(bpe.decode(torch.tensor([token])), end='/')
# A/ relaxing/ cup/ of/ café/ con/ le/che/ in/ Plaza/ Mayor/

File mingpt/utils.py#

It is not necessary to study this file in detail. Simply open it to observe that it defines two utility functions (set_seed and setup_logging) and a class (CfgNode) for managing the model's configuration parameters.

File mingpt/trainer.py#

Study this file, as it contains the general code responsible for training a model. The code is not specific to the transformer architecture and could be applied with minor modifications to other models.

File mingpt/model.py#

The most important file for our purposes. However, you can skip the from_pretrained method of the GPT class (incorporates GPT-2 weights downloaded from Hugging Face Transformers) and especially the configure_optimizers method (returns an Adam optimizer with different behavior depending on the type of parameter it acts upon), as they contain code specific to the GPT-2 system.

Study the CausalSelfAttention and Block classes in detail, as well as the forward, generate, __init__, _init_weights, and get_default_config methods of the GPT class.

File generate.ipynb#

Study this code, which uses the model to generate text. It is a Python notebook but can be executed from the command line by converting it to a Python script:

pip install nbconvert
jupyter nbconvert --to script generate.ipynb
python generate.py

You can change the model-type variable to use different pre-trained GPT-2 models. From largest to smallest, the available models are gpt2-xl, gpt2-large, gpt2-medium, and gpt2. If you want to run the code on a CPU, change the device value to:

device = 'cuda' if torch.cuda.is_available() else 'cpu'

File projects/chargpt/charpgt.py#

This code trains a character-level language model using the content of the input.txt file. You can use texts such as Don Quixote or parts of Shakespeare's works as input files.

You can change the C.model.model_type variable to use models of different sizes (from largest to smallest: gpt2-xl, gpt2-large, gpt2-medium, gpt2, gpt-mini, gpt-micro, and gpt-nano). The number of layers, attention heads, and embedding sizes for each model can be found in the GPT class constructor in the mingpt/model.py file.

Run the program and let it train for a while with:

python charpgt.py

The model is saved periodically in the out folder.

Additional Implementations#

The MinT project includes various tutorials with scratch implementations of models like BERT, GPT, BART, or T5. The code is slightly more extensive than what we have studied but can help consolidate knowledge at an advanced stage. The x-transformers project follows a similar approach.

There is some competition among developers to achieve the most compact transformer implementation possible. Some notable ones are minGPT, nanoGPT, and picoGPT. A notable feature of these implementations is their ability to load GPT-2 weights and perform inference. Andrej Karpathy, the developer of minGPT and nanoGPT, has a highly educational video explaining his implementation.