Implementation of Models in PyTorch#
The implementation of different models in code is a complementary approach to studying them from a mathematical perspective. This page presents PyTorch implementations of each of the studied models. The idea is to approach these implementations after conceptually studying the respective model.
Note
This page is part of the series "A step-by-step guide to transformers," which presents a guide to understanding how neural networks process text and how to program them. It is also possible that you arrived here from another source (e.g., a specific course) that suggests a different way to use this content. In that case, follow the recommendations and planning provided by that source.
Code for a Logistic and a Multinomial Regressor#
Here are two PyTorch implementations of the regressors studied on this page in just a few dozen lines of code. Ensure you understand the code well enough to feel confident about modifying it to suit other needs.
Review how to debug Python programs before tackling the code. Also, review how the broadcasting mechanism works in PyTorch.
The two programs in this section are:
- A logistic regressor that classifies two-dimensional synthetic samples into two classes. Only the most basic elements of PyTorch are used to keep the implementation as detailed as possible. As an exercise, trace and analyze the sizes of the tensors. Experiment with the number of training steps and the learning rate to observe how training evolves. Explore various positions of class centers and data dispersion to see how the decision boundary changes. Remove the bias from the equations and observe how forcing the decision boundary to pass through the origin restricts its shape.
- A softmax regressor for classifying texts by topic. As an exercise, try training with a single embedding per training step instead of a batch of all embeddings and observe how the error behaves. You can also adapt the previous logistic regressor code to use the PyTorch functions seen in this program.
If you haven't already, you can start learning Python and PyTorch by following the chapter corresponding to this series.
Code for Skip-Grams#
This is an implementation of the skip-gram algorithm for obtaining static embeddings as studied on this page. It follows the guidelines in the book by Jurafsky and Martin.
Code for a Language Model with Feedforward Networks#
This is the implementation of a language model using feedforward networks as studied on this page. It adheres to the equations in the book by Jurafsky and Martin.
Code for the Transformer#
The transformer (studied in this section of the guide) is presented in three separate notebooks:
- One that contains the base architecture and implementations for both an encoder-based model and a decoder-based model.
- Another that applies the decoder to a language model predicting the next token in a sequence.
- And one based on the encoder to build a named entity recognition system.
Code for a Transformer from the minGPT Project#
A good PyTorch implementation of a transformer-based language model is Andrej Karpathy's minGPT. The code allows for training and using language models and loading the weights of the GPT-2 model. The transformer in our guide is based on minGPT, so the model itself should not be difficult to understand.
This guide's repository has a copy of the minGPT code with minor modifications. Below is a summary of relevant files. You do not need to examine files that are not mentioned. To use and modify the code, you can install it with:
Due to changes in external dependencies, the current code may not work as-is. To fix this, modify line 200 of the file mingpt/model.py
from:
to:
File mingpt/bpe.py#
This file contains the necessary implementation to use the BPE subword model used by GPT-2. Its functionality is discussed later. The main code in the file demonstrates a step-by-step tokenization example of an input string, which you can see by running python bpe.py
. The first time the encode
or decode
methods are called, the files encoder.json
and vocab.bpe
—containing the vocabulary and subword merge rules used by GPT-2, respectively—are downloaded. These files are stored in the ~/.cache/mingpt
directory.
It is not necessary to study the code in this file. Simply know that it allows you to obtain a list of token indices from an input text and retrieve the text associated with a list of token indices output by the model:
bpe = BPETokenizer()
tokens = bpe("A relaxing cup of café con leche in Plaza Mayor") # encode
# tokens is a tensor of shape (1, 9)
print(bpe.decode(tokens[0]))
# "A relaxing cup of café con leche in Plaza Mayor"
print(tokens[0].tolist())
# [32, 28175, 6508, 286, 40304, 369, 443, 2395, 287, 23280, 10106]
for token in tokens[0]:
print(bpe.decode(torch.tensor([token])), end='/')
# A/ relaxing/ cup/ of/ café/ con/ le/che/ in/ Plaza/ Mayor/
File mingpt/utils.py#
It is not necessary to study this file in detail. Simply open it to observe that it defines two utility functions (set_seed
and setup_logging
) and a class (CfgNode
) for managing the model's configuration parameters.
File mingpt/trainer.py#
Study this file, as it contains the general code responsible for training a model. The code is not specific to the transformer architecture and could be applied with minor modifications to other models.
File mingpt/model.py#
The most important file for our purposes. However, you can skip the from_pretrained
method of the GPT
class (incorporates GPT-2 weights downloaded from Hugging Face Transformers) and especially the configure_optimizers
method (returns an Adam optimizer with different behavior depending on the type of parameter it acts upon), as they contain code specific to the GPT-2 system.
Study the CausalSelfAttention
and Block
classes in detail, as well as the forward
, generate
, __init__
, _init_weights
, and get_default_config
methods of the GPT
class.
File generate.ipynb#
Study this code, which uses the model to generate text. It is a Python notebook but can be executed from the command line by converting it to a Python script:
You can change the model-type
variable to use different pre-trained GPT-2 models. From largest to smallest, the available models are gpt2-xl
, gpt2-large
, gpt2-medium
, and gpt2
. If you want to run the code on a CPU, change the device
value to:
File projects/chargpt/charpgt.py#
This code trains a character-level language model using the content of the input.txt
file. You can use texts such as Don Quixote or parts of Shakespeare's works as input files.
You can change the C.model.model_type
variable to use models of different sizes (from largest to smallest: gpt2-xl
, gpt2-large
, gpt2-medium
, gpt2
, gpt-mini
, gpt-micro
, and gpt-nano
). The number of layers, attention heads, and embedding sizes for each model can be found in the GPT
class constructor in the mingpt/model.py
file.
Run the program and let it train for a while with:
The model is saved periodically in the out
folder.
Additional Implementations#
The MinT project includes various tutorials with scratch implementations of models like BERT, GPT, BART, or T5. The code is slightly more extensive than what we have studied but can help consolidate knowledge at an advanced stage. The x-transformers project follows a similar approach.
There is some competition among developers to achieve the most compact transformer implementation possible. Some notable ones are minGPT, nanoGPT, and picoGPT. A notable feature of these implementations is their ability to load GPT-2 weights and perform inference. Andrej Karpathy, the developer of minGPT and nanoGPT, has a highly educational video explaining his implementation.