# The transformer architecture

<a target="_blank" href="https://colab.research.google.com/github/jaspock/me/blob/main/docs/materials/transformers/assets/notebooks/transformer.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a href="http://dlsi.ua.es/~japerez/"><img src="https://img.shields.io/badge/Universitat-d'Alacant-5b7c99" style="margin-left:10px"></a>

Text written by Juan Antonio PÃ©rez in 2023-2024. Code modified after the original code in [minGPT](https://github.com/karpathy/minGPT/) by Andrej Karpathy.

This notebook presents the low-level bricks of the transformer architecture. PyTorch already provides ready-to-use high-level implementations of the transformer architecture that you can use in your projects, such as classes `torch.nn.Transformer`, `torch.nn.TransformerDecoder` or `torch.nn.TransformerEncoderLayer`. Moving a little bit deeper, you may use methods such as `torch.nn.MultiheadAttention` or `torch.nn.LayerNorm`, but still add the code to connect these components together. In spite of this catalog of methods, our vision is that, in order to fully understand the transformer architecture, it is important to understand the low-level details, as this level of abstraction will allow you to better understand what is going on behind the scenes. Therefore, here we go through the transformer architecture step by step, from the most basic building blocks to the full architecture.

It is assumed that you are already familiar with the basics of PyTorch. This notebook complements a [learning guide](https://dlsi.ua.es/~japerez/materials/transformers/intro/) based on studying the math behind the models by reading the book "[Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)" (3rd edition) by Jurafsky and Martin. It is part of a series of notebooks which are supposed to be incrementally studied, so make sure you follow the right order. If your learning is being supervised by a teacher, follow the additional instructions that you may have received. Although you may use a GPU environment to execute the code, the computational requirements for the default settings are so low that you can probably run it on CPU.

In this notebook, we will incrementally build the transformer architecture, starting from the basic building blocks and ending with an abstract class (a class that is not intended to be instantiated but to be inherited from) called `AbstractTransformer` that implements the full transformer architecture. We will then use this abstract class to implement encoder-only and decoder-only models. The notebook contains a block of code that feeds a randomly-initialized decoder-like transformer with a random sequence of tokens and prints the index corresponding to the next token that predicted at each step. This simply evaluates that the code does not crash but has no other purpose. A couple of additional notebooks will exploit our `DecoderTransformer` and `EncoderTransformer` classes to implement a language model and a named entity recognition (NER) system, respectively.

In [None]:
%%capture
%pip install torch numpy

Note how we set the seed of the random number generators to a fixed value to ensure reproducibility (that is, to ensure that the same results are obtained every time the code is run). As there are a number of libraries involved in our code, we need to set the seed of each of them. Besides that, `torch.use_deterministic_algorithms` is mandatory as some methods of PyTorch are not deterministic even with a fixed seed.

ðŸ“˜ *Documentation:* notes on [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html) in PyTorch.

In [None]:
import os
# set before importing pytorch to avoid all non-deterministic operations on GPU
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

import random
import numpy as np
import torch

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.use_deterministic_algorithms(True)
    
set_seed(42)  # to ensure reproducibility

## Layer normalization

The implementation follows the formulae in the Jurafsky and Martin's book and barely introduces new concepts that you have not seen before. The parameters `keepdim` in the `mean` and `std` functions are used to keep the dimension of the tensor after the operation, although with a dimension of size 1. As this introduces a number of interesting aspects, we start with a simple example.

### Keepdim and broadcasting

In the following sample code, `x - m` will throw an error because the dimensions of `x` and `m` do not match. However, `x - m2` will work because the dimensions of `x` and `m2` allow the mechanism of broadcasting to activate. The broadcasting mechanism is explained in the [PyTorch documentation](https://pytorch.org/docs/stable/notes/broadcasting.html). Read it carefully and make sure you understand it. In our case, `x` with shape `(2, 3)` and `m` with shape `(2,)` have incompatible dimensions as they do not follow the rule that when iterating over the dimension sizes, starting at the last dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist. In particular, 3 and 2 are not equal. However, `x` with shape `(2, 3)` and `m2` with shape `(2, 1)` do follow the rule because 3 and 1 are not equal but one of them is 1, and then 2 and 2 are equal. The broadcasting mechanism will therefore expand the dimension of size 1 to match the dimension of size 3 (the resulting dimension size is the max of the sizes of the two tensors along that dimension), and then the substraction will be performed element-wise. 

Note that without the argument `dtype`, `x` would be a tensor of longs. A different way to make it a tensor of floats would be to use `torch.tensor([[1., 2, 3], [4, 5, 6]]` where at least one of the elements is a float. 

Recall that we usually represent the dimensions of a tensor as a Python tuple, and the `(2,)` tuple denotes a vector with size 2, where the extra comma is used to distinguish it from the Python expression `(2)`, which is just the integer 2. When representing this shape with a list (something like `[2]`), there is no need for the extra comma as the expression is not ambiguous.

Finally, the -1 in the parameter `dim` of the `mean` and `std` functions is used to indicate that the mean and standard deviation should be computed over the last dimension of the tensor. In the case of a matrix such as `x` whose first dimension corresponds to the rows and the second dimension corresponds to the columns, the mean and standard deviation will be computed over the columns. 

ðŸ“˜ *Documentation*: [torch.mean](https://pytorch.org/docs/stable/generated/torch.mean.html), [torch.std](https://pytorch.org/docs/stable/generated/torch.std.html), [broadcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) in PyTorch

In [None]:
import torch

x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
print(f"x = {x}")
print(f"x.shape = {x.shape}")

m = x.mean(-1)
print(f"m = {m}")
print(f"m.shape = {m.shape}")
# print(f"x - m = {x - m}")  # raises an error

m2 = x.mean(-1, keepdim=True)
print(f"m2 = {m2}")
print(f"m2.shape = {m2.shape}")
print(f"x - m2 = {x - m2}")

After this temporary digression, we are ready to explore the code of the layer normalization module.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import math

class LayerNorm(nn.Module):
    def __init__(self, size, eps=1e-6):
        super().__init__()
        self.a = nn.Parameter(torch.ones(size))
        self.b = nn.Parameter(torch.zeros(size))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a * (x - mean) / (std + self.eps) + self.b

## Attention in a single head

The implementation of the self-attention mechanism in a single head is in principle relatively easy to understand. The constructor of the class receives the dimensionality of the embeddings $d$ in variable `n_embd`. Before the multi-headed attention is discussed in the book, this is the only dimensionality that we have to deal with. However, after the multi-headed attention is introduced, we will have to deal with two different dimensionality values: the dimensionality of the *main* embeddings $d$ and the dimensionality of the *head* embeddings $d'$, which are usually (but not always) related by the formula $d = h \times d'$, where $h$ is the number of heads. In the code, this value $d'$ is represented by the variable `n_embd_head`. The dimensionality of the head embeddings $d'$ can additionally be different for the queries and keys ($d_k$) and for the values ($d_v$). However, we will simplify this and assume a single `n_embd_head` value for all of them.

To help you follow the shapes of the different tensors in the `forward` function, we have used the common notation of using `B` for the first dimension (batch size), `T` for the second dimension (token or sequence length) and `C` for the third dimension (main embedding dimension). This third dimension is usually denoted as *channels* in the context of convolutional networks which explains the use of the letter `C`. We denote the dimensionality of the head embeddings as `C'`.

The `@` operator is equivalent to the `torch.matmul` function. Although we have previously used the function rather than the operator, we retain the operator here to familiarize you with both.

### Mask

A few remarks are relevant regarding the mask received as a parameter. The mask is a tensor of shape `(B, T, T)` (or something that may be broadcasted to that shape) that enables restriction of the keys with which the queries interact in the dot product. Given that the mask can take various forms depending on the transformer's application, we delegate the mask's definition to other parts of the code (see, for instance, the `DecoderTransformer` and `EncoderTransformer` classes further down). For example, in the case of using a decoder-only transformer for text generation, this mask prevents the $i$-th query to attend the $j$-th keys for $j > i$; there is no need to have a different mask for each sample in the mini-batch as the mask is the same for all of them and, therefore, its shape will probably be `(1, T, T)`. In the case of encoder-based bidirectional models, we may have mini-batches of sentences of different lengths, and the mask will be used to prevent the attention mechanism from attending to the padding tokens; the shape of the mask will then be `(B, T, T)` because each sample in the mini-batch may have a different number of padding tokens.

The mask is a boolean tensor containing true in positions that need to be nullified for the attention operation. Notice in the code how it is more efficient to perform the dot product of all queries with all keys and then apply the mask to discard the undesired dot products before calculating the softmax. This discard is achieved by inserting $-\infty$ into the mask positions we want to discard, as the softmax will assign a value of zero to these elements, effectively ignoring them. The function `torch.masked_fill_` fills the elements of a tensor with a given value at positions where another tensor is `True`. Review the broadcasting rules of PyTorch to understand how this works for the two different shapes of masks mentioned before.

### Dropout

Dropout is a regularization technique used in neural networks to prevent overfitting. It works by randomly setting a fraction of input units to 0 at each update during training time, which helps to prevent neurons from co-adapting too much, thereby encouraging the model to learn more robust and generalized representations. `torch.nn.Dropout` creates a so called *dropout layer* where the argument specifies the probability of an element to be zeroed. During training, dropout randomly deactivates a proportion of neurons and scales up the remaining neurons so that the total *energy* remains the same. During inference (testing), dropout should not be applied; this is why it is important to call `model.eval` before inference. In the transformer model, dropout is applied at different points in the architecture as you can see in the code.

ðŸ“˜ *Documentation:* [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html), [`torch.matmul`](https://pytorch.org/docs/stable/generated/torch.matmul.html), [`torch.Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html), [`torch.nn.Dropout`](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html), [`torch.nn.functional.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html), [`torch.transpose`](https://pytorch.org/docs/stable/generated/torch.transpose.html), [`torch.Tensor.view`](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html)

In [None]:
class HeadAttention(nn.Module):
    def __init__(self, n_embd, n_embd_head, attn_pdrop=0.1):
        super().__init__()
        self.q_lin = nn.Linear(n_embd, n_embd_head)
        self.k_lin = nn.Linear(n_embd, n_embd_head)
        self.v_lin = nn.Linear(n_embd, n_embd_head)
        self.attn_dropout = nn.Dropout(attn_pdrop)

    def forward(self, x, mask): 
        B, T, C = x.size()  # batch size, sequence length, main embedding dim, C' = head embedding dim
        q = self.q_lin(x)  # (B, T, C) -> (B, T, C')
        k = self.k_lin(x)
        v = self.v_lin(x)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att.masked_fill_(mask, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        return att @ v  # (B, T, T) @ (B, T, C') -> (B, T, C')

## Multi-head attention (easy version)

The previous `HeadAttention` class and the following `MultiHeadAttention` class allow us to implement a naÃ¯ve version of the multi-headed attention mechanism. It has the advantage of being easy to understand, but it is not very efficient. The reason is that the multi-headed attention mechanism is usually implemented in a series of single matrix multiplications (*one* to obtain the queries of *all* heads, one to obtain the keys of all heads, one to obtain the values of all heads, and one to obtain the attention all the dot products), whereas this version performs a matrix multiplication for each head (note the `for` loop in the `forward` function). A more efficient implementation comes after these cells.

The constructor of the class receives the dimensionality of the embeddings $d$ in variable `n_embd` and the number of heads $h$ in variable `n_head`. The `assert` statement checks that the dimensionality of the embeddings is a multiple of the number of heads. This is not strictly necessary, but it is a requirement in our case as we compute the dimensionality of the head embeddings as `n_embd // n_head`, which is the integer division of `n_embd` by `n_head`. 

The `ModuleList` class is a container of `nn.Module` objects that allows us to register the modules contained in the list. Using a native Python list for PyTorch modules would not automatically consider its elements as parameters. Consequently, they would not be subject to gradient computation, mode switching between training and evaluation, or being saved and loaded, unlike modules in nn.ModuleList. The `torch.nn.ModuleList` class is a subclass of the `torch.nn.Module` class, so it can be used as a module itself. 

The `for` loop in the `forward` function applies the `HeadAttention` module to each head and concatenates the results along the last dimension. The `c_proj` module is used to project the concatenated results to the original dimensionality of the embeddings. The `resid_dropout` module is used to apply dropout to the output of the projection.

In [None]:
class MultiHeadAttentionSimple(nn.Module):
    def __init__(self, n_embd, n_head, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        assert n_embd % n_head == 0
        self.heads = nn.ModuleList([HeadAttention(n_embd, n_embd // n_head, attn_pdrop) for _ in range(n_head)])
        self.c_proj = nn.Linear(n_embd, n_embd)  # output projection to integrate head outputs
        self.resid_dropout = nn.Dropout(resid_pdrop)

    def forward(self, x, mask):
        y = torch.cat([h(x, mask) for h in self.heads], dim=-1)  # [(B,T,C'), (B,T,C'), ...] -> (B, T, C)
        y = self.resid_dropout(self.c_proj(y))
        return y

## Multi-head attention (efficient yet more complex version)

The study of this class is left as an exercise for the advanced reader. The main difference with the `MultiHeadAttentionSimple` version is that the multi-headed attention mechanism is implemented in a series of single matrix multiplications, instead of a loop iterating over the heads. 

ðŸ“˜ *Documentation:* [`torch.nn.Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html), [`torch.split`](https://pytorch.org/docs/stable/generated/torch.split.html), [`torch.transpose`](https://pytorch.org/docs/stable/generated/torch.transpose.html), [`torch.Tensor.view`](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html), [`torch.Tensor.contiguous`](https://pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html), [`torch.matmul`](https://pytorch.org/docs/stable/generated/torch.matmul.html), [`torch.Tensor.masked_fill_`](https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill_.html), [`torch.nn.functional.softmax`](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html)

In [None]:
class MultiHeadAttentionEfficient(nn.Module):
    def __init__(self, n_embd, n_head, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        assert n_embd % n_head == 0
        self.c_attn = nn.Linear(n_embd, 3 * n_embd)
        self.c_proj = nn.Linear(n_embd, n_embd)
        self.attn_dropout = nn.Dropout(attn_pdrop)
        self.resid_dropout = nn.Dropout(resid_pdrop)
        self.n_head = n_head
        self.n_embd = n_embd

    def forward(self, x, mask=None):
        B, T, C = x.size() 
        H = self.n_head
        Cp = C // H  # C'
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, H, T, C')
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, H, T, C')
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, H, T, C')

        mask = mask.view(1,1,T,T)

        # self-attention: (B, H, T, C') x (B, H, C', T) -> (B, H, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att.masked_fill_(mask, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, H, T, T) x (B, H, T, C') -> (B, H, T, C')
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        y = self.resid_dropout(self.c_proj(y))
        return y

## Selection of the attention implementation

Set the following assignment to choose the attention mechanism to use in your transformer: the naÃ¯ve and inefficient version provided for educational purposes or the more complex and efficient version.

In [None]:
# choose one and comment the other:
MultiHeadAttention = MultiHeadAttentionSimple
# MultiHeadAttention = MultiHeadAttentionEfficient

## Layer block

Each layer of the transformer is made of a multi-headed attention mechanism followed by a feed-forward network, each of them preceded by a layer normalization module. The `LayerBlock` class implements this layer block. 

The only new concept here is the `torch.nn.ModuleDict` class, which is a dictionary that can be used as a module. In the same way as a native Python list is usually not an alternative to a `torch.nn.ModuleList`, a native Python dictionary is usually not an alternative to a `torch.nn.ModuleDict`. Modules in a module lists are accessed via integer indices, whereas modules in a module dictionary are accessed via string keys. Note that the use of `ModuleDict` is not strictly necessary here, but we leave it so that you can familiarize yourself with it. Most of the times, a module dictionary allows us to group together several modules when we want to be able to access them independently sometimes, but at the same time we want to be able to access them as a whole, for example, to obtain their parameters or to obtain the output in a single call.

To simplify the logic of the code and the number of function parameters, we hardcode the dimensionality of the hidden layer of the feed-forward network to 4 times the dimensionality of the embeddings.

ðŸ“˜ *Documentation:* [`torch.nn.ModuleDict`](https://pytorch.org/docs/stable/generated/torch.nn.ModuleDict.html)

In [None]:
class Block(nn.Module):
    def __init__(self, n_embd, n_head, attn_pdrop, resid_pdrop):
        super().__init__()
        self.ln_1 = nn.LayerNorm(n_embd)
        self.attn = MultiHeadAttention(n_embd, n_head, attn_pdrop, resid_pdrop)
        self.ln_2 = nn.LayerNorm(n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(n_embd, 4 * n_embd),  # ffw hidden layer size is fixed to 4*n_embd
            c_proj  = nn.Linear(4 * n_embd, n_embd),
            act     = nn.GELU(),
            dropout = nn.Dropout(resid_pdrop),
        ))
        
    def forward(self, x, mask):
        x = x + self.attn(self.ln_1(x),mask)
        x = x +  self.mlp.dropout(self.mlp.c_proj(self.mlp.act(self.mlp.c_fc(self.ln_2(x)))))
        return x

## The transformer architecture

The previous building blocks are most of the ingredients we need to implement the transformer architecture. 

Our `AbstractTransformer` class inherits from `nn.Module` and `ABC`. The former is the base class for all neural network modules, as you already know, while the latter is the base class for abstract classes in Python. The `ABC` class provides the `abstractmethod` decorator that indicates that a method is abstract and must be implemented in the subclasses. The class is an incomplete module (for example, it lacks output heads and a definition of the mask) and, consequently, cannot be instantiated. Next, we will inherit from this class to implement encoder-only and decoder-only transformers.

We define embeddings for tokens and positions. Although the original transformer proposed a complex embedding scheme for the positional embeddings, it is more common these days to simply learn them after random initialization as we do here. Note that the loop in the `forward` function cannot be avoided due to the sequential nature of each layer. 

The `max_len` parameter is used to define the maximum length of the sequences that the transformer can process. This is used to define the number of positional embeddings in the `wpe` embedding matrix. The `pos` tensor is a tensor of shape `(1, T)` that contains the positions of the tokens in the sequence in the form `[0, 1, ..., T-1]`. The `unsqueeze` function adds a dimension of size 1 to the tensor (the function `torch.Tensor.view` could also be used here; check it). The `pos` tensor is then used to obtain the positional embeddings, which are added to the token embeddings (notice the broadcasting here).

ðŸ“˜ *Documentation:* [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html), [`torch.unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html), [`torch.Tensor.view`](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html), [`torch.numel`](https://pytorch.org/docs/stable/generated/torch.numel.html), [`torch.arange`](https://pytorch.org/docs/stable/generated/torch.arange.html)

In [None]:
from abc import ABC, abstractmethod

class AbstractTransformer(nn.Module, ABC):
    def __init__(self, n_embd, n_head, n_layer, max_len, vocab_size,
                 embd_pdrop=0.1, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(vocab_size, n_embd),
            wpe = nn.Embedding(max_len, n_embd),
            drop = nn.Dropout(embd_pdrop),
            h = nn.ModuleList([Block(n_embd, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)]),
            ln_f = LayerNorm(n_embd),  # we could use nn.LayerNorm instead
        ))
        self._init_weights()
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print(f"number of parameters: {(n_params/1e6):.2f}M")
        
    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            # elif isinstance(module, nn.LayerNorm):
            #    torch.nn.init.zeros_(module.bias)
            #    torch.nn.init.ones_(module.weight)

    @abstractmethod
    def forward(self, inputs, mask):
        B, T = inputs.size()
        device = inputs.device  
        pos = torch.arange(0, T, dtype=torch.long, device=device).unsqueeze(0)  # (1, T)
        tok_emb = self.transformer.wte(inputs)  # (B, T, C)
        pos_emb = self.transformer.wpe(pos)  # (1, T, C)
        
        x = self.transformer.drop(tok_emb + pos_emb)  # broadcasting in the addition
        for block in self.transformer.h:
            x = block(x, mask)
        x = self.transformer.ln_f(x)
        
        return x

## Interlude: a few notes on Pytorch tensors

The main road of this notebook goes on with the implementation of decoder-only and decoder-only transformers below. However, before you continue, it would be interesting to make a few remarks on PyTorch tensors. As in the "Choose your own adventure" books, you can proceed to the next cell or jump to the [notes on Pytorch tensors](#notes-on-pytorch-tensors) section below and come back here later.

## A decoder-only transformer

Our decoder-like transformer can be used to implement language models that predict the next token in a sequence given the previous ones. The `DecoderTransformer` class inherits from `AbstractTransformer` and refines its methods. As regards the architecture, it simply adds a predictor layer on top of the transformer architecture that will generate a probability distribution over the vocabulary.

The class also defines the mask that prevents the attention mechanism from attending to future tokens. Recall from above that the mask is a boolean tensor of shape `(B, T, T)` (or something that may be broadcasted to that shape) containing `True` in positions that need to be nullified for the attention operation. In the case of a decoder-only transformer, the mask is the same for all samples in the mini-batch. Consequently, we simply create a tensor of shape `(1, T, T)` and let the broadcasting mechanism do the rest.

The expansion of the mask to make it of a shape compatible with `(B, T, T)` is carried out via the `torch.Tensor.view` that changes the shape of a tensor without changing its data. Make sure that you understand the difference between a tensor of shape, say, `(2, 3)` and a tensor of shape `(1, 2, 3)`.

The method `torch.triu` will keep only the upper triangular part of the matrix (in this case, a matrix of ones), excluding the main diagonal (because `diagonal=1`), and set to 0 the remaining elements, that is, those corresponding to the lower triangular part and the main diagonal. The `torch.Tensor.bool` function converts the resulting tensor of ones and zeros into a tensor of `True` and `False`. 

ðŸ“˜ *Documentation:* [`torch.triu`](https://pytorch.org/docs/stable/generated/torch.triu.html), [`torch.Tensor.bool`](https://pytorch.org/docs/stable/generated/torch.Tensor.bool.html), [`torch.Tensor.view`](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html)

In [None]:
class DecoderTransformer(AbstractTransformer):
    def __init__(self, n_embd, n_head, n_layer, vocab_size, max_len, 
                 embd_pdrop=0.1, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__(n_embd=n_embd, n_head=n_head, n_layer=n_layer, max_len=max_len, vocab_size=vocab_size,
                         embd_pdrop=embd_pdrop, attn_pdrop=attn_pdrop, resid_pdrop=resid_pdrop)
        self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
        self._init_weights()
        
    def _init_weights(self):
        super()._init_weights()

    def forward(self, inputs):
        B, T = inputs.size()
        device = inputs.device
        mask = torch.triu(torch.ones(T, T, device=device), diagonal=1).bool()  # causal attention mask
        mask = mask.view(1,T,T) # expand mask, (T, T) -> (1, T, T)
        x = super().forward(inputs, mask)
        logits = self.lm_head(x)

        return logits

## A encoder-only transformer

Our encoder-like transformer can be used in tasks that involve classification or sequence tagging. The `EncoderTransformer` class inherits from `AbstractTransformer` and refines its methods. As regards the architecture, it simply adds a classifier layer on top of the transformer architecture that will generate a probability distribution over the classes.

The mask is initially of shape `(B, T)` (`True` in positions corresponding to padding tokens) and is expanded to `(B, T, T)` to make it compatible with the shape required in our implementation of the attention mechanism. The `torch.Tensor.view` function adds a dimension of size 1 to the tensor (equivalently, `mask.unsqueeze(1)` could also be used here). Then, the `expand` function expands the tensor along the second dimension to give a tensor of shape `(B, T, T)`, but this is carried out efficiently in a way that does not copy the data.

ðŸ“˜ *Documentation:* [`torch.Tensor.view`](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html), [`torch.Tensor.expand`](https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html), [`torch.Tensor.unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html)

In [None]:
class EncoderTransformer(AbstractTransformer):
    def __init__(self, n_embd, n_head, n_layer, input_vocab_size, output_vocab_size, max_len, pad_index,
                 embd_pdrop=0.1, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__(n_embd=n_embd, n_head=n_head, n_layer=n_layer, max_len=max_len, vocab_size=input_vocab_size,
                         embd_pdrop=embd_pdrop, attn_pdrop=attn_pdrop, resid_pdrop=resid_pdrop)
        self.pad_index = pad_index
        self.lm_head = nn.Linear(n_embd, output_vocab_size, bias=False)
        self._init_weights()
        
    def _init_weights(self):
        super()._init_weights()

    def forward(self, inputs):
        B, T = inputs.size()
        device = inputs.device
        mask = inputs == self.pad_index  # padding mask
        mask = mask.view(B, 1, T)  # expand mask, (B, T) -> (B, 1, T)
        mask = mask.expand(-1, inputs.size(1), -1)
        mask.to(device)
        x = super().forward(inputs, mask)
        logits = self.lm_head(x)

        return logits

## Testing the model

Our main program simply tests a randomly initialized decoder-like transformer with a random sequence of token indexes. The model is not trained, so no meaning can be attributed to the output. The only purpose of these lines is to verify that the whole code does not crash.

Note the use of the `torch.no_grad` context manager. This context manager is used to prevent PyTorch from building the computation graph and computing the gradient, as already studied.

ðŸ“˜ *Documentation:* [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html)

In [None]:
n_layer = 2
n_embd =  64
n_head = 2
max_len = 32
lr = 0.001
vocab_size = 5

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = DecoderTransformer(n_embd=n_embd, n_head=n_head, n_layer=n_layer, vocab_size=vocab_size, max_len=max_len)
model.to(device)

model.eval()
# create a random input; length is not relevant as long as it is smaller than max_len
x = torch.randint(0, vocab_size, (1, max_len//2), dtype=torch.long, device=device)
print(f"input: {x}")
with torch.no_grad():
    logits = model(x)
print(f"logits: {logits}")
print(f"output: {logits.argmax(-1)}")

## Notes on Pytorch tensors

This is a collection of notes on PyTorch tensors that you may find useful to fully understand the code in this notebook.

### Unsqueezing tensors

Note: you can associate the terms *squeeze* and *unsqueeze* with the shape alteration that occurs when opening or closing instruments (called *squeezeboxes*) such as an accordion or a concertina.

A frequent operation in PyTorch is the unsqueezing of a tensor using the `unsqueeze` operation. This operation adds a dimension of size 1 at the indicated position. For instance, if we have a tensor of shape (2,3,4) and apply `unsqueeze(1)`, the result will be a tensor of shape (2,1,3,4). If we apply `unsqueeze(0)`, the result will be a tensor of shape (1,2,3,4). If we apply `unsqueeze(-1)`, the result will be a tensor of shape (2,3,4,1). One of the most typical uses of unsqueeze is to convert a single data sample into a minibatch. For example, imagine we have a model for assigning lexical categories (verb, noun, adjective, etc.) to words that receives a minibatch of different word embeddings and returns for each word a probability vector for each category. If we want to apply the model to a single word, we need to convert its embedding into a minibatch of a single item, and for this, we can use `unsqueeze(0)`. If we assume the number of categories is 10, after executing the model, the result will be a tensor of shape (1,10), which we can convert to a tensor of shape (10) with `squeeze(0)`. The squeeze operation is the complement of unsqueeze: by default, it removes all dimensions of size 1, but allows specifying the position of the dimension we want to eliminate.

Adding a dimension of size 1 at the indicated position, as squeeze does, does not affect the number of elements in the tensor, but it does affect its shape. The block of data contained in the tensor is not modified in memory. The following example shows the result of un-squeezing operations on different positions:

In [None]:
import torch 
a=torch.tensor([[1,2],[3,4]])  #   [ [ 1,     2 ],     [ 3,     4 ] ]    2x2
a.squeeze(0)                   # [ [ [ 1,     2 ],     [ 3,     4 ] ] ]  1x2x2
a.squeeze(1)                   # [ [ [ 1,     2 ] ], [ [ 3,     4 ] ] ]  2x1x2
# a.squeeze(2)                 # exception: dimension out of range

As usual in PyTorch, dimensions can be negative, which allows specifying the position of the dimension counting from the end. In the previous example, `a.unsqueeze(-1)` is equivalent to `a.unsqueeze(3)`. In terms of the `view` function, `t.squeeze()` is equivalent to `view(*[s for s in t.shape if s != 1])`. On the other hand, `t.unsqueeze(i)` is equivalent to `view(*t.shape[:i], 1, *t.shape[i:])`.

Viewing an $n$-dimensional tensor as a list of $(n-1)$-dimensional tensors facilitates the understanding of tensor representation in PyTorch. You will probably find it easier to visualize a 5-dimensional tensor as a list of 4-dimensional tensors (and so on) than as a matrix of cubes, for example.

### Row and column vectors

The squeeze operation also helps us clarify the difference between the representation of vectors, row vectors, and column vectors in PyTorch. To begin, consider these two tensors:

In [None]:
a=torch.tensor([[1,2],[3,5]])
b=torch.tensor([2,3])

The tensor `a` corresponds to a 2x2 matrix and `b` to a vector of 2 elements. The operation `torch.mm(a,b)` produces an error because the sizes are incompatible, as this operation does not perform broadcasting and only works on two matrices. We can transform `b` into a column vector `[[2],[3]]` of 2x1 with the help of `unsqueeze` so that `torch.mm(a,b.unsqueeze(1))` works correctly. We can also transform `b` into a row vector `[[2,3]]` of 1x2 with the help of `unsqueeze` so that `torch.mm(b.unsqueeze(0),a)` works correctly. Note that the result of both products is obviously different (the resulting tensors, in fact, have different shapes). We can now use squeeze on the result to obtain a 2-element vector.

The operation torch.matmul not only supports broadcasting, but it is also prepared to operate with two-dimensional and one-dimensional tensors. The result in this case is a one-dimensional tensor. Therefore, the following two assertions do not fail:

In [None]:
assert torch.equal(torch.mm(b.unsqueeze(0),a).squeeze(), torch.matmul(b,a))
assert torch.equal(torch.mm(a,b.unsqueeze(1)).squeeze(), torch.matmul(a,b))

### Memory representation of tensors

To simplify, consider a 4x3 matrix initialized as follows:

In [None]:
a = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

In memory, the elements of a tensor like the one above are stored in consecutive positions following a row-wise order, so they are arranged as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. The storage order of the elements of a tensor is characterized by a concept called *stride*, which can be checked with the stride method:

In [None]:
print(a.stride())  # (3, 1)

The tuple `(3,1)` indicates that to move to the next element in the first dimension (the rows), it is necessary to jump 3 positions in memory and to move to the next element in the second dimension (the columns), it is necessary to jump 1 position in memory.

There are PyTorch operations (for example, transpose or the view function) that modify the strides of the tensors without moving the elements in memory, making the operation very efficient as it does not have to create new values in memory or reorder existing ones:

In [None]:
t = a.t()
print(t.stride())  # (1, 3)

Check that the strides `(1, 3)` are correct if the data in memory has not been modified. Many PyTorch operations are implemented so that they iterate through the data from the last dimension to the first (first by columns and then by rows, for example), expecting this to mean starting with the smallest step dimensions (columns, in our case) and moving towards dimensions with larger steps. This way, when the algorithm accesses the next data point, it is likely to be a neighbor of the current one and will probably be available in cache. If the elements were arranged differently in memory, the algorithm would have to jump more positions in memory to access the data and, therefore, be slower or not work at all. For this reason, sometimes some operations (for example, `t.view(-1)`) throw an exception and we will have to explicitly reorder the data in memory of the affected tensor before we can use such operation:

In [None]:
print(a.is_contiguous())  # True
print(t.is_contiguous())  # False
print(a.data_ptr()==t.data_ptr())  # True
t = t.contiguous()
print(t.stride())  # (4, 1)
print(a.data_ptr()==t.data_ptr())  # False

The `contiguous` operation returns the input tensor (`self`) if it is already contiguous and returns a copy with the data reorganized otherwise. For contiguous tensors of any shape, the stride is always larger in a given dimension than in the next:

In [None]:
x= torch.ones((5, 4, 3, 2))
print(x.stride())  # (24, 6, 2, 1)

## Exercises

If your learning path is supervised by a teacher, they may have provided you with additional instructions on how to proceed with the exercises.

âœŽ Use SentencePiece to tokenize the data and use more data (for example, a corpus from El Quijote) to train the model

âœŽ Efficiently reuse attentions when generating text auto-regressively.

âœŽ Consider if the mask could be passed to the constructor instead of every time `forward` is called. Note that passing it to `forward` is not necessarily inefficient since what is passed is a reference to the tensor, not a copy of it.

âœŽ Compare the processing times as you increase the size of the transformer between using the two implementations (naÃ¯ve, efficient) of the attention mechanism.