# Named entity recognition with transformers

<a target="_blank" href="https://colab.research.google.com/github/jaspock/me/blob/main/docs/materials/transformers/assets/notebooks/nerbert.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a href="http://dlsi.ua.es/~japerez/"><img src="https://img.shields.io/badge/Universitat-d'Alacant-5b7c99" style="margin-left:10px"></a>

Notebook and code written by Juan Antonio Pérez in 2023–2024.

This notebook uses the encoder-like transformer of our previous notebook to train and test a toy-like named entity recognition (NER) model from a tiny dataset. NER consists of identifying and classifying named entities in texts into a number of pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

It is assumed that you are already familiar with the basics of PyTorch. This notebook complements a [learning guide](https://dlsi.ua.es/~japerez/materials/transformers/intro/) based on studying the math behind the models by reading the book "[Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)" (3rd edition) by Jurafsky and Martin. It is part of a series of notebooks which are supposed to be incrementally studied, so make sure you follow the right order. If your learning is being supervised by a teacher, follow the additional instructions that you may have received. Although you may use a GPU environment to execute the code, the computational requirements for the default settings are so low that you can probably run it on CPU.

In [None]:
%%capture
%pip install torch

## Mini-batch preparation

The auxiliary function `sample_indexes` takes a sentence and its corresponding tags and returns a pair of lists of indexes. The first list contains the indexes of the words in the sentence, and the second list contains the indexes of the tags. The indexes are obtained from the dictionaries `word_index` and `tag_index`. The function also takes care of padding the lists to the maximum length `max_len` with the index `pad_index`.

The function `make_batch` is another example of data generator. `itertools.cycle` is used to repeat the data indefinitely. It creates an iterator that returns elements from an iterable, saving a copy of each element. Once the iterable is exhausted, it starts returning elements from the saved copy. This means that it may require significant memory if the iterable is long. For real training data, it is usually much better to use a generator that reads the data from disk mini-batch by mini-batch.

Our iterable is made of tuples of sentences and tags. Python's `zip` function creates an iterator that aggregates elements from each of the iterables. For example, `zip([1,2,3], [4,5,6])` returns `[(1,4), (2,5), (3,6)]`.

It is not being done here, but note that the `PAD` token is so often represented by the index 0 that it is common to hardcode its value in code.

In [None]:
import torch
import itertools

def sample_indexes(sentence, tags, word_index, tag_index, max_len, pad_word_id, pad_tag_id):
    words = sentence.split()
    tags = tags.split()
    assert len(words) == len(tags), "Lengths of input sentences and labels do not match"
    # truncate lists to max_len:
    if len(words) > max_len:
        words = words[:max_len]
        tags = tags[:max_len]
    inputs = [word_index.get(n, word_index['[UNK]']) for n in words]
    inputs = inputs + [pad_word_id] * (max_len - len(inputs))  # padded inputs
    tags = [tag_index[n] for n in tags]
    tags = tags + [pad_tag_id] * (max_len - len(tags))  # padded outputs
    return inputs, tags

def make_batch(input_sentences, output_tags, word_index, tag_index, max_len, batch_size, pad_word_id, pad_tag_id, device):
    input_batch = []
    output_batch = []
    data_cycle = itertools.cycle(zip(input_sentences, output_tags))
    for s,t in data_cycle:  # infinite loop
        inputs, outputs = sample_indexes(s, t, word_index, tag_index, max_len, pad_word_id, pad_tag_id)
        input_batch.append(inputs)
        output_batch.append(outputs)
        if len(input_batch) == batch_size:
            yield torch.LongTensor(input_batch, device=device), torch.LongTensor(output_batch, device=device)
            input_batch = []
            output_batch = []

## Import our transformer code

We load the `EncoderTransformer` class implemented in the previous notebook. If we are running this on the cloud, we download the file from GitHub. If we are running it locally, we assume that the file is in the same directory as this notebook. The seed is also set to a fixed value to ensure reproducibility.

In [None]:
%%capture
import os
colab = bool(os.getenv("COLAB_RELEASE_TAG"))  # running in Google Colab?
if not os.path.isfile('transformer.ipynb') and colab:
    %pip install wget
    !wget https://raw.githubusercontent.com/jaspock/me/main/docs/materials/transformers/assets/notebooks/transformer.ipynb

%pip install nbformat
%run './transformer.ipynb'

set_seed(42)

## Corpus preprocessing

This code does not add novel elements to what you have already seen in the previous notebook. Note that, in addition to `PAD`, we add some special tokens which will not be used in this notebook, but we leave them there for potential future use as they are common in NLP tasks based on encoders. The tags used here for named entity recognition are `PER` (person), `LOC` (location), `ORG` (organization), `MISC` (miscellaneous), and `O` (other).

In [None]:
input_sentences = [
    "Steve Jobs founded Apple in Cupertino .",
    "The Eiffel Tower is located in Paris .",
    "I am currently reading 1984 by George Orwell .",
    "The United Nations was established in 1945 .",
    "Mount Everest is the highest mountain in the world .",
    "Shakespeare wrote Romeo and Juliet ."
]

output_tags = [
    "PER PER O ORG O LOC O",
    "O MISC MISC O O O LOC O",
    "O O O O MISC O PER PER O",
    "O ORG ORG O O O O O",
    "LOC LOC O O O O O O LOC O",
    "PER O PER O PER O"
]

word_list = list(set(" ".join(input_sentences).split()))
word_index = {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4}
special_tokens = len(word_index) 
for step, w in enumerate(word_list):
    word_index[w] = step + special_tokens
index_word = {i: w for i, w in enumerate(word_index)}
input_vocab_size = len(word_index)
tag_list = list(set(" ".join(output_tags).split()))
tag_index = {'[PAD]': 0}
for step, t in enumerate(tag_list):
    tag_index[t] = step + 1
index_tag = {i:t for i, t in enumerate(tag_index)}
output_vocab_size = len(tag_index)
print(f"input_vocab_size = {input_vocab_size}")
print(f"output_vocab_size = {output_vocab_size}")

## Model training

Hopefully, having studied the other notebooks, once you reach this point, you will realize that everything sounds familiar and understandable.

The `for` loop automatically calls the `__next__` method of the iterator returned by `make_batch`. 

In [None]:
n_layer = 2
n_head = 2
n_embd =  64
embd_pdrop = 0.1
resid_pdrop = 0.1
attn_pdrop = 0.1
batch_size = 3
max_len = 12
lr = 0.001
training_steps = 500
eval_steps = 100

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = EncoderTransformer(n_embd=n_embd, n_head=n_head, n_layer=n_layer, input_vocab_size=input_vocab_size, output_vocab_size=output_vocab_size, 
                max_len=max_len, pad_index = word_index['[PAD]'], embd_pdrop=embd_pdrop, attn_pdrop=attn_pdrop, resid_pdrop=resid_pdrop)
model.to(device)

criterion = nn.CrossEntropyLoss(ignore_index=tag_index['[PAD]'])
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5, total_iters=training_steps)

model.train()
step = 0
for inputs, outputs in make_batch(input_sentences=input_sentences, output_tags=output_tags, word_index=word_index, 
                                    tag_index=tag_index, max_len=max_len, batch_size=batch_size, 
                                    pad_word_id = word_index['[PAD]'], pad_tag_id = tag_index['[PAD]'], device=device):
    optimizer.zero_grad()
    logits = model(inputs)
    loss = criterion(logits.view(-1,logits.size(-1)), outputs.view(-1)) 
    if step % eval_steps == 0:
        print(f'Step [{step}/{training_steps}], loss: {loss.item():.4f}')
    loss.backward()
    optimizer.step()
    scheduler.step()
    step = step + 1
    if (step==training_steps):
        break

print(f'Step [{step}/{training_steps}], loss: {loss.item():.4f}')

## Model evaluation

We measure the accuracy of the model by comparing the predicted tags with the gold tags. As expected, we do not take into account the `PAD` tokens when computing the accuracy.

In [None]:
model.eval()

pad_word_id = word_index['[PAD]']
pad_tag_id = tag_index['[PAD]']
test_sentence = "Steve Jobs wrote Romeo and Juliet ."
expected_tags = "PER PER O PER O PER O"
inputs, outputs = sample_indexes(test_sentence, expected_tags, word_index, tag_index, max_len, pad_word_id, pad_tag_id)

inputs = torch.LongTensor(inputs, device=device).unsqueeze(0)  # convert to batch of size 1
outputs = torch.LongTensor(outputs, device=device).unsqueeze(0)
logits = model(inputs)
_, indices = torch.max(logits, dim=-1)

# compute accuracy excluding pads:
accuracy = torch.sum(indices[outputs!=pad_word_id]==outputs[outputs!=pad_word_id]).item()/torch.sum(outputs!=pad_word_id).item()
print(f"Accuracy: {accuracy*100:.2f}%")
print()

i = 0
print(f'Input: {inputs[i]}')
print(f'Expected output: {outputs[i]}')
print(f'Predicted output: {indices[i]}')
# print words and tags using index_word and index_tag:
print(f'Input words: {[index_word[w.item()] for w in inputs[i] if w!=pad_word_id]}')
print(f'Expected tags: {[index_tag[t.item()] for t in outputs[i] if t!=pad_tag_id]}')
print(f'Predicted tags: {[index_tag[t.item()] for t in indices[i] if t!=pad_tag_id]}')

## Exercises

If your learning path is supervised by a teacher, they may have provided you with additional instructions on how to proceed with the exercises.

✎ Compare the original pre-norm implementation of the transformer with the post-norm implementation under this task.

✎ Add a pre-training step to the model that implements the masked language model objective and is trained on a separate corpus. Note that the `MASK` token is already included in the vocabulary.

✎ Exclude some words in the training corpus from the vocabulary and check that they are replaced by the `UNK` token and that some representations are learned for it.

✎ Modify the code so that predictions are not computed for the `PAD` tokens. Make sure your implementation works for mini-batches containing sentences of different lengths.