Skip to content
Snippets Groups Projects
Commit 7d2d1dc5 authored by chloebt's avatar chloebt
Browse files

Add Course 4, TP4 correct, TP5, TP6

parent 3b92fe37
No related branches found
No related tags found
No related merge requests found
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id: tags:
# TP 5 : machine learning using neural network for text data
In this practical session, we are going to build simple neural models able to classify reviews as positive or negative. The dataset used comes from AlloCine.
The goals are to understand how to use pretrained embeddings, and to correctly tune a neural model.
you need to load:
- Allocine: Train, dev and test sets
- Embeddings: cc.fr.300.10000.vec (10,000 first lines of the original file)
## Part 1- Pre-trained word embeddings
Define a neural network that takes as input pre-trained word embeddings (here FastText embeddings). Words are represented by real-valued vectors from FastText. A review is represented by a vector that is the average or the sum of the word vectors.
So instead of having an input vector of size 5000, we now have an input vector of size e.g. 300, that represents the ‘average’, combined meaning of all the words in the document taken together.
## Part 2- Tuning report
Tune the model built on pre-trained word embeddings by testing several values for the different hyper-parameters, and by testing the addition on an hidden layer.
Describe the performance obtained by reporting the scores for each setting on the development set, printing the loss function against the hyper-parameter values, and reporting the score of the best model on the test set.
-------------------------------------
%% Cell type:markdown id: tags:
## Useful imports
Here we also:
* Look at the availability of a GPU. Reminder: in Collab, you have to go to Edit/Notebook settings to set the use of a GPU
* Setting a seed, for reproducibility: https://pytorch.org/docs/stable/notes/randomness.html
%% Cell type:code id: tags:
```
import time
import pandas as pd
import numpy as np
# torch and torch modules to deal with text data
import torch
import torch.nn as nn
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
# you can use scikit to print scores
from sklearn.metrics import classification_report
# For reproducibility, set a seed
torch.manual_seed(0)
# Check for GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
```
%% Cell type:markdown id: tags:
Paths to data:
%% Cell type:code id: tags:
```
# Data files
train_file = "allocine_train.tsv"
dev_file = "allocine_dev.tsv"
test_file = "allocine_test.tsv"
# embeddings
embed_file='cc.fr.300.10000.vec'
```
%% Cell type:markdown id: tags:
## 1- Read and load the data
%% Cell type:markdown id: tags:
### The class Dataset is defined below.
%% Cell type:code id: tags:
```
# Here we create a custom Dataset class that inherits from the Dataset class in PyTorch
# A custom Dataset class must implement three functions: __init__, __len__, and __getitem__
class Dataset(torch.utils.data.Dataset):
def __init__(self, tsv_file, vocab=None ):
""" (REQUIRED) Here we save the location of our input file,
load the data, i.e. retrieve the list of texts and associated labels,
build the vocabulary if none is given,
and define the pipelines used to prepare the data """
self.tsv_file = tsv_file
self.data, self.label_list = self.load_data( )
# splits the string sentence by space, can t make the fr tokenzer work
self.tokenizer = get_tokenizer( None )
self.vocab = vocab
if not vocab:
self.build_vocab()
# pipelines for text and label
self.text_pipeline = lambda x: self.vocab(self.tokenizer(x)) #return a list of indices from a text
self.label_pipeline = lambda x: int(x) #simple mapping to self
def load_data( self ):
""" Read a tsv file and return the list of texts and associated labels"""
data = pd.read_csv( self.tsv_file, header=0, delimiter="\t", quoting=3)
instances = []
label_list = []
for i in data.index:
label_list.append( data["sentiment"][i] )
instances.append( data["review"][i] )
return instances, label_list
def build_vocab(self):
""" Build the vocabulary, i.e. retrieve the list of unique tokens
appearing in the corpus (= training set). Se also add a specific index
corresponding to unknown words. """
self.vocab = build_vocab_from_iterator(self.yield_tokens(), specials=["<unk>"])
self.vocab.set_default_index(self.vocab["<unk>"])
def yield_tokens(self):
""" Iterator on tokens """
for text in self.data:
yield self.tokenizer(text)
def __len__(self):
""" (REQUIRED) Return the len of the data,
i.e. the total number of instances """
return len(self.data)
def __getitem__(self, index):
""" (REQUIRED) Return a specific instance in a format that can be
processed by Pytorch, i.e. torch tensors """
return (
tuple( [torch.tensor(self.text_pipeline( self.data[index] ), dtype=torch.int64),
torch.tensor( self.label_pipeline( self.label_list[index] ), dtype=torch.int64) ] )
)
```
%% Cell type:markdown id: tags:
### The function to generate data batches and iterator is defined below.
%% Cell type:code id: tags:
```
# This function explains how we process data to make batches of instances
# - The list of texts / reviews that is returned is similar to a list of list:
# each element is a batch, ie. a ensemble of BATCH_SIZE texts. But instead of
# creating sublists, PyTorch concatenates all the tensors corresponding to
# each text sequence into one tensor.
# - The list of labels is the list of list of labels for each batch
# - The offsets are used to save the position of each individual instance
# within the big tensor
def collate_fn(batch):
label_list, text_list, offsets = [], [], [0]
for ( _text, _label) in batch:
text_list.append( _text )
label_list.append( _label )
offsets.append(_text.size(0))
label = torch.tensor(label_list, dtype=torch.int64) #tensor of labels for a batch
offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) #tensor of offset indices for a batch
text_list = torch.cat(text_list) # <--- here we concatenate the reviews in the batch
return text_list.to(device), label.to(device), offsets.to(device) #move the data to GPU
```
%% Cell type:markdown id: tags:
### We load the data:
%% Cell type:code id: tags:
```
# Load the training and development data
train = Dataset( train_file )
dev = Dataset( dev_file, vocab=train.vocab )
train_loader = DataLoader(train, batch_size=2, shuffle=False, collate_fn=collate_fn) #<-- use shuffle = True instead
dev_loader = DataLoader(dev, batch_size=2, shuffle=False, collate_fn=collate_fn)
print(train[0])
print(train[1])
for input, label, offset in train_loader:
print( input, label, input.size(), offset )
break
```
%% Cell type:markdown id: tags:
### The functions to load the embeddings vectors and build the weight matrix are defined below.
%% Cell type:code id: tags:
```
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
print("Originally we have: ", n, 'tokens, and vectors of',d, 'dimensions') #here in fact only 10000 words
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = [float(t) for t in tokens[1:]]
return data
vectors = load_vectors( embed_file )
print( 'Version with', len( vectors), 'tokens')
print(vectors.keys() )
print( vectors['de'] )
# Load the weight matrix: modify the code below to check the coverage of the
# pre-trained embeddings
emb_dim = 300
matrix_len = len(train.vocab)
weights_matrix = np.zeros((matrix_len, emb_dim))
words_found, words_unk = 0,0
for i in range(0, len(train.vocab)):
word = train.vocab.lookup_token(i)
try:
weights_matrix[i] = vectors[word]
words_found += 1
except KeyError:
weights_matrix[i] = np.random.normal(scale=0.6, size=(emb_dim, ))
words_unk += 1
weights_matrix = torch.from_numpy(weights_matrix).to( torch.float32)
print( "Words found:", weights_matrix.size() )
print( "Unk words:", words_unk )
```
%% Cell type:markdown id: tags:
### Model definition
%% Cell type:code id: tags:
```
class FeedforwardNeuralNetModel(nn.Module):
def __init__(self, hidden_dim, output_dim, weights_matrix):
# calls the init function of nn.Module. Dont get confused by syntax,
# just always do it in an nn.Module
super(FeedforwardNeuralNetModel, self).__init__()
# Embedding layer
# mode (string, optional) – "sum", "mean" or "max". Default=mean.
self.embedding_bag = nn.EmbeddingBag.from_pretrained(
weights_matrix,
mode='mean')
embed_dim = self.embedding_bag.embedding_dim
# Linear function
self.fc1 = nn.Linear(embed_dim, hidden_dim)
# Non-linearity
self.sigmoid = nn.Sigmoid()
# Linear function (readout)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, text, offsets):
# Embedding layer
embedded = self.embedding_bag(text, offsets)
# Linear function
out = self.fc1(embedded)
# Non-linearity
out = self.sigmoid(out)
# Linear function (readout)
out = self.fc2(out)
return out
```
%% Cell type:markdown id: tags:
### Train and evaluation functions are defined below:
%% Cell type:code id: tags:
```
import matplotlib.pyplot as plt
import os
def my_plot(epochs, loss):
plt.plot(epochs, loss)
#fig.savefig(os.path.join('./lossGraphs', 'train.jpg'))
def training(model, train_loader, optimizer, num_epochs=5, plot=False ):
loss_vals = []
for epoch in range(num_epochs):
train_loss, total_acc, total_count = 0, 0, 0
for input, label, offsets in train_loader:
# Step1. Clearing the accumulated gradients
optimizer.zero_grad()
# Step 2. Forward pass to get output/logits
outputs = model( input, offsets ) # <---- argument offsets en plus
# Step 3. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
# - Calculate Loss: softmax --> cross entropy loss
loss = criterion(outputs, label)
# - Getting gradients w.r.t. parameters
loss.backward()
# - Updating parameters
optimizer.step()
# Accumulating the loss over time
train_loss += loss.item()
total_acc += (outputs.argmax(1) == label).sum().item()
total_count += label.size(0)
# Compute accuracy on train set at each epoch
print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/len(train), total_acc/len(train)))
loss_vals.append(train_loss/len(train))
total_acc, total_count = 0, 0
train_loss = 0
if plot:
# plotting
my_plot(np.linspace(1, num_epochs, num_epochs).astype(int), loss_vals)
def evaluate( model, dev_loader ):
predictions = []
gold = []
with torch.no_grad():
for input, label, offsets in dev_loader:
probs = model(input, offsets) # <---- fct forward with offsets
# -- to deal with batches
predictions.extend( torch.argmax(probs, dim=1).cpu().numpy() )
gold.extend([int(l) for l in label])
print(classification_report(gold, predictions))
return gold, predictions
```
%% Cell type:code id: tags:
```
# Set the values of the hyperparameters
hidden_dim = 4
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()
output_dim = 2
```
%% Cell type:code id: tags:
```
# Initialize the model
model_ffnn = FeedforwardNeuralNetModel( hidden_dim, output_dim, weights_matrix)
optimizer = torch.optim.SGD(model_ffnn.parameters(), lr=learning_rate)
model_ffnn = model_ffnn.to(device)
# Train the model
training( model_ffnn, train_loader, optimizer, num_epochs=5 )
# Evaluate on dev
gold, pred = evaluate( model_ffnn, dev_loader )
```
%% Cell type:markdown id: tags:
## 2 - Exercise: Tuning your model
The model comes with a variety of hyper-parameters. To find the best model, we need to test different values for these free parameters.
Be careful:
* you always optimize / fine-tune your model on the **development set**.
* Then you compare the results obtained with the different settings on the dev set to choose the best setting
* finally you report the results of the best model on the test set
* you always keep a track of your experimentation, for reproducibility purpose: report the values tested for each hyper-parameters and the values used by your best model.
In this part, you have to test different values for the following hyper-parameters:
1. Batch size
2. Max number of epochs (with best batch size)
3. Size of the hidden layer
4. Activation function
5. Optimizer
6. Learning rate
Inspect your model to give some hypothesis on the influence of these parameters on the model by inspecting how they affect the loss during training and the performance of the model.
**Note:** (not done below) Here you are trying to make a report on the performance of your model. try to organise your code to keep track of what you're doing:
* give a different name to each model, to be able to run them again
* save the results in a dictionnary of a file, to be able to use them later:
* think that you should be able to provide e.g. plots of your results (for example, plotting the accuracy for different value of a specific hyper-parameter), or analysis of your results (e.g. by inspecting the predictions of your model) so you need to be able to access the results.
%% Cell type:code id: tags:
```
from sklearn.metrics import accuracy_score, f1_score
```
%% Cell type:code id: tags:
```
# For now, we keep a medium number of epochs eg 50
num_epochs = 50
```
%% Cell type:markdown id: tags:
#### 1. BATCH SIZE
We need to reload the data to change the size of the batch.
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
#### 3. HIDDEN SIZE
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
#### 4. ACTIVATION FUNCTION
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
#### 5. LEARNING RATE
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
#### 6. OPTIMIZER
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
#### 2. NUMBER OF EPOCHS
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
### Additional exercise
Modify your model to test a variation on the architecture. Here you don't have to tune all your model again, just try for example when keeping the best values found previously for the hyper-parameters:
7. Try with 1 additional hidden layer
%% Cell type:code id: tags:
```
```
%% Cell type:markdown id: tags:
# TP 6: Introduction aux transformers
Dans cette séance, nous verrons comment utiliser la librairie Transformers de HuggingFace et des modèles pré-entraînés.
Nous nous intéresserons encore à la tâche d'analyse de sentiments sur les données anglaises IMDB.
Au sens d'HF, il s'agit d'une tâche de classification de séquences de mots.
Nous nous appuierons sur la librairie HuggingFace et les modèles de langue Transformer (i.e. BERT).
- https://huggingface.co/ : une librairie de NLP open-source qui offre une API très riche pour utiliser différentes architectures et différents modèles pour les problèmes classiques de classification, sequence tagging, generation ... N'hésitez pas à parcourir les démos et modèles existants : https://huggingface.co/tasks/text-classification
- Un assez grand nombre de jeux de données est aussi accessible directement via l'API, pour le texte ou l'image notamment cf les jeux de données https://huggingface.co/datasets et la doc pour gérer ces données : https://huggingface.co/docs/datasets/index
Le code ci-dessous vous permet d'installer :
- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/
- la librairie de datasets pour accéder à des jeux de données
%% Cell type:code id: tags:
``` python
!pip install -U transformers
!pip install datasets
```
%% Cell type:markdown id: tags:
Finally, if the installation is successful, we can import the transformers library:
%% Cell type:code id: tags:
``` python
import transformers
from transformers import pipeline
from datasets import load_dataset
import numpy as np
```
%% Cell type:markdown id: tags:
# 1. Sentiment analysis with a pretrained model
Many NLP tasks are made easy to perform within HuggingFace using the Pipeline abstraction.
Useful resource: course made available on HuggingFace website, e.g. part on pipelines: https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines
For example for text classification, we can very simply have access to pretrained models for varied tasks, included sentiment analysis:
https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline
Let's try!
%% Cell type:markdown id: tags:
#### 1.1 ▶▶ Exercise: Default model
You can test pipelines by simply specifying the task you want to perform, a model is chosen by default.
Run the code below:
* what is the name of the chosen pretrained model?
* what language?
* run the next lines and look at the predictions of the model, does it seem alright? Can you produce an example that is not well predicted?
%% Cell type:code id: tags:
``` python
classifier = pipeline("sentiment-analysis")
```
%% Cell type:code id: tags:
``` python
classifier("This movie is disgustingly good !")
```
%% Cell type:code id: tags:
``` python
classifier("This movie is not as good as expected !")
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### 1.3 Specifying a pretrained model for English
You can specify the pretrained model you want to use.
HuggingFace makes available tons of models for NLP (and other domains).
You can browse them on this page, here restricted to English model for Text classification tasks: https://huggingface.co/models?language=en&pipeline_tag=text-classification&sort=downloads
▶▶ Exercise: use the same model as before, but using the parameter of the pipeline to specify its name
Hint: look at the doc https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt#using-any-model-from-the-hub-in-a-pipeline
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### 1.4 ▶▶ Exercise: use a pretrained model for French
Now, take a look at the models page and find a suitable model for the task in French: we want to try an adapted version of **FlauBERT**.
* Find the model in the database, look at the documentation: how has been built this model?
* load it. You will need to install sacremoses library using ```!pip install sacremoses```
* Then try it on a few examples.
%% Cell type:code id: tags:
``` python
!pip install sacremoses
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# 1.5 Exploring a dataset
In this part, we will focus on exploring datasets that are part of the HuggingFace hub.
%% Cell type:markdown id: tags:
## 1.5.1 Load a dataset
▶▶ Exercise: Find the dataset corresponding to IMDB and load it.
Doc: https://huggingface.co/datasets and https://huggingface.co/docs/datasets/load_hub
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 1.5.2 Print statistics on the dataset
▶▶ Exercise:
* Print the number of classes
* Print the first 2 examples of the dataset (advice: shuffle the dataset..)
* Print the distribution
* Count the total number of tokens and unique tokens
Hint: start by simply 'printing' the dataset object, i.e.:
```
dataset
```
It will show you the structure of this object.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 1.5.3 Tokenizer
The text in the dataset is not tokenized.
In fact, transformers models have been trained using a specifc tokenization, and it is crucial to rely on the same tokenization when using a transformer model.
%% Cell type:markdown id: tags:
### ▶▶ Exercise: Load the pretrained model for English and test it on the first example
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Notes on tokenizers
Notez que la librairie HuggingFace définit des *Auto Classes*: elles permettent d'inférer directement l'architecture requise selon le type de modèle spécifié en argument.
* Par exemple ici, le tokenizer est spécifique au modèle DistilBERT, plus précisément il est identique à celui de BERT, et hérite beaucoup de méthodes de la classe *PreTrainedTokenizerFast*.
* On utilise la classe *class transformers.AutoModelForSequenceClassification* pour un modèle d'étiquetage de séquence.
Le tokenizer est en charge de préparer les données d'entrée, et notamment dans le cas de BERT, de découper les tokens en sous-tokens, mais aussi d'assigner des ids à chaque sous-token, de permettre le mapping dans un sens et dans l'autre...
- Les *Auto Classes*: https://huggingface.co/docs/transformers/model_doc/auto
- Les Tokenizer dans HuggingFace: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer
- *Bert tokenizer*: https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/bert#transformers.BertTokenizer
- Classe *PreTrainedTokenizerFast*: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer
# Defining the tokenizer using Auto Classes
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
```
%% Cell type:markdown id: tags:
### ▶▶ Exercice: Tester le tokenizer
**Utiliser le tokenizer pour :**
- encoder une phrase (en anglais) :
- convertir dans l'autre sens : d'une liste d'ids de tokens en texte
* que se passe-t-il dans le cas de mots longs ?
* de mots inconnus ?
* Que répresentent les éléments entre crochets ?
Hint: regardez les méthodes 'encode' et 'decode' dans la doc https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer (et éventuellement 'convert_ids_to_tokens()').
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Compute the vocabulary using the tokenizer
The function below will tokenize the entire dataset.
▶▶ Exercise: compute the total number of tokens and unique tokens.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = dataset.map(tokenize_function)
```
%% Cell type:code id: tags:
``` python
tokenized_datasets
```
%% Cell type:markdown id: tags:
Notez que le tokenizer retourne deux éléments:
- input_ids: the numbers representing the tokens in the text.
- attention_mask: indicates whether a token should be masked or not.
Plus d'info sur les datasets: https://huggingface.co/docs/datasets/use_dataset
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Additional notes about HuggingFace dataset
%% Cell type:markdown id: tags:
#### Available corpora
Note that many corpora are available directly from HuggingFace, for example for text classification tasks:
https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads
In particular you can directly load the full AlloCine corpus:
https://huggingface.co/datasets/allocine
%% Cell type:markdown id: tags:
#### Some preprocessing
The library allows to perform some preprocessing directly on the Dataset object, very easily.
Take alook at the doc: https://huggingface.co/course/chapter5/3?fw=pt
For example here we can compute the lenght of each review and filter our dataset to excluse outliers, e.g. reviews with too few words.
%% Cell type:code id: tags:
``` python
def compute_review_length(example):
return {"review_length": len(example["review"].split())}
dataset = dataset.map(compute_review_length) #Add the column review_lenght
# Inspect the first training example
dataset["train"][0]
```
%% Cell type:markdown id: tags:
Some review are very short... Dataset.filter() can be used to remove some examples.
%% Cell type:code id: tags:
``` python
dataset["train"].sort("review_length")[:3]
```
%% Cell type:code id: tags:
``` python
filtered_dataset = dataset.filter(lambda x: x["review_length"] > 10)
print(filtered_dataset.num_rows)
```
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment