Skip to content
Snippets Groups Projects
Commit 7d2d1dc5 authored by chloebt's avatar chloebt
Browse files

Add Course 4, TP4 correct, TP5, TP6

parent 3b92fe37
Branches
No related tags found
No related merge requests found
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
%% Cell type:markdown id: tags:
# TP 6: Introduction aux transformers
Dans cette séance, nous verrons comment utiliser la librairie Transformers de HuggingFace et des modèles pré-entraînés.
Nous nous intéresserons encore à la tâche d'analyse de sentiments sur les données anglaises IMDB.
Au sens d'HF, il s'agit d'une tâche de classification de séquences de mots.
Nous nous appuierons sur la librairie HuggingFace et les modèles de langue Transformer (i.e. BERT).
- https://huggingface.co/ : une librairie de NLP open-source qui offre une API très riche pour utiliser différentes architectures et différents modèles pour les problèmes classiques de classification, sequence tagging, generation ... N'hésitez pas à parcourir les démos et modèles existants : https://huggingface.co/tasks/text-classification
- Un assez grand nombre de jeux de données est aussi accessible directement via l'API, pour le texte ou l'image notamment cf les jeux de données https://huggingface.co/datasets et la doc pour gérer ces données : https://huggingface.co/docs/datasets/index
Le code ci-dessous vous permet d'installer :
- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/
- la librairie de datasets pour accéder à des jeux de données
%% Cell type:code id: tags:
``` python
!pip install -U transformers
!pip install datasets
```
%% Cell type:markdown id: tags:
Finally, if the installation is successful, we can import the transformers library:
%% Cell type:code id: tags:
``` python
import transformers
from transformers import pipeline
from datasets import load_dataset
import numpy as np
```
%% Cell type:markdown id: tags:
# 1. Sentiment analysis with a pretrained model
Many NLP tasks are made easy to perform within HuggingFace using the Pipeline abstraction.
Useful resource: course made available on HuggingFace website, e.g. part on pipelines: https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines
For example for text classification, we can very simply have access to pretrained models for varied tasks, included sentiment analysis:
https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline
Let's try!
%% Cell type:markdown id: tags:
#### 1.1 ▶▶ Exercise: Default model
You can test pipelines by simply specifying the task you want to perform, a model is chosen by default.
Run the code below:
* what is the name of the chosen pretrained model?
* what language?
* run the next lines and look at the predictions of the model, does it seem alright? Can you produce an example that is not well predicted?
%% Cell type:code id: tags:
``` python
classifier = pipeline("sentiment-analysis")
```
%% Cell type:code id: tags:
``` python
classifier("This movie is disgustingly good !")
```
%% Cell type:code id: tags:
``` python
classifier("This movie is not as good as expected !")
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### 1.3 Specifying a pretrained model for English
You can specify the pretrained model you want to use.
HuggingFace makes available tons of models for NLP (and other domains).
You can browse them on this page, here restricted to English model for Text classification tasks: https://huggingface.co/models?language=en&pipeline_tag=text-classification&sort=downloads
▶▶ Exercise: use the same model as before, but using the parameter of the pipeline to specify its name
Hint: look at the doc https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt#using-any-model-from-the-hub-in-a-pipeline
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### 1.4 ▶▶ Exercise: use a pretrained model for French
Now, take a look at the models page and find a suitable model for the task in French: we want to try an adapted version of **FlauBERT**.
* Find the model in the database, look at the documentation: how has been built this model?
* load it. You will need to install sacremoses library using ```!pip install sacremoses```
* Then try it on a few examples.
%% Cell type:code id: tags:
``` python
!pip install sacremoses
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# 1.5 Exploring a dataset
In this part, we will focus on exploring datasets that are part of the HuggingFace hub.
%% Cell type:markdown id: tags:
## 1.5.1 Load a dataset
▶▶ Exercise: Find the dataset corresponding to IMDB and load it.
Doc: https://huggingface.co/datasets and https://huggingface.co/docs/datasets/load_hub
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 1.5.2 Print statistics on the dataset
▶▶ Exercise:
* Print the number of classes
* Print the first 2 examples of the dataset (advice: shuffle the dataset..)
* Print the distribution
* Count the total number of tokens and unique tokens
Hint: start by simply 'printing' the dataset object, i.e.:
```
dataset
```
It will show you the structure of this object.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 1.5.3 Tokenizer
The text in the dataset is not tokenized.
In fact, transformers models have been trained using a specifc tokenization, and it is crucial to rely on the same tokenization when using a transformer model.
%% Cell type:markdown id: tags:
### ▶▶ Exercise: Load the pretrained model for English and test it on the first example
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Notes on tokenizers
Notez que la librairie HuggingFace définit des *Auto Classes*: elles permettent d'inférer directement l'architecture requise selon le type de modèle spécifié en argument.
* Par exemple ici, le tokenizer est spécifique au modèle DistilBERT, plus précisément il est identique à celui de BERT, et hérite beaucoup de méthodes de la classe *PreTrainedTokenizerFast*.
* On utilise la classe *class transformers.AutoModelForSequenceClassification* pour un modèle d'étiquetage de séquence.
Le tokenizer est en charge de préparer les données d'entrée, et notamment dans le cas de BERT, de découper les tokens en sous-tokens, mais aussi d'assigner des ids à chaque sous-token, de permettre le mapping dans un sens et dans l'autre...
- Les *Auto Classes*: https://huggingface.co/docs/transformers/model_doc/auto
- Les Tokenizer dans HuggingFace: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer
- *Bert tokenizer*: https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/bert#transformers.BertTokenizer
- Classe *PreTrainedTokenizerFast*: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast
%% Cell type:code id: tags:
``` python
from transformers import AutoTokenizer
# Defining the tokenizer using Auto Classes
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
```
%% Cell type:markdown id: tags:
### ▶▶ Exercice: Tester le tokenizer
**Utiliser le tokenizer pour :**
- encoder une phrase (en anglais) :
- convertir dans l'autre sens : d'une liste d'ids de tokens en texte
* que se passe-t-il dans le cas de mots longs ?
* de mots inconnus ?
* Que répresentent les éléments entre crochets ?
Hint: regardez les méthodes 'encode' et 'decode' dans la doc https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer (et éventuellement 'convert_ids_to_tokens()').
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Compute the vocabulary using the tokenizer
The function below will tokenize the entire dataset.
▶▶ Exercise: compute the total number of tokens and unique tokens.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = dataset.map(tokenize_function)
```
%% Cell type:code id: tags:
``` python
tokenized_datasets
```
%% Cell type:markdown id: tags:
Notez que le tokenizer retourne deux éléments:
- input_ids: the numbers representing the tokens in the text.
- attention_mask: indicates whether a token should be masked or not.
Plus d'info sur les datasets: https://huggingface.co/docs/datasets/use_dataset
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Additional notes about HuggingFace dataset
%% Cell type:markdown id: tags:
#### Available corpora
Note that many corpora are available directly from HuggingFace, for example for text classification tasks:
https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads
In particular you can directly load the full AlloCine corpus:
https://huggingface.co/datasets/allocine
%% Cell type:markdown id: tags:
#### Some preprocessing
The library allows to perform some preprocessing directly on the Dataset object, very easily.
Take alook at the doc: https://huggingface.co/course/chapter5/3?fw=pt
For example here we can compute the lenght of each review and filter our dataset to excluse outliers, e.g. reviews with too few words.
%% Cell type:code id: tags:
``` python
def compute_review_length(example):
return {"review_length": len(example["review"].split())}
dataset = dataset.map(compute_review_length) #Add the column review_lenght
# Inspect the first training example
dataset["train"][0]
```
%% Cell type:markdown id: tags:
Some review are very short... Dataset.filter() can be used to remove some examples.
%% Cell type:code id: tags:
``` python
dataset["train"].sort("review_length")[:3]
```
%% Cell type:code id: tags:
``` python
filtered_dataset = dataset.filter(lambda x: x["review_length"] > 10)
print(filtered_dataset.num_rows)
```
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment