Skip to content
Snippets Groups Projects
Commit ce4518f5 authored by chloebt's avatar chloebt
Browse files

add TP9

parent e29d77a2
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# TP 9: Additional elements on Transformers
In this practical session, we will look at aditional elements of the HuggingFace library:
* Importing a dataset
* Modifying a dataset
* Tuning hyper-parameters
* (code given) Reporting to wandb
Dans cette séance, nous verrons comment utiliser un modèle pré-entrainé pour l'adapter à une nouvelle tâche (transfert). Ce TP fait suite au TP6.
Rappel = le code ci-dessous vous permet d'installer :
- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/
- la librairie de datasets pour accéder à des jeux de données
- la librairie *evaluate* : utilisée pour évaluer et comparer des modèles https://pypi.org/project/evaluate/
%% Cell type:code id: tags:
``` python
!pip install -U transformers
!pip install accelerate -U
!pip install datasets
!pip install evaluate
```
%% Cell type:markdown id: tags:
Finally, if the installation is successful, we can import the transformers library:
%% Cell type:code id: tags:
``` python
import transformers
from datasets import load_dataset, Dataset
import evaluate
import numpy as np
import sklearn
```
%% Cell type:code id: tags:
``` python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments, Trainer
```
%% Cell type:code id: tags:
``` python
import pandas as pds
from tqdm import tqdm
```
%% Cell type:markdown id: tags:
Path to data, dataset for genre classification of movies:
%% Cell type:code id: tags:
``` python
dataset_file = 'train_data.txt'
```
%% Cell type:markdown id: tags:
# 1- Importing a dataset
We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.
There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset
%% Cell type:markdown id: tags:
## 1-2 Using a dictionnary
First solution: having a dictionnary saving the info for each examples, see the code below.
▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict
Finally, print the Dataset keys and the first example.
%% Cell type:code id: tags:
``` python
def read_dataset( dataset_file ):
dataset_dict = {"id":[], "title":[], "genre":[], "plot":[] }
with open( dataset_file, 'r' ) as f:
mylines = f.readlines()
for l in mylines:
l = l.strip()
data = l.split(' ::: ')
dataset_dict["id"].append( data[0] )
dataset_dict["title"].append( data[1] )
dataset_dict["genre"].append( data[2] )
dataset_dict["plot"].append( data[3] )
return dataset_dict
```
%% Cell type:markdown id: tags:
-------------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 1-2 Using a generator
Suppose we import the dataset using the function below, with a function yielding / generating the examples while reading the input file.
▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_generator* method:
* The method is described here: https://huggingface.co/docs/datasets/create_dataset#from-python-dictionaries
* You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator
Finally, print the Dataset keys and the first example.
%% Cell type:code id: tags:
``` python
def read_dataset( dataset_file ):
with open( dataset_file, 'r' ) as f:
mylines = f.readlines()
for l in mylines:
l = l.strip()
data = l.split(' ::: ')
yield {'id':data[0], "title":data[1], "genre":data[2], "plot":data[3] }
```
%% Cell type:markdown id: tags:
----------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 1-2 Using a Pandas Dataframe
▶▶ **Exercise:**
* Read the dataset and save a Pandas dataframe
* Transform the dataframe into a Dataset object.
%% Cell type:markdown id: tags:
-------------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# 2- Modifying a dataset
In the original dataset, the labels are given as text.
For use with HuggingFace, we need to have numeric labels.
But first, we'll see how we can use the filter function to modify the dataset.
%% Cell type:markdown id: tags:
## 2-1 Filtering the dataset
Imagine we want to remove a certain category, for example the less represented.
▶▶ **Exercise:**
- Count the initial number of examples (i.e. number of rows)
- Print the number of unique labels and the list of labels
- Find the less representative label
- Remove all examples of this category using the *filter* function
- Check the number of unique labels in the filtered dataset
- Recompute the mapping id to label
- Count the number of examples after filtering
%% Cell type:markdown id: tags:
-------------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 2-2 Mapping of labels
▶▶ **Exercise:**
* Build a mapping from each label to a numeric value.
%% Cell type:markdown id: tags:
-----------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 2-3 Adding numeric labels to the dataset
HuggingFace models need a column called 'label', that contains a numeric label.
We will add this column to the whole dataset.
You'll need to look at the API: https://huggingface.co/docs/datasets/package_reference/main_classes
▶▶ **Exercise:**
- Add a column called 'label' to the Dataset ds_filtered
- with values corresponding to the numeric label
- Print the keys of the augmented dataset (note that no transformation is on place).
%% Cell type:code id: tags:
``` python
ds_filtered
```
%% Cell type:markdown id: tags:
----------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 2-4 Mapping
Let's say we want to add the title to the plot, for our future classification task.
▶▶ **Exercise:**
- Use the *map* function to add the title to the plot, using the function below.
See the doc: https://huggingface.co/docs/datasets/process#map
%% Cell type:code id: tags:
``` python
def add_plot( example ):
example['plot'] = example['title'] + " " + example['plot']
return example
```
%% Cell type:markdown id: tags:
-----------------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 2-5 Shuffle and split
▶▶ **Exercise:**
- Shuffle the final dataset
- split into train, dev, test
%% Cell type:markdown id: tags:
----------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 2-6 DatasetDict
Finally, we put the datasets into a DatasetDict object, with the split as keys, easier to handle:
%% Cell type:code id: tags:
``` python
from datasets.dataset_dict import DatasetDict
d = {'train':dataset_train,
'val':dataset_dev,
'test':dataset_test
}
dataset_dict = DatasetDict(d)
```
%% Cell type:code id: tags:
``` python
print( len(np.unique(dataset_dict["train"]['genre'])))
print( len(np.unique(dataset_dict["train"]['label'])))
print( np.unique(dataset_dict["train"]['label']) )
```
%% Cell type:markdown id: tags:
# 3- Simple training
HuggingFace Trainers supports hyperparameter search based on Optuna or RayTune.
First, let's launch a simple fine-tuning, we'll see below what we need to modify to do hyper-parameter search.
%% Cell type:markdown id: tags:
## 3-1 Tokenization
Our base model will be distilBERT (case or uncased).
▶▶ **Exercise:** Tokenize the dataset based on this model. Define a tokenize_function then use *map* to apply it to the entire DatasetDict.
%% Cell type:markdown id: tags:
--------------------
SOLUTION
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## 3-2 Initialize the model
Before training, we need to define:
* a training config, i.e. *TrainingArguments*.
- an evaluation metrics
▶▶ **Exercise:** Take a look at the training arguments below:
* Add a comment on each line to explain the argument
* Refer to the API if needed: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
%% Cell type:code id: tags:
``` python
# Evaluate during training and a bit more often
# than the default to be able to prune bad trials early.
training_args = TrainingArguments(
output_dir="test_trainer",
seed=42,
no_cuda=False,
per_device_train_batch_size=4,
evaluation_strategy="steps",
eval_steps=100,
save_strategy="best",
metric_for_best_model="eval_loss",
greater_is_better=False,
learning_rate=5e-5,
num_train_epochs=3,
report_to="none",
#log_level="debug"
)
```
%% Cell type:markdown id: tags:
----------------------------
SOLUTION
%% Cell type:markdown id: tags:
The code below defines the metrics used to compute performance, here accuracy.
We also define a function that tells the model how to compute the performance based on its output.
%% Cell type:code id: tags:
``` python
metric = evaluate.load("accuracy")
```
%% Cell type:code id: tags:
``` python
def compute_metrics(eval_pred):
metric = evaluate.load("accuracy")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
```
%% Cell type:markdown id: tags:
## 3-3 Launch training
The function below is used to retrieve the model.
In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.
But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search
%% Cell type:code id: tags:
``` python
# Here we need to specify the number of labels
# Note that model_init doesn't take an argument, if you want to specify the
# number of labels outside the function, you need to embed the methods within
# e.g. your train method.
def model_init():
return AutoModelForSequenceClassification.from_pretrained(
base_model, num_labels = 26 )
```
%% Cell type:code id: tags:
``` python
trainer = Trainer(
model_init=model_init,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
```
%% Cell type:markdown id: tags:
Now we can launch training, we will compare the results with default values to the results of the hyper-parameter search.
%% Cell type:code id: tags:
``` python
import os
trainer.train( )
```
%% Cell type:code id: tags:
``` python
trainer.save_model( "best_model" )
```
%% Cell type:code id: tags:
``` python
# use a small version of the dataset if run on CPU
logits, gold, metrics = trainer.predict( small_eval_dataset )
#logits, gold, metrics = trainer.predict( tokenized_datasets["val"] )
```
%% Cell type:code id: tags:
``` python
predictions = np.argmax(logits, axis=-1)
all_metrics = metric.compute(predictions=predictions, references=gold)
print( all_metrics )
```
%% Cell type:markdown id: tags:
# 4- (code given) Reporting to wandb
%% Cell type:markdown id: tags:
WeightAndBiases is a platform that can be used to save results of your experiments, and make comparisons easier.
You need an account to use it, let's just see how it works.
See the differences in the training arguments?
https://wandb.ai/amogkam/transformers/reports/Hyperparameter-Optimization-for-Huggingface-Transformers--VmlldzoyMTc2ODI
%% Cell type:code id: tags:
``` python
!pip install wandb
```
%% Cell type:code id: tags:
``` python
import wandb
```
%% Cell type:code id: tags:
``` python
# Needs to log during training
training_args = TrainingArguments(
output_dir="test_trainer", # Name of the directory where model will be saved
seed=42, # seed for random initialization
no_cuda=False, # whether to use GPU or not
per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)
evaluation_strategy="steps", # when we want to report evaluation during training
eval_steps=10, # number of steps before reporting evaluation during training
save_strategy="best", # strategy to save models
metric_for_best_model="eval_loss", # metrics to choose the best model
greater_is_better=False, # for metrics best model: False since eval on loss
learning_rate=5e-5, # learning rate value
num_train_epochs=3, # Number of epochs / iterations
report_to="wandb", # <<<<<< reports results to some platforms
log_level="debug", # log level
logging_strategy="steps", # <<<<
logging_steps=10, # <<<<
)
```
%% Cell type:code id: tags:
``` python
# start a new wandb run to track this script
wandb.init(
# set the wandb project where this run will be logged
entity='teaching',
project="tp9_litl",
# track hyperparameters and run metadata
# track hyperparameters and run metadata
config={
"model_checkpoint": base_model,
"dataset": dataset_file,
}
)
```
%% Cell type:code id: tags:
``` python
trainer = Trainer(
model_init=model_init,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
```
%% Cell type:code id: tags:
``` python
import os
trainer.train( )
```
%% Cell type:markdown id: tags:
# 5- (Code given) Run hyperparameter search
The hyper-parameter search is called on the trainer.
By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.
Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:
```
train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
```
for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.
%% Cell type:code id: tags:
``` python
!pip install ray[tune]
```
%% Cell type:code id: tags:
``` python
shard_train_dataset = tokenized_datasets["train"].shard(index=1, num_shards=10)
```
%% Cell type:code id: tags:
``` python
from ray.tune.search.hyperopt import HyperOptSearch
from ray.tune.schedulers import ASHAScheduler
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import CLIReporter
```
%% Cell type:markdown id: tags:
## 4-1 Simple example of hyper-parameter search
We give again the trainer argument and initialize again the trainer below.
Then we can run the hyper-parameter search, with default arguments.
%% Cell type:code id: tags:
``` python
training_args = TrainingArguments(
output_dir="test_trainer", # Name of the directory where model will be saved
seed=42, # seed for random initialization
no_cuda=False, # whether to use GPU or not
per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)
evaluation_strategy="steps", # when we want to report evaluation during training
eval_steps=10, # number of steps before reporting evaluation during training
save_strategy="best", # strategy to save models
metric_for_best_model="eval_loss", # metrics to choose the best model
greater_is_better=False, # for metrics best model: False since eval on loss
learning_rate=5e-5, # learning rate value
num_train_epochs=3, # Number of epochs / iterations
report_to="wandb", # reports results to some platforms
log_level="debug", # log level
logging_strategy="steps", #
logging_steps=10, #
)
trainer = Trainer(
model_init=model_init,
args=training_args,
train_dataset=shard_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer
)
```
%% Cell type:code id: tags:
``` python
# start a new wandb run to track this script
wandb.init(
# set the wandb project where this run will be logged
entity='teaching',
project="tp9_litl_ray",
# track hyperparameters and run metadata
# track hyperparameters and run metadata
config={
"model_checkpoint": base_model,
"dataset": dataset_file,
}
)
```
%% Cell type:code id: tags:
``` python
tune_config = {
"learning_rate": tune.loguniform(1e-4, 1e-2),
"num_train_epochs": tune.choice(range(1, 6)),
"seed": tune.choice(range(1, 41)),
"per_device_train_batch_size": tune.choice([2, 8]),
}
```
%% Cell type:code id: tags:
``` python
# Default objective is the sum of all metrics
# when metrics are provided, so we have to maximize it.
best_run = trainer.hyperparameter_search(
#hp_space=lambda _: tune_config,
direction="maximize",
backend="ray",
n_trials=3 # number of trials, here very low
)
```
%% Cell type:markdown id: tags:
The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.
%% Cell type:code id: tags:
``` python
best_run
```
%% Cell type:markdown id: tags:
You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.
See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10
%% Cell type:markdown id: tags:
To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:
%% Cell type:code id: tags:
``` python
for n, v in best_run.hyperparameters.items():
setattr(trainer.args, n, v)
trainer.train()
```
%% Cell type:markdown id: tags:
You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.
Read the post: https://huggingface.co/blog/ray-tune
Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment