"Path to data, dataset for genre classification of movies:"
],
"metadata": {
"id": "urp4cUXq42Us"
}
},
{
"cell_type": "code",
"source": [
"dataset_file = 'train_data.txt'"
],
"metadata": {
"id": "4vnlP28r46SI"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# 1- Importing a dataset\n",
"\n",
"We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.\n",
"There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset\n"
],
"metadata": {
"id": "ACnrPB_kyS6j"
}
},
{
"cell_type": "markdown",
"source": [
"## 1-2 Using a dictionnary\n",
"\n",
"First solution: having a dictionnary saving the info for each examples, see the code below.\n",
"\n",
"▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict\n",
"\n",
"Finally, print the Dataset keys and the first example."
"Suppose we import the dataset using the function below, with a function yielding / generating the examples while reading the input file.\n",
"\n",
"▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_generator* method:\n",
"* The method is described here: https://huggingface.co/docs/datasets/create_dataset#from-python-dictionaries\n",
"* You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator\n",
"\n",
"Finally, print the Dataset keys and the first example."
"The function below is used to retrieve the model.\n",
"In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.\n",
"But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search"
],
"metadata": {
"id": "xrADeM0tU63G"
}
},
{
"cell_type": "code",
"source": [
"# Here we need to specify the number of labels\n",
"# Note that model_init doesn't take an argument, if you want to specify the\n",
"# number of labels outside the function, you need to embed the methods within\n",
" output_dir=\"test_trainer\", # Name of the directory where model will be saved\n",
" seed=42, # seed for random initialization\n",
" no_cuda=False, # whether to use GPU or not\n",
" per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n",
" evaluation_strategy=\"steps\", # when we want to report evaluation during training\n",
" eval_steps=10, # number of steps before reporting evaluation during training\n",
" save_strategy=\"best\", # strategy to save models\n",
" metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n",
" greater_is_better=False, # for metrics best model: False since eval on loss\n",
" learning_rate=5e-5, # learning rate value\n",
" num_train_epochs=3, # Number of epochs / iterations\n",
" report_to=\"wandb\", # <<<<<< reports results to some platforms\n",
" log_level=\"debug\", # log level\n",
" logging_strategy=\"steps\", # <<<<\n",
" logging_steps=10, # <<<<\n",
" )\n",
"\n"
],
"metadata": {
"id": "TvRtaEBZHnUn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# start a new wandb run to track this script\n",
"wandb.init(\n",
" # set the wandb project where this run will be logged\n",
" entity='teaching',\n",
" project=\"tp9_litl\",\n",
" # track hyperparameters and run metadata\n",
" # track hyperparameters and run metadata\n",
"\t\t config={\n",
"\t\t\t \"model_checkpoint\": base_model,\n",
"\t\t\t \"dataset\": dataset_file,\n",
"\t\t }\n",
")"
],
"metadata": {
"id": "ghd5iyi49b3U"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Z4Ke6NUb9J1q"
},
"outputs": [],
"source": [
"trainer = Trainer(\n",
" model_init=model_init,\n",
" args=training_args,\n",
" train_dataset=small_train_dataset,\n",
" eval_dataset=small_eval_dataset,\n",
" compute_metrics=compute_metrics,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vucFggUT9J1v"
},
"outputs": [],
"source": [
"import os\n",
"trainer.train( )"
]
},
{
"cell_type": "markdown",
"source": [
"# 5- (Code given) Run hyperparameter search\n",
"\n",
"The hyper-parameter search is called on the trainer.\n",
"\n",
"By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.\n",
"\n",
" Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:\n",
"# Default objective is the sum of all metrics\n",
"# when metrics are provided, so we have to maximize it.\n",
"best_run = trainer.hyperparameter_search(\n",
" #hp_space=lambda _: tune_config,\n",
" direction=\"maximize\",\n",
" backend=\"ray\",\n",
" n_trials=3 # number of trials, here very low\n",
")"
],
"metadata": {
"id": "WRMCdQXUPMcj"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.\n",
"\n"
],
"metadata": {
"id": "ZQsI7iMgtnwv"
}
},
{
"cell_type": "code",
"source": [
"best_run"
],
"metadata": {
"id": "nZrTWUeMtqNB"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.\n",
"See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10"
],
"metadata": {
"id": "3ShywJOst4QY"
}
},
{
"cell_type": "markdown",
"source": [
"To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:"
],
"metadata": {
"id": "CWqlhRy3t9dP"
}
},
{
"cell_type": "code",
"source": [
"for n, v in best_run.hyperparameters.items():\n",
" setattr(trainer.args, n, v)\n",
"\n",
"trainer.train()"
],
"metadata": {
"id": "udpzh2YVt_9o"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.\n",
"\n",
"Read the post: https://huggingface.co/blog/ray-tune\n",
"\n",
"Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb"
],
"metadata": {
"id": "Xu-q6Rs1Pa58"
}
},
{
"cell_type": "code",
"source": [],
"metadata": {
"id": "yldJ_Uolc5Ii"
},
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"display_name": "visual",
"language": "python",
"name": "visual"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
},
"colab": {
"provenance": [],
"gpuType": "T4",
"toc_visible": true
},
"accelerator": "GPU"
},
"nbformat": 4,
"nbformat_minor": 0
}
\ No newline at end of file
%% Cell type:markdown id: tags:
# TP 9: Additional elements on Transformers
In this practical session, we will look at aditional elements of the HuggingFace library:
* Importing a dataset
* Modifying a dataset
* Tuning hyper-parameters
* (code given) Reporting to wandb
Dans cette séance, nous verrons comment utiliser un modèle pré-entrainé pour l'adapter à une nouvelle tâche (transfert). Ce TP fait suite au TP6.
Rappel = le code ci-dessous vous permet d'installer :
- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/
- la librairie de datasets pour accéder à des jeux de données
- la librairie *evaluate* : utilisée pour évaluer et comparer des modèles https://pypi.org/project/evaluate/
%% Cell type:code id: tags:
``` python
!pipinstall-Utransformers
!pipinstallaccelerate-U
!pipinstalldatasets
!pipinstallevaluate
```
%% Cell type:markdown id: tags:
Finally, if the installation is successful, we can import the transformers library:
Path to data, dataset for genre classification of movies:
%% Cell type:code id: tags:
``` python
dataset_file='train_data.txt'
```
%% Cell type:markdown id: tags:
# 1- Importing a dataset
We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.
There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset
%% Cell type:markdown id: tags:
## 1-2 Using a dictionnary
First solution: having a dictionnary saving the info for each examples, see the code below.
▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict
Finally, print the Dataset keys and the first example.
*You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator
Finally, print the Dataset keys and the first example.
In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.
But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search
%% Cell type:code id: tags:
``` python
# Here we need to specify the number of labels
# Note that model_init doesn't take an argument, if you want to specify the
# number of labels outside the function, you need to embed the methods within
output_dir="test_trainer", # Name of the directory where model will be saved
seed=42, # seed for random initialization
no_cuda=False, # whether to use GPU or not
per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)
evaluation_strategy="steps", # when we want to report evaluation during training
eval_steps=10, # number of steps before reporting evaluation during training
save_strategy="best", # strategy to save models
metric_for_best_model="eval_loss", # metrics to choose the best model
greater_is_better=False, # for metrics best model: False since eval on loss
learning_rate=5e-5, # learning rate value
num_train_epochs=3, # Number of epochs / iterations
report_to="wandb", # <<<<<< reports results to some platforms
log_level="debug", # log level
logging_strategy="steps", # <<<<
logging_steps=10, # <<<<
)
```
%% Cell type:code id: tags:
``` python
# start a new wandb run to track this script
wandb.init(
# set the wandb project where this run will be logged
entity='teaching',
project="tp9_litl",
# track hyperparameters and run metadata
# track hyperparameters and run metadata
config={
"model_checkpoint": base_model,
"dataset": dataset_file,
}
)
```
%% Cell type:code id: tags:
``` python
trainer = Trainer(
model_init=model_init,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
```
%% Cell type:code id: tags:
``` python
import os
trainer.train( )
```
%% Cell type:markdown id: tags:
# 5- (Code given) Run hyperparameter search
The hyper-parameter search is called on the trainer.
By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.
Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:
We give again the trainer argument and initialize again the trainer below.
Then we can run the hyper-parameter search, with default arguments.
%% Cell type:code id: tags:
``` python
training_args=TrainingArguments(
output_dir="test_trainer",# Name of the directory where model will be saved
seed=42,# seed for random initialization
no_cuda=False,# whether to use GPU or not
per_device_train_batch_size=4,# Train batch size (on each GPU/CPU)
evaluation_strategy="steps",# when we want to report evaluation during training
eval_steps=10,# number of steps before reporting evaluation during training
save_strategy="best",# strategy to save models
metric_for_best_model="eval_loss",# metrics to choose the best model
greater_is_better=False,# for metrics best model: False since eval on loss
learning_rate=5e-5,# learning rate value
num_train_epochs=3,# Number of epochs / iterations
report_to="wandb",# reports results to some platforms
log_level="debug",# log level
logging_strategy="steps",#
logging_steps=10,#
)
trainer=Trainer(
model_init=model_init,
args=training_args,
train_dataset=shard_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer
)
```
%% Cell type:code id: tags:
``` python
# start a new wandb run to track this script
wandb.init(
# set the wandb project where this run will be logged
entity='teaching',
project="tp9_litl_ray",
# track hyperparameters and run metadata
# track hyperparameters and run metadata
config={
"model_checkpoint":base_model,
"dataset":dataset_file,
}
)
```
%% Cell type:code id: tags:
``` python
tune_config={
"learning_rate":tune.loguniform(1e-4,1e-2),
"num_train_epochs":tune.choice(range(1,6)),
"seed":tune.choice(range(1,41)),
"per_device_train_batch_size":tune.choice([2,8]),
}
```
%% Cell type:code id: tags:
``` python
# Default objective is the sum of all metrics
# when metrics are provided, so we have to maximize it.
best_run=trainer.hyperparameter_search(
#hp_space=lambda _: tune_config,
direction="maximize",
backend="ray",
n_trials=3# number of trials, here very low
)
```
%% Cell type:markdown id: tags:
The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.
%% Cell type:code id: tags:
``` python
best_run
```
%% Cell type:markdown id: tags:
You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.
See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10
%% Cell type:markdown id: tags:
To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:
%% Cell type:code id: tags:
``` python
forn,vinbest_run.hyperparameters.items():
setattr(trainer.args,n,v)
trainer.train()
```
%% Cell type:markdown id: tags:
You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.
Read the post: https://huggingface.co/blog/ray-tune
Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb