diff --git a/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb b/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..9aea4aaa82dd090a6c1775b6769cb83a6a39bd56 --- /dev/null +++ b/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb @@ -0,0 +1,1146 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "-bb49S7B50eh" + }, + "source": [ + "# TP 9: Additional elements on Transformers\n", + "\n", + "In this practical session, we will look at aditional elements of the HuggingFace library:\n", + "\n", + "* Importing a dataset\n", + "* Modifying a dataset\n", + "* Tuning hyper-parameters\n", + "* (code given) Reporting to wandb\n", + "\n", + "Dans cette séance, nous verrons comment utiliser un modèle pré-entrainé pour l'adapter à une nouvelle tâche (transfert). Ce TP fait suite au TP6.\n", + "\n", + "Rappel = le code ci-dessous vous permet d'installer : \n", + "- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/\n", + "- la librairie de datasets pour accéder à des jeux de données\n", + "- la librairie *evaluate* : utilisée pour évaluer et comparer des modèles https://pypi.org/project/evaluate/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9UoSnFV250el" + }, + "outputs": [], + "source": [ + "!pip install -U transformers\n", + "!pip install accelerate -U\n", + "!pip install datasets\n", + "!pip install evaluate" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, if the installation is successful, we can import the transformers library:" + ], + "metadata": { + "id": "StClx_Hh9PDm" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZBQcA9Ol50en" + }, + "outputs": [], + "source": [ + "import transformers\n", + "from datasets import load_dataset, Dataset\n", + "import evaluate\n", + "import numpy as np\n", + "import sklearn" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3TIXCS5P50en" + }, + "outputs": [], + "source": [ + "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n", + "from transformers import TrainingArguments, Trainer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vCLf1g8z50ep" + }, + "outputs": [], + "source": [ + "import pandas as pds\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Path to data, dataset for genre classification of movies:" + ], + "metadata": { + "id": "urp4cUXq42Us" + } + }, + { + "cell_type": "code", + "source": [ + "dataset_file = 'train_data.txt'" + ], + "metadata": { + "id": "4vnlP28r46SI" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 1- Importing a dataset\n", + "\n", + "We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.\n", + "There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset\n" + ], + "metadata": { + "id": "ACnrPB_kyS6j" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 1-2 Using a dictionnary\n", + "\n", + "First solution: having a dictionnary saving the info for each examples, see the code below.\n", + "\n", + "▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict\n", + "\n", + "Finally, print the Dataset keys and the first example." + ], + "metadata": { + "id": "eS5S0b994gyB" + } + }, + { + "cell_type": "code", + "source": [ + "def read_dataset( dataset_file ):\n", + " dataset_dict = {\"id\":[], \"title\":[], \"genre\":[], \"plot\":[] }\n", + " with open( dataset_file, 'r' ) as f:\n", + " mylines = f.readlines()\n", + " for l in mylines:\n", + " l = l.strip()\n", + " data = l.split(' ::: ')\n", + " dataset_dict[\"id\"].append( data[0] )\n", + " dataset_dict[\"title\"].append( data[1] )\n", + " dataset_dict[\"genre\"].append( data[2] )\n", + " dataset_dict[\"plot\"].append( data[3] )\n", + " return dataset_dict" + ], + "metadata": { + "id": "GhZkf2C_4hCi" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "-------------------\n", + "SOLUTION" + ], + "metadata": { + "id": "yBth_OZU75Mi" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "4qQiajHucrFX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 1-2 Using a generator\n", + "\n", + "Suppose we import the dataset using the function below, with a function yielding / generating the examples while reading the input file.\n", + "\n", + "▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_generator* method:\n", + "* The method is described here: https://huggingface.co/docs/datasets/create_dataset#from-python-dictionaries\n", + "* You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator\n", + "\n", + "Finally, print the Dataset keys and the first example." + ], + "metadata": { + "id": "ADItE2v33N3N" + } + }, + { + "cell_type": "code", + "source": [ + "def read_dataset( dataset_file ):\n", + " with open( dataset_file, 'r' ) as f:\n", + " mylines = f.readlines()\n", + " for l in mylines:\n", + " l = l.strip()\n", + " data = l.split(' ::: ')\n", + " yield {'id':data[0], \"title\":data[1], \"genre\":data[2], \"plot\":data[3] }\n" + ], + "metadata": { + "id": "qrdbJmeByTL-" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "----------------\n", + "SOLUTION" + ], + "metadata": { + "id": "PPb3CFsA4H7R" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "yVuz74ndcyU2" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 1-2 Using a Pandas Dataframe\n", + "\n", + "▶▶ **Exercise:**\n", + "* Read the dataset and save a Pandas dataframe\n", + "* Transform the dataframe into a Dataset object." + ], + "metadata": { + "id": "U4H5kjyM4UNF" + } + }, + { + "cell_type": "markdown", + "source": [ + "-------------------\n", + "SOLUTION" + ], + "metadata": { + "id": "JeGVSAUO84OS" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "teK3vR0qc9fx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 2- Modifying a dataset\n", + "\n", + "In the original dataset, the labels are given as text.\n", + "For use with HuggingFace, we need to have numeric labels.\n", + "But first, we'll see how we can use the filter function to modify the dataset." + ], + "metadata": { + "id": "fO6xWKaA87e0" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 2-1 Filtering the dataset\n", + "\n", + "Imagine we want to remove a certain category, for example the less represented.\n", + "\n", + "▶▶ **Exercise:**\n", + "- Count the initial number of examples (i.e. number of rows)\n", + "- Print the number of unique labels and the list of labels\n", + "- Find the less representative label\n", + "- Remove all examples of this category using the *filter* function\n", + "- Check the number of unique labels in the filtered dataset\n", + "- Recompute the mapping id to label\n", + "- Count the number of examples after filtering" + ], + "metadata": { + "id": "8UqmkAoK-rxJ" + } + }, + { + "cell_type": "markdown", + "source": [ + "-------------------\n", + "SOLUTION" + ], + "metadata": { + "id": "u2ZExtpaCK_b" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "JOVSVp5Pc_O_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 2-2 Mapping of labels\n", + "\n", + "▶▶ **Exercise:**\n", + "* Build a mapping from each label to a numeric value." + ], + "metadata": { + "id": "o4HsEERY9IbG" + } + }, + { + "cell_type": "markdown", + "source": [ + "-----------------\n", + "SOLUTION" + ], + "metadata": { + "id": "GmKcQWts9c-V" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "h76O2MvudBx8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 2-3 Adding numeric labels to the dataset\n", + "\n", + "HuggingFace models need a column called 'label', that contains a numeric label.\n", + "We will add this column to the whole dataset.\n", + "You'll need to look at the API: https://huggingface.co/docs/datasets/package_reference/main_classes\n", + "\n", + "▶▶ **Exercise:**\n", + "- Add a column called 'label' to the Dataset ds_filtered\n", + "- with values corresponding to the numeric label\n", + "- Print the keys of the augmented dataset (note that no transformation is on place)." + ], + "metadata": { + "id": "-whji9AN9oyK" + } + }, + { + "cell_type": "code", + "source": [ + "ds_filtered" + ], + "metadata": { + "id": "CFPbtJ7yfOv7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "----------------\n", + "SOLUTION" + ], + "metadata": { + "id": "DsRi-MAL-ivV" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "wveIbm_6dEFX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 2-4 Mapping\n", + "\n", + "Let's say we want to add the title to the plot, for our future classification task.\n", + "\n", + "▶▶ **Exercise:**\n", + "- Use the *map* function to add the title to the plot, using the function below.\n", + "\n", + "See the doc: https://huggingface.co/docs/datasets/process#map" + ], + "metadata": { + "id": "JyVvDi3IDNA8" + } + }, + { + "cell_type": "code", + "source": [ + "def add_plot( example ):\n", + " example['plot'] = example['title'] + \" \" + example['plot']\n", + " return example" + ], + "metadata": { + "id": "OntAlo86EVg4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "-----------------------\n", + "SOLUTION" + ], + "metadata": { + "id": "EOvniEZuFAra" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "oH7-zEeTdHDE" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 2-5 Shuffle and split\n", + "\n", + "▶▶ **Exercise:**\n", + "- Shuffle the final dataset\n", + "- split into train, dev, test" + ], + "metadata": { + "id": "mu2Qt4F8FSsu" + } + }, + { + "cell_type": "markdown", + "source": [ + "----------------\n", + "SOLUTION" + ], + "metadata": { + "id": "8yzwEm6zHX3u" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "mN7TI_RRdJH4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 2-6 DatasetDict\n", + "\n", + "Finally, we put the datasets into a DatasetDict object, with the split as keys, easier to handle:" + ], + "metadata": { + "id": "7i7aoUdhR9qU" + } + }, + { + "cell_type": "code", + "source": [ + "from datasets.dataset_dict import DatasetDict\n", + "\n", + "d = {'train':dataset_train,\n", + " 'val':dataset_dev,\n", + " 'test':dataset_test\n", + " }\n", + "\n", + "dataset_dict = DatasetDict(d)" + ], + "metadata": { + "id": "Fd83tuNgSBri" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print( len(np.unique(dataset_dict[\"train\"]['genre'])))\n", + "print( len(np.unique(dataset_dict[\"train\"]['label'])))\n", + "print( np.unique(dataset_dict[\"train\"]['label']) )" + ], + "metadata": { + "id": "YXwkow2fhi7B" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 3- Simple training\n", + "\n", + "HuggingFace Trainers supports hyperparameter search based on Optuna or RayTune.\n", + "First, let's launch a simple fine-tuning, we'll see below what we need to modify to do hyper-parameter search.\n" + ], + "metadata": { + "id": "yXSBE_8wHatD" + } + }, + { + "cell_type": "markdown", + "source": [ + "## 3-1 Tokenization\n", + "\n", + "Our base model will be distilBERT (case or uncased).\n", + "\n", + "▶▶ **Exercise:** Tokenize the dataset based on this model. Define a tokenize_function then use *map* to apply it to the entire DatasetDict." + ], + "metadata": { + "id": "QQA2AynoHmLp" + } + }, + { + "cell_type": "markdown", + "source": [ + "--------------------\n", + "SOLUTION" + ], + "metadata": { + "id": "bZLNUt1ZT985" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "TDUreFDtdbxx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 3-2 Initialize the model\n", + "\n", + "Before training, we need to define:\n", + "* a training config, i.e. *TrainingArguments*.\n", + "- an evaluation metrics\n", + "\n", + "▶▶ **Exercise:** Take a look at the training arguments below:\n", + "* Add a comment on each line to explain the argument\n", + "* Refer to the API if needed: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments" + ], + "metadata": { + "id": "R8Nv4Lc_UBAW" + } + }, + { + "cell_type": "code", + "source": [ + "# Evaluate during training and a bit more often\n", + "# than the default to be able to prune bad trials early.\n", + "training_args = TrainingArguments(\n", + " output_dir=\"test_trainer\",\n", + " seed=42,\n", + " no_cuda=False,\n", + " per_device_train_batch_size=4,\n", + " evaluation_strategy=\"steps\",\n", + " eval_steps=100,\n", + " save_strategy=\"best\",\n", + " metric_for_best_model=\"eval_loss\",\n", + " greater_is_better=False,\n", + " learning_rate=5e-5,\n", + " num_train_epochs=3,\n", + " report_to=\"none\",\n", + " #log_level=\"debug\"\n", + " )" + ], + "metadata": { + "id": "UoSjxQmxMjgN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "----------------------------\n", + "SOLUTION" + ], + "metadata": { + "id": "vYVhkLBmQDRb" + } + }, + { + "cell_type": "markdown", + "source": [ + "The code below defines the metrics used to compute performance, here accuracy.\n", + "We also define a function that tells the model how to compute the performance based on its output." + ], + "metadata": { + "id": "F9NbB9-cXuNm" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XHuT6s2sMy-f" + }, + "outputs": [], + "source": [ + "metric = evaluate.load(\"accuracy\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qMyHHw7YMy-h" + }, + "outputs": [], + "source": [ + "def compute_metrics(eval_pred):\n", + " metric = evaluate.load(\"accuracy\")\n", + " logits, labels = eval_pred\n", + " predictions = np.argmax(logits, axis=-1)\n", + " return metric.compute(predictions=predictions, references=labels)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## 3-3 Launch training\n", + "\n", + "The function below is used to retrieve the model.\n", + "In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.\n", + "But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search" + ], + "metadata": { + "id": "xrADeM0tU63G" + } + }, + { + "cell_type": "code", + "source": [ + "# Here we need to specify the number of labels\n", + "# Note that model_init doesn't take an argument, if you want to specify the\n", + "# number of labels outside the function, you need to embed the methods within\n", + "# e.g. your train method.\n", + "def model_init():\n", + " return AutoModelForSequenceClassification.from_pretrained(\n", + " base_model, num_labels = 26 )" + ], + "metadata": { + "id": "Ow4U0NK6QIgh" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uX2nBPnk50ew" + }, + "outputs": [], + "source": [ + "trainer = Trainer(\n", + " model_init=model_init,\n", + " args=training_args,\n", + " train_dataset=small_train_dataset,\n", + " eval_dataset=small_eval_dataset,\n", + " compute_metrics=compute_metrics,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Now we can launch training, we will compare the results with default values to the results of the hyper-parameter search." + ], + "metadata": { + "id": "qMSE6qJApi6n" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IN58_eaV50ex" + }, + "outputs": [], + "source": [ + "import os\n", + "trainer.train( )" + ] + }, + { + "cell_type": "code", + "source": [ + "trainer.save_model( \"best_model\" )" + ], + "metadata": { + "id": "DebJtbM1vvup" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# use a small version of the dataset if run on CPU\n", + "logits, gold, metrics = trainer.predict( small_eval_dataset )\n", + "#logits, gold, metrics = trainer.predict( tokenized_datasets[\"val\"] )" + ], + "metadata": { + "id": "bCqGyKpYv-mL" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "predictions = np.argmax(logits, axis=-1)\n", + "all_metrics = metric.compute(predictions=predictions, references=gold)\n", + "print( all_metrics )" + ], + "metadata": { + "id": "Tbvy-TmZ02H1" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + " # 4- (code given) Reporting to wandb" + ], + "metadata": { + "id": "pe_cg8OxF9US" + } + }, + { + "cell_type": "markdown", + "source": [ + "WeightAndBiases is a platform that can be used to save results of your experiments, and make comparisons easier.\n", + "You need an account to use it, let's just see how it works.\n", + "\n", + "See the differences in the training arguments?\n", + "\n", + "https://wandb.ai/amogkam/transformers/reports/Hyperparameter-Optimization-for-Huggingface-Transformers--VmlldzoyMTc2ODI" + ], + "metadata": { + "id": "fXzVvSw9HnIZ" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install wandb" + ], + "metadata": { + "id": "_9DHvaEx974h" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import wandb" + ], + "metadata": { + "id": "WgQgkFfA-qQy" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Needs to log during training\n", + "training_args = TrainingArguments(\n", + " output_dir=\"test_trainer\", # Name of the directory where model will be saved\n", + " seed=42, # seed for random initialization\n", + " no_cuda=False, # whether to use GPU or not\n", + " per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n", + " evaluation_strategy=\"steps\", # when we want to report evaluation during training\n", + " eval_steps=10, # number of steps before reporting evaluation during training\n", + " save_strategy=\"best\", # strategy to save models\n", + " metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n", + " greater_is_better=False, # for metrics best model: False since eval on loss\n", + " learning_rate=5e-5, # learning rate value\n", + " num_train_epochs=3, # Number of epochs / iterations\n", + " report_to=\"wandb\", # <<<<<< reports results to some platforms\n", + " log_level=\"debug\", # log level\n", + " logging_strategy=\"steps\", # <<<<\n", + " logging_steps=10, # <<<<\n", + " )\n", + "\n" + ], + "metadata": { + "id": "TvRtaEBZHnUn" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# start a new wandb run to track this script\n", + "wandb.init(\n", + " # set the wandb project where this run will be logged\n", + " entity='teaching',\n", + " project=\"tp9_litl\",\n", + " # track hyperparameters and run metadata\n", + " # track hyperparameters and run metadata\n", + "\t\t config={\n", + "\t\t\t \"model_checkpoint\": base_model,\n", + "\t\t\t \"dataset\": dataset_file,\n", + "\t\t }\n", + ")" + ], + "metadata": { + "id": "ghd5iyi49b3U" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Z4Ke6NUb9J1q" + }, + "outputs": [], + "source": [ + "trainer = Trainer(\n", + " model_init=model_init,\n", + " args=training_args,\n", + " train_dataset=small_train_dataset,\n", + " eval_dataset=small_eval_dataset,\n", + " compute_metrics=compute_metrics,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vucFggUT9J1v" + }, + "outputs": [], + "source": [ + "import os\n", + "trainer.train( )" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# 5- (Code given) Run hyperparameter search\n", + "\n", + "The hyper-parameter search is called on the trainer.\n", + "\n", + "By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.\n", + "\n", + " Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:\n", + " ```\n", + "train_dataset = encoded_dataset[\"train\"].shard(index=1, num_shards=10)\n", + "```\n", + "\n", + "for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.\n" + ], + "metadata": { + "id": "nP7oCAG6PMOm" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install ray[tune]" + ], + "metadata": { + "id": "Uo8oR2RMME-y" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "shard_train_dataset = tokenized_datasets[\"train\"].shard(index=1, num_shards=10)" + ], + "metadata": { + "id": "-5a6VUa5qKSr" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from ray.tune.search.hyperopt import HyperOptSearch\n", + "from ray.tune.schedulers import ASHAScheduler\n", + "from ray import tune\n", + "from ray.tune.schedulers import PopulationBasedTraining\n", + "from ray.tune import CLIReporter" + ], + "metadata": { + "id": "3FdjPsQq6hR7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 4-1 Simple example of hyper-parameter search\n", + "\n", + "We give again the trainer argument and initialize again the trainer below.\n", + "Then we can run the hyper-parameter search, with default arguments." + ], + "metadata": { + "id": "k7huV8jj35Yl" + } + }, + { + "cell_type": "code", + "source": [ + "training_args = TrainingArguments(\n", + " output_dir=\"test_trainer\", # Name of the directory where model will be saved\n", + " seed=42, # seed for random initialization\n", + " no_cuda=False, # whether to use GPU or not\n", + " per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n", + " evaluation_strategy=\"steps\", # when we want to report evaluation during training\n", + " eval_steps=10, # number of steps before reporting evaluation during training\n", + " save_strategy=\"best\", # strategy to save models\n", + " metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n", + " greater_is_better=False, # for metrics best model: False since eval on loss\n", + " learning_rate=5e-5, # learning rate value\n", + " num_train_epochs=3, # Number of epochs / iterations\n", + " report_to=\"wandb\", # reports results to some platforms\n", + " log_level=\"debug\", # log level\n", + " logging_strategy=\"steps\", #\n", + " logging_steps=10, #\n", + " )\n", + "\n", + "\n", + "trainer = Trainer(\n", + " model_init=model_init,\n", + " args=training_args,\n", + " train_dataset=shard_train_dataset,\n", + " eval_dataset=small_eval_dataset,\n", + " compute_metrics=compute_metrics,\n", + " tokenizer=tokenizer\n", + ")" + ], + "metadata": { + "id": "Ejp-jUkup8nr" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# start a new wandb run to track this script\n", + "wandb.init(\n", + " # set the wandb project where this run will be logged\n", + " entity='teaching',\n", + " project=\"tp9_litl_ray\",\n", + " # track hyperparameters and run metadata\n", + " # track hyperparameters and run metadata\n", + "\t\t config={\n", + "\t\t\t \"model_checkpoint\": base_model,\n", + "\t\t\t \"dataset\": dataset_file,\n", + "\t\t }\n", + ")" + ], + "metadata": { + "id": "Swg3f29xOqER" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "tune_config = {\n", + " \"learning_rate\": tune.loguniform(1e-4, 1e-2),\n", + " \"num_train_epochs\": tune.choice(range(1, 6)),\n", + " \"seed\": tune.choice(range(1, 41)),\n", + " \"per_device_train_batch_size\": tune.choice([2, 8]),\n", + " }" + ], + "metadata": { + "id": "d2zq914x6Jqs" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# Default objective is the sum of all metrics\n", + "# when metrics are provided, so we have to maximize it.\n", + "best_run = trainer.hyperparameter_search(\n", + " #hp_space=lambda _: tune_config,\n", + " direction=\"maximize\",\n", + " backend=\"ray\",\n", + " n_trials=3 # number of trials, here very low\n", + ")" + ], + "metadata": { + "id": "WRMCdQXUPMcj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.\n", + "\n" + ], + "metadata": { + "id": "ZQsI7iMgtnwv" + } + }, + { + "cell_type": "code", + "source": [ + "best_run" + ], + "metadata": { + "id": "nZrTWUeMtqNB" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.\n", + "See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10" + ], + "metadata": { + "id": "3ShywJOst4QY" + } + }, + { + "cell_type": "markdown", + "source": [ + "To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:" + ], + "metadata": { + "id": "CWqlhRy3t9dP" + } + }, + { + "cell_type": "code", + "source": [ + "for n, v in best_run.hyperparameters.items():\n", + " setattr(trainer.args, n, v)\n", + "\n", + "trainer.train()" + ], + "metadata": { + "id": "udpzh2YVt_9o" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.\n", + "\n", + "Read the post: https://huggingface.co/blog/ray-tune\n", + "\n", + "Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb" + ], + "metadata": { + "id": "Xu-q6Rs1Pa58" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "yldJ_Uolc5Ii" + }, + "execution_count": null, + "outputs": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "visual", + "language": "python", + "name": "visual" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.5" + }, + "colab": { + "provenance": [], + "gpuType": "T4", + "toc_visible": true + }, + "accelerator": "GPU" + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file