add TP9

ce4518f5 · chloebt · e29d77a2 · ce4518f5
Commit ce4518f5 authored 5 months ago by chloebt
--- a/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb
+++ b/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-bb49S7B50eh"
+      },
+      "source": [
+        "# TP 9: Additional elements on Transformers\n",
+        "\n",
+        "In this practical session, we will look at aditional elements of the HuggingFace library:\n",
+        "\n",
+        "* Importing a dataset\n",
+        "* Modifying a dataset\n",
+        "* Tuning hyper-parameters\n",
+        "* (code given) Reporting to wandb\n",
+        "\n",
+        "Dans cette séance, nous verrons comment utiliser un modèle pré-entrainé pour l'adapter à une nouvelle tâche (transfert). Ce TP fait suite au TP6.\n",
+        "\n",
+        "Rappel = le code ci-dessous vous permet d'installer :    \n",
+        "- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/\n",
+        "- la librairie de datasets pour accéder à des jeux de données\n",
+        "- la librairie *evaluate* : utilisée pour évaluer et comparer des modèles https://pypi.org/project/evaluate/"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "9UoSnFV250el"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U transformers\n",
+        "!pip install accelerate -U\n",
+        "!pip install datasets\n",
+        "!pip install evaluate"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Finally, if the installation is successful, we can import the transformers library:"
+      ],
+      "metadata": {
+        "id": "StClx_Hh9PDm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ZBQcA9Ol50en"
+      },
+      "outputs": [],
+      "source": [
+        "import transformers\n",
+        "from datasets import load_dataset, Dataset\n",
+        "import evaluate\n",
+        "import numpy as np\n",
+        "import sklearn"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "3TIXCS5P50en"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n",
+        "from transformers import TrainingArguments, Trainer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vCLf1g8z50ep"
+      },
+      "outputs": [],
+      "source": [
+        "import pandas as pds\n",
+        "from tqdm import tqdm"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Path to data, dataset for genre classification of movies:"
+      ],
+      "metadata": {
+        "id": "urp4cUXq42Us"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "dataset_file = 'train_data.txt'"
+      ],
+      "metadata": {
+        "id": "4vnlP28r46SI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 1- Importing a dataset\n",
+        "\n",
+        "We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.\n",
+        "There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset\n"
+      ],
+      "metadata": {
+        "id": "ACnrPB_kyS6j"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1-2 Using a dictionnary\n",
+        "\n",
+        "First solution: having a dictionnary saving the info for each examples, see the code below.\n",
+        "\n",
+        "▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict\n",
+        "\n",
+        "Finally, print the Dataset keys and the first example."
+      ],
+      "metadata": {
+        "id": "eS5S0b994gyB"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def read_dataset( dataset_file ):\n",
+        "  dataset_dict = {\"id\":[], \"title\":[], \"genre\":[], \"plot\":[] }\n",
+        "  with open( dataset_file, 'r' ) as f:\n",
+        "    mylines = f.readlines()\n",
+        "    for l in mylines:\n",
+        "      l = l.strip()\n",
+        "      data = l.split(' ::: ')\n",
+        "      dataset_dict[\"id\"].append( data[0] )\n",
+        "      dataset_dict[\"title\"].append( data[1] )\n",
+        "      dataset_dict[\"genre\"].append( data[2] )\n",
+        "      dataset_dict[\"plot\"].append( data[3] )\n",
+        "  return dataset_dict"
+      ],
+      "metadata": {
+        "id": "GhZkf2C_4hCi"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "yBth_OZU75Mi"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "4qQiajHucrFX"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1-2 Using a generator\n",
+        "\n",
+        "Suppose we import the dataset using the function below, with a function yielding / generating the examples while reading the input file.\n",
+        "\n",
+        "▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_generator* method:\n",
+        "* The method is described here: https://huggingface.co/docs/datasets/create_dataset#from-python-dictionaries\n",
+        "* You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator\n",
+        "\n",
+        "Finally, print the Dataset keys and the first example."
+      ],
+      "metadata": {
+        "id": "ADItE2v33N3N"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def read_dataset( dataset_file ):\n",
+        "  with open( dataset_file, 'r' ) as f:\n",
+        "    mylines = f.readlines()\n",
+        "    for l in mylines:\n",
+        "      l = l.strip()\n",
+        "      data = l.split(' ::: ')\n",
+        "      yield {'id':data[0], \"title\":data[1], \"genre\":data[2], \"plot\":data[3] }\n"
+      ],
+      "metadata": {
+        "id": "qrdbJmeByTL-"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "PPb3CFsA4H7R"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "yVuz74ndcyU2"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1-2 Using a Pandas Dataframe\n",
+        "\n",
+        "▶▶  **Exercise:**\n",
+        "* Read the dataset and save a Pandas dataframe\n",
+        "* Transform the dataframe into a Dataset object."
+      ],
+      "metadata": {
+        "id": "U4H5kjyM4UNF"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "JeGVSAUO84OS"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "teK3vR0qc9fx"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 2- Modifying a dataset\n",
+        "\n",
+        "In the original dataset, the labels are given as text.\n",
+        "For use with HuggingFace, we need to have numeric labels.\n",
+        "But first, we'll see how we can use the filter function to modify the dataset."
+      ],
+      "metadata": {
+        "id": "fO6xWKaA87e0"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-1 Filtering the dataset\n",
+        "\n",
+        "Imagine we want to remove a certain category, for example the less represented.\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Count the initial number of examples (i.e. number of rows)\n",
+        "- Print the number of unique labels and the list of labels\n",
+        "- Find the less representative label\n",
+        "- Remove all examples of this category using the *filter* function\n",
+        "- Check the number of unique labels in the filtered dataset\n",
+        "- Recompute the mapping id to label\n",
+        "- Count the number of examples after filtering"
+      ],
+      "metadata": {
+        "id": "8UqmkAoK-rxJ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "u2ZExtpaCK_b"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "JOVSVp5Pc_O_"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-2 Mapping of labels\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "* Build a mapping from each label to a numeric value."
+      ],
+      "metadata": {
+        "id": "o4HsEERY9IbG"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "GmKcQWts9c-V"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "h76O2MvudBx8"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-3 Adding numeric labels to the dataset\n",
+        "\n",
+        "HuggingFace models need a column called 'label', that contains a numeric label.\n",
+        "We will add this column to the whole dataset.\n",
+        "You'll need to look at the API: https://huggingface.co/docs/datasets/package_reference/main_classes\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Add a column called 'label' to the Dataset ds_filtered\n",
+        "- with values corresponding to the numeric label\n",
+        "- Print the keys of the augmented dataset (note that no transformation is on place)."
+      ],
+      "metadata": {
+        "id": "-whji9AN9oyK"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "ds_filtered"
+      ],
+      "metadata": {
+        "id": "CFPbtJ7yfOv7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "DsRi-MAL-ivV"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "wveIbm_6dEFX"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-4 Mapping\n",
+        "\n",
+        "Let's say we want to add the title to the plot, for our future classification task.\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Use the *map* function to add the title to the plot, using the function below.\n",
+        "\n",
+        "See the doc: https://huggingface.co/docs/datasets/process#map"
+      ],
+      "metadata": {
+        "id": "JyVvDi3IDNA8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def add_plot( example ):\n",
+        "  example['plot'] = example['title'] + \" \" + example['plot']\n",
+        "  return example"
+      ],
+      "metadata": {
+        "id": "OntAlo86EVg4"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-----------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "EOvniEZuFAra"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "oH7-zEeTdHDE"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-5 Shuffle and split\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Shuffle the final dataset\n",
+        "- split into train, dev, test"
+      ],
+      "metadata": {
+        "id": "mu2Qt4F8FSsu"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "8yzwEm6zHX3u"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "mN7TI_RRdJH4"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-6 DatasetDict\n",
+        "\n",
+        "Finally, we put the datasets into a DatasetDict object, with the split as keys, easier to handle:"
+      ],
+      "metadata": {
+        "id": "7i7aoUdhR9qU"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from datasets.dataset_dict import DatasetDict\n",
+        "\n",
+        "d = {'train':dataset_train,\n",
+        "     'val':dataset_dev,\n",
+        "     'test':dataset_test\n",
+        "     }\n",
+        "\n",
+        "dataset_dict = DatasetDict(d)"
+      ],
+      "metadata": {
+        "id": "Fd83tuNgSBri"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print( len(np.unique(dataset_dict[\"train\"]['genre'])))\n",
+        "print( len(np.unique(dataset_dict[\"train\"]['label'])))\n",
+        "print( np.unique(dataset_dict[\"train\"]['label']) )"
+      ],
+      "metadata": {
+        "id": "YXwkow2fhi7B"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 3- Simple training\n",
+        "\n",
+        "HuggingFace Trainers supports hyperparameter search based on Optuna or RayTune.\n",
+        "First, let's launch a simple fine-tuning, we'll see below what we need to modify to do hyper-parameter search.\n"
+      ],
+      "metadata": {
+        "id": "yXSBE_8wHatD"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3-1 Tokenization\n",
+        "\n",
+        "Our base model will be distilBERT (case or uncased).\n",
+        "\n",
+        "▶▶ **Exercise:** Tokenize the dataset based on this model. Define a tokenize_function then use *map* to apply it to the entire DatasetDict."
+      ],
+      "metadata": {
+        "id": "QQA2AynoHmLp"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "--------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "bZLNUt1ZT985"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "TDUreFDtdbxx"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3-2 Initialize the model\n",
+        "\n",
+        "Before training, we need to define:\n",
+        "* a training config, i.e. *TrainingArguments*.\n",
+        "- an evaluation metrics\n",
+        "\n",
+        "▶▶ **Exercise:** Take a look at the training arguments below:\n",
+        "* Add a comment on each line to explain the argument\n",
+        "* Refer to the API if needed: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments"
+      ],
+      "metadata": {
+        "id": "R8Nv4Lc_UBAW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Evaluate during training and a bit more often\n",
+        "# than the default to be able to prune bad trials early.\n",
+        "training_args = TrainingArguments(\n",
+        "                                  output_dir=\"test_trainer\",\n",
+        "                                  seed=42,\n",
+        "                                  no_cuda=False,\n",
+        "                                  per_device_train_batch_size=4,\n",
+        "                                  evaluation_strategy=\"steps\",\n",
+        "                                  eval_steps=100,\n",
+        "                                  save_strategy=\"best\",\n",
+        "                                  metric_for_best_model=\"eval_loss\",\n",
+        "                                  greater_is_better=False,\n",
+        "                                  learning_rate=5e-5,\n",
+        "                                  num_train_epochs=3,\n",
+        "                                  report_to=\"none\",\n",
+        "                                  #log_level=\"debug\"\n",
+        "                                  )"
+      ],
+      "metadata": {
+        "id": "UoSjxQmxMjgN"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "vYVhkLBmQDRb"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The code below defines the metrics used to compute performance, here accuracy.\n",
+        "We also define a function that tells the model how to compute the performance based on its output."
+      ],
+      "metadata": {
+        "id": "F9NbB9-cXuNm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "XHuT6s2sMy-f"
+      },
+      "outputs": [],
+      "source": [
+        "metric = evaluate.load(\"accuracy\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "qMyHHw7YMy-h"
+      },
+      "outputs": [],
+      "source": [
+        "def compute_metrics(eval_pred):\n",
+        "    metric = evaluate.load(\"accuracy\")\n",
+        "    logits, labels = eval_pred\n",
+        "    predictions = np.argmax(logits, axis=-1)\n",
+        "    return metric.compute(predictions=predictions, references=labels)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3-3 Launch training\n",
+        "\n",
+        "The function below is used to retrieve the model.\n",
+        "In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.\n",
+        "But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search"
+      ],
+      "metadata": {
+        "id": "xrADeM0tU63G"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Here we need to specify the number of labels\n",
+        "# Note that model_init doesn't take an argument, if you want to specify the\n",
+        "# number of labels outside the function, you need to embed the methods within\n",
+        "# e.g. your train method.\n",
+        "def model_init():\n",
+        "    return AutoModelForSequenceClassification.from_pretrained(\n",
+        "        base_model, num_labels = 26 )"
+      ],
+      "metadata": {
+        "id": "Ow4U0NK6QIgh"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "uX2nBPnk50ew"
+      },
+      "outputs": [],
+      "source": [
+        "trainer = Trainer(\n",
+        "    model_init=model_init,\n",
+        "    args=training_args,\n",
+        "    train_dataset=small_train_dataset,\n",
+        "    eval_dataset=small_eval_dataset,\n",
+        "    compute_metrics=compute_metrics,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now we can launch training, we will compare the results with default values to the results of the hyper-parameter search."
+      ],
+      "metadata": {
+        "id": "qMSE6qJApi6n"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "IN58_eaV50ex"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "trainer.train(  )"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "trainer.save_model( \"best_model\"  )"
+      ],
+      "metadata": {
+        "id": "DebJtbM1vvup"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# use a small version of the dataset if run on CPU\n",
+        "logits, gold, metrics = trainer.predict( small_eval_dataset )\n",
+        "#logits, gold, metrics = trainer.predict( tokenized_datasets[\"val\"] )"
+      ],
+      "metadata": {
+        "id": "bCqGyKpYv-mL"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "predictions = np.argmax(logits, axis=-1)\n",
+        "all_metrics = metric.compute(predictions=predictions, references=gold)\n",
+        "print( all_metrics )"
+      ],
+      "metadata": {
+        "id": "Tbvy-TmZ02H1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        " # 4- (code given) Reporting to wandb"
+      ],
+      "metadata": {
+        "id": "pe_cg8OxF9US"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "WeightAndBiases is a platform that can be used to save results of your experiments, and make comparisons easier.\n",
+        "You need an account to use it, let's just see how it works.\n",
+        "\n",
+        "See the differences in the training arguments?\n",
+        "\n",
+        "https://wandb.ai/amogkam/transformers/reports/Hyperparameter-Optimization-for-Huggingface-Transformers--VmlldzoyMTc2ODI"
+      ],
+      "metadata": {
+        "id": "fXzVvSw9HnIZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install wandb"
+      ],
+      "metadata": {
+        "id": "_9DHvaEx974h"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import wandb"
+      ],
+      "metadata": {
+        "id": "WgQgkFfA-qQy"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Needs to log during training\n",
+        "training_args = TrainingArguments(\n",
+        "                                  output_dir=\"test_trainer\", # Name of the directory where model will be saved\n",
+        "                                  seed=42, # seed for random initialization\n",
+        "                                  no_cuda=False, # whether to use GPU or not\n",
+        "                                  per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n",
+        "                                  evaluation_strategy=\"steps\", # when we want to report evaluation during training\n",
+        "                                  eval_steps=10, # number of steps before reporting evaluation during training\n",
+        "                                  save_strategy=\"best\", # strategy to save models\n",
+        "                                  metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n",
+        "                                  greater_is_better=False, # for metrics best model: False since eval on loss\n",
+        "                                  learning_rate=5e-5, # learning rate value\n",
+        "                                  num_train_epochs=3, # Number of epochs / iterations\n",
+        "                                  report_to=\"wandb\", # <<<<<< reports results to some platforms\n",
+        "                                  log_level=\"debug\", # log level\n",
+        "                                  logging_strategy=\"steps\", # <<<<\n",
+        "                                  logging_steps=10, # <<<<\n",
+        "                                  )\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "TvRtaEBZHnUn"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# start a new wandb run to track this script\n",
+        "wandb.init(\n",
+        "      # set the wandb project where this run will be logged\n",
+        "      entity='teaching',\n",
+        "      project=\"tp9_litl\",\n",
+        "      # track hyperparameters and run metadata\n",
+        "      # track hyperparameters and run metadata\n",
+        "\t\t  config={\n",
+        "\t\t\t  \"model_checkpoint\": base_model,\n",
+        "\t\t\t  \"dataset\": dataset_file,\n",
+        "\t\t  }\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "ghd5iyi49b3U"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Z4Ke6NUb9J1q"
+      },
+      "outputs": [],
+      "source": [
+        "trainer = Trainer(\n",
+        "    model_init=model_init,\n",
+        "    args=training_args,\n",
+        "    train_dataset=small_train_dataset,\n",
+        "    eval_dataset=small_eval_dataset,\n",
+        "    compute_metrics=compute_metrics,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vucFggUT9J1v"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "trainer.train(  )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 5- (Code given) Run hyperparameter search\n",
+        "\n",
+        "The hyper-parameter search is called on the trainer.\n",
+        "\n",
+        "By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.\n",
+        "\n",
+        " Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:\n",
+        " ```\n",
+        "train_dataset = encoded_dataset[\"train\"].shard(index=1, num_shards=10)\n",
+        "```\n",
+        "\n",
+        "for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.\n"
+      ],
+      "metadata": {
+        "id": "nP7oCAG6PMOm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install ray[tune]"
+      ],
+      "metadata": {
+        "id": "Uo8oR2RMME-y"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "shard_train_dataset = tokenized_datasets[\"train\"].shard(index=1, num_shards=10)"
+      ],
+      "metadata": {
+        "id": "-5a6VUa5qKSr"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from ray.tune.search.hyperopt import HyperOptSearch\n",
+        "from ray.tune.schedulers import ASHAScheduler\n",
+        "from ray import tune\n",
+        "from ray.tune.schedulers import PopulationBasedTraining\n",
+        "from ray.tune import CLIReporter"
+      ],
+      "metadata": {
+        "id": "3FdjPsQq6hR7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 4-1 Simple example of hyper-parameter search\n",
+        "\n",
+        "We give again the trainer argument and initialize again the trainer below.\n",
+        "Then we can run the hyper-parameter search, with default arguments."
+      ],
+      "metadata": {
+        "id": "k7huV8jj35Yl"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "training_args = TrainingArguments(\n",
+        "                                  output_dir=\"test_trainer\", # Name of the directory where model will be saved\n",
+        "                                  seed=42, # seed for random initialization\n",
+        "                                  no_cuda=False, # whether to use GPU or not\n",
+        "                                  per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n",
+        "                                  evaluation_strategy=\"steps\", # when we want to report evaluation during training\n",
+        "                                  eval_steps=10, # number of steps before reporting evaluation during training\n",
+        "                                  save_strategy=\"best\", # strategy to save models\n",
+        "                                  metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n",
+        "                                  greater_is_better=False, # for metrics best model: False since eval on loss\n",
+        "                                  learning_rate=5e-5, # learning rate value\n",
+        "                                  num_train_epochs=3, # Number of epochs / iterations\n",
+        "                                  report_to=\"wandb\", # reports results to some platforms\n",
+        "                                  log_level=\"debug\", # log level\n",
+        "                                  logging_strategy=\"steps\", #\n",
+        "                                  logging_steps=10, #\n",
+        "                                  )\n",
+        "\n",
+        "\n",
+        "trainer = Trainer(\n",
+        "    model_init=model_init,\n",
+        "    args=training_args,\n",
+        "    train_dataset=shard_train_dataset,\n",
+        "    eval_dataset=small_eval_dataset,\n",
+        "    compute_metrics=compute_metrics,\n",
+        "    tokenizer=tokenizer\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "Ejp-jUkup8nr"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# start a new wandb run to track this script\n",
+        "wandb.init(\n",
+        "      # set the wandb project where this run will be logged\n",
+        "      entity='teaching',\n",
+        "      project=\"tp9_litl_ray\",\n",
+        "      # track hyperparameters and run metadata\n",
+        "      # track hyperparameters and run metadata\n",
+        "\t\t  config={\n",
+        "\t\t\t  \"model_checkpoint\": base_model,\n",
+        "\t\t\t  \"dataset\": dataset_file,\n",
+        "\t\t  }\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "Swg3f29xOqER"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "tune_config = {\n",
+        "        \"learning_rate\": tune.loguniform(1e-4, 1e-2),\n",
+        "        \"num_train_epochs\": tune.choice(range(1, 6)),\n",
+        "        \"seed\": tune.choice(range(1, 41)),\n",
+        "        \"per_device_train_batch_size\": tune.choice([2, 8]),\n",
+        "    }"
+      ],
+      "metadata": {
+        "id": "d2zq914x6Jqs"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Default objective is the sum of all metrics\n",
+        "# when metrics are provided, so we have to maximize it.\n",
+        "best_run = trainer.hyperparameter_search(\n",
+        "    #hp_space=lambda _: tune_config,\n",
+        "    direction=\"maximize\",\n",
+        "    backend=\"ray\",\n",
+        "    n_trials=3 # number of trials, here very low\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "WRMCdQXUPMcj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "ZQsI7iMgtnwv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "best_run"
+      ],
+      "metadata": {
+        "id": "nZrTWUeMtqNB"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.\n",
+        "See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10"
+      ],
+      "metadata": {
+        "id": "3ShywJOst4QY"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:"
+      ],
+      "metadata": {
+        "id": "CWqlhRy3t9dP"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "for n, v in best_run.hyperparameters.items():\n",
+        "    setattr(trainer.args, n, v)\n",
+        "\n",
+        "trainer.train()"
+      ],
+      "metadata": {
+        "id": "udpzh2YVt_9o"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.\n",
+        "\n",
+        "Read the post: https://huggingface.co/blog/ray-tune\n",
+        "\n",
+        "Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb"
+      ],
+      "metadata": {
+        "id": "Xu-q6Rs1Pa58"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "yldJ_Uolc5Ii"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "visual",
+      "language": "python",
+      "name": "visual"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.5"
+    },
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4",
+      "toc_visible": true
+    },
+    "accelerator": "GPU"
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
+%% Cell type:markdown id: tags:
+
+# TP 9: Additional elements on Transformers
+
+In this practical session, we will look at aditional elements of the HuggingFace library:
+
+* Importing a dataset
+* Modifying a dataset
+* Tuning hyper-parameters
+* (code given) Reporting to wandb
+
+Dans cette séance, nous verrons comment utiliser un modèle pré-entrainé pour l'adapter à une nouvelle tâche (transfert). Ce TP fait suite au TP6.
+
+Rappel = le code ci-dessous vous permet d'installer :
+- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/
+- la librairie de datasets pour accéder à des jeux de données
+- la librairie *evaluate* : utilisée pour évaluer et comparer des modèles https://pypi.org/project/evaluate/
+
+%% Cell type:code id: tags:
+
+``` python
+!pip install -U transformers
+!pip install accelerate -U
+!pip install datasets
+!pip install evaluate
+```
+
+%% Cell type:markdown id: tags:
+
+Finally, if the installation is successful, we can import the transformers library:
+
+%% Cell type:code id: tags:
+
+``` python
+import transformers
+from datasets import load_dataset, Dataset
+import evaluate
+import numpy as np
+import sklearn
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from transformers import TrainingArguments, Trainer
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pds
+from tqdm import tqdm
+```
+
+%% Cell type:markdown id: tags:
+
+Path to data, dataset for genre classification of movies:
+
+%% Cell type:code id: tags:
+
+``` python
+dataset_file = 'train_data.txt'
+```
+
+%% Cell type:markdown id: tags:
+
+# 1- Importing a dataset
+
+We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.
+There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset
+
+%% Cell type:markdown id: tags:
+
+## 1-2 Using a dictionnary
+
+First solution: having a dictionnary saving the info for each examples, see the code below.
+
+▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict
+
+Finally, print the Dataset keys and the first example.
+
+%% Cell type:code id: tags:
+
+``` python
+def read_dataset( dataset_file ):
+  dataset_dict = {"id":[], "title":[], "genre":[], "plot":[] }
+  with open( dataset_file, 'r' ) as f:
+    mylines = f.readlines()
+    for l in mylines:
+      l = l.strip()
+      data = l.split(' ::: ')
+      dataset_dict["id"].append( data[0] )
+      dataset_dict["title"].append( data[1] )
+      dataset_dict["genre"].append( data[2] )
+      dataset_dict["plot"].append( data[3] )
+  return dataset_dict
+```
+
+%% Cell type:markdown id: tags:
+
+-------------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 1-2 Using a generator
+
+Suppose we import the dataset using the function below, with a function yielding / generating the examples while reading the input file.
+
+▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_generator* method:
+* The method is described here: https://huggingface.co/docs/datasets/create_dataset#from-python-dictionaries
+* You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator
+
+Finally, print the Dataset keys and the first example.
+
+%% Cell type:code id: tags:
+
+``` python
+def read_dataset( dataset_file ):
+  with open( dataset_file, 'r' ) as f:
+    mylines = f.readlines()
+    for l in mylines:
+      l = l.strip()
+      data = l.split(' ::: ')
+      yield {'id':data[0], "title":data[1], "genre":data[2], "plot":data[3] }
+```
+
+%% Cell type:markdown id: tags:
+
+----------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 1-2 Using a Pandas Dataframe
+
+▶▶  **Exercise:**
+* Read the dataset and save a Pandas dataframe
+* Transform the dataframe into a Dataset object.
+
+%% Cell type:markdown id: tags:
+
+-------------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+# 2- Modifying a dataset
+
+In the original dataset, the labels are given as text.
+For use with HuggingFace, we need to have numeric labels.
+But first, we'll see how we can use the filter function to modify the dataset.
+
+%% Cell type:markdown id: tags:
+
+## 2-1 Filtering the dataset
+
+Imagine we want to remove a certain category, for example the less represented.
+
+▶▶ **Exercise:**
+- Count the initial number of examples (i.e. number of rows)
+- Print the number of unique labels and the list of labels
+- Find the less representative label
+- Remove all examples of this category using the *filter* function
+- Check the number of unique labels in the filtered dataset
+- Recompute the mapping id to label
+- Count the number of examples after filtering
+
+%% Cell type:markdown id: tags:
+
+-------------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 2-2 Mapping of labels
+
+▶▶ **Exercise:**
+* Build a mapping from each label to a numeric value.
+
+%% Cell type:markdown id: tags:
+
+-----------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 2-3 Adding numeric labels to the dataset
+
+HuggingFace models need a column called 'label', that contains a numeric label.
+We will add this column to the whole dataset.
+You'll need to look at the API: https://huggingface.co/docs/datasets/package_reference/main_classes
+
+▶▶ **Exercise:**
+- Add a column called 'label' to the Dataset ds_filtered
+- with values corresponding to the numeric label
+- Print the keys of the augmented dataset (note that no transformation is on place).
+
+%% Cell type:code id: tags:
+
+``` python
+ds_filtered
+```
+
+%% Cell type:markdown id: tags:
+
+----------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 2-4 Mapping
+
+Let's say we want to add the title to the plot, for our future classification task.
+
+▶▶ **Exercise:**
+- Use the *map* function to add the title to the plot, using the function below.
+
+See the doc: https://huggingface.co/docs/datasets/process#map
+
+%% Cell type:code id: tags:
+
+``` python
+def add_plot( example ):
+  example['plot'] = example['title'] + " " + example['plot']
+  return example
+```
+
+%% Cell type:markdown id: tags:
+
+-----------------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 2-5 Shuffle and split
+
+▶▶ **Exercise:**
+- Shuffle the final dataset
+- split into train, dev, test
+
+%% Cell type:markdown id: tags:
+
+----------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 2-6 DatasetDict
+
+Finally, we put the datasets into a DatasetDict object, with the split as keys, easier to handle:
+
+%% Cell type:code id: tags:
+
+``` python
+from datasets.dataset_dict import DatasetDict
+
+d = {'train':dataset_train,
+     'val':dataset_dev,
+     'test':dataset_test
+     }
+
+dataset_dict = DatasetDict(d)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print( len(np.unique(dataset_dict["train"]['genre'])))
+print( len(np.unique(dataset_dict["train"]['label'])))
+print( np.unique(dataset_dict["train"]['label']) )
+```
+
+%% Cell type:markdown id: tags:
+
+# 3- Simple training
+
+HuggingFace Trainers supports hyperparameter search based on Optuna or RayTune.
+First, let's launch a simple fine-tuning, we'll see below what we need to modify to do hyper-parameter search.
+
+%% Cell type:markdown id: tags:
+
+## 3-1 Tokenization
+
+Our base model will be distilBERT (case or uncased).
+
+▶▶ **Exercise:** Tokenize the dataset based on this model. Define a tokenize_function then use *map* to apply it to the entire DatasetDict.
+
+%% Cell type:markdown id: tags:
+
+--------------------
+SOLUTION
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 3-2 Initialize the model
+
+Before training, we need to define:
+* a training config, i.e. *TrainingArguments*.
+- an evaluation metrics
+
+▶▶ **Exercise:** Take a look at the training arguments below:
+* Add a comment on each line to explain the argument
+* Refer to the API if needed: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
+
+%% Cell type:code id: tags:
+
+``` python
+# Evaluate during training and a bit more often
+# than the default to be able to prune bad trials early.
+training_args = TrainingArguments(
+                                  output_dir="test_trainer",
+                                  seed=42,
+                                  no_cuda=False,
+                                  per_device_train_batch_size=4,
+                                  evaluation_strategy="steps",
+                                  eval_steps=100,
+                                  save_strategy="best",
+                                  metric_for_best_model="eval_loss",
+                                  greater_is_better=False,
+                                  learning_rate=5e-5,
+                                  num_train_epochs=3,
+                                  report_to="none",
+                                  #log_level="debug"
+                                  )
+```
+
+%% Cell type:markdown id: tags:
+
+----------------------------
+SOLUTION
+
+%% Cell type:markdown id: tags:
+
+The code below defines the metrics used to compute performance, here accuracy.
+We also define a function that tells the model how to compute the performance based on its output.
+
+%% Cell type:code id: tags:
+
+``` python
+metric = evaluate.load("accuracy")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def compute_metrics(eval_pred):
+    metric = evaluate.load("accuracy")
+    logits, labels = eval_pred
+    predictions = np.argmax(logits, axis=-1)
+    return metric.compute(predictions=predictions, references=labels)
+```
+
+%% Cell type:markdown id: tags:
+
+## 3-3 Launch training
+
+The function below is used to retrieve the model.
+In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.
+But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search
+
+%% Cell type:code id: tags:
+
+``` python
+# Here we need to specify the number of labels
+# Note that model_init doesn't take an argument, if you want to specify the
+# number of labels outside the function, you need to embed the methods within
+# e.g. your train method.
+def model_init():
+    return AutoModelForSequenceClassification.from_pretrained(
+        base_model, num_labels = 26 )
+```
+
+%% Cell type:code id: tags:
+
+``` python
+trainer = Trainer(
+    model_init=model_init,
+    args=training_args,
+    train_dataset=small_train_dataset,
+    eval_dataset=small_eval_dataset,
+    compute_metrics=compute_metrics,
+)
+```
+
+%% Cell type:markdown id: tags:
+
+Now we can launch training, we will compare the results with default values to the results of the hyper-parameter search.
+
+%% Cell type:code id: tags:
+
+``` python
+import os
+trainer.train(  )
+```
+
+%% Cell type:code id: tags:
+
+``` python
+trainer.save_model( "best_model"  )
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use a small version of the dataset if run on CPU
+logits, gold, metrics = trainer.predict( small_eval_dataset )
+#logits, gold, metrics = trainer.predict( tokenized_datasets["val"] )
+```
+
+%% Cell type:code id: tags:
+
+``` python
+predictions = np.argmax(logits, axis=-1)
+all_metrics = metric.compute(predictions=predictions, references=gold)
+print( all_metrics )
+```
+
+%% Cell type:markdown id: tags:
+
+ # 4- (code given) Reporting to wandb
+
+%% Cell type:markdown id: tags:
+
+WeightAndBiases is a platform that can be used to save results of your experiments, and make comparisons easier.
+You need an account to use it, let's just see how it works.
+
+See the differences in the training arguments?
+
+https://wandb.ai/amogkam/transformers/reports/Hyperparameter-Optimization-for-Huggingface-Transformers--VmlldzoyMTc2ODI
+
+%% Cell type:code id: tags:
+
+``` python
+!pip install wandb
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import wandb
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Needs to log during training
+training_args = TrainingArguments(
+                                  output_dir="test_trainer", # Name of the directory where model will be saved
+                                  seed=42, # seed for random initialization
+                                  no_cuda=False, # whether to use GPU or not
+                                  per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)
+                                  evaluation_strategy="steps", # when we want to report evaluation during training
+                                  eval_steps=10, # number of steps before reporting evaluation during training
+                                  save_strategy="best", # strategy to save models
+                                  metric_for_best_model="eval_loss", # metrics to choose the best model
+                                  greater_is_better=False, # for metrics best model: False since eval on loss
+                                  learning_rate=5e-5, # learning rate value
+                                  num_train_epochs=3, # Number of epochs / iterations
+                                  report_to="wandb", # <<<<<< reports results to some platforms
+                                  log_level="debug", # log level
+                                  logging_strategy="steps", # <<<<
+                                  logging_steps=10, # <<<<
+                                  )
+
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# start a new wandb run to track this script
+wandb.init(
+      # set the wandb project where this run will be logged
+      entity='teaching',
+      project="tp9_litl",
+      # track hyperparameters and run metadata
+      # track hyperparameters and run metadata
+		  config={
+			  "model_checkpoint": base_model,
+			  "dataset": dataset_file,
+		  }
+)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+trainer = Trainer(
+    model_init=model_init,
+    args=training_args,
+    train_dataset=small_train_dataset,
+    eval_dataset=small_eval_dataset,
+    compute_metrics=compute_metrics,
+)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import os
+trainer.train(  )
+```
+
+%% Cell type:markdown id: tags:
+
+# 5- (Code given) Run hyperparameter search
+
+The hyper-parameter search is called on the trainer.
+
+By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.
+
+ Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:
+ ```
+train_dataset = encoded_dataset["train"].shard(index=1, num_shards=10)
+```
+
+for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.
+
+%% Cell type:code id: tags:
+
+``` python
+!pip install ray[tune]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+shard_train_dataset = tokenized_datasets["train"].shard(index=1, num_shards=10)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from ray.tune.search.hyperopt import HyperOptSearch
+from ray.tune.schedulers import ASHAScheduler
+from ray import tune
+from ray.tune.schedulers import PopulationBasedTraining
+from ray.tune import CLIReporter
+```
+
+%% Cell type:markdown id: tags:
+
+## 4-1 Simple example of hyper-parameter search
+
+We give again the trainer argument and initialize again the trainer below.
+Then we can run the hyper-parameter search, with default arguments.
+
+%% Cell type:code id: tags:
+
+``` python
+training_args = TrainingArguments(
+                                  output_dir="test_trainer", # Name of the directory where model will be saved
+                                  seed=42, # seed for random initialization
+                                  no_cuda=False, # whether to use GPU or not
+                                  per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)
+                                  evaluation_strategy="steps", # when we want to report evaluation during training
+                                  eval_steps=10, # number of steps before reporting evaluation during training
+                                  save_strategy="best", # strategy to save models
+                                  metric_for_best_model="eval_loss", # metrics to choose the best model
+                                  greater_is_better=False, # for metrics best model: False since eval on loss
+                                  learning_rate=5e-5, # learning rate value
+                                  num_train_epochs=3, # Number of epochs / iterations
+                                  report_to="wandb", # reports results to some platforms
+                                  log_level="debug", # log level
+                                  logging_strategy="steps", #
+                                  logging_steps=10, #
+                                  )
+
+
+trainer = Trainer(
+    model_init=model_init,
+    args=training_args,
+    train_dataset=shard_train_dataset,
+    eval_dataset=small_eval_dataset,
+    compute_metrics=compute_metrics,
+    tokenizer=tokenizer
+)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# start a new wandb run to track this script
+wandb.init(
+      # set the wandb project where this run will be logged
+      entity='teaching',
+      project="tp9_litl_ray",
+      # track hyperparameters and run metadata
+      # track hyperparameters and run metadata
+		  config={
+			  "model_checkpoint": base_model,
+			  "dataset": dataset_file,
+		  }
+)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+tune_config = {
+        "learning_rate": tune.loguniform(1e-4, 1e-2),
+        "num_train_epochs": tune.choice(range(1, 6)),
+        "seed": tune.choice(range(1, 41)),
+        "per_device_train_batch_size": tune.choice([2, 8]),
+    }
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Default objective is the sum of all metrics
+# when metrics are provided, so we have to maximize it.
+best_run = trainer.hyperparameter_search(
+    #hp_space=lambda _: tune_config,
+    direction="maximize",
+    backend="ray",
+    n_trials=3 # number of trials, here very low
+)
+```
+
+%% Cell type:markdown id: tags:
+
+
+
+The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.
+
+
+%% Cell type:code id: tags:
+
+``` python
+best_run
+```
+
+%% Cell type:markdown id: tags:
+
+You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.
+See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10
+
+%% Cell type:markdown id: tags:
+
+To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:
+
+%% Cell type:code id: tags:
+
+``` python
+for n, v in best_run.hyperparameters.items():
+    setattr(trainer.args, n, v)
+
+trainer.train()
+```
+
+%% Cell type:markdown id: tags:
+
+You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.
+
+Read the post: https://huggingface.co/blog/ray-tune
+
+Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb
+
+%% Cell type:code id: tags:
+
+``` python
+```