From ce4518f5a534ac08df58464e0bcd07dec1840c42 Mon Sep 17 00:00:00 2001
From: chloebt <chloe.braud@gmail.com>
Date: Wed, 15 Jan 2025 11:17:48 +0100
Subject: [PATCH] add TP9

---
 ...ormers_additional_notions_2425_SUJET.ipynb | 1146 +++++++++++++++++
 1 file changed, 1146 insertions(+)
 create mode 100644 notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb

diff --git a/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb b/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb
new file mode 100644
index 0000000..9aea4aa
--- /dev/null
+++ b/notebooks/TP9_m2LiTL_transformers_additional_notions_2425_SUJET.ipynb
@@ -0,0 +1,1146 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-bb49S7B50eh"
+      },
+      "source": [
+        "# TP 9: Additional elements on Transformers\n",
+        "\n",
+        "In this practical session, we will look at aditional elements of the HuggingFace library:\n",
+        "\n",
+        "* Importing a dataset\n",
+        "* Modifying a dataset\n",
+        "* Tuning hyper-parameters\n",
+        "* (code given) Reporting to wandb\n",
+        "\n",
+        "Dans cette séance, nous verrons comment utiliser un modèle pré-entrainé pour l'adapter à une nouvelle tâche (transfert). Ce TP fait suite au TP6.\n",
+        "\n",
+        "Rappel = le code ci-dessous vous permet d'installer :    \n",
+        "- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/\n",
+        "- la librairie de datasets pour accéder à des jeux de données\n",
+        "- la librairie *evaluate* : utilisée pour évaluer et comparer des modèles https://pypi.org/project/evaluate/"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "9UoSnFV250el"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U transformers\n",
+        "!pip install accelerate -U\n",
+        "!pip install datasets\n",
+        "!pip install evaluate"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Finally, if the installation is successful, we can import the transformers library:"
+      ],
+      "metadata": {
+        "id": "StClx_Hh9PDm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ZBQcA9Ol50en"
+      },
+      "outputs": [],
+      "source": [
+        "import transformers\n",
+        "from datasets import load_dataset, Dataset\n",
+        "import evaluate\n",
+        "import numpy as np\n",
+        "import sklearn"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "3TIXCS5P50en"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n",
+        "from transformers import TrainingArguments, Trainer"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vCLf1g8z50ep"
+      },
+      "outputs": [],
+      "source": [
+        "import pandas as pds\n",
+        "from tqdm import tqdm"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Path to data, dataset for genre classification of movies:"
+      ],
+      "metadata": {
+        "id": "urp4cUXq42Us"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "dataset_file = 'train_data.txt'"
+      ],
+      "metadata": {
+        "id": "4vnlP28r46SI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 1- Importing a dataset\n",
+        "\n",
+        "We saw how to import a dataset in CSV, here say that we import a dataset not in CSV.\n",
+        "There are different ways of importing this dataset: https://huggingface.co/docs/datasets/create_dataset\n"
+      ],
+      "metadata": {
+        "id": "ACnrPB_kyS6j"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1-2 Using a dictionnary\n",
+        "\n",
+        "First solution: having a dictionnary saving the info for each examples, see the code below.\n",
+        "\n",
+        "▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_dict* method: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_dict\n",
+        "\n",
+        "Finally, print the Dataset keys and the first example."
+      ],
+      "metadata": {
+        "id": "eS5S0b994gyB"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def read_dataset( dataset_file ):\n",
+        "  dataset_dict = {\"id\":[], \"title\":[], \"genre\":[], \"plot\":[] }\n",
+        "  with open( dataset_file, 'r' ) as f:\n",
+        "    mylines = f.readlines()\n",
+        "    for l in mylines:\n",
+        "      l = l.strip()\n",
+        "      data = l.split(' ::: ')\n",
+        "      dataset_dict[\"id\"].append( data[0] )\n",
+        "      dataset_dict[\"title\"].append( data[1] )\n",
+        "      dataset_dict[\"genre\"].append( data[2] )\n",
+        "      dataset_dict[\"plot\"].append( data[3] )\n",
+        "  return dataset_dict"
+      ],
+      "metadata": {
+        "id": "GhZkf2C_4hCi"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "yBth_OZU75Mi"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "4qQiajHucrFX"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1-2 Using a generator\n",
+        "\n",
+        "Suppose we import the dataset using the function below, with a function yielding / generating the examples while reading the input file.\n",
+        "\n",
+        "▶▶ **Exercise:** Now, build a Dataset object based on this function. You will use the *from_generator* method:\n",
+        "* The method is described here: https://huggingface.co/docs/datasets/create_dataset#from-python-dictionaries\n",
+        "* You'll probably need to take a look at the API: https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.from_generator\n",
+        "\n",
+        "Finally, print the Dataset keys and the first example."
+      ],
+      "metadata": {
+        "id": "ADItE2v33N3N"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def read_dataset( dataset_file ):\n",
+        "  with open( dataset_file, 'r' ) as f:\n",
+        "    mylines = f.readlines()\n",
+        "    for l in mylines:\n",
+        "      l = l.strip()\n",
+        "      data = l.split(' ::: ')\n",
+        "      yield {'id':data[0], \"title\":data[1], \"genre\":data[2], \"plot\":data[3] }\n"
+      ],
+      "metadata": {
+        "id": "qrdbJmeByTL-"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "PPb3CFsA4H7R"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "yVuz74ndcyU2"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1-2 Using a Pandas Dataframe\n",
+        "\n",
+        "▶▶  **Exercise:**\n",
+        "* Read the dataset and save a Pandas dataframe\n",
+        "* Transform the dataframe into a Dataset object."
+      ],
+      "metadata": {
+        "id": "U4H5kjyM4UNF"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "JeGVSAUO84OS"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "teK3vR0qc9fx"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 2- Modifying a dataset\n",
+        "\n",
+        "In the original dataset, the labels are given as text.\n",
+        "For use with HuggingFace, we need to have numeric labels.\n",
+        "But first, we'll see how we can use the filter function to modify the dataset."
+      ],
+      "metadata": {
+        "id": "fO6xWKaA87e0"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-1 Filtering the dataset\n",
+        "\n",
+        "Imagine we want to remove a certain category, for example the less represented.\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Count the initial number of examples (i.e. number of rows)\n",
+        "- Print the number of unique labels and the list of labels\n",
+        "- Find the less representative label\n",
+        "- Remove all examples of this category using the *filter* function\n",
+        "- Check the number of unique labels in the filtered dataset\n",
+        "- Recompute the mapping id to label\n",
+        "- Count the number of examples after filtering"
+      ],
+      "metadata": {
+        "id": "8UqmkAoK-rxJ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "u2ZExtpaCK_b"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "JOVSVp5Pc_O_"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-2 Mapping of labels\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "* Build a mapping from each label to a numeric value."
+      ],
+      "metadata": {
+        "id": "o4HsEERY9IbG"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "GmKcQWts9c-V"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "h76O2MvudBx8"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-3 Adding numeric labels to the dataset\n",
+        "\n",
+        "HuggingFace models need a column called 'label', that contains a numeric label.\n",
+        "We will add this column to the whole dataset.\n",
+        "You'll need to look at the API: https://huggingface.co/docs/datasets/package_reference/main_classes\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Add a column called 'label' to the Dataset ds_filtered\n",
+        "- with values corresponding to the numeric label\n",
+        "- Print the keys of the augmented dataset (note that no transformation is on place)."
+      ],
+      "metadata": {
+        "id": "-whji9AN9oyK"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "ds_filtered"
+      ],
+      "metadata": {
+        "id": "CFPbtJ7yfOv7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "DsRi-MAL-ivV"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "wveIbm_6dEFX"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-4 Mapping\n",
+        "\n",
+        "Let's say we want to add the title to the plot, for our future classification task.\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Use the *map* function to add the title to the plot, using the function below.\n",
+        "\n",
+        "See the doc: https://huggingface.co/docs/datasets/process#map"
+      ],
+      "metadata": {
+        "id": "JyVvDi3IDNA8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def add_plot( example ):\n",
+        "  example['plot'] = example['title'] + \" \" + example['plot']\n",
+        "  return example"
+      ],
+      "metadata": {
+        "id": "OntAlo86EVg4"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "-----------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "EOvniEZuFAra"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "oH7-zEeTdHDE"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-5 Shuffle and split\n",
+        "\n",
+        "▶▶ **Exercise:**\n",
+        "- Shuffle the final dataset\n",
+        "- split into train, dev, test"
+      ],
+      "metadata": {
+        "id": "mu2Qt4F8FSsu"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "8yzwEm6zHX3u"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "mN7TI_RRdJH4"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2-6 DatasetDict\n",
+        "\n",
+        "Finally, we put the datasets into a DatasetDict object, with the split as keys, easier to handle:"
+      ],
+      "metadata": {
+        "id": "7i7aoUdhR9qU"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from datasets.dataset_dict import DatasetDict\n",
+        "\n",
+        "d = {'train':dataset_train,\n",
+        "     'val':dataset_dev,\n",
+        "     'test':dataset_test\n",
+        "     }\n",
+        "\n",
+        "dataset_dict = DatasetDict(d)"
+      ],
+      "metadata": {
+        "id": "Fd83tuNgSBri"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "print( len(np.unique(dataset_dict[\"train\"]['genre'])))\n",
+        "print( len(np.unique(dataset_dict[\"train\"]['label'])))\n",
+        "print( np.unique(dataset_dict[\"train\"]['label']) )"
+      ],
+      "metadata": {
+        "id": "YXwkow2fhi7B"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 3- Simple training\n",
+        "\n",
+        "HuggingFace Trainers supports hyperparameter search based on Optuna or RayTune.\n",
+        "First, let's launch a simple fine-tuning, we'll see below what we need to modify to do hyper-parameter search.\n"
+      ],
+      "metadata": {
+        "id": "yXSBE_8wHatD"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3-1 Tokenization\n",
+        "\n",
+        "Our base model will be distilBERT (case or uncased).\n",
+        "\n",
+        "▶▶ **Exercise:** Tokenize the dataset based on this model. Define a tokenize_function then use *map* to apply it to the entire DatasetDict."
+      ],
+      "metadata": {
+        "id": "QQA2AynoHmLp"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "--------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "bZLNUt1ZT985"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "TDUreFDtdbxx"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3-2 Initialize the model\n",
+        "\n",
+        "Before training, we need to define:\n",
+        "* a training config, i.e. *TrainingArguments*.\n",
+        "- an evaluation metrics\n",
+        "\n",
+        "▶▶ **Exercise:** Take a look at the training arguments below:\n",
+        "* Add a comment on each line to explain the argument\n",
+        "* Refer to the API if needed: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments"
+      ],
+      "metadata": {
+        "id": "R8Nv4Lc_UBAW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Evaluate during training and a bit more often\n",
+        "# than the default to be able to prune bad trials early.\n",
+        "training_args = TrainingArguments(\n",
+        "                                  output_dir=\"test_trainer\",\n",
+        "                                  seed=42,\n",
+        "                                  no_cuda=False,\n",
+        "                                  per_device_train_batch_size=4,\n",
+        "                                  evaluation_strategy=\"steps\",\n",
+        "                                  eval_steps=100,\n",
+        "                                  save_strategy=\"best\",\n",
+        "                                  metric_for_best_model=\"eval_loss\",\n",
+        "                                  greater_is_better=False,\n",
+        "                                  learning_rate=5e-5,\n",
+        "                                  num_train_epochs=3,\n",
+        "                                  report_to=\"none\",\n",
+        "                                  #log_level=\"debug\"\n",
+        "                                  )"
+      ],
+      "metadata": {
+        "id": "UoSjxQmxMjgN"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "----------------------------\n",
+        "SOLUTION"
+      ],
+      "metadata": {
+        "id": "vYVhkLBmQDRb"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The code below defines the metrics used to compute performance, here accuracy.\n",
+        "We also define a function that tells the model how to compute the performance based on its output."
+      ],
+      "metadata": {
+        "id": "F9NbB9-cXuNm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "XHuT6s2sMy-f"
+      },
+      "outputs": [],
+      "source": [
+        "metric = evaluate.load(\"accuracy\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "qMyHHw7YMy-h"
+      },
+      "outputs": [],
+      "source": [
+        "def compute_metrics(eval_pred):\n",
+        "    metric = evaluate.load(\"accuracy\")\n",
+        "    logits, labels = eval_pred\n",
+        "    predictions = np.argmax(logits, axis=-1)\n",
+        "    return metric.compute(predictions=predictions, references=labels)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3-3 Launch training\n",
+        "\n",
+        "The function below is used to retrieve the model.\n",
+        "In the previous TP, we were simply defining the model with something like *model = AutoModel...(...)* and using it as the value for the *model* argument of thre trainer.\n",
+        "But for hyper-parameter search (below), we need to define a function initializing the model, that will be called at each run. See: https://huggingface.co/docs/transformers/main/main_classes/trainer#transformers.Trainer.hyperparameter_search"
+      ],
+      "metadata": {
+        "id": "xrADeM0tU63G"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Here we need to specify the number of labels\n",
+        "# Note that model_init doesn't take an argument, if you want to specify the\n",
+        "# number of labels outside the function, you need to embed the methods within\n",
+        "# e.g. your train method.\n",
+        "def model_init():\n",
+        "    return AutoModelForSequenceClassification.from_pretrained(\n",
+        "        base_model, num_labels = 26 )"
+      ],
+      "metadata": {
+        "id": "Ow4U0NK6QIgh"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "uX2nBPnk50ew"
+      },
+      "outputs": [],
+      "source": [
+        "trainer = Trainer(\n",
+        "    model_init=model_init,\n",
+        "    args=training_args,\n",
+        "    train_dataset=small_train_dataset,\n",
+        "    eval_dataset=small_eval_dataset,\n",
+        "    compute_metrics=compute_metrics,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now we can launch training, we will compare the results with default values to the results of the hyper-parameter search."
+      ],
+      "metadata": {
+        "id": "qMSE6qJApi6n"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "IN58_eaV50ex"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "trainer.train(  )"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "trainer.save_model( \"best_model\"  )"
+      ],
+      "metadata": {
+        "id": "DebJtbM1vvup"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# use a small version of the dataset if run on CPU\n",
+        "logits, gold, metrics = trainer.predict( small_eval_dataset )\n",
+        "#logits, gold, metrics = trainer.predict( tokenized_datasets[\"val\"] )"
+      ],
+      "metadata": {
+        "id": "bCqGyKpYv-mL"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "predictions = np.argmax(logits, axis=-1)\n",
+        "all_metrics = metric.compute(predictions=predictions, references=gold)\n",
+        "print( all_metrics )"
+      ],
+      "metadata": {
+        "id": "Tbvy-TmZ02H1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        " # 4- (code given) Reporting to wandb"
+      ],
+      "metadata": {
+        "id": "pe_cg8OxF9US"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "WeightAndBiases is a platform that can be used to save results of your experiments, and make comparisons easier.\n",
+        "You need an account to use it, let's just see how it works.\n",
+        "\n",
+        "See the differences in the training arguments?\n",
+        "\n",
+        "https://wandb.ai/amogkam/transformers/reports/Hyperparameter-Optimization-for-Huggingface-Transformers--VmlldzoyMTc2ODI"
+      ],
+      "metadata": {
+        "id": "fXzVvSw9HnIZ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install wandb"
+      ],
+      "metadata": {
+        "id": "_9DHvaEx974h"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import wandb"
+      ],
+      "metadata": {
+        "id": "WgQgkFfA-qQy"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Needs to log during training\n",
+        "training_args = TrainingArguments(\n",
+        "                                  output_dir=\"test_trainer\", # Name of the directory where model will be saved\n",
+        "                                  seed=42, # seed for random initialization\n",
+        "                                  no_cuda=False, # whether to use GPU or not\n",
+        "                                  per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n",
+        "                                  evaluation_strategy=\"steps\", # when we want to report evaluation during training\n",
+        "                                  eval_steps=10, # number of steps before reporting evaluation during training\n",
+        "                                  save_strategy=\"best\", # strategy to save models\n",
+        "                                  metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n",
+        "                                  greater_is_better=False, # for metrics best model: False since eval on loss\n",
+        "                                  learning_rate=5e-5, # learning rate value\n",
+        "                                  num_train_epochs=3, # Number of epochs / iterations\n",
+        "                                  report_to=\"wandb\", # <<<<<< reports results to some platforms\n",
+        "                                  log_level=\"debug\", # log level\n",
+        "                                  logging_strategy=\"steps\", # <<<<\n",
+        "                                  logging_steps=10, # <<<<\n",
+        "                                  )\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "TvRtaEBZHnUn"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# start a new wandb run to track this script\n",
+        "wandb.init(\n",
+        "      # set the wandb project where this run will be logged\n",
+        "      entity='teaching',\n",
+        "      project=\"tp9_litl\",\n",
+        "      # track hyperparameters and run metadata\n",
+        "      # track hyperparameters and run metadata\n",
+        "\t\t  config={\n",
+        "\t\t\t  \"model_checkpoint\": base_model,\n",
+        "\t\t\t  \"dataset\": dataset_file,\n",
+        "\t\t  }\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "ghd5iyi49b3U"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Z4Ke6NUb9J1q"
+      },
+      "outputs": [],
+      "source": [
+        "trainer = Trainer(\n",
+        "    model_init=model_init,\n",
+        "    args=training_args,\n",
+        "    train_dataset=small_train_dataset,\n",
+        "    eval_dataset=small_eval_dataset,\n",
+        "    compute_metrics=compute_metrics,\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vucFggUT9J1v"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "trainer.train(  )"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 5- (Code given) Run hyperparameter search\n",
+        "\n",
+        "The hyper-parameter search is called on the trainer.\n",
+        "\n",
+        "By default, each trial will utilize 1 CPU, and optionally 1 GPU if available.\n",
+        "\n",
+        " Note that it can take a long time to run on the full dataset for some of the tasks. You can try to find some good hyperparameter on a portion of the training dataset by replacing the train_dataset line by:\n",
+        " ```\n",
+        "train_dataset = encoded_dataset[\"train\"].shard(index=1, num_shards=10)\n",
+        "```\n",
+        "\n",
+        "for 1/10th of the dataset. Then you can run a full training on the best hyperparameters picked by the search.\n"
+      ],
+      "metadata": {
+        "id": "nP7oCAG6PMOm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install ray[tune]"
+      ],
+      "metadata": {
+        "id": "Uo8oR2RMME-y"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "shard_train_dataset = tokenized_datasets[\"train\"].shard(index=1, num_shards=10)"
+      ],
+      "metadata": {
+        "id": "-5a6VUa5qKSr"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from ray.tune.search.hyperopt import HyperOptSearch\n",
+        "from ray.tune.schedulers import ASHAScheduler\n",
+        "from ray import tune\n",
+        "from ray.tune.schedulers import PopulationBasedTraining\n",
+        "from ray.tune import CLIReporter"
+      ],
+      "metadata": {
+        "id": "3FdjPsQq6hR7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 4-1 Simple example of hyper-parameter search\n",
+        "\n",
+        "We give again the trainer argument and initialize again the trainer below.\n",
+        "Then we can run the hyper-parameter search, with default arguments."
+      ],
+      "metadata": {
+        "id": "k7huV8jj35Yl"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "training_args = TrainingArguments(\n",
+        "                                  output_dir=\"test_trainer\", # Name of the directory where model will be saved\n",
+        "                                  seed=42, # seed for random initialization\n",
+        "                                  no_cuda=False, # whether to use GPU or not\n",
+        "                                  per_device_train_batch_size=4, # Train batch size (on each GPU/CPU)\n",
+        "                                  evaluation_strategy=\"steps\", # when we want to report evaluation during training\n",
+        "                                  eval_steps=10, # number of steps before reporting evaluation during training\n",
+        "                                  save_strategy=\"best\", # strategy to save models\n",
+        "                                  metric_for_best_model=\"eval_loss\", # metrics to choose the best model\n",
+        "                                  greater_is_better=False, # for metrics best model: False since eval on loss\n",
+        "                                  learning_rate=5e-5, # learning rate value\n",
+        "                                  num_train_epochs=3, # Number of epochs / iterations\n",
+        "                                  report_to=\"wandb\", # reports results to some platforms\n",
+        "                                  log_level=\"debug\", # log level\n",
+        "                                  logging_strategy=\"steps\", #\n",
+        "                                  logging_steps=10, #\n",
+        "                                  )\n",
+        "\n",
+        "\n",
+        "trainer = Trainer(\n",
+        "    model_init=model_init,\n",
+        "    args=training_args,\n",
+        "    train_dataset=shard_train_dataset,\n",
+        "    eval_dataset=small_eval_dataset,\n",
+        "    compute_metrics=compute_metrics,\n",
+        "    tokenizer=tokenizer\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "Ejp-jUkup8nr"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# start a new wandb run to track this script\n",
+        "wandb.init(\n",
+        "      # set the wandb project where this run will be logged\n",
+        "      entity='teaching',\n",
+        "      project=\"tp9_litl_ray\",\n",
+        "      # track hyperparameters and run metadata\n",
+        "      # track hyperparameters and run metadata\n",
+        "\t\t  config={\n",
+        "\t\t\t  \"model_checkpoint\": base_model,\n",
+        "\t\t\t  \"dataset\": dataset_file,\n",
+        "\t\t  }\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "Swg3f29xOqER"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "tune_config = {\n",
+        "        \"learning_rate\": tune.loguniform(1e-4, 1e-2),\n",
+        "        \"num_train_epochs\": tune.choice(range(1, 6)),\n",
+        "        \"seed\": tune.choice(range(1, 41)),\n",
+        "        \"per_device_train_batch_size\": tune.choice([2, 8]),\n",
+        "    }"
+      ],
+      "metadata": {
+        "id": "d2zq914x6Jqs"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Default objective is the sum of all metrics\n",
+        "# when metrics are provided, so we have to maximize it.\n",
+        "best_run = trainer.hyperparameter_search(\n",
+        "    #hp_space=lambda _: tune_config,\n",
+        "    direction=\"maximize\",\n",
+        "    backend=\"ray\",\n",
+        "    n_trials=3 # number of trials, here very low\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "WRMCdQXUPMcj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "The hyperparameter_search method returns a BestRun objects, which contains the value of the objective maximized (by default the sum of all metrics) and the hyperparameters it used for that run.\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "ZQsI7iMgtnwv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "best_run"
+      ],
+      "metadata": {
+        "id": "nZrTWUeMtqNB"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "You can customize the objective to maximize by passing along a compute_objective function to the hyperparameter_search method, and you can customize the search space by passing a hp_space argument to hyperparameter_search.\n",
+        "See this forum post for some examples: https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10"
+      ],
+      "metadata": {
+        "id": "3ShywJOst4QY"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To reproduce the best training, just set the hyperparameters in your TrainingArgument before creating a Trainer:"
+      ],
+      "metadata": {
+        "id": "CWqlhRy3t9dP"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "for n, v in best_run.hyperparameters.items():\n",
+        "    setattr(trainer.args, n, v)\n",
+        "\n",
+        "trainer.train()"
+      ],
+      "metadata": {
+        "id": "udpzh2YVt_9o"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "You can also easily swap different parameter tuning algorithms such as HyperBand, Bayesian Optimization, Population-Based Training.\n",
+        "\n",
+        "Read the post: https://huggingface.co/blog/ray-tune\n",
+        "\n",
+        "Full example on text classification: https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb"
+      ],
+      "metadata": {
+        "id": "Xu-q6Rs1Pa58"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "yldJ_Uolc5Ii"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "visual",
+      "language": "python",
+      "name": "visual"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.5"
+    },
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4",
+      "toc_visible": true
+    },
+    "accelerator": "GPU"
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
-- 
GitLab