Add Course 4, TP4 correct, TP5, TP6

7d2d1dc5 · chloebt · 3b92fe37 · 7d2d1dc5 · 7d2d1dc5 · 7d2d1dc5
Commit 7d2d1dc5 authored 1 year ago by chloebt
--- a/notebooks/TP4_m2LiTL_EmbeddingsWithNN_CORRECT_2324.ipynb
+++ b/notebooks/TP4_m2LiTL_EmbeddingsWithNN_CORRECT_2324.ipynb
--- a/notebooks/TP5_m2LiTL_learningWithNN_2324.ipynb
+++ b/notebooks/TP5_m2LiTL_learningWithNN_2324.ipynb
--- a/notebooks/TP6_m2LiTL_transformers_data_2324.ipynb
+++ b/notebooks/TP6_m2LiTL_transformers_data_2324.ipynb
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-bb49S7B50eh"
+      },
+      "source": [
+        "# TP 6: Introduction aux transformers\n",
+        "\n",
+        "Dans cette séance, nous verrons comment utiliser la librairie Transformers de HuggingFace et des modèles pré-entraînés.\n",
+        "\n",
+        "Nous nous intéresserons encore à la tâche d'analyse de sentiments sur les données anglaises IMDB.\n",
+        "Au sens d'HF, il s'agit d'une tâche de classification de séquences de mots.\n",
+        "\n",
+        "Nous nous appuierons sur la librairie HuggingFace et les modèles de langue Transformer (i.e. BERT).  \n",
+        "\n",
+        "- https://huggingface.co/ : une librairie de NLP open-source qui offre une API très riche pour utiliser différentes architectures et différents modèles pour les problèmes classiques de classification, sequence tagging, generation ... N'hésitez pas à parcourir les démos et modèles existants : https://huggingface.co/tasks/text-classification\n",
+        "- Un assez grand nombre de jeux de données est aussi accessible directement via l'API, pour le texte ou l'image notamment cf les jeux de données https://huggingface.co/datasets et la doc pour gérer ces données : https://huggingface.co/docs/datasets/index\n",
+        "\n",
+        "Le code ci-dessous vous permet d'installer :    \n",
+        "- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/\n",
+        "- la librairie de datasets pour accéder à des jeux de données\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "9UoSnFV250el"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install -U transformers\n",
+        "!pip install datasets"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Finally, if the installation is successful, we can import the transformers library:"
+      ],
+      "metadata": {
+        "id": "StClx_Hh9PDm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ZBQcA9Ol50en"
+      },
+      "outputs": [],
+      "source": [
+        "import transformers\n",
+        "from transformers import pipeline\n",
+        "from datasets import load_dataset\n",
+        "import numpy as np"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 1. Sentiment analysis with a pretrained model\n",
+        "\n",
+        "Many NLP tasks are made easy to perform within HuggingFace using the Pipeline abstraction.\n",
+        "\n",
+        "Useful resource: course made available on HuggingFace website, e.g. part on pipelines: https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines\n",
+        "\n",
+        "\n",
+        "For example for text classification, we can very simply have access to pretrained models for varied tasks, included sentiment analysis:\n",
+        "https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline\n",
+        "\n",
+        "Let's try!"
+      ],
+      "metadata": {
+        "id": "4avqXNnF73M0"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### 1.1 ▶▶ Exercise: Default model\n",
+        "\n",
+        "You can test pipelines by simply specifying the task you want to perform, a model is chosen by default.\n",
+        "\n",
+        "Run the code below:\n",
+        "* what is the name of the chosen pretrained model?\n",
+        "* what language?\n",
+        "* run the next lines and look at the predictions of the model, does it seem alright? Can you produce an example that is not well predicted?"
+      ],
+      "metadata": {
+        "id": "TxAzsZLjA6P_"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "classifier = pipeline(\"sentiment-analysis\")"
+      ],
+      "metadata": {
+        "id": "y-Y4a8Dn_6n7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "classifier(\"This movie is disgustingly good !\")"
+      ],
+      "metadata": {
+        "id": "nRDF7Sd4ArdG"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "classifier(\"This movie is not as good as expected !\")"
+      ],
+      "metadata": {
+        "id": "iNcy1YsjArko"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "O9ZL4YKMD4ra"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "XadsLGxUD4uM"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "_DerR4loD4w1"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### 1.3 Specifying a pretrained model for English\n",
+        "\n",
+        "You can specify the pretrained model you want to use.\n",
+        "HuggingFace makes available tons of models for NLP (and other domains).\n",
+        "You can browse them on this page, here restricted to English model for Text classification tasks: https://huggingface.co/models?language=en&pipeline_tag=text-classification&sort=downloads\n",
+        "\n",
+        "▶▶ Exercise: use the same model as before, but using the parameter of the pipeline to specify its name\n",
+        "\n",
+        "Hint: look at the doc https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt#using-any-model-from-the-hub-in-a-pipeline"
+      ],
+      "metadata": {
+        "id": "ipX_Nwxi_q9D"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "GTcKK78itzYQ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 1.4 ▶▶ Exercise: use a pretrained model for French\n",
+        "\n",
+        "Now, take a look at the models page and find a suitable model for the task in French: we want to try an adapted version of **FlauBERT**.\n",
+        "\n",
+        "* Find the model in the database, look at the documentation: how has been built this model?\n",
+        "* load it. You will need to install sacremoses library using ```!pip install sacremoses```\n",
+        "* Then try it on a few examples."
+      ],
+      "metadata": {
+        "id": "dQo8pS93BJKf"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install sacremoses"
+      ],
+      "metadata": {
+        "id": "i5t_Ik688rIX"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "fQcEX6OCuOsg"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 1.5 Exploring a dataset\n",
+        "\n",
+        "In this part, we will focus on exploring datasets that are part of the HuggingFace hub."
+      ],
+      "metadata": {
+        "id": "RLOpYtKavaio"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1.5.1 Load a dataset\n",
+        "\n",
+        "▶▶ Exercise: Find the dataset corresponding to IMDB and load it.\n",
+        "\n",
+        "Doc: https://huggingface.co/datasets and https://huggingface.co/docs/datasets/load_hub\n"
+      ],
+      "metadata": {
+        "id": "QVx1g9QN3CjG"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "njVoUS1vwmtY"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1.5.2 Print statistics on the dataset\n",
+        "\n",
+        "▶▶ Exercise:\n",
+        "* Print the number of classes\n",
+        "* Print the first 2 examples of the dataset (advice: shuffle the dataset..)\n",
+        "* Print the distribution\n",
+        "* Count the total number of tokens and unique tokens\n",
+        "\n",
+        "Hint: start by simply 'printing' the dataset object, i.e.:\n",
+        "```\n",
+        "dataset\n",
+        "```\n",
+        "It will show you the structure of this object.  "
+      ],
+      "metadata": {
+        "id": "5UAvXlgLvxvh"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "UksRQxWdvBom"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "Ix_za-YVvB1L"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1.5.3 Tokenizer\n",
+        "\n",
+        "The text in the dataset is not tokenized.\n",
+        "In fact, transformers models have been trained using a specifc tokenization, and it is crucial to rely on the same tokenization when using a transformer model.\n"
+      ],
+      "metadata": {
+        "id": "bBQ6u5i41ROT"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### ▶▶ Exercise: Load the pretrained model for English and test it on the first example"
+      ],
+      "metadata": {
+        "id": "q9LjYU-C3ibO"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "ZexcERzqvJeL"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Notes on tokenizers\n",
+        "\n",
+        "Notez que la librairie HuggingFace définit des *Auto Classes*: elles permettent d'inférer directement l'architecture requise selon le type de modèle spécifié en argument.\n",
+        "* Par exemple ici, le tokenizer est spécifique au modèle DistilBERT, plus précisément il est identique à celui de BERT, et hérite beaucoup de méthodes de la classe *PreTrainedTokenizerFast*.\n",
+        "* On utilise la classe *class transformers.AutoModelForSequenceClassification* pour un modèle d'étiquetage de séquence.\n",
+        "\n",
+        "Le tokenizer est en charge de préparer les données d'entrée, et notamment dans le cas de BERT, de découper les tokens en sous-tokens, mais aussi d'assigner des ids à chaque sous-token, de permettre le mapping dans un sens et dans l'autre...\n",
+        "\n",
+        "- Les *Auto Classes*: https://huggingface.co/docs/transformers/model_doc/auto\n",
+        "- Les Tokenizer dans HuggingFace: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer\n",
+        "- *Bert tokenizer*: https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/bert#transformers.BertTokenizer\n",
+        "- Classe *PreTrainedTokenizerFast*: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast"
+      ],
+      "metadata": {
+        "id": "NUus9JUNB3Qq"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "9XwH5If4B3Qq"
+      },
+      "outputs": [],
+      "source": [
+        "from transformers import AutoTokenizer\n",
+        "\n",
+        "# Defining the tokenizer using Auto Classes\n",
+        "tokenizer = AutoTokenizer.from_pretrained(pretrained_model)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### ▶▶ Exercice: Tester le tokenizer\n",
+        "\n",
+        "**Utiliser le tokenizer pour :**\n",
+        "- encoder une phrase (en anglais) :\n",
+        "- convertir dans l'autre sens : d'une liste d'ids de tokens en texte\n",
+        "  * que se passe-t-il dans le cas de mots longs ?\n",
+        "  * de mots inconnus ?\n",
+        "  * Que répresentent les éléments entre crochets ?\n",
+        "\n",
+        "\n",
+        "Hint: regardez les méthodes 'encode' et 'decode' dans la doc https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer (et éventuellement 'convert_ids_to_tokens()')."
+      ],
+      "metadata": {
+        "id": "V8C5djpXB3Qr"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "uO8UkY4yvksn"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "3pLoN8VHvkvb"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Compute the vocabulary using the tokenizer\n",
+        "\n",
+        "The function below will tokenize the entire dataset.\n",
+        "\n",
+        "▶▶ Exercise: compute the total number of tokens and unique tokens."
+      ],
+      "metadata": {
+        "id": "0GNXQIm9vuNX"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "4b_ICjruwKj5"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-Kj0bW3_50et"
+      },
+      "outputs": [],
+      "source": [
+        "def tokenize_function(examples):\n",
+        "    return tokenizer(examples[\"text\"])\n",
+        "\n",
+        "\n",
+        "tokenized_datasets = dataset.map(tokenize_function)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "tokenized_datasets"
+      ],
+      "metadata": {
+        "id": "TKTi2eO8d-JJ"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Notez que le tokenizer retourne deux éléments:\n",
+        "\n",
+        "- input_ids: the numbers representing the tokens in the text.\n",
+        "- attention_mask: indicates whether a token should be masked or not.\n",
+        "\n",
+        "Plus d'info sur les datasets: https://huggingface.co/docs/datasets/use_dataset"
+      ],
+      "metadata": {
+        "id": "ATFZVbiYwD34"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "VhabMuyjwSHe"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Additional notes about HuggingFace dataset"
+      ],
+      "metadata": {
+        "id": "-bUnXTbbGp5e"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### Available corpora\n",
+        "\n",
+        "Note that many corpora are available directly from HuggingFace, for example for text classification tasks:\n",
+        "https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads\n",
+        "\n",
+        "\n",
+        "In particular you can directly load the full AlloCine corpus:\n",
+        "https://huggingface.co/datasets/allocine"
+      ],
+      "metadata": {
+        "id": "bsbgcxgTJsW2"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#### Some preprocessing\n",
+        "\n",
+        "The library allows to perform some preprocessing directly on the Dataset object, very easily.\n",
+        "Take alook at the doc: https://huggingface.co/course/chapter5/3?fw=pt\n",
+        "\n",
+        "For example here we can compute the lenght of each review and filter our dataset to excluse outliers, e.g. reviews with too few words."
+      ],
+      "metadata": {
+        "id": "FLvU5EYUCnVK"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def compute_review_length(example):\n",
+        "    return {\"review_length\": len(example[\"review\"].split())}\n",
+        "\n",
+        "dataset = dataset.map(compute_review_length) #Add the column review_lenght\n",
+        "# Inspect the first training example\n",
+        "dataset[\"train\"][0]"
+      ],
+      "metadata": {
+        "id": "SgeXPXp6JmZU"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Some review are very short... Dataset.filter() can be used to remove some examples."
+      ],
+      "metadata": {
+        "id": "6fL34GWd53ij"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "dataset[\"train\"].sort(\"review_length\")[:3]"
+      ],
+      "metadata": {
+        "id": "56Lv3xpAJmb5"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "filtered_dataset = dataset.filter(lambda x: x[\"review_length\"] > 10)\n",
+        "print(filtered_dataset.num_rows)"
+      ],
+      "metadata": {
+        "id": "UuDpP1JyF-6a"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "visual",
+      "language": "python",
+      "name": "visual"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.9.5"
+    },
+    "colab": {
+      "provenance": [],
+      "toc_visible": true
+    },
+    "accelerator": "GPU",
+    "gpuClass": "standard"
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file
+%% Cell type:markdown id: tags:
+
+# TP 6: Introduction aux transformers
+
+Dans cette séance, nous verrons comment utiliser la librairie Transformers de HuggingFace et des modèles pré-entraînés.
+
+Nous nous intéresserons encore à la tâche d'analyse de sentiments sur les données anglaises IMDB.
+Au sens d'HF, il s'agit d'une tâche de classification de séquences de mots.
+
+Nous nous appuierons sur la librairie HuggingFace et les modèles de langue Transformer (i.e. BERT).
+
+- https://huggingface.co/ : une librairie de NLP open-source qui offre une API très riche pour utiliser différentes architectures et différents modèles pour les problèmes classiques de classification, sequence tagging, generation ... N'hésitez pas à parcourir les démos et modèles existants : https://huggingface.co/tasks/text-classification
+- Un assez grand nombre de jeux de données est aussi accessible directement via l'API, pour le texte ou l'image notamment cf les jeux de données https://huggingface.co/datasets et la doc pour gérer ces données : https://huggingface.co/docs/datasets/index
+
+Le code ci-dessous vous permet d'installer :
+- le module *transformers*, qui contient les modèles de langue https://pypi.org/project/transformers/
+- la librairie de datasets pour accéder à des jeux de données
+
+%% Cell type:code id: tags:
+
+``` python
+!pip install -U transformers
+!pip install datasets
+```
+
+%% Cell type:markdown id: tags:
+
+Finally, if the installation is successful, we can import the transformers library:
+
+%% Cell type:code id: tags:
+
+``` python
+import transformers
+from transformers import pipeline
+from datasets import load_dataset
+import numpy as np
+```
+
+%% Cell type:markdown id: tags:
+
+# 1. Sentiment analysis with a pretrained model
+
+Many NLP tasks are made easy to perform within HuggingFace using the Pipeline abstraction.
+
+Useful resource: course made available on HuggingFace website, e.g. part on pipelines: https://huggingface.co/course/chapter1/3?fw=pt#working-with-pipelines
+
+
+For example for text classification, we can very simply have access to pretrained models for varied tasks, included sentiment analysis:
+https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TextClassificationPipeline
+
+Let's try!
+
+%% Cell type:markdown id: tags:
+
+#### 1.1 ▶▶ Exercise: Default model
+
+You can test pipelines by simply specifying the task you want to perform, a model is chosen by default.
+
+Run the code below:
+* what is the name of the chosen pretrained model?
+* what language?
+* run the next lines and look at the predictions of the model, does it seem alright? Can you produce an example that is not well predicted?
+
+%% Cell type:code id: tags:
+
+``` python
+classifier = pipeline("sentiment-analysis")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+classifier("This movie is disgustingly good !")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+classifier("This movie is not as good as expected !")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+#### 1.3 Specifying a pretrained model for English
+
+You can specify the pretrained model you want to use.
+HuggingFace makes available tons of models for NLP (and other domains).
+You can browse them on this page, here restricted to English model for Text classification tasks: https://huggingface.co/models?language=en&pipeline_tag=text-classification&sort=downloads
+
+▶▶ Exercise: use the same model as before, but using the parameter of the pipeline to specify its name
+
+Hint: look at the doc https://huggingface.co/learn/nlp-course/chapter1/3?fw=pt#using-any-model-from-the-hub-in-a-pipeline
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+### 1.4 ▶▶ Exercise: use a pretrained model for French
+
+Now, take a look at the models page and find a suitable model for the task in French: we want to try an adapted version of **FlauBERT**.
+
+* Find the model in the database, look at the documentation: how has been built this model?
+* load it. You will need to install sacremoses library using ```!pip install sacremoses```
+* Then try it on a few examples.
+
+%% Cell type:code id: tags:
+
+``` python
+!pip install sacremoses
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+# 1.5 Exploring a dataset
+
+In this part, we will focus on exploring datasets that are part of the HuggingFace hub.
+
+%% Cell type:markdown id: tags:
+
+## 1.5.1 Load a dataset
+
+▶▶ Exercise: Find the dataset corresponding to IMDB and load it.
+
+Doc: https://huggingface.co/datasets and https://huggingface.co/docs/datasets/load_hub
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 1.5.2 Print statistics on the dataset
+
+▶▶ Exercise:
+* Print the number of classes
+* Print the first 2 examples of the dataset (advice: shuffle the dataset..)
+* Print the distribution
+* Count the total number of tokens and unique tokens
+
+Hint: start by simply 'printing' the dataset object, i.e.:
+```
+dataset
+```
+It will show you the structure of this object.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 1.5.3 Tokenizer
+
+The text in the dataset is not tokenized.
+In fact, transformers models have been trained using a specifc tokenization, and it is crucial to rely on the same tokenization when using a transformer model.
+
+%% Cell type:markdown id: tags:
+
+### ▶▶ Exercise: Load the pretrained model for English and test it on the first example
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+### Notes on tokenizers
+
+Notez que la librairie HuggingFace définit des *Auto Classes*: elles permettent d'inférer directement l'architecture requise selon le type de modèle spécifié en argument.
+* Par exemple ici, le tokenizer est spécifique au modèle DistilBERT, plus précisément il est identique à celui de BERT, et hérite beaucoup de méthodes de la classe *PreTrainedTokenizerFast*.
+* On utilise la classe *class transformers.AutoModelForSequenceClassification* pour un modèle d'étiquetage de séquence.
+
+Le tokenizer est en charge de préparer les données d'entrée, et notamment dans le cas de BERT, de découper les tokens en sous-tokens, mais aussi d'assigner des ids à chaque sous-token, de permettre le mapping dans un sens et dans l'autre...
+
+- Les *Auto Classes*: https://huggingface.co/docs/transformers/model_doc/auto
+- Les Tokenizer dans HuggingFace: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer
+- *Bert tokenizer*: https://huggingface.co/docs/transformers/v4.25.1/en/model_doc/bert#transformers.BertTokenizer
+- Classe *PreTrainedTokenizerFast*: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast
+
+%% Cell type:code id: tags:
+
+``` python
+from transformers import AutoTokenizer
+
+# Defining the tokenizer using Auto Classes
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
+```
+
+%% Cell type:markdown id: tags:
+
+### ▶▶ Exercice: Tester le tokenizer
+
+**Utiliser le tokenizer pour :**
+- encoder une phrase (en anglais) :
+- convertir dans l'autre sens : d'une liste d'ids de tokens en texte
+  * que se passe-t-il dans le cas de mots longs ?
+  * de mots inconnus ?
+  * Que répresentent les éléments entre crochets ?
+
+
+Hint: regardez les méthodes 'encode' et 'decode' dans la doc https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/tokenizer (et éventuellement 'convert_ids_to_tokens()').
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+### Compute the vocabulary using the tokenizer
+
+The function below will tokenize the entire dataset.
+
+▶▶ Exercise: compute the total number of tokens and unique tokens.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def tokenize_function(examples):
+    return tokenizer(examples["text"])
+
+
+tokenized_datasets = dataset.map(tokenize_function)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+tokenized_datasets
+```
+
+%% Cell type:markdown id: tags:
+
+Notez que le tokenizer retourne deux éléments:
+
+- input_ids: the numbers representing the tokens in the text.
+- attention_mask: indicates whether a token should be masked or not.
+
+Plus d'info sur les datasets: https://huggingface.co/docs/datasets/use_dataset
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+# Additional notes about HuggingFace dataset
+
+%% Cell type:markdown id: tags:
+
+#### Available corpora
+
+Note that many corpora are available directly from HuggingFace, for example for text classification tasks:
+https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads
+
+
+In particular you can directly load the full AlloCine corpus:
+https://huggingface.co/datasets/allocine
+
+%% Cell type:markdown id: tags:
+
+#### Some preprocessing
+
+The library allows to perform some preprocessing directly on the Dataset object, very easily.
+Take alook at the doc: https://huggingface.co/course/chapter5/3?fw=pt
+
+For example here we can compute the lenght of each review and filter our dataset to excluse outliers, e.g. reviews with too few words.
+
+%% Cell type:code id: tags:
+
+``` python
+def compute_review_length(example):
+    return {"review_length": len(example["review"].split())}
+
+dataset = dataset.map(compute_review_length) #Add the column review_lenght
+# Inspect the first training example
+dataset["train"][0]
+```
+
+%% Cell type:markdown id: tags:
+
+Some review are very short... Dataset.filter() can be used to remove some examples.
+
+%% Cell type:code id: tags:
+
+``` python
+dataset["train"].sort("review_length")[:3]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+filtered_dataset = dataset.filter(lambda x: x["review_length"] > 10)
+print(filtered_dataset.num_rows)
+```
--- a/slides/MasterLiTL_Course1_281123.pdf
+++ b/slides/MasterLiTL_Course1_281123.pdf
--- a/slides/MasterLiTL_2324_Course4_090124.pdf
+++ b/slides/MasterLiTL_2324_Course4_090124.pdf