diff --git a/README.md b/README.md index b5d0eabccf11f0373542244a13d883086d07a5e4..e26536262aefeabee6145d797f867dbefe431787 100644 --- a/README.md +++ b/README.md @@ -4,9 +4,15 @@ Contact: chloe.braud@irit.fr Current schedule: * 24.11 9h30-12h30 : ML reminder - TP Sentiment analysis with ScikitLearn - * To load the notebook: TP1 [](https://colab.research.google.com/github/chloebt/m2-litl-students/blob/main/notebooks/TP1_masterLiTL_2122.ipynb) + * To load the notebook: TP1 [](https://colab.research.google.com/drive/1IOXm9X6BEJgkTjEd_kh_OYPnK1OvwD7l?usp=sharing) + +* Basics of OOP: https://colab.research.google.com/drive/1oPha9EekRpq5Uvm227xBlumZOZ3f8UNF?usp=sharing + +(https://colab.research.google.com/github/chloebt/m2-litl-students/blob/main/notebooks/TP1_masterLiTL_2122.ipynb) * You'll need to save a copy to your google drive or a notebook on your PC Assignments: * Code + report * due on 9.02.2022 + + diff --git a/notebooks/MasterLiTL_2223_basics_of_poo.ipynb b/notebooks/MasterLiTL_2223_basics_of_poo.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..bdc6319255517baff5b2795f40212193957ab33f --- /dev/null +++ b/notebooks/MasterLiTL_2223_basics_of_poo.ipynb @@ -0,0 +1,169 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3chvArhZ7wsg" + }, + "outputs": [], + "source": [ + "class Car:\n", + " def __init__( self, name, color, year):\n", + " self.name = name\n", + " self.color = color\n", + " self.year = year\n", + "\n", + " def present( self ):\n", + " return 'Hey! I am '+self.name+', I am '+self.color+' and I appeared in '+str(self.year)\n", + "\n", + " def be_painted( self, new_color ):\n", + " self.color = new_color\n", + " \n", + " def love( self, other_car ):\n", + " return self.name+' is in love with '+other_car.name\n" + ] + }, + { + "cell_type": "code", + "source": [ + "flash = Car( 'Flash McQueen', 'red', 2006 )\n", + "flash.present()" + ], + "metadata": { + "id": "H4jlq4Jk8txx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "flash.be_painted( 'purple' )\n", + "flash.present()" + ], + "metadata": { + "id": "8yZiyne39Yt7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Exercise: \n", + "- define a new object of type Car with the following attributes: its name is Sally Carrera, its color is blue and it has the same year as the object my_car.\n", + "- call the method love of the Car class with Flash and Sally. " + ], + "metadata": { + "id": "0U0iRh0EmcI2" + } + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "Kn3BotmnjegN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Inheritance\n", + "\n", + "https://www.geeksforgeeks.org/python-oops-concepts/#:~:text=In%20Python%2C%20object%2Doriented%20Programming,%2C%20etc.%20in%20the%20programming. " + ], + "metadata": { + "id": "fu1RyW-ZnUun" + } + }, + { + "cell_type": "code", + "source": [ + "# Parent class\n", + "class Person(object):\n", + "\n", + "\t# __init__ is known as the constructor\n", + "\tdef __init__(self, name, idnumber):\n", + "\t\tself.name = name\n", + "\t\tself.idnumber = idnumber\n", + "\n", + "\tdef display(self):\n", + "\t\tprint(self.name)\n", + "\t\tprint(self.idnumber)\n", + "\t\t\n", + "\tdef details(self):\n", + "\t\tprint(\"My name is {}\".format(self.name))\n", + "\t\tprint(\"IdNumber: {}\".format(self.idnumber))" + ], + "metadata": { + "id": "2Rmw4GeonU6n" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "# child class\n", + "class Employee(Person):\n", + " def __init__(self, name, idnumber, salary, post):\n", + " super().__init__(name, idnumber )\n", + " self.salary = salary\n", + " self.post = post\n", + " \n", + " def details(self):\n", + " print(\"My name is {}\".format(self.name))\n", + " print(\"IdNumber: {}\".format(self.idnumber))\n", + " print(\"Post: {}\".format(self.post))" + ], + "metadata": { + "id": "WyH-7GwAo3Hz" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Exercise: \n", + "- Create an object of the class Employee \n", + "- Print his name and id number." + ], + "metadata": { + "id": "ewKCb9ozpCgz" + } + }, + { + "cell_type": "code", + "source": [ + "# creation of an object variable or an instance\n", + "\n", + "\n", + "# calling a function\n", + "\n" + ], + "metadata": { + "id": "OCUkIAdVo3LT" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/notebooks/TP1_masterLiTL_2223.ipynb b/notebooks/TP1_masterLiTL_2223.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..75cf4d060585ee7d49fbb7d82967b4182bbd7227 --- /dev/null +++ b/notebooks/TP1_masterLiTL_2223.ipynb @@ -0,0 +1,482 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "lmulT50Qopks" + }, + "source": [ + "# TP1: Machine learning (reminder)\n", + "Master LiTL - 2021-2022\n", + "\n", + "## Requirements\n", + "In this practical session, we will explore machine learning models for NLP applications ; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. \n", + "For these exercises, we will make use of Python (v3.*), and a number of modules for data processing and machine learning: *numpy*, *scipy*, *scikit-learn*, *pandas* and *spacy* . \n", + "If you want to use your own computer you will need to make sure these are installed (e.g. using the command *pip*). If you’re using *Miniconda*, you can use the command\n", + "```\n", + "conda install <modulename>\n", + "```\n", + "\n", + "\n", + "First, download the data for the practical session from the course github page to an appropriate working directory, and unzip it. Under linux, you can issue the following commands :\n", + "```\n", + "$ unzip allocine.zip \n", + "```\n", + "\n", + "If you want to use Google colab, you need to upload the data using the menu on the left. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D4dW8cDBpG2v" + }, + "source": [ + "## Task and dataset\n", + "\n", + "We’ll go through the following stages of an NLP machine learning pipeline, using sentiment classification as an application:\n", + "* data preprocessing (tokenization) \n", + "* feature extraction\n", + "* model training\n", + "* evaluation\n", + "\n", + "As a dataset, we’ll be using a set of reviews for television series in French, extracted from the website allocine.fr. \n", + "The dataset consists of the text of the review, as well as a sentiment label (positive or negative).\n", + "\n", + "The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). \n", + "The dataset is balanced, which means positive and negative instances are evenly distributed. \n", + "Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating)." + ] + }, + { + "cell_type": "code", + "source": [ + "# Useful imports\n", + "import pandas as pd\n", + "# spacy’s preprocessing pipeline and model for French\n", + "import spacy.cli\n", + "spacy.cli.download(\"fr_core_news_sm\")\n", + "nlp = spacy.load('fr_core_news_sm', disable=['tagger', 'parser', 'ner'])\n", + "\n", + "# Path to data\n", + "train_path = \"allocine_train.tsv\"\n", + "dev_path = \"allocine_dev.tsv\"\n", + "test_path = \"allocine_test.tsv\"\n" + ], + "metadata": { + "id": "YoPn-Cbx44bO" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PKlLeW1Rp3Hl" + }, + "source": [ + "## Exercise 1: Preprocessing (code given)\n", + "\n", + "First, we’ll load the training set and axplore the dataset.\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vjWd_ZWYj9i0" + }, + "source": [ + "def read_data( data_path ):\n", + " dataset = pd.read_csv(data_path, header=0, delimiter='\\t', quoting=3)\n", + " print( '\\nFile:', data_path, '\\nData format:', dataset.columns.values, \n", + " '\\nFirst instance: ', dataset['sentiment'][0], dataset['review'][0] )\n", + " return dataset, dataset['sentiment']\n", + "\n", + "def preprocess_data( dataset ):\n", + " num_reviews = dataset['review'].size\n", + " print(\"#Reviews =\", num_reviews)\n", + " dataset_tok = []\n", + " for i in range(num_reviews):\n", + " clean_review = review_to_tokens(dataset['review'][i])\n", + " dataset_tok.append(clean_review)\n", + " for i, r in enumerate(dataset_tok[:2]):\n", + " print('\\n', i, r) \n", + " return dataset_tok\n", + "\n", + "def review_to_tokens(raw_review):\n", + " doc = nlp(raw_review)\n", + " tokenList = [token.text for token in doc]\n", + " tokenized_string = ' '.join(tokenList)\n", + " tokenized_string_lowercase = tokenized_string.lower()\n", + " return tokenized_string_lowercase\n", + "\n", + "def read_and_preprocess( data_path ):\n", + " dataset, labels = read_data( data_path )\n", + " dataset_tokenized = preprocess_data( dataset )\n", + " return dataset_tokenized, dataset, labels\n", + "\n", + "train, train_df, y_train = read_and_preprocess( train_path )\n", + "dev, dev_df, y_dev = read_and_preprocess( dev_path )\n", + "test, test_df, y_test = read_and_preprocess( test_path ) " + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "90iwyLkc9hw8" + }, + "source": [ + "## Exercise 2: Feature extraction \n", + "\n", + "Now it’s time to decide which features to use in our classifier. We’ll start with simple bag of words features.\n", + "\n", + "▶▶ **TODO: write the code to vectorize the data using CountVectorizer:**\n", + "* Take a look at the doc: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html \n", + "* Create an instance on the class CountVectorizer\n", + "* Look at the definition of the class to see how to limit the number of features to 500 (parameter in the constructor)\n", + "* Use the method fit_transform from this instance, with the tokenized train set as argument\n", + "* Print the shape of the obtained matrix\n", + "* Print the vocabulary, ie use the method get_feature_names() of the class CountVectorizer\n", + "* You can do that again for the dev set, once it's done for the train meaning using the same countVectorizer instance, but this time to vectorize the devset. Don't forget that for the dev we don't use 'fit_transform' but only 'transform'. Do you remeber why?" + ] + }, + { + "cell_type": "code", + "source": [ + "# useful imports \n", + "from sklearn.feature_extraction.text import CountVectorizer" + ], + "metadata": { + "id": "iyUCtQ7661PU" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZGnHRsH_9jtk" + }, + "source": [ + "# -- Vectorize train\n", + "\n", + "# Create an instance on the class CountVectorizer\n", + "\n", + "\n", + "# Use the method fit_transform from this instance, with 'train' as argument\n", + "\n", + "\n", + "# Print the shape of the obtained matrix\n", + "\n", + "\n", + "# Print the vocabulary, ie use the method get_feature_names() of the class CountVectorizer\n", + "\n", + "\n", + "# --------------------------------------------------------\n", + "# -- Vectorize dev\n", + "\n", + "# Use the method transform from this instance, with 'dev' as argument\n", + "\n", + "\n", + "# Print the shape of the obtained matrix\n", + "\n", + "\n", + "# Vocabulary should remain the same!\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u3VCUNqZ9ohj" + }, + "source": [ + "## Exercise 3: Classification\n", + "\n", + "We’ll start with the simplest classifier, yet often performing well: Naive Bayes.\n", + "\n", + "▶▶ **TODO: Train the classifier et report its performance on the dev set.**\n", + "* Take a look at the doc: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html \n", + "* Create an instance of the class MultinomialNB\n", + "* Use the method fit() on this instance of the naive bayes classifier, with the vectorized train set as an argument\n", + "* Look at the definition of the class and find a method that can be used for evaluating the model on the dev set. What does the score represent ?\n" + ] + }, + { + "cell_type": "code", + "source": [ + "# Useful imports\n", + "from sklearn.naive_bayes import MultinomialNB\n" + ], + "metadata": { + "id": "uNzDYo6e9fo_" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CssLBO2SBPMN" + }, + "source": [ + "## -- Classification with NAIVE BAYES\n", + "\n", + "# Create an instance of the class MultinomialNB\n", + "\n", + "\n", + "# Use the method fit() on this instance of the naive bayes classifier, with the vectorized train set as an argument\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "IKuY2GrsKv3R" + }, + "source": [ + "# Compute the performance on the dev set\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0KdSuShTBQ00" + }, + "source": [ + "### Exercise 3-b (code given)\n", + "\n", + "* Look at the instances that were classified badly. Do you see why the review was misclassified ? " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0fek-bze96bp" + }, + "source": [ + "## -- Look at misclassified instances\n", + "pred = classifier.predict( dev_bow )\n", + "#print(pred) # = matrix, illisible\n", + "\n", + "print('Misclassified examples: ')\n", + "count_err = 0\n", + "for i in range(len(pred)):\n", + " if pred[i] != y_dev[i]:\n", + " print( \"\\nGOLD=\", y_dev[i], \"PRED=\",pred[i] , i, dev_df['review'][i])\n", + " count_err += 1\n", + " \n", + "print( \"CHECK: \", \"#Total=\", len(pred), \"#Errors=\", count_err, \"Acc=\", (len(pred)-count_err)/len(pred))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LBe-8okB-9oc" + }, + "source": [ + "## Exercise 4: Experiment with different feature sets.\n", + "\n", + "Here, we'll just try bi-grams.\n", + "\n", + "▶▶ **TODO: write the code to vectorize the data into bigrams. Keep 'max_features = 500'. Then retrain and evaluate the classifier.**\n", + "* Look again at the class CountVectorizer and find the parameter that has to be changed in the constructor to get bigrams, and to limit the number of features.\n", + "* Use this new instance of CountVectorizer to vectorize the train and dev sets.\n", + "* Train and evaluate a naive bayes model with these data representation.\n", + "\n", + "We could also have tried to e.g.:\n", + " * Exclude a list of stopwords (high-frequency words that are considered too general to be meaningful, such as une or le)\n", + " * Experiment with n-grams with n>2 \n", + " * Combine features (e.g. BOW + bi-grams)\n", + " * Can you think of other features to include?" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SkyrhwGx-jYz" + }, + "source": [ + "## -- Write the code to vectorize to extract bigrams\n", + "\n", + "# Create an instance on the class CountVectorizer\n", + "\n", + "\n", + "# Vectorize the train and dev sets, print their shape\n", + "\n", + "\n", + "# Print the vocabulary\n", + "\n", + "\n", + "## -- Train a Naive Bayes classifier and evaluate on dev\n", + "\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-rFwwod5_gXI" + }, + "source": [ + "## Exercise 5\n", + "\n", + "Experiment with different classifiers, compare:\n", + "* Naive Bayes \n", + "* MaxEnt\n", + "\n", + "▶▶ **Compare the results obtained with NB to the ones obtained with MaxEnt.**\n", + "% doc: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html" + ] + }, + { + "cell_type": "code", + "source": [ + "# Useful imports\n", + "from sklearn.linear_model import LogisticRegression" + ], + "metadata": { + "id": "7geCF-2NI1t7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kHrZ3vI-_9zE" + }, + "source": [ + "# -- Train a MaxEnt classifier and evaluate on dev\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "▶▶ **Try again but without the limitation on the number of features. What do you conclude?**" + ], + "metadata": { + "id": "rz3K8eWYKSKK" + } + }, + { + "cell_type": "code", + "source": [ + "## -- Make experiments with BoW and bigrams a,d NB and LR\n", + "# without the limitation on the number of features\n", + "\n" + ], + "metadata": { + "id": "5tYvyvIfQHjH" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BEMs98mR_-AG" + }, + "source": [ + "## Exercise 6: evaluation on the test set\n", + "\n", + "You’ve determined the best feature set and classification algorithm (missing: the best set of hyper-parameters). \n", + "\n", + "▶▶ **compute the performance on the test set**." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YJYV84q6_-o0" + }, + "source": [ + "# -- Compute the final results on the TEST set\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y71z29pK-jwS" + }, + "source": [ + "## Exercise 7: Intrinsic model evaluation (code given)\n", + "\n", + "Some models allow us to look at the most informative features. \n", + "\n", + "▶▶ **Examine both the top and the bottom of the list. Which features are most informative ?**" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "U4mRULuT-TQi" + }, + "source": [ + "classifier_lr_bow = LogisticRegression()\n", + "classifier_lr_bow.fit( train_bow, y_train )\n", + "\n", + "vocab = vectorizer_bow.get_feature_names()\n", + "\n", + "allCoefficients = [(classifier_lr_bow.coef_[0,i], vocab[i]) for i in range(len(vocab))]\n", + "allCoefficients.sort()\n", + "allCoefficients.reverse()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "in907YQ4-oXv" + }, + "source": [ + "print(\"Top features for positive class:\")\n", + "print( '\\n'.join( [ f+':\\t'+str((round(w,3))) for (w,f) in allCoefficients[:50]] ) )" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "AdrYsRJ7-rMf" + }, + "source": [ + "print(\"Top features for negative class:\")\n", + "print( '\\n'.join( [ f+':'+str((round(w,3))) for (w,f) in allCoefficients[-50:]] ) )" + ], + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/slides/Master LiTL_ Course 1 - 24112022.pdf b/slides/Master LiTL_ Course 1 - 24112022.pdf new file mode 100644 index 0000000000000000000000000000000000000000..8c6fcf292c60c9de94860563dda7fdb849737b73 Binary files /dev/null and b/slides/Master LiTL_ Course 1 - 24112022.pdf differ