diff --git a/notebooks/TP3_m2LiTL_2324_WordEmbeddings_2324.ipynb b/notebooks/TP3_m2LiTL_2324_WordEmbeddings_2324.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..5879cfa470235688ce68da34d376773ceb9bda26 --- /dev/null +++ b/notebooks/TP3_m2LiTL_2324_WordEmbeddings_2324.ipynb @@ -0,0 +1,609 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fleet-worry", + "metadata": { + "id": "fleet-worry" + }, + "source": [ + "# TP3: Word embeddings\n", + "Master LiTL\n", + "\n", + "\n", + "In this practical session, we will explore the generation of word embeddings.\n", + "\n", + "We will make use of *gensim* for generating word embeddings.\n", + "If you want to use your own computer you will need to make sure it is installed (e.g. using the command ```pip```).\n", + "If you’re using Anaconda/Miniconda, you can use the command ```conda install <modulename>```.\n", + "\n", + "Sources:\n", + "- Practical from T. van de Cruys\n", + "- https://machinelearningmastery.com/develop-word-embeddings-python-gensim/\n", + "- https://radimrehurek.com/gensim/models/word2vec.html\n", + "- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/: see an example based on the 20NewsGroup corpus\n", + "- http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/\n", + "- (not used but seems interesting: https://www.machinelearningplus.com/nlp/gensim-tutorial/#14howtotrainword2vecmodelusinggensim)\n" + ] + }, + { + "cell_type": "markdown", + "id": "abandoned-basketball", + "metadata": { + "id": "abandoned-basketball" + }, + "source": [ + "## 1. Look at the data\n", + "\n", + "Upload the data: *corpus_food.txt.gz*.\n", + "The data come from blogs on cooking.\n", + "\n", + "You cab take a look at your data using a terminal and the following commands:\n", + "\n", + "* Number of lines:\n", + "```\n", + "$ wc -l corpus_food.txt\n", + "$ 1161270 corpus_food.txt\n", + "```\n", + "\n", + "* first ten lines:\n", + "```\n", + "$ head -n 10 corpus_food.txt\n", + "$ -mention meilleur espoir féminin : on aurait pu ajouter ioudgine .\n", + "malheureusement , comme presque tout ce qui est bon , c' est bourré de beurre et de sucre .\n", + "j' avais déjà façonné une recette allégée pour weight watchers mais elle contenait encore du beurre et un peu de sucre .\n", + "aujourd' hui je vous propose cette recette que j' ai improvisée hier soir , sans beurre et sans sucre .\n", + "n' empêche que pour acheter sa propre baguette magique ou pour déguster des bières au beurre , on pourrait partir au bout du monde !\n", + "menthe , sucre de canne , rhum , citron vert , sont vos meilleurs amis en soirée ?\n", + "parfois , on rêve d' un bon verre de vin .\n", + "la marque de biscuits oreo a pensé aux gourmandes et aux gourmands , et s' apprête à lancer des gâteaux dotés de nouvelles saveurs : caramel beurre salé et noix de coco .\n", + "rangez les parapluies , et sortez le sel et le citron !\n", + "le vin on adore le savourer avec modération .\n", + "```\n", + "\n", + "Première phrase bizarre mais sinon le début : http://www.leblogdelaura.com/2017/03/pancakes-sans-sucre-et-sans-graisses.html" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## 2. Build word embeddings\n", + "\n", + "We will use *gensim* in order to induce word embeddings from text.\n", + "*gensim* is a vector space modeling and topic modeling toolkit for python, and contains an efficient implementation of the *word2vec* algorithms.\n", + "\n", + "*word2vec* consists of two different algorithms: *skipgram* (sg) and *continuous-bag-of-words* (cbow).\n", + "The underlying prediction task of the former is to estimate the context words from the target word ; the prediction task of the latter is to estimate the target word from the sum of the context words.\n", + "\n", + "### 2.1 Train a model\n", + "▶▶**Run the following code: it will build word embeddings based on the food corpus using the Word2Vec algorithm.**\n", + "The model will be saved on your disk." + ], + "metadata": { + "id": "FJbl3GWfFYms" + }, + "id": "FJbl3GWfFYms" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "impressed-consumer", + "metadata": { + "id": "impressed-consumer" + }, + "outputs": [], + "source": [ + "# potential Error: need to update smart_open with conda install smart_open==2.0.0 or pip install smart_open==2.0.0\n", + "\n", + "\n", + "# construct word2vec model using gensim\n", + "\n", + "from gensim.models import Word2Vec\n", + "\n", + "import gzip\n", + "import logging\n", + "\n", + "import time\n", + "\n", + "# set up logging for gensim\n", + "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\n", + " level=logging.INFO, force=True)\n", + "logging.getLogger('RootLogger').setLevel(logging.INFO)\n", + "\n", + "logging.info(\"This is a test log ..\")\n", + "\n", + "# we define a PlainTextCorpus class; this will provide us with an\n", + "# iterator over the corpus (so that we don't have to load the corpus\n", + "# into memory)\n", + "class PlainTextCorpus(object):\n", + " def __init__(self, fileName):\n", + " self.fileName = fileName\n", + "\n", + " def __iter__(self):\n", + " for line in gzip.open(self.fileName, 'rt', encoding='utf-8'):\n", + " yield line.split()\n", + "\n", + "# -- Instantiate the corpus class using corpus location\n", + "sentences = PlainTextCorpus('corpus_food.txt.gz')\n", + "\n", + "# -- Trianing\n", + "# we only take into account words with a frequency of at least 50, and\n", + "# we iterate over the corpus only once\n", + "model = Word2Vec(sentences, min_count=50, iter=1, sorted_vocab=1)\n", + "\n", + "# -- Finally, save the constructed model to disk\n", + "# When getting started, you can save the learned model in ASCII format and review the contents.\n", + "model.wv.save_word2vec_format('model_word2vec_food.txt', binary=False)\n", + "# by default, it is saved as binary\n", + "model.save('model_word2vec_food.bin')\n", + "# a model saved can be load again using:\n", + "#model = Word2Vec.load('model_word2vec_food.bin')" + ] + }, + { + "cell_type": "markdown", + "id": "bottom-tobacco", + "metadata": { + "id": "bottom-tobacco" + }, + "source": [ + "### 2.2 A few remarks:\n", + "\n", + "From: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/\n", + "\n", + "#### Downsampling\n", + "\n", + "Subsampling frequent words to decrease the number of training examples.\n", + "\n", + "There are two “problems” with common words like “the”:\n", + "* When looking at word pairs, (“fox”, “the”) doesn’t tell us much about the meaning of “fox”. “the” appears in the context of pretty much every word.\n", + "* We will have many more samples of (“the”, …) than we need to learn a good vector for “the”.\n", + "\n", + "Word2Vec implements a “subsampling” scheme to address this.\n", + "For each word we encounter in our training text, there is a chance that we will effectively delete it from the text.\n", + "The probability that we cut the word is related to the word’s frequency.\n", + "\n", + "If we have a window size of 10, and we remove a specific instance of “the” from our text:\n", + "\n", + "* As we train on the remaining words, “the” will not appear in any of their context windows.\n", + "* We’ll have 10 fewer training samples where “the” is the input word.\n", + "\n", + "There is also a parameter in the code named ‘sample’ which controls how much subsampling occurs, and the default value is 0.001. Smaller values of ‘sample’ mean words are less likely to be kept.\n", + "\n", + "\n", + "#### Negative sampling (for SkipGram)\n", + "\n", + "Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately.\n", + "In other words, each training sample will tweak all of the weights in the neural network --> prohibitive\n", + "\n", + "Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.\n", + "\n", + "When training the network on the word pair (“fox”, “quick”), i.e. 'fox' is the target, 'quick' a context word: “quick” -> 1; all of the other thousands of output neurons -> 0.\n", + "\n", + "With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for: “quick” -> 1; 5 other random words -> 0.\n", + "\n", + "Recall that the output layer of our model has a weight matrix that’s dx|V|, e.g. 300 x 23,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer! (In the hidden layer, only the weights for the input word are updated -- this is true whether you’re using Negative Sampling or not)." + ] + }, + { + "cell_type": "markdown", + "source": [ + "### 2.3 Print information about the model learned\n", + "Note that the corpus is food-related, so food-related terms will work best.\n", + "\n", + "You can print the vocabulary using:\n", + "```\n", + "vocabulary = list(model.wv.vocab)\n", + "```\n", + "\n", + "It is possible to look at the individual word embeddings using the following :\n", + "```\n", + "model.wv[’citron’]\n", + "```\n", + "```\n", + "print(model['citron'])\n", + "```\n", + "\n", + "Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.\n", + "\n", + "▶▶ **Print the vocabulary and then the vectors for a few terms, e.g. 'citron' and 'fruit'. Do they seem close?**" + ], + "metadata": { + "id": "ZAgXv40pHQXs" + }, + "id": "ZAgXv40pHQXs" + }, + { + "cell_type": "code", + "source": [ + "vocabulary = list(model.wv.vocab)\n", + "model.wv[\"citron\"]\n", + "print(model['citron'])" + ], + "metadata": { + "id": "exp1yDyJx3y5" + }, + "id": "exp1yDyJx3y5", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 3. Compute word similarity\n", + "\n", + "You can now compute similarity measure between word using\n", + "```\n", + "model.similarity('manger', goûter')\n", + "```\n", + "\n", + "You can also print the most similar words (which is measured by cosine similarity between the word vectors) by issuing the following command :\n", + "```\n", + "model.wv.most_similar(’citron’)\n", + "```\n", + "▶▶**Print the similarity between some terms, e.g. ('manger', 'boire'), ('manger', 'dormir') ... Do the results seem coherent?**\n", + "\n", + "▶▶**Print the words that are most similar to: 'citron', 'manger' and other words, e.g. not related to food. Do the results seem coherent?**" + ], + "metadata": { + "id": "eSm6IZ3IFTP9" + }, + "id": "eSm6IZ3IFTP9" + }, + { + "cell_type": "code", + "source": [ + "model.similarity(\"manger\", \"goûter\")\n", + "model.wv.most_similar(\"citron\")" + ], + "metadata": { + "id": "sSusMXEgX5Pd" + }, + "id": "sSusMXEgX5Pd", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "bigger-drink", + "metadata": { + "id": "bigger-drink" + }, + "source": [ + "## 4. Exercise: change the parameters values\n", + "\n", + "As a default, the *word2vec* module creates **word embeddings of size 100**, using a **cbow model** with a **window of 5 words**.\n", + "\n", + "▶▶**Train a model with different parameters:**\n", + "- using a different window size,\n", + "- using a different embedding size\n", + "- using *skipgram*,\n", + "\n", + "Inspect the results (similar words) qualitatively. Do the similarity computations change ? Are they better or worse ?\n", + "\n", + "See doc: https://radimrehurek.com/gensim_3.8.3/models/word2vec.html" + ] + }, + { + "cell_type": "code", + "source": [ + "vocabulary = list(model.wv.vocab)\n", + "model = Word2Vec(sentences, size=100, window=5, min_count=1, workers=4)\n", + "model.wv[\"citron\"]\n", + "print(model['citron'])" + ], + "metadata": { + "id": "L0L-muU3yEDr" + }, + "id": "L0L-muU3yEDr", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "least-justice", + "metadata": { + "id": "least-justice" + }, + "source": [ + "## 5. Analogical reasoning\n", + "\n", + "As we saw in class, word embeddings allow us to do analogical reasoning using vector addiction and subtraction. *gensim* offers the possibility to do so.\n", + "\n", + "▶▶ **Try to perform analogical reasoning in the food realm, e.g. fourchette - légume + soupe = ?**\n", + "\n", + "Hint : the function *most_similar()* takes arguments positive and negative.\n", + "\n", + "▶▶ **Try the same using the function most_similar_cosmul()** (which performs a similar computation but uses multiplication and division instead), and see what works best\n", + "\n", + "See: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html\n", + "\n", + "https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar_cosmul.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "removable-validity", + "metadata": { + "id": "removable-validity" + }, + "outputs": [], + "source": [ + "#model.most_similar(positive=[\"fourchette\",\"soupe\"], negative=[\"légume\"])\n", + "model.most_similar_cosmul(positive=[\"fourchette\",\"soupe\"], negative=[\"légume\"])" + ] + }, + { + "cell_type": "markdown", + "source": [ + "### 6. Visualize word embeddings\n", + "\n", + "After you learn word embedding for your text data, it can be nice to explore it with visualization.\n", + "\n", + "You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.\n", + "\n", + "The visualizations can provide a qualitative diagnostic for your learned model.\n", + "\n", + "We can retrieve all of the vectors from a trained model as follows:\n", + "```\n", + "X = model[model.wv.vocab]\n", + "```\n", + "\n", + "We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.\n", + "\n", + "Let’s look at an example with Principal Component Analysis or PCA." + ], + "metadata": { + "id": "NrSg7OJhKP5_" + }, + "id": "NrSg7OJhKP5_" + }, + { + "cell_type": "code", + "source": [ + "X = model[list(model.wv.vocab)[:1000]]\n", + "print(X.shape)" + ], + "metadata": { + "id": "OGnQezWkGwnt" + }, + "id": "OGnQezWkGwnt", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### 6.1 Using PCA\n", + "\n", + "We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows." + ], + "metadata": { + "id": "V3GEpKVfKxMe" + }, + "id": "V3GEpKVfKxMe" + }, + { + "cell_type": "code", + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "pca = PCA(n_components=2)\n", + "result = pca.fit_transform(X)" + ], + "metadata": { + "id": "TDPHq7XVGwqb" + }, + "id": "TDPHq7XVGwqb", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from matplotlib import pyplot\n", + "\n", + "pyplot.scatter(result[:, 0], result[:, 1])" + ], + "metadata": { + "id": "TPxqh9N0Gwta" + }, + "id": "TPxqh9N0Gwta", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows." + ], + "metadata": { + "id": "MUW6T3b4LPE7" + }, + "id": "MUW6T3b4LPE7" + }, + { + "cell_type": "code", + "source": [ + "pyplot.scatter(result[:, 0], result[:, 1])\n", + "words = list(model.wv.vocab)[:1000]\n", + "for i, word in enumerate(words):\n", + "\tpyplot.annotate(word, xy=(result[i, 0], result[i, 1]))\n", + "#pyplot.show()\n", + "pyplot.savefig('plot_w2v.png')" + ], + "metadata": { + "id": "87nkrdIsGwuv" + }, + "id": "87nkrdIsGwuv", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "pyplot.scatter(result[:, 0], result[:, 1])\n", + "words = list(model.wv.vocab)[:100]\n", + "for i, word in enumerate(words):\n", + "\tpyplot.annotate(word, xy=(result[i, 0], result[i, 1]))\n", + "#pyplot.show()\n", + "pyplot.savefig('plot_w2v.png')" + ], + "metadata": { + "id": "E79KLz3ARN14" + }, + "id": "E79KLz3ARN14", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "#### 6.2 Using TensorFlow projector\n", + "\n", + "As we saw during the course, TensorFlow provides a tool to vizualize word embeddings. We need to provide: \n", + "* A TSV file with the vectors\n", + "* Another TSV file with the words\n", + "\n", + "The following code allows to write this file from the model.\n", + "\n", + "It comes from the source code of the script: https://radimrehurek.com/gensim/scripts/word2vec2tensor.html\n", + "\n", + "▶▶ **Run the followng code and then load the files within the TensorFlow projector. Look e.g. for 'citron', 'manger', 'pain'..., check their neighbors (with PCA and/or T-SNE).**\n", + "\n", + "https://projector.tensorflow.org/" + ], + "metadata": { + "id": "fCr9WTKLVRLM" + }, + "id": "fCr9WTKLVRLM" + }, + { + "cell_type": "code", + "source": [ + "#model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)\n", + "tensorsfp = \"model_word2vec_food_tensor.tsv\"\n", + "metadatafp = \"metadata_word2vec_food_tensor.tsv\"\n", + "with open( tensorsfp, 'w+') as tensors:\n", + " with open( metadatafp, 'w+') as metadata:\n", + " for word in model.wv.index2word:\n", + " metadata.write(word + '\\n')\n", + " vector_row = '\\t'.join(map(str, model[word]))\n", + " tensors.write(vector_row + '\\n')" + ], + "metadata": { + "id": "I40CgQgJTSKd" + }, + "id": "I40CgQgJTSKd", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "#### 7. Embeddings with multiword ngrams\n", + "\n", + "There is a *gensim.models.phrases* module which lets you automatically detect phrases longer than one word, using collocation statistics. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as new_york_times or financial_crisis:\n", + "\n" + ], + "metadata": { + "id": "wY8HkerUhbc7" + }, + "id": "wY8HkerUhbc7" + }, + { + "cell_type": "code", + "source": [ + "from gensim.models import Phrases\n", + "\n", + "# Train a bigram detector.\n", + "bigram_transformer = Phrases(sentences)\n", + "\n", + "# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.\n", + "model = Word2Vec(bigram_transformer[sentences], min_count=10, iter=1)" + ], + "metadata": { + "id": "RaYOSCa-TSUy" + }, + "id": "RaYOSCa-TSUy", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(list(model.wv.vocab))" + ], + "metadata": { + "id": "DOjkiWcUhwmH" + }, + "id": "DOjkiWcUhwmH", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "If you’re finished training a model (i.e. no more updates, only querying), you can switch to the KeyedVectors instance:\n", + "\n" + ], + "metadata": { + "id": "gbdjeIUlhUvD" + }, + "id": "gbdjeIUlhUvD" + }, + { + "cell_type": "code", + "source": [ + "word_vectors = model.wv\n", + "del model" + ], + "metadata": { + "id": "foTRPTM9TSWc" + }, + "id": "foTRPTM9TSWc", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 8. Other algorithms\n", + "\n", + "Probably not enough time, but note that you can also build embeddings using FastText algorithm with Gensim. Doc2vec is also available.\n", + "\n", + "https://radimrehurek.com/gensim/apiref.html\n", + "\n" + ], + "metadata": { + "id": "mQDB2hnXxP4Z" + }, + "id": "mQDB2hnXxP4Z" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.9" + }, + "colab": { + "provenance": [] + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/slides/Master LiTL_2324_Course3_1201023.pdf b/slides/Master LiTL_2324_Course3_1201023.pdf new file mode 100644 index 0000000000000000000000000000000000000000..cc70469fa28db27d5d8a8a6bea74bddcdf7c6b24 Binary files /dev/null and b/slides/Master LiTL_2324_Course3_1201023.pdf differ