diff --git a/notebooks/TP3_m2LiTL_WordEmbeddings_SUJET_2425.ipynb b/notebooks/TP3_m2LiTL_WordEmbeddings_SUJET_2425.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..e2cf62c4e36c1c5e94c31b42831da49e101015cb --- /dev/null +++ b/notebooks/TP3_m2LiTL_WordEmbeddings_SUJET_2425.ipynb @@ -0,0 +1,904 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fleet-worry", + "metadata": { + "id": "fleet-worry" + }, + "source": [ + "# TP3: Word embeddings\n", + "Master LiTL\n", + "\n", + "\n", + "In this practical session, we will explore the generation of word embeddings.\n", + "\n", + "We will make use of *gensim* for generating word embeddings.\n", + "If you want to use your own computer you will need to make sure it is installed (e.g. using the command ```pip```).\n", + "If you’re using Anaconda/Miniconda, you can use the command ```conda install <modulename>```.\n", + "\n", + "Sources:\n", + "- Practical from T. van de Cruys\n", + "- https://machinelearningmastery.com/develop-word-embeddings-python-gensim/\n", + "- https://radimrehurek.com/gensim/models/word2vec.html\n", + "- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/: see an example based on the 20NewsGroup corpus\n", + "- http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/\n", + "- (not used but seems interesting: https://www.machinelearningplus.com/nlp/gensim-tutorial/#14howtotrainword2vecmodelusinggensim)\n" + ] + }, + { + "cell_type": "markdown", + "id": "abandoned-basketball", + "metadata": { + "id": "abandoned-basketball" + }, + "source": [ + "## 1- Look at the data\n", + "\n", + "Upload the data: *corpus_food.txt.gz*.\n", + "The data come from blogs on cooking.\n", + "\n", + "You cab take a look at your data using a terminal and the following commands:\n", + "\n", + "* Number of lines:\n", + "```\n", + "$ wc -l corpus_food.txt\n", + "$ 1161270 corpus_food.txt\n", + "```\n", + "\n", + "* first ten lines:\n", + "```\n", + "$ head -n 10 corpus_food.txt\n", + "$ -mention meilleur espoir féminin : on aurait pu ajouter ioudgine .\n", + "malheureusement , comme presque tout ce qui est bon , c' est bourré de beurre et de sucre .\n", + "j' avais déjà façonné une recette allégée pour weight watchers mais elle contenait encore du beurre et un peu de sucre .\n", + "aujourd' hui je vous propose cette recette que j' ai improvisée hier soir , sans beurre et sans sucre .\n", + "n' empêche que pour acheter sa propre baguette magique ou pour déguster des bières au beurre , on pourrait partir au bout du monde !\n", + "menthe , sucre de canne , rhum , citron vert , sont vos meilleurs amis en soirée ?\n", + "parfois , on rêve d' un bon verre de vin .\n", + "la marque de biscuits oreo a pensé aux gourmandes et aux gourmands , et s' apprête à lancer des gâteaux dotés de nouvelles saveurs : caramel beurre salé et noix de coco .\n", + "rangez les parapluies , et sortez le sel et le citron !\n", + "le vin on adore le savourer avec modération .\n", + "```\n", + "\n", + "Première phrase bizarre mais sinon le début : http://www.leblogdelaura.com/2017/03/pancakes-sans-sucre-et-sans-graisses.html" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## 2 - Build word embeddings\n", + "\n", + "We will use *gensim* in order to induce word embeddings from text.\n", + "*gensim* is a vector space modeling and topic modeling toolkit for python, and contains an efficient implementation of the *word2vec* algorithms.\n", + "\n", + "*word2vec* consists of two different algorithms: *skipgram* (sg) and *continuous-bag-of-words* (cbow).\n", + "The underlying prediction task of the former is to estimate the context words from the target word ; the prediction task of the latter is to estimate the target word from the sum of the context words.\n", + "\n" + ], + "metadata": { + "id": "FJbl3GWfFYms" + }, + "id": "FJbl3GWfFYms" + }, + { + "cell_type": "markdown", + "source": [ + "### 2.1 - Train a model\n", + "▶▶**Run the following code: it will build word embeddings based on the food corpus using the Word2Vec algorithm.**\n", + "The model will be saved on your disk." + ], + "metadata": { + "id": "7VGe1L4jJ9ra" + }, + "id": "7VGe1L4jJ9ra" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "impressed-consumer", + "metadata": { + "id": "impressed-consumer" + }, + "outputs": [], + "source": [ + "# potential Error: need to update smart_open with conda install smart_open==2.0.0 or pip install smart_open==2.0.0\n", + "\n", + "# construct word2vec model using gensim\n", + "\n", + "from gensim.models import Word2Vec\n", + "\n", + "import gzip\n", + "import logging\n", + "\n", + "import time\n", + "\n", + "# set up logging for gensim\n", + "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\n", + " level=logging.INFO)\n", + "\n", + "# we define a PlainTextCorpus class; this will provide us with an\n", + "# iterator over the corpus (so that we don't have to load the corpus\n", + "# into memory)\n", + "class PlainTextCorpus(object):\n", + " def __init__(self, fileName):\n", + " self.fileName = fileName\n", + "\n", + " def __iter__(self):\n", + " for line in gzip.open(self.fileName, 'rt', encoding='utf-8'):\n", + " yield line.split()\n", + "\n", + "# -- Instantiate the corpus class using corpus location\n", + "sentences = PlainTextCorpus('corpus_food.txt.gz')\n", + "\n", + "# -- Trianing\n", + "# we only take into account words with a frequency of at least 50, and\n", + "# we iterate over the corpus only once\n", + "model = Word2Vec(sentences, min_count=50, epochs=1, sorted_vocab=1)\n", + "\n", + "# -- Finally, save the constructed model to disk\n", + "# When getting started, you can save the learned model in ASCII format and review the contents.\n", + "model.wv.save_word2vec_format('model_word2vec_food.txt', binary=False)\n", + "# by default, it is saved as binary\n", + "model.save('model_word2vec_food.bin')\n", + "# a model saved can be load again using:\n", + "#model = Word2Vec.load('model_word2vec_food.bin')" + ] + }, + { + "cell_type": "markdown", + "id": "bottom-tobacco", + "metadata": { + "id": "bottom-tobacco" + }, + "source": [ + "### 2.2 A few remarks:\n", + "\n", + "From: http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/\n", + "\n", + "#### Downsampling\n", + "\n", + "Subsampling frequent words to decrease the number of training examples.\n", + "\n", + "There are two “problems” with common words like “the”:\n", + "* When looking at word pairs, (“fox”, “the”) doesn’t tell us much about the meaning of “fox”. “the” appears in the context of pretty much every word.\n", + "* We will have many more samples of (“the”, …) than we need to learn a good vector for “the”.\n", + "\n", + "Word2Vec implements a “subsampling” scheme to address this.\n", + "For each word we encounter in our training text, there is a chance that we will effectively delete it from the text.\n", + "The probability that we cut the word is related to the word’s frequency.\n", + "\n", + "If we have a window size of 10, and we remove a specific instance of “the” from our text:\n", + "\n", + "* As we train on the remaining words, “the” will not appear in any of their context windows.\n", + "* We’ll have 10 fewer training samples where “the” is the input word.\n", + "\n", + "There is also a parameter in the code named ‘sample’ which controls how much subsampling occurs, and the default value is 0.001. Smaller values of ‘sample’ mean words are less likely to be kept.\n", + "\n", + "\n", + "#### Negative sampling (for SkipGram)\n", + "\n", + "Training a neural network means taking a training example and adjusting all of the neuron weights slightly so that it predicts that training sample more accurately.\n", + "In other words, each training sample will tweak all of the weights in the neural network --> prohibitive\n", + "\n", + "Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them.\n", + "\n", + "When training the network on the word pair (“fox”, “quick”), i.e. 'fox' is the target, 'quick' a context word: “quick” -> 1; all of the other thousands of output neurons -> 0.\n", + "\n", + "With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for: “quick” -> 1; 5 other random words -> 0.\n", + "\n", + "Recall that the output layer of our model has a weight matrix that’s dx|V|, e.g. 300 x 23,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer! (In the hidden layer, only the weights for the input word are updated -- this is true whether you’re using Negative Sampling or not)." + ] + }, + { + "cell_type": "markdown", + "source": [ + "### 2.3 Print information about the model learned\n", + "Note that the corpus is food-related, so food-related terms will work best.\n", + "\n", + "You can print the vocabulary using:\n", + "```\n", + "vocabulary = model.wv.key_to_index\n", + "```\n", + "\n", + "It is possible to look at the individual word embeddings using the following :\n", + "```\n", + "model.wv[’citron’]\n", + "```\n", + "```\n", + "print(model.vw['citron'])\n", + "```\n", + "\n", + "Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.\n", + "\n", + "▶▶ **Print the vocabulary and then the vectors for a few terms, e.g. 'citron' and 'fruit'. Do they seem close?**" + ], + "metadata": { + "id": "ZAgXv40pHQXs" + }, + "id": "ZAgXv40pHQXs" + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "IKJG3nNpHQjY" + }, + "id": "IKJG3nNpHQjY", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "BEbhGilgJJil" + }, + "id": "BEbhGilgJJil", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "U2u5ltKhJN0j" + }, + "id": "U2u5ltKhJN0j", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 3 - Compute word similarity\n", + "\n", + "You can now compute similarity measure between word using\n", + "```\n", + "model.wv.similarity('manger', goûter')\n", + "```\n", + "\n", + "You can also print the most similar words (which is measured by cosine similarity between the word vectors) by issuing the following command :\n", + "```\n", + "model.wv.most_similar(’citron’)\n", + "```\n", + "▶▶**Print the similarity between some terms, e.g. ('manger', 'boire'), ('manger', 'dormir') ... Do the results seem coherent?**\n", + "\n", + "▶▶**Print the words that are most similar to: 'citron', 'manger' and other words, e.g. not related to food. Do the results seem coherent?**" + ], + "metadata": { + "id": "eSm6IZ3IFTP9" + }, + "id": "eSm6IZ3IFTP9" + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "84TOaPCkOm1z" + }, + "id": "84TOaPCkOm1z", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "D-iA3eJsOm4l" + }, + "id": "D-iA3eJsOm4l", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "Ux4_ce8hOm9B" + }, + "id": "Ux4_ce8hOm9B", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "bigger-drink", + "metadata": { + "id": "bigger-drink" + }, + "source": [ + "## 4 - Exercise: change the parameters values\n", + "\n", + "As a default, the *word2vec* module creates **word embeddings of size 100**, using a **cbow model** with a **window of 5 words**.\n", + "\n", + "▶▶**Train a model with different parameters:**\n", + "- using a different window size,\n", + "- using a different embedding size\n", + "- using *skipgram*,\n", + "\n", + "Inspect the results (similar words) qualitatively. Do the similarity computations change ? Are they better or worse ?\n", + "\n", + "See doc: https://radimrehurek.com/gensim_3.8.3/models/word2vec.html" + ] + }, + { + "cell_type": "code", + "source": [ + "from gensim.models import Word2Vec\n", + "import gzip\n", + "\n", + "# we define a PlainTextCorpus class; this will provide us with an\n", + "# iterator over the corpus (so that we don't have to load the corpus\n", + "# into memory)\n", + "class PlainTextCorpus(object):\n", + " def __init__(self, fileName):\n", + " self.fileName = fileName\n", + "\n", + " def __iter__(self):\n", + " for line in gzip.open(self.fileName, 'rt', encoding='utf-8'):\n", + " yield line.split()\n", + "\n", + "# -- Instantiate the corpus class using corpus location\n", + "sentences = PlainTextCorpus('corpus_food.txt.gz')" + ], + "metadata": { + "id": "t29EU--0Txre" + }, + "id": "t29EU--0Txre", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "GOTCCxFKO6pk" + }, + "id": "GOTCCxFKO6pk", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "S6gLHxxgO6so" + }, + "id": "S6gLHxxgO6so", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "7dm3n7p2O6vu" + }, + "id": "7dm3n7p2O6vu", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "bJYQRcMZO6yl" + }, + "id": "bJYQRcMZO6yl", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "id": "satellite-colombia", + "metadata": { + "id": "satellite-colombia" + }, + "source": [ + "#### According to Mikolov\n", + "\n", + "**Skip-gram**: works well with small amount of the training data, represents well even rare words or phrases.\n", + "\n", + "**CBOW**: several times faster to train than the skip-gram, slightly better accuracy for the frequent words" + ] + }, + { + "cell_type": "markdown", + "id": "least-justice", + "metadata": { + "id": "least-justice" + }, + "source": [ + "## 5 - Analogical reasoning\n", + "\n", + "As we saw in class, word embeddings allow us to do analogical reasoning using vector addiction and subtraction. *gensim* offers the possibility to do so.\n", + "\n", + "▶▶ **Try to perform analogical reasoning in the food realm, e.g. fourchette - légume + soupe = ?**\n", + "\n", + "Hint : the function *most_similar()* takes arguments positive and negative.\n", + "\n", + "▶▶ **Try the same using the function most_similar_cosmul()** (which performs a similar computation but uses multiplication and division instead), and see what works best\n", + "\n", + "See: https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html\n", + "\n", + "https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar_cosmul.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "removable-validity", + "metadata": { + "id": "removable-validity" + }, + "outputs": [], + "source": [ + "model.wv.most_similar(positive=[\"fourchette\", \"soupe\"], negative=[\"légume\"])" + ] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "D2MrhPfHGwla" + }, + "id": "D2MrhPfHGwla", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6 - Visualize word embeddings\n", + "\n", + "After you learn word embedding for your text data, it can be nice to explore it with visualization.\n", + "\n", + "You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.\n", + "\n", + "The visualizations can provide a qualitative diagnostic for your learned model.\n", + "\n", + "We can retrieve all of the vectors from a trained model as follows:\n", + "```\n", + "X = model[model.wv.vocab]\n", + "```\n", + "\n", + "We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.\n", + "\n", + "Let’s look at an example with Principal Component Analysis or PCA." + ], + "metadata": { + "id": "NrSg7OJhKP5_" + }, + "id": "NrSg7OJhKP5_" + }, + { + "cell_type": "code", + "source": [ + "X = model.wv.get_normed_vectors()[:1000]\n", + "print(X.shape)" + ], + "metadata": { + "id": "OGnQezWkGwnt" + }, + "id": "OGnQezWkGwnt", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### 6.1 - Using PCA\n", + "\n", + "We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows." + ], + "metadata": { + "id": "V3GEpKVfKxMe" + }, + "id": "V3GEpKVfKxMe" + }, + { + "cell_type": "code", + "source": [ + "from sklearn.decomposition import PCA\n", + "\n", + "pca = PCA(n_components=2)\n", + "result = pca.fit_transform(X)" + ], + "metadata": { + "id": "TDPHq7XVGwqb" + }, + "id": "TDPHq7XVGwqb", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "from matplotlib import pyplot\n", + "\n", + "pyplot.scatter(result[:, 0], result[:, 1])" + ], + "metadata": { + "id": "TPxqh9N0Gwta" + }, + "id": "TPxqh9N0Gwta", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows." + ], + "metadata": { + "id": "MUW6T3b4LPE7" + }, + "id": "MUW6T3b4LPE7" + }, + { + "cell_type": "code", + "source": [ + "words = [ w for w in model.wv.key_to_index.keys() ][:1000]" + ], + "metadata": { + "id": "EpgU1Is9IMyg" + }, + "id": "EpgU1Is9IMyg", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "for i, word in enumerate(words):\n", + "\tpyplot.annotate(word, xy=(result[i, 0], result[i, 1]))\n", + "#pyplot.show()\n", + "pyplot.savefig('plot_w2v.png')" + ], + "metadata": { + "id": "87nkrdIsGwuv" + }, + "id": "87nkrdIsGwuv", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "id": "CvTlbB5eIuGM" + }, + "id": "CvTlbB5eIuGM" + }, + { + "cell_type": "code", + "source": [ + "pyplot.scatter(result[:, 0], result[:, 1])\n", + "\n", + "for i, word in enumerate(words):\n", + "\tpyplot.annotate(word, xy=(result[i, 0], result[i, 1]))\n", + "#pyplot.show()\n", + "pyplot.savefig('plot_w2v.png')" + ], + "metadata": { + "id": "E79KLz3ARN14" + }, + "id": "E79KLz3ARN14", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "sATOlUC0TSH7" + }, + "id": "sATOlUC0TSH7", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### 6.2 - Using TensorFlow projector (at home)\n", + "\n", + "As we saw during the course, TensorFlow provides a tool to vizualize word embeddings. We need to provide: \n", + "* A TSV file with the vectors\n", + "* Another TSV file with the words\n", + "\n", + "The following code allows to write this file from the model.\n", + "\n", + "It comes from the source code of the script: https://radimrehurek.com/gensim/scripts/word2vec2tensor.html\n", + "\n", + "▶▶ **Run the followng code and then load the files within the TensorFlow projector. Look e.g. for 'citron', 'manger', 'pain'..., check their neighbors (with PCA and/or T-SNE).**\n", + "\n", + "https://projector.tensorflow.org/" + ], + "metadata": { + "id": "fCr9WTKLVRLM" + }, + "id": "fCr9WTKLVRLM" + }, + { + "cell_type": "code", + "source": [ + "#model = gensim.models.Word2Vec.load_word2vec_format(model_path, binary=True)\n", + "tensorsfp = \"model_word2vec_food_tensor.tsv\"\n", + "metadatafp = \"metadata_word2vec_food_tensor.tsv\"\n", + "with open( tensorsfp, 'w+') as tensors:\n", + " with open( metadatafp, 'w+') as metadata:\n", + " for word in model.wv.index_to_key:\n", + " metadata.write(word + '\\n')\n", + " vector_row = '\\t'.join(map(str, model[word]))\n", + " tensors.write(vector_row + '\\n')" + ], + "metadata": { + "id": "I40CgQgJTSKd" + }, + "id": "I40CgQgJTSKd", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "id": "oG_qFvHJJvCU" + }, + "id": "oG_qFvHJJvCU" + }, + { + "cell_type": "markdown", + "source": [ + "## 7 - Embeddings sentences\n", + "\n", + "Now that we have vectors to represent words, how can we represent sentences / sequences of words?" + ], + "metadata": { + "id": "UW-h-NioJvTc" + }, + "id": "UW-h-NioJvTc" + }, + { + "cell_type": "code", + "source": [ + "import numpy as np" + ], + "metadata": { + "id": "pbzTLFKpI9JO" + }, + "execution_count": null, + "outputs": [], + "id": "pbzTLFKpI9JO" + }, + { + "cell_type": "markdown", + "source": [ + "Below is given a set of sentences, some have a similar meanings: they are paraphrases.\n", + "We want to compute a representation for each sentence where similar sentences have a similar vector." + ], + "metadata": { + "id": "9fui4z4aSPPS" + }, + "id": "9fui4z4aSPPS" + }, + { + "cell_type": "code", + "source": [ + "sentence1 = \"My disease was cured with a lot of medication\"\n", + "sentence2 = \"I was treated with medicaments\"\n", + "sentence3 = 'The sun is shining today'" + ], + "metadata": { + "id": "ZlVT2QGFJvTe" + }, + "execution_count": null, + "outputs": [], + "id": "ZlVT2QGFJvTe" + }, + { + "cell_type": "markdown", + "source": [ + "▶▶ **Write the code to encode a sentence as the average vector over the word vectors.**" + ], + "metadata": { + "id": "0zyDNnh-Shzz" + }, + "id": "0zyDNnh-Shzz" + }, + { + "cell_type": "code", + "source": [ + "# **TODO** exercise\n", + "def encode(sentence,model):\n", + " #...\n", + " return vector" + ], + "metadata": { + "id": "9HqunfslIiIe" + }, + "execution_count": null, + "outputs": [], + "id": "9HqunfslIiIe" + }, + { + "cell_type": "markdown", + "source": [ + "Run the code below to compute vectors for the sentences above and **check that you found the right shape for the sentence vectors**." + ], + "metadata": { + "id": "azVNk7-ZS9O0" + }, + "id": "azVNk7-ZS9O0" + }, + { + "cell_type": "code", + "source": [ + "v1 = encode(sentence1,model)\n", + "v2 = encode(sentence2,model)\n", + "v3 = encode(sentence3,model)\n", + "v1.shape" + ], + "metadata": { + "id": "xJJKlJkAI1r6" + }, + "execution_count": null, + "outputs": [], + "id": "xJJKlJkAI1r6" + }, + { + "cell_type": "markdown", + "source": [ + "The code below can be used to compute the cosine similarity between two sentence vectors:\n", + "\n" + ], + "metadata": { + "id": "-QSJXdi5JvTg" + }, + "id": "-QSJXdi5JvTg" + }, + { + "cell_type": "code", + "source": [ + "def cos(v1,v2):\n", + " return np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))" + ], + "metadata": { + "id": "oTWuS_oBI9u0" + }, + "execution_count": null, + "outputs": [], + "id": "oTWuS_oBI9u0" + }, + { + "cell_type": "markdown", + "source": [ + "▶▶ **Compute the similarity between the sentences given above. Are the results coherent?**" + ], + "metadata": { + "id": "HwA9JhaxZW8d" + }, + "id": "HwA9JhaxZW8d" + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "bXE1ZLIWPrau" + }, + "id": "bXE1ZLIWPrau", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 8. Other algorithms\n", + "\n", + "Not enough time, but note that you can also build embeddings using FastText algorithm with Gensim. Doc2vec is also available.\n", + "\n", + "https://radimrehurek.com/gensim/apiref.html\n", + "\n" + ], + "metadata": { + "id": "mQDB2hnXxP4Z" + }, + "id": "mQDB2hnXxP4Z" + }, + { + "cell_type": "markdown", + "source": [ + "#### Embeddings with multiword ngrams\n", + "\n", + "There is a *gensim.models.phrases* module which lets you automatically detect phrases longer than one word, using collocation statistics. Using phrases, you can learn a word2vec model where “words” are actually multiword expressions, such as new_york_times or financial_crisis:\n", + "\n" + ], + "metadata": { + "id": "wY8HkerUhbc7" + }, + "id": "wY8HkerUhbc7" + }, + { + "cell_type": "code", + "source": [ + "from gensim.models import Phrases\n", + "\n", + "# Train a bigram detector.\n", + "bigram_transformer = Phrases(sentences)\n", + "\n", + "# Apply the trained MWE detector to a corpus, using the result to train a Word2vec model.\n", + "model = Word2Vec(bigram_transformer[sentences], min_count=10, iter=1)" + ], + "metadata": { + "id": "RaYOSCa-TSUy" + }, + "id": "RaYOSCa-TSUy", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "print(list(model.wv.key_to_index))" + ], + "metadata": { + "id": "DOjkiWcUhwmH" + }, + "id": "DOjkiWcUhwmH", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "If you’re finished training a model (i.e. no more updates, only querying), you can switch to the KeyedVectors instance:\n", + "\n" + ], + "metadata": { + "id": "gbdjeIUlhUvD" + }, + "id": "gbdjeIUlhUvD" + }, + { + "cell_type": "code", + "source": [ + "word_vectors = model.wv\n", + "del model" + ], + "metadata": { + "id": "Z9hR0n7JPlem" + }, + "id": "Z9hR0n7JPlem", + "execution_count": null, + "outputs": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.9" + }, + "colab": { + "provenance": [], + "collapsed_sections": [ + "satellite-colombia" + ] + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/projets_etudiants_2425/[NN-LiTL]Projects-NeuralMethodsNLP_2425.pdf b/projets_etudiants_2425/[NN-LiTL]Projects-NeuralMethodsNLP_2425.pdf new file mode 100644 index 0000000000000000000000000000000000000000..77f1d267287082fef7a1931a10c45834d7b81202 Binary files /dev/null and b/projets_etudiants_2425/[NN-LiTL]Projects-NeuralMethodsNLP_2425.pdf differ