Jack Dry

Using Doc2Vec to classify movie reviews


In this tutorial, I explain how to get a state-of-the-art result on the IMDB dataset using gensim's implementation of Paragraph Vector, called Doc2Vec. If you don't know what Paragraph Vector is, I recommend reading the [original paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf), [gensim's documentation](https://radimrehurek.com/gensim/models/doc2vec.html) and [Lau and Baldwin's empirical evaluation](https://www.aclweb.org/anthology/W16-1609). In case you aren't familiar with the IMDB dataset, it contains 25,000 movie reviews for training, and 25,000 for testing. Each review is given a label indicating whether it's positive or negative. The dataset is *balanced*, meaning there are equal numbers of positive and negative reviews. The full code for this tutorial is available here: https://github.com/jackdry/using-doc2vec-to-classify-movie-reviews. --- ## 1. Import packages ```python from tensorflow.contrib.tensorboard.plugins import projector from gensim.models.doc2vec import Doc2Vec, TaggedDocument from gensim.models.callbacks import CallbackAny2Vec import tensorflow as tf import multiprocessing import numpy as np import os ``` --- ## 2. Download the IMDB dataset The code below downloads the dataset. ```python imdb = tf.keras.datasets.imdb (train_reviews, train_labels), (test_reviews, test_labels) = imdb.load_data() ``` `train_reviews` and `test_reviews` are `ndarray` objects, each containing 25,000 lists of integers, which represent words in a *vocabulary*. ```python train_reviews[0] ``` ```output [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32] ``` `train_labels` and `test_labels` are also `ndarray` objects. Reviews are assigned the label `1` if they are positive and `0` if they are negative. ```python train_labels[0] ``` ```output 1 ``` --- ## 3. Create a function to decode reviews To learn embeddings for the reviews using `Doc2Vec`, we must first convert each one from a list of integers to a list of words. To do this, we need to obtain the vocabulary that was used to encode the reviews (using [this section](https://www.tensorflow.org/tutorials/keras/basic_text_classification#convert_the_integers_back_to_words) of TensorFlow's IMDB tutorial as a guide). ```python vocab = imdb.get_word_index() vocab = {k:(v + 3) for k, v in vocab.items()} vocab["<PAD>"] = 0 vocab["<START>"] = 1 vocab["<UNK>"] = 2 vocab["<UNUSED>"] = 3 ``` ```python vocab["brilliant"] ``` ```output 530 ``` We then *invert* `vocab` so that we can find the word corresponding to a particular integer. ```python vocab_inv = dict([(value, key) for (key, value) in vocab.items()]) ``` ```python vocab_inv[1048] ``` ```output 'incredible' ``` Now we can create a function that decodes a review into a list of words. ```python def decode_review(review): return [vocab_inv.get(i, "?") for i in review] ``` ```python decode_review(train_reviews[0]) ``` ```output ['&lt;START&gt;', 'this', 'film', 'was', 'just', 'brilliant', 'casting', 'location', 'scenery', 'story', 'direction', "everyone's", 'really', 'suited', 'the', 'part', 'they', 'played', 'and', 'you', 'could', 'just', 'imagine', 'being', 'there', 'robert', "redford's", 'is', 'an', 'amazing', 'actor', 'and', 'now', 'the', 'same', 'being', 'director', "norman's", 'father', 'came', 'from', 'the', 'same', 'scottish', 'island', 'as', 'myself', 'so', 'i', 'loved', 'the', 'fact', 'there', 'was', 'a', 'real', 'connection', 'with', 'this', 'film', 'the', 'witty', 'remarks', 'throughout', 'the', 'film', 'were', 'great', 'it', 'was', 'just', 'brilliant', 'so', 'much', 'that', 'i', 'bought', 'the', 'film', 'as', 'soon', 'as', 'it', 'was', 'released', 'for', 'retail', 'and', 'would', 'recommend', 'it', 'to', 'everyone', 'to', 'watch', 'and', 'the', 'fly', 'fishing', 'was', 'amazing', 'really', 'cried', 'at', 'the', 'end', 'it', 'was', 'so', 'sad', 'and', 'you', 'know', 'what', 'they', 'say', 'if', 'you', 'cry', 'at', 'a', 'film', 'it', 'must', 'have', 'been', 'good', 'and', 'this', 'definitely', 'was', 'also', 'congratulations', 'to', 'the', 'two', 'little', "boy's", 'that', 'played', 'the', "part's", 'of', 'norman', 'and', 'paul', 'they', 'were', 'just', 'brilliant', 'children', 'are', 'often', 'left', 'out', 'of', 'the', 'praising', 'list', 'i', 'think', 'because', 'the', 'stars', 'that', 'play', 'them', 'all', 'grown', 'up', 'are', 'such', 'a', 'big', 'profile', 'for', 'the', 'whole', 'film', 'but', 'these', 'children', 'are', 'amazing', 'and', 'should', 'be', 'praised', 'for', 'what', 'they', 'have', 'done', "don't", 'you', 'think', 'the', 'whole', 'story', 'was', 'so', 'lovely', 'because', 'it', 'was', 'true', 'and', 'was', "someone's", 'life', 'after', 'all', 'that', 'was', 'shared', 'with', 'us', 'all'] ``` --- ## 4. Learn embeddings for reviews Before training our `Doc2Vec` model, we need to create a list of `TaggedDocument` objects from the train and test reviews, as well as a callback so that we can monitor training progress. ```python reviews = np.concatenate((train_reviews, test_reviews)) docs = [TaggedDocument(decode_review(review), [i]) for i, review in enumerate(reviews)] ``` ```python class Doc2VecCallback(CallbackAny2Vec): def __init__(self, epochs): self.prog_bar = tf.keras.utils.Progbar(epochs) self.epoch = 0 def on_epoch_end(self, model): self.epoch += 1 self.prog_bar.update(self.epoch) ``` I encourage you to experiment with the arguments you pass `Doc2Vec`. I selected the ones below based on [this notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb). ```python d2v_model = Doc2Vec(docs, dm=0, min_count=2, vector_size=100, hs=0, negative=5, epochs=100, callbacks=[Doc2VecCallback(100)], sample=0, workers=multiprocessing.cpu_count()) ``` Training should take roughly 10 minutes, depending on the number of cores your computer has. Afterwards, we can extract the learned embeddings, and split them into `train_emdgs` and `test_embdgs`. ```python embdgs = d2v_model.docvecs.vectors_docs train_embdgs, test_embdgs = np.split(embdgs, [25000]) ``` ```python train_embdgs train_embdgs[0] ``` ```output array([ 0.08079641, 0.20569757, 0.4738193, 0.23749965, 0.06664906, 1.2267363 , -0.70511824, 0.48151103, -0.55024695, -0.14436685, -0.23059061, 0.7129091, -0.60188824, 0.5016063, 0.18376477, -0.5230938, 0.16004896, -0.18659687, 0.8274295, 0.04011085, 0.03508369, 0.29871807, 0.12340536, -0.55743134, 0.06399595, -0.5479066, -0.89346504, -0.615669, -0.05332805, 0.28452045, -0.08361472, -0.82962734, 1.2487692, -0.8348145, -1.3827287, -0.32844827, -0.05866596, -0.20214, 0.8929514, -0.50951415, -0.42142662, 0.2502974, -0.5526857, -0.01847663, -0.5334354, -0.44521442, 0.00903169, 0.09517114, -0.06399161, 0.21078157, -0.44145957, 0.79780304, 0.708781, 0.52510357, 0.6052623, 0.14815222, -0.5089591, 0.20163493, -1.6821849, -0.6525678, -0.20529775, -0.34921286, -0.91900027, -0.4330489, -0.20630024, 0.02228682, -1.0429921, 0.07120833, 0.13347925, -0.16419138, -0.16784236, -0.55934054, -0.56118524, -0.37115732, 0.04414184, 0.18220526, 0.4717986, 0.01929729, -0.10927698, -0.32006076, -0.16162223, 0.6462481, -0.6281219, -1.134469, 0.10179093, 0.23171625, 0.01063073, -0.07949349, 0.27011207, -0.43695652, 0.16555595, 0.40691534, 0.34857702, 0.6036801, -0.43055603, 0.393619, 0.11630932, 0.7341948, -0.86189365, -0.06586093], dtype=float32) ``` --- ## 5. Visualize embeddings Feel free to skip this section if you'd like. It just describes how to visualize the learned embeddings using [TensorBoard's Embedding Projector](https://projector.tensorflow.org). First we need to create a directory to keep a [metadata file](https://www.tensorflow.org/guide/embedding#metadata) (which will contain labels for the embeddings) as well as event and checkpoint files. ```python embdgs_dir = "embdgs" os.mkdir(embdgs_dir) metadata_path = os.path.join(embdgs_dir, "metadata.tsv") embdgs_path = os.path.join(embdgs_dir, "embdgs.ckpt") ``` Next we write the first 30 words of each review to the metadata file. ```python with open(metadata_path, "w", encoding="utf-8") as f: f.write("review") for review in reviews: excerpt = " ".join(decode_review(review[1:31])) f.write(f"{excerpt}\n") ``` We then create an `InteractiveSession` and add a variable to the default graph which will initialize to the learned embeddings. We also instantiate `FileWriter` and `Saver` objects. ```python sess = tf.InteractiveSession() tf.get_variable("embdgs", initializer=embdgs) writer = tf.summary.FileWriter(embdgs_dir) saver = tf.train.Saver() ``` Next we need to set up the `projector`. There isn't much documentation on how to do this, but I found some useful code [here](https://github.com/tensorflow/tensorflow/blob/152d1822f6592c4c0ee0f606b1522c8d03e29339/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L294). ```python config = projector.ProjectorConfig() embdg_conf = config.embeddings.add() embdg_conf.tensor_name = "embdgs" embdg_conf.metadata_path = "metadata.tsv" projector.visualize_embeddings(writer, config) ``` Lastly, we need to initialize the variable, save it, and close the `InteractiveSession`. ```python sess.run(tf.global_variables_initializer()) saver.save(sess, embdgs_path) sess.close() ``` Now's the exciting part. Open your terminal at your working directory, run `tensorboard --logdir embdgs`, and copy the printed URL into your browser. If all goes well, you should see something like the image below. <img src="https://jackdry.s3.us-east-2.amazonaws.com/content/articles/using-doc2vec-to-classify-movie-reviews/projection.png" width="350px"> If you click on one of the points, you'll be able to see its nearest neighbors. This is useful for judging the quality of the embeddings -- reviews with similar meaning should be mapped to nearby positions. --- ## 6. Classify reviews All that's left is to train a model to predict whether a review is positive or negative *given* its embedding. Needless to say, we only train on the reviews in the training set. ```python model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(1, activation="sigmoid")) model.compile(optimizer=tf.train.AdamOptimizer(0.01), loss='binary_crossentropy', metrics=['accuracy']) model.fit(train_embdgs, train_labels, batch_size=64, epochs=50, shuffle=True) ``` The code below evaluates our model on the reviews in the test set. ```python model.evaluate(test_embdgs, test_labels) ``` ```output [0.2794467719936371, 0.88748] ``` It achieved an accuracy of 88.7%, which means an error rate of 11.3%. Pretty impressive, especially when compared to the results in *Table 2* of the [paper introducing Paragraph Vector](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)!

My email list

If you'd like to be sent emails about new articles, subscribe below.


Any thoughts?

Log in with twitter or reddit to comment.


Comments

No comments.