Gensim Word2vec

 

Down to business. In this tutorial, you will learn how to use the Gensim implementation of Word2Vec (in python) and actually get it to work! I‘ve long heard complaints about poor performance, but it really is a combination of two things: (1) your input data and (2) your parameter settings.Check out the Jupyter Notebook if you want direct access to the working example, or read on to get more. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words. # build vocabulary and train model model = gensim.models.Word2Vec( documents, size=150, window=10, mincount=2, workers=10, iter=10). Word2Vec is one of the most popular techniques to learn word embeddings by using a shallow neural network. The theory is discussed in this paper, available as a PDF download: Efficient Estimation of Word Representations in Vector Space. The implementation in this module is based on the Gensim library for Word2Vec. Word2Vec models require a lot of text, e.g. The entire Wikipedia corpus. Nevertheless, we will demonstrate the principles using a small in-memory example of text. Gensim provides the Word2Vec class for working with a Word2Vec model.

  • Gensim Tutorial
  • Gensim Useful Resources
  • Selected Reading

The chapter will help us understand developing word embedding in Gensim.

Word embedding, approach to represent words & document, is a dense vector representation for text where words having the same meaning have a similar representation. Following are some characteristics of word embedding −

  • It is a class of technique which represents the individual words as real-valued vectors in a pre-defined vector space.

  • This technique is often lumped into the field of DL (deep learning) because every word is mapped to one vector and the vector values are learned in the same way a NN (Neural Networks) does.

  • The key approach of word embedding technique is a dense distributed representation for every word.

Different Word Embedding Methods/Algorithms

As discussed above, word embedding methods/algorithms learn a real-valued vector representation from a corpus of text. This learning process can either use with the NN model on task like document classification or is an unsupervised process such as document statistics. Here we are going to discuss two methods/algorithm that can be used to learn a word embedding from text −

Word2Vec by Google

Word2Vec, developed by Tomas Mikolov, et. al. at Google in 2013, is a statistical method for efficiently learning a word embedding from text corpus. It’s actually developed as a response to make NN based training of word embedding more efficient. It has become the de facto standard for word embedding.

Word embedding by Word2Vec involves analysis of the learned vectors as well as exploration of vector math on representation of words. Following are the two different learning methods which can be used as the part of Word2Vec method −

  • CBoW(Continuous Bag of Words) Model
  • Continuous Skip-Gram Model

Gensim Word2vec Similarity

GloVe by Standford

Gensim

GloVe(Global vectors for Word Representation), is an extension to the Word2Vec method. It was developed by Pennington et al. at Stanford. GloVe algorithm is a mix of both −

  • Global statistics of matrix factorization techniques like LSA (Latent Semantic Analysis)
  • Local context-based learning in Word2Vec.

If we talk about its working then instead of using a window to define local context, GloVe constructs an explicit word co-occurrence matrix using statistics across the whole text corpus.

Developing Word2Vec Embedding

Gensim word2vec vocabulary

Here, we will develop Word2Vec embedding by using Gensim. In order to work with a Word2Vec model, Gensim provides us Word2Vec class which can be imported from models.word2vec. For its implementation, word2vec requires a lot of text e.g. the entire Amazon review corpus. But here, we will apply this principle on small-in memory text.

Gensim Word2vec Accuracy

Implementation Example

Gensim Word2vec

First we need to import the Word2Vec class from gensim.models as follows −

Next, we need to define the training data. Rather than taking big text file, we are using some sentences to implement this principal.

Once the training data is provided, we need to train the model. it can be done as follows −

We can summarise the model as follows −;

We can summarise the vocabulary as follows −

Next, let’s access the vector for one word. We are doing it for the word ‘tutorial’.

Next, we need to save the model −

Next, we need to load the model −

Finally, print the saved model as follows −

Complete Implementation Example

Output

Visualising Word Embedding

We can also explore the word embedding with visualisation. It can be done by using a classical projection method (like PCA) to reduce the high-dimensional word vectors to 2-D plots. Once reduced, we can then plot them on graph.

Gensim

Plotting Word Vectors Using PCA

First, we need to retrieve all the vectors from a trained model as follows −

Next, we need to create a 2-D PCA model of word vectors by using PCA class as follows −

Now, we can plot the resulting projection by using the matplotlib as follows −

We can also annotate the points on the graph with the words itself. Plot the resulting projection by using the matplotlib as follows −

Complete Implementation Example

Output