Generating a Word2Vec model from a block of Text using Gensim (Python)

Word2Vec is a semantic learning framework that uses a shallow neural network to learn the representations of words/phrases in a particular text. Simply put, its an algorithm that takes in all the terms (with repetitions) in a particular document, divided into sentences, and outputs a vectorial form of each. The ‘advantage’ word2vec offers is in its utilization of a neural model in understanding the semantic meaning behind those terms. For example, a document may employ the words ‘dog’ and ‘canine’ to mean the same thing, but never use them together in a sentence. Ideally, Word2Vec would be able to learn the context and place them together in its semantic space. Most applications of Word2Vec using cosine similarity to quantify closeness. This Quora question (or rather its answers) does a good job of explaining the intuition behind it.

You would need to take the following steps to develop a Word2Vec model from a block of text (Usually, documents that are extensive and yet stick to the topic of interest with minimum ambiguity do well):

[I use Gensim’s Word2Vec API in Python to form Word2Vec models of Wikipedia articles.]

1. Obtain the text (obviously)

To obtain the Wikipedia articles, I use the Python wikipedia library. Once installed from the link, here’s how you could use it obtain all the text from an aritcle-


#'title' denotes the exact title of the article to be fetched
title = "Machine learning"
from wikipedia import page
wikipage = page(title)

You could then use wikipage.context to access the entire textual context in the form of a String. Now, incase you don’t have the exact title and want to do a search, you would do:


from wikipedia import search, page
titles = search('machine learning')
wikipage = page(titles[0])

[Tip: Store the content into a file and access it from there. This would provide you a reference later, if needed.]

2. Preprocess the text

In the context of Python, you would require an iterable that yields one iterable for each sentence in the text. The inner iterable would contain the terms in the particular sentence. A ‘term’ could be individual words like ‘machine’, or phrases(n-grams) like ‘machine learning’, or a combination of both. Coming up with appropriate bigrams/trigrams is a tricky task on its own, so I just stick to unigrams.

First of all, I remove all special characters and short lines from the article, to eliminate noise. Then, I use Porter Stemming on my unigrams, using a ‘wrapper’ around Gensim’s stemming API.


from gensim.parsing import PorterStemmer
global_stemmer = PorterStemmer()

class StemmingHelper(object):
    """
    Class to aid the stemming process - from word to stemmed form,
    and vice versa.
    The 'original' form of a stemmed word will be returned as the
    form in which its been used the most number of times in the text.
    """

    #This reverse lookup will remember the original forms of the stemmed
    #words
    word_lookup = {}

    @classmethod
    def stem(cls, word):
        """
        Stems a word and updates the reverse lookup.
        """

        #Stem the word
        stemmed = global_stemmer.stem(word)

        #Update the word lookup
        if stemmed not in cls.word_lookup:
            cls.word_lookup[stemmed] = {}
        cls.word_lookup[stemmed][word] = (
            cls.word_lookup[stemmed].get(word, 0) + 1)

        return stemmed

    @classmethod
    def original_form(cls, word):
        """
        Returns original form of a word given the stemmed version,
        as stored in the word lookup.
        """

        if word in cls.word_lookup:
            return max(cls.word_lookup[word].keys(),
                       key=lambda x: cls.word_lookup[word][x])
        else:
            return word

Refer to the code and docstrings to understand how it works. (Its pretty simple anyways). It can be used as follows-


>>> StemmingHelper.stem('learning')
'learn'
>>> StemmingHelper.original_form('learn')
'learning'

Pre-stemming, you could also use a list of stopwords to eliminate terms that occur frequently in the English language, but don’t carry much semantic meaning.

After your pre-processing, lets assume you come up with an iterable called sentences from your source of text.

3. Figure out the values for your numerical parameters

Gensim’s Word2Vec API requires some parameters for initialization. Ofcourse they do have default values, but you want to define some on your own:

i. size – Denotes the number of dimensions present in the vectorial forms. If you have read the document and have an idea of how many ‘topics’ it has, you can use that number. For sizeable blocks, people use 100-200. I use around 50 for the Wikipedia articles. Usually, you would want to repeat the initialization for different numbers of topics in a certain range, and pick the one that yields the best results (depending on your application – I will be using them to build Mind-Maps, and I usually have to try values from 20-100.). A good heuristic thats frequently used is the square-root of the length of the vocabulary, after pre-processing.

ii. min_count – Terms that occur less than min_count number of times are ignored in the calculations. This reduces noise in the semantic space. I use 2 for Wikipedia. Usually, the bigger and more extensive your text, the higher this number can be.

iii. window – Only terms hat occur within a window-neighbourhood of a term, in a sentence, are associated with it during training. The usual value is 4. Unless your text contains big sentences, leave it at that.

iv. sg – This defines the algorithm. If equal to 1, the skip-gram technique is used. Else, the CBoW method is employed. (Look at the aforementioned Quora answers). I usually use the default(1).

4. Initialize the model and use it

The model can be generated using Gensim’s API, as follows:


from gensim.models import Word2Vec
min_count = 2
size = 50
window = 4

model = Word2Vec(sentences, min_count=min_count, size=size, window=window)

Now that you have the model initialized, you can access all the terms in its vocabulary, using something like list(model.vocab.keys()). To get the vectorial representation of a particular term, use model[term]. If you have used my stemming wrapper, you could find the appropriate original form of the stemmed terms using StemmingHelper.original_form(term). Heres an example, from the Wiki article on Machine learning:


>>> vocab = list(model.vocab.keys())
>>> vocab[:10]
[u'represent', u'concept', u'founder', u'focus', u'invent', u'signific', u'abil', u'implement', u'benevol', u'hierarch']
>>> 'learn' in model.vocab
True
>>> model['learn']
array([  1.23792759e-03,   5.49776992e-03,   2.18261080e-03,
         8.37465748e-03,  -6.10323064e-03,  -6.94877980e-03,
         6.29429379e-03,  -7.06598908e-03,  -7.16267806e-03,
        -2.78065586e-03,   7.40372669e-03,   9.68673080e-03,
        -4.75220988e-03,  -8.34807567e-03,   5.25208283e-03,
         8.43616109e-03,  -1.07231298e-02,  -3.88528360e-03,
        -9.20894090e-03,   4.17305576e-03,   1.90116244e-03,
        -1.92442467e-03,   2.74807960e-03,  -1.01113841e-02,
        -3.71694425e-03,  -6.60350174e-03,  -5.90716442e-03,
         3.90679482e-03,  -5.32188127e-03,   5.63300075e-03,
        -5.52612450e-03,  -5.57334488e-03,  -8.51202477e-03,
        -8.78736563e-03,   6.41061319e-03,   6.64879987e-03,
        -3.55080629e-05,   4.81080823e-03,  -7.11903954e-03,
         9.83678619e-04,   1.60697231e-03,   7.42980337e-04,
        -2.12235347e-04,  -8.05167668e-03,   4.08948492e-03,
        -5.48054813e-04,   8.55423324e-03,  -7.08682090e-03,
         1.57684216e-03,   6.79725129e-03], dtype=float32)
>>> StemmingHelper.original_form('learn')
u'learning'
>>> StemmingHelper.original_form('hierarch')
u'hierarchical'

As you might have guessed, the vectors are NumPy arrays, and support all their functionality. Now, to compute the cosine similarity between two terms, use the similarity method. Cosine similarity is generally bounded by [-1, 1]. The corresponding ‘distance’ can be measured as 1-similarity. To figure out the terms most similar to a particular one, you can use the most_similar method.


>>> model.most_similar(StemmingHelper.stem('classification'))
[(u'spam', 0.25190210342407227), (u'metric', 0.22569453716278076), (u'supervis', 0.19861873984336853), (u'decis', 0.18607790768146515), (u'inform', 0.17607420682907104), (u'artifici', 0.16593246161937714), (u'previous', 0.16366994380950928), (u'train', 0.15940310060977936), (u'network', 0.14765430986881256), (u'term', 0.14321796596050262)]
>>> model.similarity(StemmingHelper.stem('classification'), 'supervis')
0.19861870268896875
>>> model.similarity('unsupervis', 'supervis')
-0.11546791800661522

There’s a ton of other functionality that’s supported by the class, so you should have a look at the API I gave a link to. Happy topic modelling 🙂

16 thoughts on “Generating a Word2Vec model from a block of Text using Gensim (Python)”

Andy T says:

25/12/2015 at 06:56

Great post! Do you know a good source to look at examples of some of the other functionality you mentioned at the end?

1. srjoglekar246 says:
  
  25/12/2015 at 10:41
  
  Examples would be a little hard to find, since Word2Vec is relatively new. I am sure you must have read the docs of the class itself here: https://radimrehurek.com/gensim/models/word2vec.html . If all you want is to get your hands dirty, this Kaggle event would be a nice attempt: https://www.kaggle.com/c/word2vec-nlp-tutorial . Hope this helps!
  
zhiyong Young says:

24/04/2016 at 20:03

That’s an Awesome Blog!!! I’m stucked with an error when using word2vec with python 2.7 running on a 64-bit win7 sys,it seems that there’s something wrong with seeds for numpy.random
(once = random.RandomState(self.hashfxn(seed_string)& 0xffffffff )).

The following is the traceback

Traceback (most recent call last):
File “D:/pypro/tr01/tryChar.py”, line 18, in
model = gensim.models.Word2Vec(lin,seed =20 ,size=20, window=3,min_count=1)
File “D:\Anaconda\lib\site-packages\gensim\models\word2vec.py”, line 444, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)
File “D:\Anaconda\lib\site-packages\gensim\models\word2vec.py”, line 510, in build_vocab
self.finalize_vocab() # build tables & arrays
File “D:\Anaconda\lib\site-packages\gensim\models\word2vec.py”, line 640, in finalize_vocab
self.reset_weights()
File “D:\Anaconda\lib\site-packages\gensim\models\word2vec.py”, line 986, in reset_weights
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
File “D:\Anaconda\lib\site-packages\gensim\models\word2vec.py”, line 998, in seeded_vector
once = random.RandomState(self.hashfxn(seed_string)& 0xffffffff )
File “mtrand.pyx”, line 564, in mtrand.RandomState.__init__ (numpy\random\mtrand\mtrand.c:5260)
File “mtrand.pyx”, line 600, in mtrand.RandomState.seed (numpy\random\mtrand\mtrand.c:5515)
ValueError: object of too small depth for desired array

1. Noam Kaplan says:
  
  15/05/2016 at 19:59
  
  Try updating numpy to its latest version. I had a similar problem which was solved this way (I updated scipy as well).
  
sam says:

19/06/2016 at 16:09

Reblogged this on I, ME, MYSELF in PY and commented:
Short tutorial on NLP & ML

Pingback: [160708] Weekly Report of CS297 | alexatca
Pingback: Gensim Vektörel Doküman Eğitimi - gurmezin sci-tech-art
Rumesa says:

20/01/2017 at 15:55

i got an error while running this code
NameError: name ‘sentences’ is not defined
I am using python version 3.5.2..how to resolve this

1. srjoglekar246 says:
  
  20/01/2017 at 21:43
  
  I have not provided any sentences in the code. You have to do that with your data. I have defined what ‘sentences’ needs to be :-).
  
  1. Rohit Shankar says:
    
    26/09/2017 at 06:06
    
    What do you mean by sentences in the code. Can you give me an example. Thank you for your help.
Rumesa says:

23/01/2017 at 12:21

Now I am getting the following error

Warning (from warnings module):
File “C:\Users\pc\Anaconda3\Lib\site-packages\gensim\utils.py”, line 843
warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”)
UserWarning: detected Windows; aliasing chunkize to chunkize_serial
Traceback (most recent call last):
File “C:/Users/pc/Desktop/word2vec projects/4.py”, line 70, in
model[‘learn’]
File “C:\Users\pc\Anaconda3\Lib\site-packages\gensim\models\word2vec.py”, line 1297, in _getitem__
return self.wv.__getitem__(words)
File “C:\Users\pc\Anaconda3\Lib\site-packages\gensim\models\keyedvectors.py”, line 348, in _getitem__
return self.syn0[self.vocab[words].index]
KeyError: ‘learn’

1. srjoglekar246 says:
  
  23/01/2017 at 13:06
  
  Is ‘learn’ in the vocabulary of your text? You might have to change things a bit, depending on what text you are analysing 🙂
  
Rajkumar Kaliyaperumal says:

17/07/2017 at 12:32

This is a great post for beginners of word2vec framework. I had to tweak the code a little bit to use model.wv as there is change in the instance in the newest version of gensim package. Plus I also filtered the words for stop words to get more meaningful results for similar words. Otherwise I get a lot of stop words as similar words.
Can you share some insight on how word2vec can be used for sentiment analysis and assigning polarity & a score to aspects

John BC says:

04/02/2019 at 17:37

Hi … I am unable to get vocab keys as ‘words’, I get them as single alphabets. Also I use model.wv for python 3.X. Am I doing something wrong ?

Hairstyles says:

11/12/2020 at 17:02

One thing I would really like to touch upon is that weightloss program fast can be achieved by the perfect diet and exercise. Someone’s size not merely affects the look, but also the general quality of life. Self-esteem, despression symptoms, health risks, as well as physical capabilities are afflicted in an increase in weight. It is possible to make everything right but still gain. If this happens, a problem may be the culprit. While an excessive amount of food but not enough workout are usually responsible, common health concerns and trusted prescriptions can certainly greatly enhance size. Kudos for your post here.

Pingback: Machine learning resources - Technology Blog

	full stack python tr… on A practical introduction to Fu…
	Tracy on K-Means Clustering with T…
	Vyom Overseas on How to deploy multiple Node.js…
	kazuma on Self-Organizing Maps with Goog…
	Olen Litzenberg on Predicting Trigonometric Waves…

Sachin Joglekar's blog

Programming | Python | ML

Generating a Word2Vec model from a block of Text using Gensim (Python)

16 thoughts on “Generating a Word2Vec model from a block of Text using Gensim (Python)”

Leave a comment Cancel reply

Share this:

16 thoughts on “Generating a Word2Vec model from a block of Text using Gensim (Python)”

Leave a comment Cancel reply