Generating a Word2Vec model from a block of Text using Gensim (Python)


Word2Vec is a semantic learning framework that uses a shallow neural network to learn the representations of words/phrases in a particular text. Simply put, its an algorithm that takes in all the terms (with repetitions) in a particular document, divided into sentences, and outputs a vectorial form of each. The ‘advantage’ word2vec offers is in its utilization of a neural model in understanding the semantic meaning behind those terms. For example, a document may employ the words ‘dog’ and ‘canine’ to mean the same thing, but never use them together in a sentence. Ideally, Word2Vec would be able to learn the context and place them together in its semantic space. Most applications of Word2Vec using cosine similarity to quantify closeness. This Quora question (or rather its answers) does a good job of explaining the intuition behind it.

You would need to take the following steps to develop a Word2Vec model from a block of text (Usually, documents that are extensive and yet stick to the topic of interest with minimum ambiguity do well):

[I use Gensim’s Word2Vec API in Python to form Word2Vec models of Wikipedia articles.]

1. Obtain the text (obviously)

To obtain the Wikipedia articles, I use the Python wikipedia library. Once installed from the link, here’s how you could use it obtain all the text from an aritcle-

#'title' denotes the exact title of the article to be fetched
title = "Machine learning"
from wikipedia import page
wikipage = page(title)

You could then use wikipage.context to access the entire textual context in the form of a String. Now, incase you don’t have the exact title and want to do a search, you would do:

from wikipedia import search, page
titles = search('machine learning')
wikipage = page(titles[0])

[Tip: Store the content into a file and access it from there. This would provide you a reference later, if needed.]

2. Preprocess the text

In the context of Python, you would require an iterable that yields one iterable for each sentence in the text. The inner iterable would contain the terms in the particular sentence. A ‘term’ could be individual words like ‘machine’, or phrases(n-grams) like ‘machine learning’, or a combination of both. Coming up with appropriate bigrams/trigrams is a tricky task on its own, so I just stick to unigrams.

First of all, I remove all special characters and short lines from the article, to eliminate noise. Then, I use Porter Stemming on my unigrams, using a ‘wrapper’ around Gensim’s stemming API.

from gensim.parsing import PorterStemmer
global_stemmer = PorterStemmer()

class StemmingHelper(object):
    Class to aid the stemming process - from word to stemmed form,
    and vice versa.
    The 'original' form of a stemmed word will be returned as the
    form in which its been used the most number of times in the text.

    #This reverse lookup will remember the original forms of the stemmed
    word_lookup = {}

    def stem(cls, word):
        Stems a word and updates the reverse lookup.

        #Stem the word
        stemmed = global_stemmer.stem(word)

        #Update the word lookup
        if stemmed not in cls.word_lookup:
            cls.word_lookup[stemmed] = {}
        cls.word_lookup[stemmed][word] = (
            cls.word_lookup[stemmed].get(word, 0) + 1)

        return stemmed

    def original_form(cls, word):
        Returns original form of a word given the stemmed version,
        as stored in the word lookup.

        if word in cls.word_lookup:
            return max(cls.word_lookup[word].keys(),
                       key=lambda x: cls.word_lookup[word][x])
            return word

Refer to the code and docstrings to understand how it works. (Its pretty simple anyways). It can be used as follows-

>>> StemmingHelper.stem('learning')
>>> StemmingHelper.original_form('learn')

Pre-stemming, you could also use a list of stopwords to eliminate terms that occur frequently in the English language, but don’t carry much semantic meaning.

After your pre-processing, lets assume you come up with an iterable called sentences from your source of text.

3. Figure out the values for your numerical parameters

Gensim’s Word2Vec API requires some parameters for initialization. Ofcourse they do have default values, but you want to define some on your own:

i. size – Denotes the number of dimensions present in the vectorial forms. If you have read the document and have an idea of how many ‘topics’ it has, you can use that number. For sizeable blocks, people use 100-200. I use around 50 for the Wikipedia articles. Usually, you would want to repeat the initialization for different numbers of topics in a certain range, and pick the one that yields the best results (depending on your application – I will be using them to build Mind-Maps, and I usually have to try values from 20-100.). A good heuristic thats frequently used is the square-root of the length of the vocabulary, after pre-processing.

ii. min_count – Terms that occur less than min_count number of times are ignored in the calculations. This reduces noise in the semantic space. I use 2 for Wikipedia. Usually, the bigger and more extensive your text, the higher this number can be.

iii. window – Only terms hat occur within a window-neighbourhood of a term, in a sentence, are associated with it during training. The usual value is 4. Unless your text contains big sentences, leave it at that.

iv. sg – This defines the algorithm. If equal to 1, the skip-gram technique is used. Else, the CBoW method is employed. (Look at the aforementioned Quora answers). I usually use the default(1).

4. Initialize the model and use it

The model can be generated using Gensim’s API, as follows:

from gensim.models import Word2Vec
min_count = 2
size = 50
window = 4

model = Word2Vec(sentences, min_count=min_count, size=size, window=window)

Now that you have the model initialized, you can access all the terms in its vocabulary, using something like list(model.vocab.keys()). To get the vectorial representation of a particular term, use model[term]. If you have used my stemming wrapper, you could find the appropriate original form of the stemmed terms using StemmingHelper.original_form(term). Heres an example, from the Wiki article on Machine learning:

>>> vocab = list(model.vocab.keys())
>>> vocab[:10]
[u'represent', u'concept', u'founder', u'focus', u'invent', u'signific', u'abil', u'implement', u'benevol', u'hierarch']
>>> 'learn' in model.vocab
>>> model['learn']
array([  1.23792759e-03,   5.49776992e-03,   2.18261080e-03,
         8.37465748e-03,  -6.10323064e-03,  -6.94877980e-03,
         6.29429379e-03,  -7.06598908e-03,  -7.16267806e-03,
        -2.78065586e-03,   7.40372669e-03,   9.68673080e-03,
        -4.75220988e-03,  -8.34807567e-03,   5.25208283e-03,
         8.43616109e-03,  -1.07231298e-02,  -3.88528360e-03,
        -9.20894090e-03,   4.17305576e-03,   1.90116244e-03,
        -1.92442467e-03,   2.74807960e-03,  -1.01113841e-02,
        -3.71694425e-03,  -6.60350174e-03,  -5.90716442e-03,
         3.90679482e-03,  -5.32188127e-03,   5.63300075e-03,
        -5.52612450e-03,  -5.57334488e-03,  -8.51202477e-03,
        -8.78736563e-03,   6.41061319e-03,   6.64879987e-03,
        -3.55080629e-05,   4.81080823e-03,  -7.11903954e-03,
         9.83678619e-04,   1.60697231e-03,   7.42980337e-04,
        -2.12235347e-04,  -8.05167668e-03,   4.08948492e-03,
        -5.48054813e-04,   8.55423324e-03,  -7.08682090e-03,
         1.57684216e-03,   6.79725129e-03], dtype=float32)
>>> StemmingHelper.original_form('learn')
>>> StemmingHelper.original_form('hierarch')

As you might have guessed, the vectors are NumPy arrays, and support all their functionality. Now, to compute the cosine similarity between two terms, use the similarity method. Cosine similarity is generally bounded by [-1, 1]. The corresponding ‘distance’ can be measured as 1-similarity. To figure out the terms most similar to a particular one, you can use the most_similar method.

>>> model.most_similar(StemmingHelper.stem('classification'))
[(u'spam', 0.25190210342407227), (u'metric', 0.22569453716278076), (u'supervis', 0.19861873984336853), (u'decis', 0.18607790768146515), (u'inform', 0.17607420682907104), (u'artifici', 0.16593246161937714), (u'previous', 0.16366994380950928), (u'train', 0.15940310060977936), (u'network', 0.14765430986881256), (u'term', 0.14321796596050262)]
>>> model.similarity(StemmingHelper.stem('classification'), 'supervis')
>>> model.similarity('unsupervis', 'supervis')

There’s a ton of other functionality that’s supported by the class, so you should have a look at the API I gave a link to. Happy topic modelling 🙂


Interesting take-aways from ‘Data Science For Business’

I have recently been reading Data Science for Business by Foster Provost and Tom Fawcett. I picked it up on Amazon some time back, while trying to find some good books on the applications of machine learning in industries. And I hit the bull’s eye with this one!

Data Science for Business does an awesome job at getting the reader acquainted with all the basic(as well as some niche) applications of ML frameworks in a business context. What this book does not do, is give you the rigor behind the algorithms discussed in it. The focus is on the when and why, not the how. For example, you will find detailed descriptions of the kind of problems Association Mining solves- from Market Basket Analysis to mining Facebook ‘like’ relationships- but you wont find details of the Apriori or any other actual AM techniques. Why I love this book, is because it does what most other ML texts don’t- giving the reader an intuitive understanding of when to apply a certain technique and why. For a newbie, reading this book side-by-side with an established text like Elements of Statistical Learning would perhaps be the best way to get acquainted with Machine Learning and its applications.

There were a lot of ‘Aha!’ moments for me while reading this book, especially on the lines of ‘I never thought it could be used this way!’ or ‘Thats a good rule of thumb to follow while applying this!’. To give you guys a brief idea, here are three interesting things I learned/revisited while reading this book:

1. CRISP Data Mining

Okay, anyone who has done a formal course in Data Mining knows about the CRoss Industry Standard Process for Data Mining, and for good reason. Its one of the most fundamental, extremely basic ways to go about mining any kind of data. A diagrammatic representation of the same is shown below:


However, while most courses would teach their students about this framework and leave it at that, this book tries to explain many other applications of statistical learning techniques in the context of CRISP data mining. For example, in an exploratory data mining technique such as clustering, its the Evaluation part of the process thats highly crucial- since you need to extract the maximum ‘information’ that you can from the results that a standard clustering technique would provide. More on this in a bit.

The book uses the Expected Value framework to augment the way in which CRISP is applied. It basically stresses the fact that you need to formalize your Business Understanding of the problem by expressing it mathematically- as a value to estimate, or function to optimize. At this stage, you would rather think of your solution as a black box- not focusing on the algorithms and implementations underneath. You then need to understand your data in terms of the available attributes, sparseness of the data, etc. (Data Understanding). Only once you have the mathematical formulation and familiarity with the data, are you in a position to combine the two in a synergistic mix of Data Preparation and Modelling – in essence, formulating a solution that performs best with the data you have.

2. Using Decision Trees to Understand Clusters

Why understand the output of clustering?

Since clustering, as I mentioned before, is an exploratory process, you don’t exactly know what you expect to find going in. You may have a rough idea of the number of clusters(or not), but not what each of them represent- atleast in the general case.

Lets consider a simple example. You have decided to use DBSCAN to cluster all of your customers into an unknown number of groups, where customers in the same cluster would have similar properties based on some criteria. You hope to use the output of this clustering to classify a new customer into one of these groups, and deal with him better given the characteristics that set him apart from the rest. You can easily use a supervised learning process to classify a new customer given the clustering output, but what you don’t have is an intuitive understanding of what the cluster represents. Usually, the following two methods are most likely used to ‘understand’ a given cluster-

i. Pick out a member of the cluster that has the highest value for a certain property. For example, if you are clustering companies, you could just pick the one that has the highest turnover- which would mean the company thats most ‘well-known’.

ii. Pick out that member which is closest to the centroid.

However, neither of these would be applicable in all contexts and problems. Especially if the cluster members don’t mean anything to you individually, or you need to focus on the trends in a cluster- which cannot be summarized by a single member anyways.

How do you use a Decision Tree here?

A Decision Tree is usually pretty under-estimated an algorithm when it comes to supervised learning. The biggest reason for this is its innate simplicity which results in people going for more ‘sophisticated’ techniques like SVMs(usually). However, this is exactly the property that makes us select a decision tree in this scenario.

To understand a cluster K, what you essentially do is this:

i. Take all members of cluster K on one side, and members of all other clusters on another side.

ii. ‘Learn’ a decision tree to classify any given data point as a member of K or not-member of K – essentially a one-vs-all classification.

iii. Decompose the decision tree into the corresponding set of rules.

Voila! What you now essentially have, is a set of rules that help you distinguish the properties of class K from the rest. By virtue of decision-tree learning algorithms such as C4.5, your rules will optimally be based on those attributes that play the biggest role in characterising members of cluster K. Elegant isn’t it? The book explains this using the famous Whiskies dataset.

3. Virtual Items in Association Mining

This isn’t a big technique of any kind, but a nifty thing to remember all the same. Mostly always, when talking of Association Mining, we think of conventional ‘items’ as they would appear in any Market-Basket analysis. However, one way to expand this mining process, is to include Virtual Items in the ‘Baskets’. For example, suppose you are a big chain of supermarkets across the country. Obviously, every time a customer buys something, you include all the purchased items in a ‘transaction’. What you could also include, are some other attributes of the transaction such as- Location, Gender of person buying, Time of the Data, etc. So a given transaction might look like (Mumbai, Male, Afternoon, Bread, Butter, Milk). Here, the first three ‘items’ aren’t really items at all! Hence, they are called ‘virtual items’. Nonetheless, if you enter these transactions into an Association Mining framework, you could end up finding relationships between conventional goods and properties such as gender and location!


These are just three of such things to learn from this book. I would definitely recommend this book to you, especially if you want to supplement your knowledge of various ML algorithms with the intuition of how and where to apply them. Cheers!