# Generating a Word2Vec model from a block of Text using Gensim (Python)

Word2Vec is a semantic learning framework that uses a shallow neural network to learn the representations of words/phrases in a particular text. Simply put, its an algorithm that takes in all the terms (with repetitions) in a particular document, divided into sentences, and outputs a vectorial form of each. The ‘advantage’ word2vec offers is in its utilization of a neural model in understanding the semantic meaning behind those terms. For example, a document may employ the words ‘dog’ and ‘canine’ to mean the same thing, but never use them together in a sentence. Ideally, Word2Vec would be able to learn the context and place them together in its semantic space. Most applications of Word2Vec using cosine similarity to quantify closeness. This Quora question (or rather its answers) does a good job of explaining the intuition behind it.

You would need to take the following steps to develop a Word2Vec model from a block of text (Usually, documents that are extensive and yet stick to the topic of interest with minimum ambiguity do well):

[I use Gensim’s Word2Vec API in Python to form Word2Vec models of Wikipedia articles.]

1. Obtain the text (obviously)

To obtain the Wikipedia articles, I use the Python wikipedia library. Once installed from the link, here’s how you could use it obtain all the text from an aritcle-


#'title' denotes the exact title of the article to be fetched
title = "Machine learning"
from wikipedia import page
wikipage = page(title)



You could then use wikipage.context to access the entire textual context in the form of a String. Now, incase you don’t have the exact title and want to do a search, you would do:


from wikipedia import search, page
titles = search('machine learning')
wikipage = page(titles[0])



[Tip: Store the content into a file and access it from there. This would provide you a reference later, if needed.]

2. Preprocess the text

In the context of Python, you would require an iterable that yields one iterable for each sentence in the text. The inner iterable would contain the terms in the particular sentence. A ‘term’ could be individual words like ‘machine’, or phrases(n-grams) like ‘machine learning’, or a combination of both. Coming up with appropriate bigrams/trigrams is a tricky task on its own, so I just stick to unigrams.

First of all, I remove all special characters and short lines from the article, to eliminate noise. Then, I use Porter Stemming on my unigrams, using a ‘wrapper’ around Gensim’s stemming API.


from gensim.parsing import PorterStemmer
global_stemmer = PorterStemmer()

class StemmingHelper(object):
"""
Class to aid the stemming process - from word to stemmed form,
and vice versa.
The 'original' form of a stemmed word will be returned as the
form in which its been used the most number of times in the text.
"""

#This reverse lookup will remember the original forms of the stemmed
#words
word_lookup = {}

@classmethod
def stem(cls, word):
"""
Stems a word and updates the reverse lookup.
"""

#Stem the word
stemmed = global_stemmer.stem(word)

#Update the word lookup
if stemmed not in cls.word_lookup:
cls.word_lookup[stemmed] = {}
cls.word_lookup[stemmed][word] = (
cls.word_lookup[stemmed].get(word, 0) + 1)

return stemmed

@classmethod
def original_form(cls, word):
"""
Returns original form of a word given the stemmed version,
as stored in the word lookup.
"""

if word in cls.word_lookup:
return max(cls.word_lookup[word].keys(),
key=lambda x: cls.word_lookup[word][x])
else:
return word



Refer to the code and docstrings to understand how it works. (Its pretty simple anyways). It can be used as follows-


>>> StemmingHelper.stem('learning')
'learn'
>>> StemmingHelper.original_form('learn')
'learning'



Pre-stemming, you could also use a list of stopwords to eliminate terms that occur frequently in the English language, but don’t carry much semantic meaning.

After your pre-processing, lets assume you come up with an iterable called sentences from your source of text.

3. Figure out the values for your numerical parameters

Gensim’s Word2Vec API requires some parameters for initialization. Ofcourse they do have default values, but you want to define some on your own:

i. size – Denotes the number of dimensions present in the vectorial forms. If you have read the document and have an idea of how many ‘topics’ it has, you can use that number. For sizeable blocks, people use 100-200. I use around 50 for the Wikipedia articles. Usually, you would want to repeat the initialization for different numbers of topics in a certain range, and pick the one that yields the best results (depending on your application – I will be using them to build Mind-Maps, and I usually have to try values from 20-100.). A good heuristic thats frequently used is the square-root of the length of the vocabulary, after pre-processing.

ii. min_count – Terms that occur less than min_count number of times are ignored in the calculations. This reduces noise in the semantic space. I use 2 for Wikipedia. Usually, the bigger and more extensive your text, the higher this number can be.

iii. window – Only terms hat occur within a window-neighbourhood of a term, in a sentence, are associated with it during training. The usual value is 4. Unless your text contains big sentences, leave it at that.

iv. sg – This defines the algorithm. If equal to 1, the skip-gram technique is used. Else, the CBoW method is employed. (Look at the aforementioned Quora answers). I usually use the default(1).

4. Initialize the model and use it

The model can be generated using Gensim’s API, as follows:


from gensim.models import Word2Vec
min_count = 2
size = 50
window = 4

model = Word2Vec(sentences, min_count=min_count, size=size, window=window)



Now that you have the model initialized, you can access all the terms in its vocabulary, using something like list(model.vocab.keys()). To get the vectorial representation of a particular term, use model[term]. If you have used my stemming wrapper, you could find the appropriate original form of the stemmed terms using StemmingHelper.original_form(term). Heres an example, from the Wiki article on Machine learning:


>>> vocab = list(model.vocab.keys())
>>> vocab[:10]
[u'represent', u'concept', u'founder', u'focus', u'invent', u'signific', u'abil', u'implement', u'benevol', u'hierarch']
>>> 'learn' in model.vocab
True
>>> model['learn']
array([  1.23792759e-03,   5.49776992e-03,   2.18261080e-03,
8.37465748e-03,  -6.10323064e-03,  -6.94877980e-03,
6.29429379e-03,  -7.06598908e-03,  -7.16267806e-03,
-2.78065586e-03,   7.40372669e-03,   9.68673080e-03,
-4.75220988e-03,  -8.34807567e-03,   5.25208283e-03,
8.43616109e-03,  -1.07231298e-02,  -3.88528360e-03,
-9.20894090e-03,   4.17305576e-03,   1.90116244e-03,
-1.92442467e-03,   2.74807960e-03,  -1.01113841e-02,
-3.71694425e-03,  -6.60350174e-03,  -5.90716442e-03,
3.90679482e-03,  -5.32188127e-03,   5.63300075e-03,
-5.52612450e-03,  -5.57334488e-03,  -8.51202477e-03,
-8.78736563e-03,   6.41061319e-03,   6.64879987e-03,
-3.55080629e-05,   4.81080823e-03,  -7.11903954e-03,
9.83678619e-04,   1.60697231e-03,   7.42980337e-04,
-2.12235347e-04,  -8.05167668e-03,   4.08948492e-03,
-5.48054813e-04,   8.55423324e-03,  -7.08682090e-03,
1.57684216e-03,   6.79725129e-03], dtype=float32)
>>> StemmingHelper.original_form('learn')
u'learning'
>>> StemmingHelper.original_form('hierarch')
u'hierarchical'



As you might have guessed, the vectors are NumPy arrays, and support all their functionality. Now, to compute the cosine similarity between two terms, use the similarity method. Cosine similarity is generally bounded by [-1, 1]. The corresponding ‘distance’ can be measured as 1-similarity. To figure out the terms most similar to a particular one, you can use the most_similar method.


>>> model.most_similar(StemmingHelper.stem('classification'))
[(u'spam', 0.25190210342407227), (u'metric', 0.22569453716278076), (u'supervis', 0.19861873984336853), (u'decis', 0.18607790768146515), (u'inform', 0.17607420682907104), (u'artifici', 0.16593246161937714), (u'previous', 0.16366994380950928), (u'train', 0.15940310060977936), (u'network', 0.14765430986881256), (u'term', 0.14321796596050262)]
>>> model.similarity(StemmingHelper.stem('classification'), 'supervis')
0.19861870268896875
>>> model.similarity('unsupervis', 'supervis')
-0.11546791800661522



There’s a ton of other functionality that’s supported by the class, so you should have a look at the API I gave a link to. Happy topic modelling 🙂

# Linear and Quadratic Discriminant Analysis for ML / statistics newbies

(Note: This post assumes that the reader has knowledge of basic statistics and terms used in machine learning. Inspite of that, I have provided links whereever necessary 🙂 ).

Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA) are types of Bayesian classifiers.

A Bayesian classifier, in mathematical terms, does the following-

$C^{bayesian}(x) = argmax_{r \in \{1, 2, ..., k\}} Pr(Y = r | X = x)$

What does this mean? To put it in the form of steps, heres what happens-

1. The classifier is given an input($X$) that is the feature vector $x$. $x$ consists of $p$ different predictors, one for each input dimension.

2. Then, with respect to each possible output class($Y$) $r$ from $1, 2, ..., k$, the classifier computes $Pr(Y = r | X = x)$ – this is the probability that the actual output is $r$, given $x$ as the input.

3. The class (i.e. that value of $r$) that returns the maximum value for the above mentioned probability, is given as the output of the classifier – $C^{bayesian}(x)$.

In essence, a Bayesian classifier tries to minimize the probability of misclassification (which would be equal to $1 - Pr(Y = r | X = x)$).

But then the most natural question is- How is this probability computed? It is precisely at this point where all Bayesian classifiers differ. Each one of them uses a different technique to understand the data, and compute these crucial quantities. In fact, even a simple nearest-neighbors classifier can be used as a Bayesian classifier, with $Pr(Y = r | X = x)$ being the fraction of the nearest neighbors to $x$  that belong to class $r$. LDA and QDA, on the other hand, use a more mathematical approach.

Anyone acquainted with basic statistics knows Bayes theorem. Given two possible events $A$ and $B$, it states-

$Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)}$

Applying this to $Pr(Y = r | X = x)$, we would get

$Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}$.

Now, look at the first part in the above numerator- $Pr(Y = r)$. This is essentially the probability of the class of a random input being $r$ – prior to even looking at what the input is! Thats why its called the priori probability. During the training stage itself, it is computed as the fraction of the training samples that belonged to class $r$. As you might have understood by now, this number is independent of the what $x$ is, and reflects the trends seen in the training set.

Now lets consider the second part – $Pr(X = x | Y = r)$. This number is conceptually equivalent to answering the question- Given that a random data point(not necessarily in the training data) was being selected from class $r$, what would be the likelihood that it looked like $x$? To quantify this ‘likelihood’, LDA and QDA use a Multivariate Gaussian Distribution model for each class. Mathematically put, the probability density function for a multivariate Gaussian distribution is-

$f_{\boldsymbol\Sigma, \boldsymbol\mu}(x) = \frac{1}{\sqrt{(2\pi)^k|\boldsymbol\Sigma|}}\exp(-\frac{1}{2}({ x}-{\boldsymbol\mu})^\mathrm{T}{\boldsymbol\Sigma}^{-1}({x}-{\boldsymbol\mu}))$.

You definitely don’t need to remember this huge-ass formula. To put it simply, heres the intuition behind it- A probability distribution model is a way for an algorithm to understand how data points are distributed in a $p$-dimensional space. In the case of the multivariate Gaussian(also called Normal) distribution, the two parameters $\boldsymbol\mu$ (the $p$-dimensional mean vector) and $\boldsymbol\Sigma$ (the $p$x$p$-dimensional covariance matrix) are used to ‘summarize’ how the data ‘looks’ in the conceptual space. These are obviously learnt during training.

While $\boldsymbol\mu$ denotes the approximate average position, $\boldsymbol\Sigma$ summarizes the ‘shape’ of the distribution of data points. (Imagine drawing a random cluster of points on a graph paper. Now draw an ellipse-like shape over that region where most of the points are concentrated. The ‘central’ point of the ellipse will be a 2-dimensional vector $\boldsymbol\mu$, and the ‘shape’ would be approximated by a 2×2 matrix given by $\boldsymbol\Sigma$.) The corresponding probability density function(pdf) for a random $x$ in space, is proportional to how likely it is that a point at that location would be selected randomly from the distribution. The pdf is to probability, what pressure is to force. Pressure isn’t exactly force itself, but it shows how intense the latter is at a particular location- and the higher the pressure, the greater is the likelihood that the force will show its effect at that point.

Thus, LDA and QDA use the aforementioned $f(x)$ to quantify $Pr(X = x | Y = r)$. What about the denominator then – $Pr(X = x)$? It is nothing but the sum of the values of $Pr(Y = r)Pr(X = x | Y = r)$ over all possible $r$s. This is given by the Total Probability Theorem.

But then where do LDA and QDA differ? They differ in only one subtle point- While LDA assumes the same $\boldsymbol\Sigma$ for all classes, QDA computes a different $\boldsymbol\Sigma$ for each class. Essentially, LDA computes a separate $\boldsymbol\mu$ for each class(using training points that belonged to it), but $\boldsymbol\Sigma$ is computed using the entire training data. And this common covariance matrix is used in the computations corresponding to every class. QDA, on the other hand, computes a separate $\boldsymbol\Sigma$ and $\boldsymbol\mu$ for each possible output class.

We will now tackle the two most important question that might be plaguing your mind-

1. Why is LDA ‘linear’, and QDA ‘quadratic’?

The answer lies in the probability function. Consider the simplest case of one input variable only. The answer can be generalized to multiple dimensions using matrix algebra.

$Pr(Y = r | X = x) = \frac{Pr(Y = r)Pr(X = x | Y = r)}{Pr(X = x)}$

We are maximizing over all possible $r$s right? Intuitively, that should tell you that the denominator in the above equation, doesn’t really matter in comparisons- since its the same for all values of $r$. So we are now down to finding that $r$ which gives the maximum value for

$Pr(Y = r)Pr(X = x | Y = r)$

Let $Pr(Y = r) = \beta_{r}$ (to reduce my LaTeX efforts). Using the formula for 1-D Normal distribution density function, we have

$Pr(X = x | Y = r) = \frac{1}{\sigma_r\sqrt{2\pi}}e^\frac{(x - \mu_r)^2}{\sigma_r^2}$

Multiplying the two and taking a natural logarithm of the resultant expression, we are now down to finding that $r$ that gives the maximum value for

$log(\beta_{r}/\sigma_r) - \frac{x^2}{2\sigma_r^2} - \frac{\mu_r^2}{2\sigma_r^2} + \frac{x\mu_r}{\sigma_r^2}$

Observe the above quadratic(in x) expression carefully. Once again, I have omitted those terms that are constants or which remain the same irrespective of $r$(NOT $x$), which is the focus of the whole process. Or have I? Consider the second term. IF I am talking of the LDA classifier, would $\sigma_r$ be dependent on $r$ at all? It wouldn’t, since LDA assumes the same variance/covariance for all classes. As a result, the second term could no longer be something worth taking into the comparison process. Incidentally, that very term is what makes the above expression quadratic. Otherwise, its linear!

And thats (more or less) why LDA is ‘linear’ and QDA is ‘quadratic’ 🙂 .

2. Why would we ever choose LDA? Isn’t QDA conceptually more flexible?

QDA isn’t always better. In a certain way(too mathematically convoluted for this post), LDA does learn a linear boundary between the points belonging to different classes. QDA, on the other hand, learns a quadratic one. If the actual boundaries are indeed linear, QDA may have a higher model bias. – especially if the available data isn’t dense enough. Another(and easier to imagine) scenario is when you have a very limited set of training points. In such a case, if you divide your already sparse dataset into the constituent classes, the covariance matrix computed for each might be extremely inaccurate. In such cases, its better to simplify the entire process and use a common covariance matrix thats computed from the entire dataset we use while training.