On Interpretable Models

Artificial Intelligence is everywhere today. And as intelligent systems get more ubiquitous, the need to understand their behavior becomes equally important. Maybe if you are developing an algorithm to recognize a cat-photo for fun, you don’t need to know how it works as long as it delivers the results. But if you have deployed a model to predict whether a person will default on a loan or not, and you use it to make your decisions, you better be sure you are doing the right thing – for practical, ethical AND legal reasons.

From Dictionary.com,

interpretabilityto give or provide the meaning of; explain; explicate; elucidate

 

Why do models need to be interpretable?

The primary reason why we need explainability in AI, is to develop a sense of understanding and trust. Think about it – the only way you would ever delegate an important task to someone else, is if you had a certain level of trust in their thought process. If for instance Siri makes an error in understanding your commands, thats fine. But now consider self-driven cars. The reason why most people would not readily sit in a self-driven car within a city, is because we cannot guarantee if it will do the right thing in every situation. Interpretability is thus crucial for building trust towards models, especially in domains like healthcare, finance and the judicial system.

Interpretability is also important while debugging problems in a model’s performance. These problems might be caused due to the algorithm itself, or the data being used to train it. And you may not really observe these issues until you deploy a compound system that uses this model. Lets take the example of Google’s Word2Vec. Word2Vec is currently one of the best algorithms for computing word-embeddings given a significant amount of text. It was originally trained on a 3 million-word corpus of Google News articles. In some research conducted by people from Boston university and Microsoft Research, they found a ton of hidden sexism in the word-embeddings generated from that dataset. For example, the framework came up with this particular analogy: “man : computer programmer :: woman : homemaker”. Funny ain’t it? This was not a problem with the algorithm itself, but a screw-up of the way news articles are usually written. Quoting the source, “Any bias contained in word embeddings like those from Word2vec is automatically passed on in any application that exploits it.”.

How do we increase interpretability of models?

There are two ways to promote interpretability when it comes to Machine Learning/AI systems: Transparency, and Post-Hoc Explainability. Algorithmic transparency would mean that you understand the way your model works on an intuitive level, with respect to the dataset you used for training. A Decision Tree, for example, is pretty transparent – in fact, you can use the paths from the root to every leaf node to decompose the Tree into the set of rules used for classification. But a deep Neural Network is not so transparent, for obvious reasons. Though you may understand linear algebra and back-propagation, you will typically not be able to make sense of the weights/biases learned by a deep-NN after training.

Transparency has two aspects: Decomposability, and Simultaneity. Decomposability would mean understanding each individual component of your model. In essence, there should not be a ‘black-box’ component of the system in your eyes. Simultaneity, on the other hand, indicates an understanding of how all these individual components work together as a whole. And the former does not necessarily imply the latter – consider an algorithm as simple as linear regression. You would probably know that if the weight with respect to a predictor is positive after training, it shows a direct proportionality to the target variable. Now, if you train a simple linear regression of Disease-risk vs Vaccination, you would most probably get a negative weight on the Vaccination variable. But if you now take tricky factors such as immunodeficiency or age (old-age or infancy) into the picture, the weight might take on a whole different value. In fact, as the number of predictor variables goes on increasing in regression, it gets more and more difficult to understand how your model will behave as a whole. And thus, the notion that a ‘simple’ model (linear regression) would be far easier to interpret than a ‘complex’ one (deep learning) is misleading.

Post-Hoc means ‘occurring after an event’. In the context of model transparency, post-hoc interpretation would mean an effort to understand its behavior after it has finished training, typically using some test inputs. Some models/algorithms inherently have the ability to ‘explain’ their behavior. For example, take k-NN classifiers. Along with the required output, you can hack (with minimal effort) the model to return the k-nearest neighbors as examples for scrutiny. This way, you get a good idea of the combination of properties that produce similar results by looking at the known training points.

Most algorithms don’t have such easy post-hoc interpretability, though. In such cases, you have to use techniques such as visualization to understand how they behave/work. For instance, you could use a dimensionality reduction technique such as t-SNE to reduce vector points to 2-3 dimensions and visualize class ‘regions’ in 2D/3D space. Essentially, you are enabling easy visualization of higher-dimensional data by embedding it in a lower-dimensional space. Saliency maps are a technique used to interpret deep neural networks. In Natural Language Processing, textual explanations are also being adopted. Since humans usually understand words better than raw numbers, providing text-based explanations makes sense. For example, in a system like LSI, you could ‘understand’ a word’s embedding by (proportionately) looking at the words that strongly belong to the latent topic(s) is most relates to.

Conclusion and further reading

I did kind-of imply that interpretability is required so that we end up trusting automated systems as much as we trust humans. But as it turns out, its not like human actions are perfectly explainable. There is a ton of research in psychology that clearly indicates that the motivations for our actions are not as clear as we ourselves tend to believe. The Illusion of Conscious Will by Daniel Wegner talks about how our decisions tend to be influenced by subconscious processes without us realizing it. Moreover, it seems contradictory to the ultimate aim of AI to avoid building models that we cannot ‘understand’. If there will be machine intelligence smarter than us, the likelihood of us understanding it completely is pretty slim (Terminator, anyone?).

Heres a couple of links for you to look at, if you want to read more:

  1. EU’s General Data Protection Regulation (GDPR) makes interpretability a top priority.
  2. David Gunning’s summarization of the work on Explainable AI at the International Joint Conference on Artificial Intelligence.
  3. “Is Artificial Intelligence Permanently Inscrutable?” by Aaron Bornstein (Has some good discussion in the Comments section)

Fuzzy Speciation in Genetic Algorithms using k-d Trees

You would need to know the basics of Genetic Algorithms to understand this post. AI Junkie’s tutorial is a good place to get to upto speed. You would also need to know what k-D Trees are, for which the Wiki article is a good reference.

allopatric_speciation_med

What is speciation?

Speciation, in simple terms, is an evolutionary process by which new biological species arise. This concept is used in the implementation of evolutionary algorithms (such as GAs) as a form of optimization. To understand speciation, its necessary to understand what we mean by ‘species’.

A species is a group of organisms that – 1) possess some common prominent traits, and 2) can breed/mate with one-another. For example, consider cats and leopards. They share many characteristics, and hence exist in the same ‘family’ of animals. However, since they can’t really mate to provide a cat+leopard offspring, they don’t belong to the same species. The part 2) of the above definition may not be applicable in all cases (such as asexual plants). Nevertheless, it is what is important to us in the context of evolutionary algorithms.

Basically, a genetic algorithm (that implements speciation) will divide the population of candidates into a number of species. Essentially, each species will be a group of solutions that are allowed to crossover with each other. Candidates belonging to different species rarely mate.

Why do we need speciation?

A Genetic Algorithm chiefly deals with two genetic operations: Crossover and Mutation. Mutation works on a single candidate solution by perturbing it in some small way. Crossover, on the other hand, takes two candidate ‘parents’ (selected based on their fitness) and produces offspring(s) that share parts of their genome with the parents. This seems good enough, but it has some serious pitfalls.

Many times, crossover has a pretty high rate of failure (that is, it produces very inferior solutions). The reason for this, in simple terms, is: though both parents may individually have very high levels of fitness, the result of combining random parts of their genome together can be sub-optimal. This tends to happen when the parents come from distant parts of the search space, causing the offspring to lie in some entirely new region that has no optima in the vicinity. For example, imagine you are trying to evolve the fastest mammal. During a crossover, you select a deer (a fast mammal), and a cheetah(the fastest indeed). In this case, even if you could breed their offspring, it may have a random mixture of the anatomical features of its parents thats disastrous from the point of view of speed. In other words, mating fit parents does not guarantee a fit offspring, if the properties of the parents are vastly different.

In cases like these, speciation comes to the rescue. By restricting mating between greatly dissimilar candidates, evolution of sub-par candidates is minimized.

How is speciation implemented?

Usually, the two most common ways of implementing speciation in evolutionary algorithms are: Threshold Speciation and K-Means/Medoids clustering. Both are explained pretty well in Jeff Heaton’s book on Nature-Inspired algorithms in Artificial Intelligence.

The key feature of speciation is the fact that it prevents mating among candidates with highly differing characteristics. This can be thought of as promoting breeding between candidates that lie close to each other in the search-space. To achieve the same effect, I have been using k-d trees to perform nearest-neighbour searches to provide a form of speciation to my implementation of GAs. Its pretty simple(and intuitive) in terms of its approach, doesn’t degrade performance by too much and has been providing good results so far. So, here’s a basic outline of it for you all.

Implementation using k-D Trees

I will outline the algorithm first, and point out the nuances later.

First off, you must have a way of computing the ‘distance’/dissimilarity between candidate solutions. If your genome has a fixed length, you can use Euclidean distance.

Extra attributes required: 1. Starting Temperature (T), and 2. Neighbourhood Value(V). T must be in the range (0. 1). The higher this value, the greater is the tendency to neglect speciation (especially in the earlier iterations). The Neighbourhood Value V denotes the number of nearest candidates that will be considered in searches.

Algorithm:

I) The following selection process is repeated in each iteration until the new population size does not reach the required value:

Initial Values: Temperature t = T

i. Pick a random candidate parent based on fitness (using something like Roulette-Wheel selection, or Tournament selection)

ii.a. Pick a random number in (0, 1). If it is lesser than the current value of t, pick another parent randomly using the same selection process used in step i. If not, goto step ii.b.

ii.b. Using the k-D Tree trained over candidates from the current iteration, find the V nearest candidates to the parent selected in step i. Choose one randomly (using the process employed in step i), based on fitness.

iii. Use crossover with parents produced in steps i and ii. Mutate as/when appropriate. Add offspring to the pool of candidates of the next generation.

II) Once the new population grows to the required size, a new k-D Tree is generated using the set of all new candidates for the next iteration.

III) Decrease t for the next iteration.

Pointers

1) First of all, you need to have a way to compute dissimilarity between candidate solutions (apart from their fitness). As mentioned, if your genome size is constant, you could always use good-old Euclidean distance. If not, you might have to be creative. If you are using Scikit-Learn’s implementation of k-D Trees like me, you could check out the metrics it supports by doing kd_tree.valid_metrics over an instance.

2) Because the selection of the first parent in step i. depends only by the global measure for fitness, parts of the search space that generate good solutions will have higher numbers of candidates in subsequent iterations.

3) Unlike using K-Medoids for speciation, you don’t need to have a strict definition of the species that a given candidate belongs to. Nearest-neighbour searching ensures that a candidate usually mates only with other candidates that are similar to it.

4) As you might have understood, the Temperature parameter only adds some amount of exploratory tendencies to the algorithm. The higher the temperature, the greater is the tendency to ignore speciation. The decrease in t could most likely be done linearly from the user-provided starting point T, to a pre-defined minimum value(like 0.05), across iterations.

Effects on Performance of the GA

As mentioned in the discussion here, it doesn’t make much sense to analytically define the time-complexity of GAs. However, let me give you a rough idea of how things will change. Assume the total size of the population is n. First off, Roulette-Wheel selection (done in step i) takes O(n) time if done in a naive way (going through all candidates in some order one by one). If you do it using the stochastic acceptance method, it takes O(1) time. Usually, this selection will be done twice during every crossover. In case of the algorithm outlined above, for approximately (1-t) fraction of the second crossovers, the Roulette-wheel selection will be replaced by an O(log(n)) search using the k-D Tree and a much smaller (almost O(1)) Roulette Wheel selection amongst the nearest V candidates. Hence, unless your population size is very, very large, this won’t affect time that drastically.

The only significant increase in time will be seen as a result of the k-D Tree learning at the end of each iteration, which would add an O(nlog(n)) time to the time required for generating the next-gen pool of candidates (which is O(n^2) for naive implementations anyways). Since O(n^2 + nlog n) is equivalent to O(n^2), the performance decrease isn’t that significant, provided the size of your population isn’t really high. Even if you were using a highly efficient selection procedure during crossovers, for reasonable population sizes, your total run-time would only be multiplied by a factor less than 2 (mainly because crossover and mutation by themselves take some time too).

Thats all for now! Keep reading and cheers 🙂

Self-Organizing Maps with Google’s TensorFlow

[This post assumes that you know the basics of Google’s TensorFlow library. If you don’t, have a look at my earlier post to get started.]

kohonen1

A Self-Organizing Map, or SOM, falls under the rare domain of unsupervised learning in Neural Networks. Its essentially a grid of neurons, each denoting one cluster learned during training. Traditionally speaking, there is no concept of neuron ‘locations’ in ANNs. However, in an SOM, each neuron has a location, and neurons that lie close to each other represent clusters with similar properties. Each neuron has a weightage vector, which is equal to the centroid of its particular cluster.

AI-Junkie’s post does a great job of explaining how an SOM is trained, so I won’t re-invent the wheel.

The Code

Here’s my code for a 2-D version of an SOM. Its written with TensorFlow as its core training architecture: (Its heavily commented, so look at the inline docs if you want to hack/dig around)


import tensorflow as tf
import numpy as np


class SOM(object):
    """
    2-D Self-Organizing Map with Gaussian Neighbourhood function
    and linearly decreasing learning rate.
    """

    #To check if the SOM has been trained
    _trained = False

    def __init__(self, m, n, dim, n_iterations=100, alpha=None, sigma=None):
        """
        Initializes all necessary components of the TensorFlow
        Graph.

        m X n are the dimensions of the SOM. 'n_iterations' should
        should be an integer denoting the number of iterations undergone
        while training.
        'dim' is the dimensionality of the training inputs.
        'alpha' is a number denoting the initial time(iteration no)-based
        learning rate. Default value is 0.3
        'sigma' is the the initial neighbourhood value, denoting
        the radius of influence of the BMU while training. By default, its
        taken to be half of max(m, n).
        """

        #Assign required variables first
        self._m = m
        self._n = n
        if alpha is None:
            alpha = 0.3
        else:
            alpha = float(alpha)
        if sigma is None:
            sigma = max(m, n) / 2.0
        else:
            sigma = float(sigma)
        self._n_iterations = abs(int(n_iterations))

        ##INITIALIZE GRAPH
        self._graph = tf.Graph()

        ##POPULATE GRAPH WITH NECESSARY COMPONENTS
        with self._graph.as_default():

            ##VARIABLES AND CONSTANT OPS FOR DATA STORAGE

            #Randomly initialized weightage vectors for all neurons,
            #stored together as a matrix Variable of size [m*n, dim]
            self._weightage_vects = tf.Variable(tf.random_normal(
                [m*n, dim]))

            #Matrix of size [m*n, 2] for SOM grid locations
            #of neurons
            self._location_vects = tf.constant(np.array(
                list(self._neuron_locations(m, n))))

            ##PLACEHOLDERS FOR TRAINING INPUTS
            #We need to assign them as attributes to self, since they
            #will be fed in during training

            #The training vector
            self._vect_input = tf.placeholder("float", [dim])
            #Iteration number
            self._iter_input = tf.placeholder("float")

            ##CONSTRUCT TRAINING OP PIECE BY PIECE
            #Only the final, 'root' training op needs to be assigned as
            #an attribute to self, since all the rest will be executed
            #automatically during training

            #To compute the Best Matching Unit given a vector
            #Basically calculates the Euclidean distance between every
            #neuron's weightage vector and the input, and returns the
            #index of the neuron which gives the least value
            bmu_index = tf.argmin(tf.sqrt(tf.reduce_sum(
                tf.pow(tf.sub(self._weightage_vects, tf.pack(
                    [self._vect_input for i in range(m*n)])), 2), 1)),
                                  0)

            #This will extract the location of the BMU based on the BMU's
            #index
            slice_input = tf.pad(tf.reshape(bmu_index, [1]),
                                 np.array([[0, 1]]))
            bmu_loc = tf.reshape(tf.slice(self._location_vects, slice_input,
                                          tf.constant(np.array([1, 2]))),
                                 [2])

            #To compute the alpha and sigma values based on iteration
            #number
            learning_rate_op = tf.sub(1.0, tf.div(self._iter_input,
                                                  self._n_iterations))
            _alpha_op = tf.mul(alpha, learning_rate_op)
            _sigma_op = tf.mul(sigma, learning_rate_op)

            #Construct the op that will generate a vector with learning
            #rates for all neurons, based on iteration number and location
            #wrt BMU.
            bmu_distance_squares = tf.reduce_sum(tf.pow(tf.sub(
                self._location_vects, tf.pack(
                    [bmu_loc for i in range(m*n)])), 2), 1)
            neighbourhood_func = tf.exp(tf.neg(tf.div(tf.cast(
                bmu_distance_squares, "float32"), tf.pow(_sigma_op, 2))))
            learning_rate_op = tf.mul(_alpha_op, neighbourhood_func)

            #Finally, the op that will use learning_rate_op to update
            #the weightage vectors of all neurons based on a particular
            #input
            learning_rate_multiplier = tf.pack([tf.tile(tf.slice(
                learning_rate_op, np.array([i]), np.array([1])), [dim])
                                               for i in range(m*n)])
            weightage_delta = tf.mul(
                learning_rate_multiplier,
                tf.sub(tf.pack([self._vect_input for i in range(m*n)]),
                       self._weightage_vects))                                         
            new_weightages_op = tf.add(self._weightage_vects,
                                       weightage_delta)
            self._training_op = tf.assign(self._weightage_vects,
                                          new_weightages_op)                                       

            ##INITIALIZE SESSION
            self._sess = tf.Session()

            ##INITIALIZE VARIABLES
            init_op = tf.initialize_all_variables()
            self._sess.run(init_op)

    def _neuron_locations(self, m, n):
        """
        Yields one by one the 2-D locations of the individual neurons
        in the SOM.
        """
        #Nested iterations over both dimensions
        #to generate all 2-D locations in the map
        for i in range(m):
            for j in range(n):
                yield np.array([i, j])

    def train(self, input_vects):
        """
        Trains the SOM.
        'input_vects' should be an iterable of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        Current weightage vectors for all neurons(initially random) are
        taken as starting conditions for training.
        """

        #Training iterations
        for iter_no in range(self._n_iterations):
            #Train with each vector one by one
            for input_vect in input_vects:
                self._sess.run(self._training_op,
                               feed_dict={self._vect_input: input_vect,
                                          self._iter_input: iter_no})

        #Store a centroid grid for easy retrieval later on
        centroid_grid = [[] for i in range(self._m)]
        self._weightages = list(self._sess.run(self._weightage_vects))
        self._locations = list(self._sess.run(self._location_vects))
        for i, loc in enumerate(self._locations):
            centroid_grid[loc[0]].append(self._weightages[i])
        self._centroid_grid = centroid_grid

        self._trained = True

    def get_centroids(self):
        """
        Returns a list of 'm' lists, with each inner list containing
        the 'n' corresponding centroid locations as 1-D NumPy arrays.
        """
        if not self._trained:
            raise ValueError("SOM not trained yet")
        return self._centroid_grid

    def map_vects(self, input_vects):
        """
        Maps each input vector to the relevant neuron in the SOM
        grid.
        'input_vects' should be an iterable of 1-D NumPy arrays with
        dimensionality as provided during initialization of this SOM.
        Returns a list of 1-D NumPy arrays containing (row, column)
        info for each input vector(in the same order), corresponding
        to mapped neuron.
        """

        if not self._trained:
            raise ValueError("SOM not trained yet")

        to_return = []
        for vect in input_vects:
            min_index = min([i for i in range(len(self._weightages))],
                            key=lambda x: np.linalg.norm(vect-
                                                         self._weightages[x]))
            to_return.append(self._locations[min_index])

        return to_return

A few points about the code:

1) Since my post on K-Means Clustering, I have gotten more comfortable with matrix operations in TensorFlow. You need to be comfortable with matrices if you want to work with TensorFlow (or any data flow infrastructure for that matter, even SciPy). You can code pretty much any logic or operational flow with TensorFlow, you just need to be able to build up complex functionality from basic components(ops), and structure the flow of data(tensors/variables) well.

2) It took quite a while for me to build the whole graph in such a way that the entire training functionality could be enclosed in a single op. This op is called during each iteration, for every vector, during training. Such an implementation is more in line with TensorFlow’s way of doing things, than my previous attempt with clustering.

3) I have used a 2-D grid for the SOM, you can use any geometry you wish. You would just have to modify the _neuron_locations method appropriately, and also the method that returns the centroid outputs. You could return a dict that maps neuron location to the corresponding cluster centroid.

4) To keep things simple, I haven’t provided for online training. You could do that by having bounds for the learning rate(s).

Sample Usage

I have used PyMVPA’s example of RGB colours to confirm that the code does work. PyMVPA provides functionality to train SOMs too (along with many other learning techniques).

Here’s how you would do it with my code:


#For plotting the images
from matplotlib import pyplot as plt

#Training inputs for RGBcolors
colors = np.array(
     [[0., 0., 0.],
      [0., 0., 1.],
      [0., 0., 0.5],
      [0.125, 0.529, 1.0],
      [0.33, 0.4, 0.67],
      [0.6, 0.5, 1.0],
      [0., 1., 0.],
      [1., 0., 0.],
      [0., 1., 1.],
      [1., 0., 1.],
      [1., 1., 0.],
      [1., 1., 1.],
      [.33, .33, .33],
      [.5, .5, .5],
      [.66, .66, .66]])
color_names = \
    ['black', 'blue', 'darkblue', 'skyblue',
     'greyblue', 'lilac', 'green', 'red',
     'cyan', 'violet', 'yellow', 'white',
     'darkgrey', 'mediumgrey', 'lightgrey']

#Train a 20x30 SOM with 400 iterations
som = SOM(20, 30, 3, 400)
som.train(colors)

#Get output grid
image_grid = som.get_centroids()

#Map colours to their closest neurons
mapped = som.map_vects(colors)

#Plot
plt.imshow(image_grid)
plt.title('Color SOM')
for i, m in enumerate(mapped):
    plt.text(m[1], m[0], color_names[i], ha='center', va='center',
             bbox=dict(facecolor='white', alpha=0.5, lw=0))
plt.show()

Here’s a sample of the output you would get (varies each time you train, but the color names should go to the correct locations in the image):

figure_1

K-Means Clustering with TensorFlow

Google recently open-sourced its Artificial Intelligence/Numerical Computing library called TensorFlow. TensorFlow was developed by members of the Google Brain team, and has the flexibility to run on a variety of platforms – including GPUs and mobile devices.

687474703a2f2f7777772e616e64726f696463656e7472616c2e636f6d2f73697465732f616e64726f696463656e7472616c2e636f6d2f66696c65732f7374796c65732f6c617267652f7075626c69632f61727469636c655f696d616765732f323031352f31312f74656e736f72666c6f772e706e67

TensorFlow’s methodology uses what they called data-flow graphs. Consider the following diagram from the Wikipedia page on Genetic Programming (which could have some interesting applications with TensorFlow, I think):

Genetic_Program_TreeAs you probably understood, the graphical structure is a way of representing a computational expression in the form of a Tree. Every node is an operation (TensorFlow calls them ops, short for operations). The non-leaf nodes are pretty easy to understand. Some leaf nodes are a special case of an operation, always ‘returning’ a constant value (like 7 or 2.2 in the Tree). Others (like X or Y) act as placeholders that will be fed in at the time of execution. If you look at the arrows, you will realize that their directions denote the dependencies between outputs of different nodes. Hence, data (TensorFlow calls them Tensors) will flow in the opposite direction along each node – Hence the name TensorFlow. TensorFlow provides other components over this graphical abstraction, like persistent memory elements that retain data (called Variables), and optimization techniques to fine-tune the parameters in these Variables in applications like Neural Networks.

TensorFlow has a powerful Python API. The TensorFlow team has done an awesome job of writing the documentation (which is a little tricky to navigate). If you are completely new to this, heres a few links to get you started (in the order you should visit them):

1. Basic setup

2. Read this example to get a vague idea of what a TensorFlow code looks like.

3. Now read this explanation of the basic components of TensorFlow. It helps if you read the above example again, or simultaneously.

4. Read this detailed example of using TensorFlow for a common ML problem.

5. Once you have a decent understanding of the basic components and how they work, you can look at the Python docs for reference.

Now here is the code I wrote for K-Means clustering using TensorFlow. As a disclaimer, I will mention that this code is based on my (at the time of writing this) 2-day old understanding of how the library works. If you find any errors or know any optimizations possible, do drop a comment! The code is heavily documented, so do go through in-line docs.


import tensorflow as tf
from random import choice, shuffle
from numpy import array


def TFKMeansCluster(vectors, noofclusters):
    """
    K-Means Clustering using TensorFlow.
    'vectors' should be a n*k 2-D NumPy array, where n is the number
    of vectors of dimensionality k.
    'noofclusters' should be an integer.
    """

    noofclusters = int(noofclusters)
    assert noofclusters < len(vectors)

    #Find out the dimensionality
    dim = len(vectors[0])

    #Will help select random centroids from among the available vectors
    vector_indices = list(range(len(vectors)))
    shuffle(vector_indices)

    #GRAPH OF COMPUTATION
    #We initialize a new graph and set it as the default during each run
    #of this algorithm. This ensures that as this function is called
    #multiple times, the default graph doesn't keep getting crowded with
    #unused ops and Variables from previous function calls.

    graph = tf.Graph()

    with graph.as_default():

        #SESSION OF COMPUTATION

        sess = tf.Session()

        ##CONSTRUCTING THE ELEMENTS OF COMPUTATION

        ##First lets ensure we have a Variable vector for each centroid,
        ##initialized to one of the vectors from the available data points
        centroids = [tf.Variable((vectors[vector_indices[i]]))
                     for i in range(noofclusters)]
        ##These nodes will assign the centroid Variables the appropriate
        ##values
        centroid_value = tf.placeholder("float64", [dim])
        cent_assigns = []
        for centroid in centroids:
            cent_assigns.append(tf.assign(centroid, centroid_value))

        ##Variables for cluster assignments of individual vectors(initialized
        ##to 0 at first)
        assignments = [tf.Variable(0) for i in range(len(vectors))]
        ##These nodes will assign an assignment Variable the appropriate
        ##value
        assignment_value = tf.placeholder("int32")
        cluster_assigns = []
        for assignment in assignments:
            cluster_assigns.append(tf.assign(assignment,
                                             assignment_value))

        ##Now lets construct the node that will compute the mean
        #The placeholder for the input
        mean_input = tf.placeholder("float", [None, dim])
        #The Node/op takes the input and computes a mean along the 0th
        #dimension, i.e. the list of input vectors
        mean_op = tf.reduce_mean(mean_input, 0)

        ##Node for computing Euclidean distances
        #Placeholders for input
        v1 = tf.placeholder("float", [dim])
        v2 = tf.placeholder("float", [dim])
        euclid_dist = tf.sqrt(tf.reduce_sum(tf.pow(tf.sub(
            v1, v2), 2)))

        ##This node will figure out which cluster to assign a vector to,
        ##based on Euclidean distances of the vector from the centroids.
        #Placeholder for input
        centroid_distances = tf.placeholder("float", [noofclusters])
        cluster_assignment = tf.argmin(centroid_distances, 0)

        ##INITIALIZING STATE VARIABLES

        ##This will help initialization of all Variables defined with respect
        ##to the graph. The Variable-initializer should be defined after
        ##all the Variables have been constructed, so that each of them
        ##will be included in the initialization.
        init_op = tf.initialize_all_variables()

        #Initialize all variables
        sess.run(init_op)

        ##CLUSTERING ITERATIONS

        #Now perform the Expectation-Maximization steps of K-Means clustering
        #iterations. To keep things simple, we will only do a set number of
        #iterations, instead of using a Stopping Criterion.
        noofiterations = 100
        for iteration_n in range(noofiterations):

            ##EXPECTATION STEP
            ##Based on the centroid locations till last iteration, compute
            ##the _expected_ centroid assignments.
            #Iterate over each vector
            for vector_n in range(len(vectors)):
                vect = vectors[vector_n]
                #Compute Euclidean distance between this vector and each
                #centroid. Remember that this list cannot be named
                #'centroid_distances', since that is the input to the
                #cluster assignment node.
                distances = [sess.run(euclid_dist, feed_dict={
                    v1: vect, v2: sess.run(centroid)})
                             for centroid in centroids]
                #Now use the cluster assignment node, with the distances
                #as the input
                assignment = sess.run(cluster_assignment, feed_dict = {
                    centroid_distances: distances})
                #Now assign the value to the appropriate state variable
                sess.run(cluster_assigns[vector_n], feed_dict={
                    assignment_value: assignment})

            ##MAXIMIZATION STEP
            #Based on the expected state computed from the Expectation Step,
            #compute the locations of the centroids so as to maximize the
            #overall objective of minimizing within-cluster Sum-of-Squares
            for cluster_n in range(noofclusters):
                #Collect all the vectors assigned to this cluster
                assigned_vects = [vectors[i] for i in range(len(vectors))
                                  if sess.run(assignments[i]) == cluster_n]
                #Compute new centroid location
                new_location = sess.run(mean_op, feed_dict={
                    mean_input: array(assigned_vects)})
                #Assign value to appropriate variable
                sess.run(cent_assigns[cluster_n], feed_dict={
                    centroid_value: new_location})

        #Return centroids and assignments
        centroids = sess.run(centroids)
        assignments = sess.run(assignments)
        return centroids, assignments

Never, ever, EVER, do something like this:


for i in range(100):
    x = sess.run(tf.assign(variable1, placeholder))

This may seem pretty harmless at first glance, but every time you initialize an op, (like tf.assign or even tf.zeros, you are adding new ops instances to the default graph. Instead, as shown in the code, define a particular op for each task (however specialized) just once in the code. Then, during every of your iterations, call sess.run over the required nodes. To check if you are crowding your graph with unnecessary ops, just print out the value of len(graph.get_operations()) during every iteration and see if its is increasing. In fact, sess.run should be the only way you interact with the graph during every iteration.

As visible on lines 138 and 139, you can call sess.run over a list of ops/Variables to return a list of the outputs in the same order.

There are a lot of intricacies of TensorFlow that this code does not go into, such as assigning devices to nodes, Graph collections, dependencies, etc. Thats partially because I am still understanding these aspects one by one. But at first glance, TensorFlow seems to be a pretty powerful and flexible way of doing AI/ML-based computations. I would personally like to explore its applications in developing dependency-based statistical metrics for data – for which I am currently using custom tree-like data structures. Lets hope this gesture by Google does lead to an increase in the applications and research in AI. Cheers!

Generating rudimentary Mind-Maps from Word2Vec models

Mind Maps are notorious for being a very powerful organizational tool for a variety of tasks, such as brainstorming, planning and problem solving. Visual (or rather, graphical) arrangement of ideas aids the thought process, and in fact mimics the way we explore our mental knowledge base while ‘thinking’. There are a lot of online tools available for drawing out mind maps, but none that can generate one. By generate, I mean coming up with the verbal content.

For the longest time (almost 8 months now), I have been tinkering with ways to combine text mining and graph theory into a framework to generate a Mind-Map (given a text document). Ofcourse, the first thing argument would be that there cannot be a single possible Mind-Map for any block of text. And its true! However, having such an automated Map as a reference while building your own, might give you more insights (especially while brainstorming), or help you remember links that you might miss out (for studying). Lets see what a Mind-Map looks like:

laws_mindmap

Two points:

i. A Mind-Map is NOT a tree that divides the overall topic into its subtopics recursively. Its infact more like a graph, that links terms that are semantically related.

ii. Like ‘Night’ might make the word ‘Day’ pop up in your mind, a mind-map may have links between opposite meaning concepts, like Thicker-Thinner in the above example.

There are of course other points like using images to enhance concepts, and so on. But thats not the point of this post (And I suck at designer-style creativity anyways). Heres an article to help you get acquainted with the process of building and using your own Mind-Maps, just in case.

In my last post, I described a method to generate a Word2Vec model from a text document (where I used Wikipedia articles as an example). I will now describe the methodology I followed to generate a rudimentary mind-map from a Wikipedia article’s Word2Vec model model.

Step 1: Figuring out the top n terms from the document

(As I mentioned in my previous post, I only use stemmed unigrams. You can of course use higher-order ngrams, but that makes things a little tricky (but more accurate, if your algorithms for n-gram generation are solid).)

Here, n denotes the number of ‘nodes’ in my Mind-Map. In my trial-and-errors, 50 is usually a good enough number. Too less means too little information, and too much would mean noisy mind-maps. You can obviously play around with different choices of n. I use the co-occurrence based technique described in this paper to list out the top n words in a document. Heres the Python code for it:


def _get_param_matrices(vocabulary, sentence_terms):
    """
    Returns
    =======
    1. Top 300(or lesser, if vocab is short) most frequent terms(list)
    2. co-occurence matrix wrt the most frequent terms(dict)
    3. Dict containing Pg of most-frequent terms(dict)
    4. nw(no of terms affected) of each term(dict)
    """

    #Figure out top n terms with respect to mere occurences
    n = min(300, len(vocabulary))
    topterms = list(vocabulary.keys())
    topterms.sort(key = lambda x: vocabulary[x], reverse = True)
    topterms = topterms[:n]

    #nw maps term to the number of terms it 'affects'
    #(sum of number of terms in all sentences it
    #appears in)
    nw = {}
    #Co-occurence values are wrt top terms only
    co_occur = {}
    #Initially, co-occurence matrix is empty
    for x in vocabulary:
        co_occur[x] = [0 for i in range(len(topterms))]

    #Iterate over list of all sentences' vocabulary dictionaries
    #Build the co-occurence matrix
    for sentence in sentence_terms:
        total_terms = sum(list(sentence.values()))
        #This list contains the indices of all terms from topterms,
        #that are present in this sentence
        top_indices = []
        #Populate top_indices
        top_indices = [topterms.index(x) for x in sentence
                       if x in topterms]
        #Update nw dict, and co-occurence matrix
        for term in sentence:
            nw[term] = nw.get(term, 0) + total_terms
            for index in top_indices:
                co_occur[term][index] += (sentence[term] *
                                          sentence[topterms[index]])

    #Pg is just nw[term]/total vocabulary of text
    Pg = {}
    N = sum(list(vocabulary.values()))
    for x in topterms:
        Pg[x] = float(nw[x])/N

    return topterms, co_occur, Pg, nw


def get_top_n_terms(vocabulary, sentence_terms, n=50):
    """
    Returns the top 'n' terms from a block of text, in the form of a list,
    from most important to least.

    'vocabulary' should be a dict mapping each term to the number
    of its occurences in the entire text.
    'sentence_terms' should be an iterable of dicts, each denoting the
    vocabulary of the corresponding sentence.
    """

    #First compute the matrices
    topterms, co_occur, Pg, nw = _get_param_matrices(vocabulary,
                                                     sentence_terms)

    #This dict will map each term to its weightage with respect to the
    #document
    result = {}

    N = sum(list(vocabulary.values()))
    #Iterates over all terms in vocabulary
    for term in co_occur:
        term = str(term)
        org_term = str(term)
        for x in Pg:
            #expected_cooccur is the expected cooccurence of term with this
            #term, based on nw value of this and Pg value of the other
            expected_cooccur = nw[term] * Pg[x]
            #Result measures the difference(in no of terms) of expected
            #cooccurence and  actual cooccurence
            result[org_term] = ((co_occur[term][topterms.index(x)] -
                                 expected_cooccur)**2/ float(expected_cooccur))

    terms = list(result.keys())
    terms.sort(key=lambda x: result[x],
               reverse=True)

    return terms[:n]

The get_top_n_terms function does the job, and I guess the docstrings and in-line comments explain how (combined with the paper, of course). If you have the patience and time, you can infact just see the entire vocabulary of your Word2Vec model and pick out those terms that you want to see in your Mind-Map. This is likely to give you the best results (but with a lot of efforts).

Step 2: Deciding the Root

The Root would be that term out of your nodes, which denotes the central idea behind your entire Mind-Map. Since the number of nodes is pretty small compared to the vocabulary, its best to pick this one out manually. OR, you could use that term which has the highest occurrence in the vocabulary, among the selected nodes. This step may require some trial and error (But then what part of data science doesn’t?).

Step 3: Generating the graph (Mind-Map)

This is of course the most crucial step, and the one I spent the most time on. First off, let me define what I call the contextual vector of a term.

Say the root of the Mind Map is ‘computer’. It is linked to the term ‘hardware’. ‘hardware’ is in turn linked to ‘keyboard’. The Word2Vec vector of ‘keyboard’ would be obtained as model[keyboard] in the Python/Gensim environment. Lets denote it with v_{keyboard}.

Now consider yourself in the position of someone building a Mind Map. When you think of ‘keyboard’, given the structure of what you have come up with so far, you will be thinking of it in the context of ‘computer’ and ‘hardware’. Thats why you probably won’t link ‘keyboard’ to ‘music’ (atleast not directly). This basically shows that the contextual vector for ‘keyboard’ (lets call it v'_{keyboard}) must be biased in its direction (since we use cosine similarity with Word2Vec models, only directions matter) towards v_{computer} and v_{hardware}. Moreover, intuitively speaking, the influence of v_{hardware} on v'_{keyboard} should be greater than that of v_{computer} – in essence, the influence of the context of a parent reduces as you go further and further away from it. To take this into account, I use what I call the contextual decay factor \alpha. Expressing it mathematically,

v'_{computer} = v_{computer}

v'_{hardware} = (1 - \alpha) v_{hardware} + \alpha v'_{computer}

v'_{keyboard} = (1 - \alpha) v_{keyboard} + \alpha v'_{hardware}

And so on…

Finally, to generate the actual Mind-Map, heres the algorithm I use (I hope the inline comments are enough to let you know what I have done):


from scipy.spatial.distance import cosine
from networkx import Graph

def build_mind_map(model, stemmer, root, nodes, alpha=0.2):
    """
    Returns the Mind-Map in the form of a NetworkX Graph instance.

    'model' should be an instance of gensim.models.Word2Vec
    'nodes' should be a list of terms, included in the vocabulary of
    'model'.
    'root' should be the node that is to be used as the root of the Mind
    Map graph.
    'stemmer' should be an instance of StemmingHelper.
    """

    #This will be the Mind-Map
    g = Graph()

    #Ensure that the every node is in the vocabulary of the Word2Vec
    #model, and that the root itself is included in the given nodes
    for node in nodes:
        if node not in model.vocab:
            raise ValueError(node + " not in model's vocabulary")
    if root not in nodes:
        raise ValueError("root not in nodes")

    ##Containers for algorithm run
    #Initially, all nodes are unvisited
    unvisited_nodes = set(nodes)
    #Initially, no nodes are visited
    visited_nodes = set([])
    #The following will map visited node to its contextual vector
    visited_node_vectors = {}
    #Thw following will map unvisited nodes to (closest_distance, parent)
    #parent will obviously be a visited node
    node_distances = {}

    #Initialization with respect to root
    current_node = root
    visited_node_vectors[root] = model[root]
    unvisited_nodes.remove(root)
    visited_nodes.add(root)

    #Build the Mind-Map in n-1 iterations
    for i in range(1, len(nodes)):
        #For every unvisited node 'x'
        for x in unvisited_nodes:
            #Compute contextual distance between current node and x
            dist_from_current = cosine(visited_node_vectors[current_node],
                                       model[x])
            #Get the least contextual distance to x found until now
            distance = node_distances.get(x, (100, ''))
            #If current node provides a shorter path to x, update x's
            #distance and parent information
            if distance[0] > dist_from_current:
                node_distances[x] = (dist_from_current, current_node)

        #Choose next 'current' as that unvisited node, which has the
        #lowest contextual distance from any of the visited nodes
        next_node = min(unvisited_nodes,
                        key=lambda x: node_distances[x][0])

        ##Update all containers
        parent = node_distances[next_node][1]
        del node_distances[next_node]
        next_node_vect = ((1 - alpha)*model[next_node] +
                          alpha*visited_node_vectors[parent])
        visited_node_vectors[next_node] = next_node_vect
        unvisited_nodes.remove(next_node)
        visited_nodes.add(next_node)

        #Add the link between newly selected node and its parent(from the
        #visited nodes) to the NetworkX Graph instance
        g.add_edge(stemmer.original_form(parent).capitalize(),
                   stemmer.original_form(next_node).capitalize())

        #The new node becomes the current node for the next iteration
        current_node = next_node

    return g

Notes: I use NetworkX’s simple Graph-building infrastructure to do the core job of maintaining the Mind-Map (makes it easier later for visualization too). To compute cosine distance, I use SciPy. Moreover, on lines 74 and 75, I use the StemmingHelper class from my last post to include the stemmed words in their original form in the actual mind-map (instead of their stemmed versions). You can pass the StemmingHelper class directly as the parameter stemmer. On the other hand, if you aren’t using stemming at all, just remove those parts of the code on lines 4, 74, and 75.

If you look at the algorithm, you will realize that its somewhat like Dijkstra’s algorithm for single-source shortest paths, but in a different context.

Example Outputs

Now for the results. (I used PyGraphViz for simple, quick-and-dirty visualization)

Heres the Mind-Map that was generated for the Wikipedia article on Machine Learning:

ml

One on Artificial Intellgence:

aiAnd finally, one on Psychology:

yoyo

The results are similar on all the other topics I tried out. A few things I noticed:

i. I should try and involve bi-grams and trigrams too. I am pretty sure it will make the Word2Vec model itself way stronger, and thus improve the interpretation of terms with respect to the document.

ii. There are some unnecessary terms in the Mind Maps, but given the short length of the texts (compared to most text mining tasks), the Keyword extraction algorithm in the paper I mentioned before, seems really good.

iii. Maybe one could use this for brainstorming. Like you start out with a term(node) of your choice. Then, the framework suggests you terms to connect it to. Once you select one of them, you get further recommendations for it based on the context, etc. – Something like a Mind-Map helper.

Anyways, this was a long post, and thanks a lot if you stuck to the end :-). Cheers!

Combining Genetic Algorithms and Recommendation Systems: Some thoughts

I had tinkered with Genetic Algorithms for a bit during college, working on a method to optimize stock portfolios using historical trends. Algorithms inspired from biology have always fascinated me with their elegance and intuitiveness. If you can relate an algorithm with something other than diagrams/numbers, there’s always a good chance you will understand and use it better. With concepts such as Swarm/Colony Optimization or Evolutionary Algorithms, the ease of understanding can be attributed to their roots in what we learn right in school.

Of course, they aren’t always the right way to go about doing things- especially when there exists an efficient and mathematically sound way of reaching the goal. For example, if Lagrangians can get you there fast enough, you probably don’t need a Particle Optimizer. The main reason being that most biologically inspired algorithms come with no guarantee- given a good definition of ‘utility’ and time, they can provide a good or even great solution- but not necessarily the best one(as promised by analytical algorithms). However, if your solution space is too large and/or you don’t fully understand the nature of your utility function(or it depends on too many factors), its a good idea to try stochastic optimization methods like genetic algorithms.

If you are new to genetic algorithms, heres a good place to learn the basics. All GAs follow the same basic format of working. What changes is how you represent your candidate solutions (the chromosomes) and how you perform key operations on them (genetic operators– selection, crossover, mutation). Selection is usually governed by the fitness function, which is perhaps the most important part of your GA- defining how good a solution is, for the problem you are trying to solve. Usually, the easier it is for you to define your fitness function, the less likely it is that you need a genetic algorithm for your problem (ironically).

Anyways, I was thinking of how one could use a genetic algorithm as a recommendation system. The link between the two is clearer if you think of recommendation-sets as evolving instead of getting generated. Let me explain a little more…

1. Chromosome

Your individual chromosome would be a set of recommendations, such as [r1, r2, …, rk]. By analogy, each of the ri‘s would denote a gene. The length of the recommendation-sets could be fixed or variable, depending on what you start out with, and what your crossover/mutation operators are.

2. Fitness function

Dealing with recommendation-sets instead of individual recommendations would allow you to judge the ‘fitness’ of a set of recommendations as a whole. The fitness of a candidate set would typically be:

a. Directly proportional to the average relevance of a recommendation to the user. (Using similarity functions such as the Jaccard coefficient on vectorial representations of recos and users- how you do it is upto you).

b. Inversely proportional to how many times the user has seen one of those recommendations in the past, and how long before.

c. Directly proportional to how diverse the recommendations are. (Maybe use something similar to an entropy for this?)

3. Selection

Selection could be done using a simple Roulette wheel, or using better explore/exploit techniques such as Boltzmann selection or Sigma scaling. These techniques allow you to explore more of your domain in the earlier stages, when you don’t know much about what the ideal solution should be. As the GA iterations progress, they give greater weightage to better solutions in the selection for ‘mating’.

4. Crossover

Crossovers could be simple single-point or two-point. Since the order of recommendations does not matter, even uniform crossovers would do the trick. Depending on the way you implement this, you could have variable-length recommendation sets (But if you prefer to not have too huge or too small sets, make the fitness function correspondingly dependent on the length of your candidate).

5. Mutation

Simply speaking, this would mean replacing/adding/subtracting a random recommendation in the set. This could be tricky if the domain of your recommendations is too large. In such a case, you could use some heuristics, such as adding items from ones most frequently bought/used, those used by users most similar to the current user, etc.

6. When to perform the iterations?

Every time there is a sufficient change in the user’s profile or the environment(other users/more recommendation items being added), you could run a given number of iterations of the GA. OR, when you have exhausted all the sets generated in the last run of the GA, by showing them one by one. The advantage of this approach(and the whole idea of using GAs for recommendation systems) is that you could use the candidates generated in the last run, as the starting population of the next run. Usually, the user’s profile and environment in consideration wouldn’t be radically different from the last run, and hence the past ‘best’ candidates would be a good place to start the new evolution process.

Ofcourse, this doesn’t solve many of the problems currently plaguing recommendation systems, such as cold-start. However, using biological evolution as the blueprint to keep generating newer sets of recommendations could be an interesting way to go about things :-).