K-Means Clustering with TensorFlow

Google recently open-sourced its Artificial Intelligence/Numerical Computing library called TensorFlow. TensorFlow was developed by members of the Google Brain team, and has the flexibility to run on a variety of platforms – including GPUs and mobile devices.

687474703a2f2f7777772e616e64726f696463656e7472616c2e636f6d2f73697465732f616e64726f696463656e7472616c2e636f6d2f66696c65732f7374796c65732f6c617267652f7075626c69632f61727469636c655f696d616765732f323031352f31312f74656e736f72666c6f772e706e67

TensorFlow’s methodology uses what they called data-flow graphs. Consider the following diagram from the Wikipedia page on Genetic Programming (which could have some interesting applications with TensorFlow, I think):

Genetic_Program_TreeAs you probably understood, the graphical structure is a way of representing a computational expression in the form of a Tree. Every node is an operation (TensorFlow calls them ops, short for operations). The non-leaf nodes are pretty easy to understand. Some leaf nodes are a special case of an operation, always ‘returning’ a constant value (like 7 or 2.2 in the Tree). Others (like X or Y) act as placeholders that will be fed in at the time of execution. If you look at the arrows, you will realize that their directions denote the dependencies between outputs of different nodes. Hence, data (TensorFlow calls them Tensors) will flow in the opposite direction along each node – Hence the name TensorFlow. TensorFlow provides other components over this graphical abstraction, like persistent memory elements that retain data (called Variables), and optimization techniques to fine-tune the parameters in these Variables in applications like Neural Networks.

TensorFlow has a powerful Python API. The TensorFlow team has done an awesome job of writing the documentation (which is a little tricky to navigate). If you are completely new to this, heres a few links to get you started (in the order you should visit them):

1. Basic setup

2. Read this example to get a vague idea of what a TensorFlow code looks like.

3. Now read this explanation of the basic components of TensorFlow. It helps if you read the above example again, or simultaneously.

4. Read this detailed example of using TensorFlow for a common ML problem.

5. Once you have a decent understanding of the basic components and how they work, you can look at the Python docs for reference.

Now here is the code I wrote for K-Means clustering using TensorFlow. As a disclaimer, I will mention that this code is based on my (at the time of writing this) 2-day old understanding of how the library works. If you find any errors or know any optimizations possible, do drop a comment! The code is heavily documented, so do go through in-line docs.


import tensorflow as tf
from random import choice, shuffle
from numpy import array


def TFKMeansCluster(vectors, noofclusters):
    """
    K-Means Clustering using TensorFlow.
    'vectors' should be a n*k 2-D NumPy array, where n is the number
    of vectors of dimensionality k.
    'noofclusters' should be an integer.
    """

    noofclusters = int(noofclusters)
    assert noofclusters < len(vectors)

    #Find out the dimensionality
    dim = len(vectors[0])

    #Will help select random centroids from among the available vectors
    vector_indices = list(range(len(vectors)))
    shuffle(vector_indices)

    #GRAPH OF COMPUTATION
    #We initialize a new graph and set it as the default during each run
    #of this algorithm. This ensures that as this function is called
    #multiple times, the default graph doesn't keep getting crowded with
    #unused ops and Variables from previous function calls.

    graph = tf.Graph()

    with graph.as_default():

        #SESSION OF COMPUTATION

        sess = tf.Session()

        ##CONSTRUCTING THE ELEMENTS OF COMPUTATION

        ##First lets ensure we have a Variable vector for each centroid,
        ##initialized to one of the vectors from the available data points
        centroids = [tf.Variable((vectors[vector_indices[i]]))
                     for i in range(noofclusters)]
        ##These nodes will assign the centroid Variables the appropriate
        ##values
        centroid_value = tf.placeholder("float64", [dim])
        cent_assigns = []
        for centroid in centroids:
            cent_assigns.append(tf.assign(centroid, centroid_value))

        ##Variables for cluster assignments of individual vectors(initialized
        ##to 0 at first)
        assignments = [tf.Variable(0) for i in range(len(vectors))]
        ##These nodes will assign an assignment Variable the appropriate
        ##value
        assignment_value = tf.placeholder("int32")
        cluster_assigns = []
        for assignment in assignments:
            cluster_assigns.append(tf.assign(assignment,
                                             assignment_value))

        ##Now lets construct the node that will compute the mean
        #The placeholder for the input
        mean_input = tf.placeholder("float", [None, dim])
        #The Node/op takes the input and computes a mean along the 0th
        #dimension, i.e. the list of input vectors
        mean_op = tf.reduce_mean(mean_input, 0)

        ##Node for computing Euclidean distances
        #Placeholders for input
        v1 = tf.placeholder("float", [dim])
        v2 = tf.placeholder("float", [dim])
        euclid_dist = tf.sqrt(tf.reduce_sum(tf.pow(tf.sub(
            v1, v2), 2)))

        ##This node will figure out which cluster to assign a vector to,
        ##based on Euclidean distances of the vector from the centroids.
        #Placeholder for input
        centroid_distances = tf.placeholder("float", [noofclusters])
        cluster_assignment = tf.argmin(centroid_distances, 0)

        ##INITIALIZING STATE VARIABLES

        ##This will help initialization of all Variables defined with respect
        ##to the graph. The Variable-initializer should be defined after
        ##all the Variables have been constructed, so that each of them
        ##will be included in the initialization.
        init_op = tf.initialize_all_variables()

        #Initialize all variables
        sess.run(init_op)

        ##CLUSTERING ITERATIONS

        #Now perform the Expectation-Maximization steps of K-Means clustering
        #iterations. To keep things simple, we will only do a set number of
        #iterations, instead of using a Stopping Criterion.
        noofiterations = 100
        for iteration_n in range(noofiterations):

            ##EXPECTATION STEP
            ##Based on the centroid locations till last iteration, compute
            ##the _expected_ centroid assignments.
            #Iterate over each vector
            for vector_n in range(len(vectors)):
                vect = vectors[vector_n]
                #Compute Euclidean distance between this vector and each
                #centroid. Remember that this list cannot be named
                #'centroid_distances', since that is the input to the
                #cluster assignment node.
                distances = [sess.run(euclid_dist, feed_dict={
                    v1: vect, v2: sess.run(centroid)})
                             for centroid in centroids]
                #Now use the cluster assignment node, with the distances
                #as the input
                assignment = sess.run(cluster_assignment, feed_dict = {
                    centroid_distances: distances})
                #Now assign the value to the appropriate state variable
                sess.run(cluster_assigns[vector_n], feed_dict={
                    assignment_value: assignment})

            ##MAXIMIZATION STEP
            #Based on the expected state computed from the Expectation Step,
            #compute the locations of the centroids so as to maximize the
            #overall objective of minimizing within-cluster Sum-of-Squares
            for cluster_n in range(noofclusters):
                #Collect all the vectors assigned to this cluster
                assigned_vects = [vectors[i] for i in range(len(vectors))
                                  if sess.run(assignments[i]) == cluster_n]
                #Compute new centroid location
                new_location = sess.run(mean_op, feed_dict={
                    mean_input: array(assigned_vects)})
                #Assign value to appropriate variable
                sess.run(cent_assigns[cluster_n], feed_dict={
                    centroid_value: new_location})

        #Return centroids and assignments
        centroids = sess.run(centroids)
        assignments = sess.run(assignments)
        return centroids, assignments

Never, ever, EVER, do something like this:


for i in range(100):
    x = sess.run(tf.assign(variable1, placeholder))

This may seem pretty harmless at first glance, but every time you initialize an op, (like tf.assign or even tf.zeros, you are adding new ops instances to the default graph. Instead, as shown in the code, define a particular op for each task (however specialized) just once in the code. Then, during every of your iterations, call sess.run over the required nodes. To check if you are crowding your graph with unnecessary ops, just print out the value of len(graph.get_operations()) during every iteration and see if its is increasing. In fact, sess.run should be the only way you interact with the graph during every iteration.

As visible on lines 138 and 139, you can call sess.run over a list of ops/Variables to return a list of the outputs in the same order.

There are a lot of intricacies of TensorFlow that this code does not go into, such as assigning devices to nodes, Graph collections, dependencies, etc. Thats partially because I am still understanding these aspects one by one. But at first glance, TensorFlow seems to be a pretty powerful and flexible way of doing AI/ML-based computations. I would personally like to explore its applications in developing dependency-based statistical metrics for data – for which I am currently using custom tree-like data structures. Lets hope this gesture by Google does lead to an increase in the applications and research in AI. Cheers!

Interesting take-aways from ‘Data Science For Business’

I have recently been reading Data Science for Business by Foster Provost and Tom Fawcett. I picked it up on Amazon some time back, while trying to find some good books on the applications of machine learning in industries. And I hit the bull’s eye with this one!

Data Science for Business does an awesome job at getting the reader acquainted with all the basic(as well as some niche) applications of ML frameworks in a business context. What this book does not do, is give you the rigor behind the algorithms discussed in it. The focus is on the when and why, not the how. For example, you will find detailed descriptions of the kind of problems Association Mining solves- from Market Basket Analysis to mining Facebook ‘like’ relationships- but you wont find details of the Apriori or any other actual AM techniques. Why I love this book, is because it does what most other ML texts don’t- giving the reader an intuitive understanding of when to apply a certain technique and why. For a newbie, reading this book side-by-side with an established text like Elements of Statistical Learning would perhaps be the best way to get acquainted with Machine Learning and its applications.

There were a lot of ‘Aha!’ moments for me while reading this book, especially on the lines of ‘I never thought it could be used this way!’ or ‘Thats a good rule of thumb to follow while applying this!’. To give you guys a brief idea, here are three interesting things I learned/revisited while reading this book:

1. CRISP Data Mining

Okay, anyone who has done a formal course in Data Mining knows about the CRoss Industry Standard Process for Data Mining, and for good reason. Its one of the most fundamental, extremely basic ways to go about mining any kind of data. A diagrammatic representation of the same is shown below:

crisp-dm

However, while most courses would teach their students about this framework and leave it at that, this book tries to explain many other applications of statistical learning techniques in the context of CRISP data mining. For example, in an exploratory data mining technique such as clustering, its the Evaluation part of the process thats highly crucial- since you need to extract the maximum ‘information’ that you can from the results that a standard clustering technique would provide. More on this in a bit.

The book uses the Expected Value framework to augment the way in which CRISP is applied. It basically stresses the fact that you need to formalize your Business Understanding of the problem by expressing it mathematically- as a value to estimate, or function to optimize. At this stage, you would rather think of your solution as a black box- not focusing on the algorithms and implementations underneath. You then need to understand your data in terms of the available attributes, sparseness of the data, etc. (Data Understanding). Only once you have the mathematical formulation and familiarity with the data, are you in a position to combine the two in a synergistic mix of Data Preparation and Modelling – in essence, formulating a solution that performs best with the data you have.

2. Using Decision Trees to Understand Clusters

Why understand the output of clustering?

Since clustering, as I mentioned before, is an exploratory process, you don’t exactly know what you expect to find going in. You may have a rough idea of the number of clusters(or not), but not what each of them represent- atleast in the general case.

Lets consider a simple example. You have decided to use DBSCAN to cluster all of your customers into an unknown number of groups, where customers in the same cluster would have similar properties based on some criteria. You hope to use the output of this clustering to classify a new customer into one of these groups, and deal with him better given the characteristics that set him apart from the rest. You can easily use a supervised learning process to classify a new customer given the clustering output, but what you don’t have is an intuitive understanding of what the cluster represents. Usually, the following two methods are most likely used to ‘understand’ a given cluster-

i. Pick out a member of the cluster that has the highest value for a certain property. For example, if you are clustering companies, you could just pick the one that has the highest turnover- which would mean the company thats most ‘well-known’.

ii. Pick out that member which is closest to the centroid.

However, neither of these would be applicable in all contexts and problems. Especially if the cluster members don’t mean anything to you individually, or you need to focus on the trends in a cluster- which cannot be summarized by a single member anyways.

How do you use a Decision Tree here?

A Decision Tree is usually pretty under-estimated an algorithm when it comes to supervised learning. The biggest reason for this is its innate simplicity which results in people going for more ‘sophisticated’ techniques like SVMs(usually). However, this is exactly the property that makes us select a decision tree in this scenario.

To understand a cluster K, what you essentially do is this:

i. Take all members of cluster K on one side, and members of all other clusters on another side.

ii. ‘Learn’ a decision tree to classify any given data point as a member of K or not-member of K – essentially a one-vs-all classification.

iii. Decompose the decision tree into the corresponding set of rules.

Voila! What you now essentially have, is a set of rules that help you distinguish the properties of class K from the rest. By virtue of decision-tree learning algorithms such as C4.5, your rules will optimally be based on those attributes that play the biggest role in characterising members of cluster K. Elegant isn’t it? The book explains this using the famous Whiskies dataset.

3. Virtual Items in Association Mining

This isn’t a big technique of any kind, but a nifty thing to remember all the same. Mostly always, when talking of Association Mining, we think of conventional ‘items’ as they would appear in any Market-Basket analysis. However, one way to expand this mining process, is to include Virtual Items in the ‘Baskets’. For example, suppose you are a big chain of supermarkets across the country. Obviously, every time a customer buys something, you include all the purchased items in a ‘transaction’. What you could also include, are some other attributes of the transaction such as- Location, Gender of person buying, Time of the Data, etc. So a given transaction might look like (Mumbai, Male, Afternoon, Bread, Butter, Milk). Here, the first three ‘items’ aren’t really items at all! Hence, they are called ‘virtual items’. Nonetheless, if you enter these transactions into an Association Mining framework, you could end up finding relationships between conventional goods and properties such as gender and location!

 

These are just three of such things to learn from this book. I would definitely recommend this book to you, especially if you want to supplement your knowledge of various ML algorithms with the intuition of how and where to apply them. Cheers!