The basics of Likelihood

Anyone who has done some course in statistics or data science, must have come across the term ‘likelihood’. In non-technical language, likelihood is synonymous with probability. But ask any mathematician, and their interpretation of the two concepts is significantly different. I went digging into likelihood for a bit this week, so thought of putting down the basics of what I revisited.

Whenever we talk about understanding data, we talk about models. In statistics, a model is usually some sort of parameterized function – like a probability density function (pdf) or a regression model. But the effectiveness of the model’s outputs will only be as good as its fit to the data. If the characteristics of the data are very different from the assumed model, the bias turns out to be pretty high. So from a probability perspective, how do we quantify this fit?

Defining the Likelihood Function

The two entities of interest are – the data X, and the parameters \theta. Now consider a function F(X, \theta), that returns a number proportional to the degree of ‘fit’ between the two – essentially quantifying their relationship with each other.

There are two practical ways you could deal with this function. If you kept \theta constant and varied the data X being analyzed, you would be getting a function F_1(X) whose only argument is the data. The output would basically be a measure of how well your input data satisfies the assumptions made by the model.

But in real life, you rarely know your model with certainty. What you do have, is a bunch of observed data. So shouldn’t we also think of the other way round? Suppose you kept X constant, but tried varying \theta instead. Now, what you have got is a function F_2(\theta), that computes how well different sets of parameters describe your data (which you have for real).

Mind you, in both cases, the underlying mathematical definition is the same. The input ‘variable’ is what has changed. This is how probability and likelihood are related. The function F_1 is what we call the probability function (or pdf for the continuous case), and F_2 is called the likelihood function. While F_1 assumes you know your model and tries to analyze data according to it, F_2 keeps the data in perspective while figuring out how well different sets of parameters describe it.

The above definition might make you think that the likelihood is nothing but a rewording of probability. But keeping the data constant, and varying the parameters has huge consequences on the way you interpret the resultant function.

Lets take a simple example. Consider you have a set C_n of n different coin tosses, where r out of them were Heads, while the others were Tails. Lets say that the coin used for tossing was biased, and the probability of a Heads coming up on it is p. In this case,

F(C_n, p) = {n \choose r} p^r (1-p)^{(n - r)}

Now suppose you made coin yourself, so you know p = 0.7. In that case,

F_1(C_n) = {n \choose r} 0.7^r 0.3^{(n - r)}

On the other hand, lets say you don’t know much about the coin, but you do have a bunch of toss-outcomes from it. You made 10 different tosses, out which 5 were Heads. From this data, you want to measure how likely it is that your guess of p is correct. Then,

F_2(p) = 252 p^5 (1-p)^5

There is a very, very important distinction between probability and likelihood functions – the value of the probability function sums (or integrates, for continuous data) to 1 over all possible values of the input data. However, the value of the likelihood function does not integrate to 1 over all possible combinations of the parameters.

The above statement leads to the second important thing to note: DO NOT interpret the value of a likelihood function, as the probability of the model parameters. If your probability function gave the value of 0.7 (say) for a discrete data point, you could be pretty sure that there would be no other option as likely as it. This is because, the sum of the probabilities of all other point would be equal to 0.3. However, if you got 0.99 as the output of your likelihood function, it wouldn’t necessarily mean that the parameters you gave in are the most likely ones. Some other set of parameters might give 0.999, 0.9999 or something higher.

The only thing you can be sure of, is this: If F_2(\theta_1) >F_2(\theta_2), then it is more likely that \theta_1 denote the parameters of the underlying model.

Log Likelihoods for Maximum Likelihood Estimation

The likelihood function is usually denoted as L(\theta | x) (likelihood L of the parameters \theta given the data point x), so we will stick with it from now on. The most common use of likelihood, is to figure out that set of parameters which yields the highest value for it (and thus describes your dataset the best). This method is called Maximum Likelihood Estimation. You maximize the value of the likelihood function in a bid to find the optimal parameters for your model. This trick is applied in many areas of data science, such as logistic regression.

Maximum Likelihood Estimation usually involves computing the partial derivative of the likelihood function with respect to the parameters. We generally deal with the log-likelihood (basically the logarithm of the likelihood function) rather than likelihood itself. Since log is a monotonically increasing function, the optimum value of the likelihood function can be calculated using derivatives of log-likelihood as well. The reason we use logarithms, is to make the process of dealing with derivatives easy. Consider the coin-toss example I gave above:

Your Likelihood function for the probability of Heads, given the n and r, was:

L(p | r) = {n \choose r} p^r (1-p)^{(n - r)}

Computing log, we get

log(L(p | r)) = log ({n \choose r}) + r log(p) + (n - r) log(1 - p)

To maximise, we will compute the partial derivative with respect to p, and equate to zero.

Using \frac{d(log x)}{dx} = \frac{1}{x} we get,

\frac{r}{p} = \frac{n-r}{1-p}

Solving, we get the intuitive result:

p = \frac{r}{n}

In most cases, when you compute likelihood, you would be dealing with a bunch of independent data points x_1, x_2, ..., x_n, rather than a single one. The likelihood of \theta with respect to the data-set X then gets defined as follows:

L(\theta | X) = L(\theta | x_1) L(\theta | x_2) L(\theta | x_3) ... L(\theta | x_n)

Using log, the overall likelihood becomes:

log(L(\theta | X)) = \sum_x{log(L(\theta | x))}

A small and easy introduction to Transductive Learning

Consider the following problem statement:

Input: a) A set U of labelled examples (x_i, y_i) where every x_i is the input vector, and y_i is the corresponding output label. b) A set V of unlabelled instances x'_i.

Output: The set of expected labels y'_i for all instances in V.

There are two ways (or rather, two philosophies) you could use, to solve this problem. Lets look at the first one most people in ML are acquainted with:

The Inductive way

Induction, in the context of learning, is the attempted discovery of rules/generalizations based on analysis of collected data. ‘Attempted discovery’ is the key term here – the generalizations are not facts, but approximations based on evidence you have gathered.

The main characteristic of inductive learning is the building of a model – those rules/properties you induce from the data to answer your questions, together make up the model. The learning can happen in a supervised or semi-supervised (or even unsupervised) fashion. What you are basically trying to do, is make generalizations that can help you understand/label unseen instances in the future. Statistical inference (which deals with building parametric models) is one of the techniques used in inductive learning.

Concretely speaking, heres how inductive learning will work for the problem defined above:

Input: U for supervised learning (and V too if we go for semi-supervised learning).

Output: The set of expected labels y'_i for all instances in V.

Approach: Compute a function f, using information in U, such that f(x_i) is as close to y_i as possible over all instances in U. Using this function, compute each y'_i as y'_i =f(x'_i).


The other way to go about this would be…

The Transductive way

Read the problem statement at the beginning of this post, once again. The output which the problem definition asks for, is the labels of instances in V only. What we tried to do in inductive learning, is build a model to label any unseen instances in the future. If you think about it, we basically solved a problem thats more general than the one we needed to solve. Instead of building a universal model – which we don’t need anyway – could we leverage the information contained in the instances of V (with respect to those in U) to make better predictions for V specifically? That is precisely what Transductive Learning tries to do.

Think about it this way. Suppose you were shown a pack of dogs, where each dog’s breed can be either A or B. Some of the dogs have been tagged with their correct breed, but many are not. You are asked to label the unlabelled dogs with their respective breeds. What would you do? Apart from the characteristics of dogs from A and B, wouldn’t it make sense to observe the unlabelled dogs and their interactions and similarities to those from A and B, to make good guesses for their breeds? That is the philosophy of Transduction.

Transduction, in the context of learning, refers to reasoning from specific observed (training) instances, to specific observed (unlabelled) instances. It was introduced by Vladmir Vapnik, with the core thought behind it being this:

“When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one.”

To avoid reinventing the wheel, please refer to the Wikipedia example problem to get an an intuitive understanding of how many Transductive learning algorithms work. For those of you who are acquainted with Support Vector Machines (SVMs), theres a transductive version of them around too. If you don’t want to go into the mathematical details, heres a summary (Basically, you incorporate the unlabelled samples into determining what the ‘maximum margin’ actually is):


It is important to understand that not all Semi-Supervised Learning methods are Transductive in nature. The inclusion of unlabelled samples in the training is not the primary characteristic of transduction – the avoidance of building a general ‘model’ is. We are using the information implicit in the instances whose output is required, to understand them better.

The major disadvantage of Transductive learning is pretty obvious: The information you ‘learn’ cannot be used to label new instances (which you did not have during training) – since you are not building a model. Essentially, every time you want to classify a new set of instances, you have to re-do the whole training again. Therefore, it makes most sense only when your goals (in terms of the instances you want to understand) are specific.

Another thing to note, is that the effectiveness of Transduction can suffer if you have some really noisy samples in your data. Think about the dogs example I gave. If some dogs from the unlabelled ones were from some random breed C (or cross-breeds of A and B), your internal understanding of the entire pack would suffer. You would see some dogs behaving randomly with others, some showing characteristics of both – it might make you question the labelled dogs as well! Since the distribution(s) of unlabelled data are used along with the training data, the integrity of the whole dataset (or at least most of it) is a prerequisite in Transductive learning.

Non-Mathematical Feature Engineering techniques for Data Science

“Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren’t.”

This is the first sentence in a Google-internal document I read about how to apply ML. And rightly so. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.

So what is Feature Engineering?

Simply put, it is the art/science of representing data is the best way possible.

Why do I say art/science? Because good Feature Engineering involves an elegant blend of domain knowledge, intuition, and basic mathematical abilities. Heck, the most effective data representation ‘hacks’ barely involve any mathematical computation at all! (As I will explain in a short while).

What do I mean by ‘best’? In essence, the way you present your data to your algorithm should denote the pertinent structures/properties of the underlying information in the most effective way possible. When you do feature engineering, you are essentially converting your data attributes into data features.

Attributes are basically all the dimensions present in your data. But do all of them, in the raw format, represent the underlying trends you want to learn in the best way possible? Maybe not. So what you do in feature engineering, is pre-process your data so that your model/learning algorithm has to spend minimum effort on wading through noise. What I mean by ‘noise’ here, is any information that is not relevant to learning/predicting your ultimate goal. In fact, using good features can even let you use considerably simpler models since you are doing a part of the thinking yourself.

But as with any technique in Machine Learning, always use validation to make sure that the new features you introduce really do improve your predictions, instead of adding unnecessary complexity to your pipeline.


As mentioned before, good feature engineering involves intuition, domain knowledge (human experience) and basic math skills. So heres a few extremely simple techniques for you to (maybe) apply in your next data science solution:

1. Representing timestamps

Time-stamp attributes are usually denoted by the EPOCH time or split up into multiple dimensions such as (Year, Month, Date, Hours, Minutes, Seconds). But in many applications, a lot of that information is unnecessary. Consider for example a supervised system that tries to predict traffic levels in a city as a function of Location+Time. In this case, trying to learn trends that vary by seconds would mostly be misleading. The year wouldn’t add much value to the model as well. Hours, day and month are probably the only dimensions you need. So when representing the time, try to ensure that your model does require all the numbers you are providing it.

And not to forget Time Zones. If your data sources come from different geographical sources, do remember to normalize by time-zones if needed.

2. Decomposing Categorical Attributes

Some attributes come as categories instead of numbers. A simple example would be a ‘color’ attribute that is (say) one of {Red, Green, Blue}. The most common way to go about representing this, is to convert each category into a binary attribute that takes one value out of {0, 1}. So you basically end up with a number of added attributes equal to the number of categories possible. And for each instance in your dataset, only one of them is 1 (with the others being 0). This is a form of one-hot encoding.

If you are new to this idea, you may think of decomposition as an unnecessary hassle (we are essentially bloating up the dimensionality of the dataset). Instead, you might be tempted to convert the categorical attribute into a scalar value. For example, the color feature might take one value from {1, 2, 3}, representing {Red, Green, Blue} respectively. There are two problems with this. First, for a mathematical model, this would mean that Red is somehow ‘more similar’ to Green than Blue (since |1-3| > |1-2|). Unless your categories do have a natural ordering (such as stations on a railway line), this might mislead your model. Secondly, it would make statistical metrics (such as mean) meaningless – or worse, misleading yet again. Consider the color example again. If your dataset contains equal numbers of Red and Blue instances but no Green ones, the ‘average’ value of color might still come out to be ~2 – essentially meaning Green!

The safest place to convert a categorical attribute into a scalar, is when you have only two categories possible. So you have {0, 1} corresponding to {Category 1, Category 2}. In this case, an ‘ordering’ isn’t really required, and you can interpret the value of the attribute as the probability of belonging to Category 2 vs Category 1.

3. Binning/Bucketing

Sometimes, it makes more sense to represent a numerical attribute as a categorical one. The idea is to reduce the noise endured by the learning algorithm, by assigning certain ranges of a numerical attribute to distinct ‘buckets’. Consider the problem of predicting whether a person owns a certain item of clothing or not. Age might definitely be a factor here. What is actually more pertinent, is the Age Group. So what you could do, is have ranges such as 1-10, 11-18, 19-25, 26-40, etc. Moreover, instead of decomposing these categories as in point 2, you could just use scalar values, since age groups that lie ‘closer by’ do represent similar properties.

Bucketing makes sense when the domain of your attribute can be divided into neat ranges, where all numbers falling in a range imply a common characteristic. It reduces overfitting in certain applications, where you don’t want your model to try and distinguish between values that are too close by – for example, you could club together all latitude values that fall in a city, if your property of interest is a function of the city as a whole. Binning also reduces the effect of tiny errors, by ’rounding off’ a given value to the nearest representative. Binning does not make sense if the number of your ranges is comparable to the total possible values, or if precision is very important to you.

4. Feature Crosses

This is perhaps the most important/useful one of these. Feature crosses are a unique way to combine two or more categorical attributes into a single one. This is extremely useful a technique, when certain features together denote a property better than individually by themselves. Mathematically speaking, you are doing a cross product between all possible values of the categorical features.

Consider a feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1, B2}. Then, a feature-cross between A & B (lets call it AB) would take one of the following values: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every combination denotes a synergy between the information contained by the corresponding values of A and B.

For example, take the diagram shown below:


All the blue points belong to one class, and the red ones belong to another. Lets put the actual model aside. First off, you would benefit from binning the X, Y values into {x < 0, x >= 0} & {y < 0, y >= 0} respectively. Lets call them {Xn, Xp} and {Yn, Yp}. It is pretty obvious that Quadrants I & III correspond to class Red, and Quadrants II & IV contain class Blue. So if you could now cross features X and Y into a single feature ‘Quadrant’, you would basically have {I, II, III, IV} being equivalent to {(Xp, Yp), (Xn, Yp), (Xn, Yn), (Xp, Yn)} respectively.

A more concrete/relatable example of a (possibly) good feature cross is something like (Latitude, Longitude). A common Latitude corresponds to so many places around the globe. Same goes for the Longitude. But once you combine Lat & Long buckets into discreet ‘blocks’, they denote ‘regions’ in a geography, with possibly similar properties for each one.

Sometimes, attributes can also be ‘combined’ into a single feature with simple mathematical hacks. In the above example, suppose you define modified features X_{sign} and Y_{sign} as follows:

X_{sign} = \frac{x}{|x|}

Y_{sign} = \frac{y}{|y|}

Now, you could just define a new feature Quadrant_{odd} as follows:

Quadrant_{odd} = X_{sign}Y_{sign}

Thats all! If Quadrant_{odd} = 1, the class is Red. Else, Blue!


For sake of completeness, I will also mention some mathematically intensive feature engineering techniques, with links for you to read more about them:

5. Feature Selection : Using certain algorithms to automatically select a subset of your original features, for your final model. Here, you are not creating/modifying your current features, but rather pruning them to reduce noise/redundancy.

6. Feature Scaling : Sometimes, you may notice that certain attributes have a higher ‘magnitude’ than others. An example might be a person’s income – as compared to his age. In such cases, for certain models (such as Ridge Regression), it is infact necessary that you scale all your attributes to comparable/equivalent ranges. This prevents your model from giving greater weightage to certain attributes as compared to others.

7. Feature Extraction : Feature extraction involves a host of algorithms that automatically generate a new set of features from your raw attributes. Dimensionality reduction falls under this category.

The magic behind Attribute Access in Python

Most people know just one thing when it comes to attribute access – the dot ‘.’ (as in x.some_attribute). In simple terms, attribute access is the way you retrieve an object linked to the one you already have. To someone who uses Python without delving too much into the details, it may seem pretty straightforward. However, under the hood, theres a lot that goes on for this seemingly trivial task.

Lets look at each of the components one by one.

The __dict__ attribute

Every object in Python has an attribute denoted by __dict__. This dictionary/dictionary-like (I will explain this shortly) object contains all the attributes defined for the object itself. It maps the attribute name to its value.

Heres an example:

>>> class C(object):
	x = 4

>>> c = C()
>>> c.y = 5
>>> c.__dict__
{'y': 5}

Notice how 'x' is not in c.__dict__. The reason for this is simple enough. While y was defined for the object c, x was defined for its class (C). Therefore, it will appear in the __dict__ of C. In fact, C‘s __dict__ contains a lot of other keys too (including '__dict__'):

>>> c.__class__.__dict__['x']
>>> c.__class__.__dict__
dict_proxy({'__dict__': <attribute '__dict__' of 'C' objects>, 'x': 4, 
'__module__': '__main__', '__weakref__': <attribute '__weakref__' of 'C' objects>, 
'__doc__': None})

We will look at what dictproxy means soon.

The __dict__ of an object is simple enough to understand. It behaves like a Python dict, and is one too.

>>> c.__dict__
{'y': 5}
>>> c.__dict__.__class__
<type 'dict'>
>>> c.__dict__ = {}
>>> c.y

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
AttributeError: 'C' object has no attribute 'y'
>>> c.__dict__['y'] = 5
>>> c.y

The __dict__ of a class however, is not that straight-forward. Its actually an object of a class called dictproxy. dictproxy is a special class whose objects behave like normal dicts, but they differ in some key behaviours.

>>> C.__dict__
dict_proxy({'__dict__': <attribute '__dict__' of 'C' objects>, 'x': 4, '__module__': '__main__', '__weakref__': <attribute '__weakref__' of 'C' objects>, '__doc__': None})
>>> C.__dict__.__class__
<type 'dictproxy'>
>>> C.__dict__['x']
>>> C.__dict__['x'] = 6

Traceback (most recent call last):
  File "<pyshell#87>", line 1, in <module>
    C.__dict__['x'] = 4
TypeError: 'dictproxy' object does not support item assignment
>>> C.x = 6
>>> C.__dict__ = {}

Traceback (most recent call last):
  File "<pyshell#89>", line 1, in <module>
    C.__dict__ = {}
AttributeError: attribute '__dict__' of 'type' objects is not writable

Notice how you cannot set a key in a dictproxy directly (C.__dict__['x'] = 4 does not work). You can accomplish the same using C.x = 6 however, since the internal behaviour then is different. Also notice how you cannot set the __dict__ attribute itself either(C.__dict__ = {} does not work).

Theres a reason behind this weird implementation. If you don’t want to get into the details, just know that its for the Python interpreter to keep working properly, and to enforce some optimizations. If you want a more detailed explanation, have a look at Scott H’s answer to this StackOverflow question.


A descriptor is an object that has atleast one of the following magic methods in its attributes: __get__, __set__ or __delete__ (Remember, methods are ultimately objects in Python). Mind you, its the object we are talking about. Its class may or may not have implemented them.

Descriptors can help you define the behaviour of an object’s attribute in Python. With each of the magic methods just mentioned, you implement how the attribute (‘described’ by the descriptor) will be retrieved, set and deleted in the object respectively. There are two types of descriptors – Data Descriptors, and Non-Data Descriptors.

Non-Data Descriptors only have __get__ defined. All others are Data Descriptors. You would naturally think, why these two types are called so. The answer is intuitive. Usually, its data-related attributes that we tend to ‘set’ or ‘delete’ with respect to an object. Other attributes, like methods themselves, we don’t. So their descriptors are called Non-Data Descriptors. As with a lot of other things in Python, this is not a hard-and-fast rule, but a convention. You could just as well describe a method with a Data Descriptor. But then, its __get__ should return a function.

Heres an example of two classes that will come up with data and non-data descriptor objects respectively:

class DataDesc(object):
    def __init__(self, name):
        self._name = name

    def __get__(self, obj, objclass):
            print("Retrieving attr " + self._name + " from " +
                  str(obj) + "...")
            return objclass.x + " + " + obj.y
            raise AttributeError("Attr " + self._name + " could not be " +
                                 "retrieved from " + str(obj))
    def __set__(self, obj, value):
        raise AttributeError("Attr " + self._name + " cannot be " +
                             "set in " + str(obj))

    def __delete__(self, obj):
        raise AttributeError("Attr " + self._name + " cannot be " +
                             "deleted in " + str(obj))

class NonDataDesc(object):
    def __init__(self, name):
        self._name = name

    def __get__(self, obj, objclass):
            print("Retrieving attr " + self._name + " from " +
                  str(obj) + "...")
            return objclass.x + " + " + obj.y
            raise AttributeError("Attr " + self._name + " could not be " +
                                 "retrieved from " + str(obj))

Notice how the __get__ function takes in an object obj and (its) class objclass. Similarly, setting the value requires obj and some candidate value. Deletion just needs obj. Taking these parameters in (along with the initializer __init__) helps you differentiate between objects of the same descriptor class. Mind you, its the objects that are intended to be the descriptors.
(P.S. If you don’t define the __get__ method for a descriptor, the descriptor object itself will get returned).

Lets use these classes in some code.

class ParentClass(object):
    x = "x1"
    y = "y1"
    data_attr_parent = DataDesc("desc1")
    data_attr_child = DataDesc("desc2")

class ChildClass(ParentClass):
    x = "x2"
    y = "y2"
    data_attr_child = DataDesc("desc3")
    non_data_attr_child = NonDataDesc("desc4")

some_object = ChildClass()

Thats it! You can access the ‘described’ objects as usual in Python.

>>> some_object.data_attr_parent
Retrieving attr desc1 from <__main__.ChildClass object at 0x1062c5790>...
'x2 + y2'

Descriptors are used for a lot of attribute and method related functionality in Python, including static methods, class methods and properties. Using descriptors, you can gain better control over how attributes and methods of a class/its objects are accessed – including defining some ‘behind the scenes’ functionality like logging.

Now lets look at the high-level rules governing attribute access in Python.

The Rules

Quoting Shalabh Chaturvedi’s book verbatim, the workflow is as follows:

  1. If attrname is a special (i.e. Python-provided) attribute for objectname, return it.
  2. Check objectname.__class__.__dict__ for attrname. If it exists and is a data-descriptor, return the descriptor result. Search all bases of objectname.__class__ for the same case.
  3. Check objectname.__dict__ for attrname, and return if found. If objectname is a class, search its bases too. If it is a class and a descriptor exists in it or its bases, return the descriptor result.
  4. Check objectname.__class__.__dict__ for attrname. If it exists and is a non-data descriptor, return the descriptor result. If it exists, and is not a descriptor, just return it. If it exists and is a data descriptor, we shouldn’t be here because we would have returned at point 2. Search all bases of objectname.__class__for same case.
  5. Raise AttributeError


To make things clearer, heres some tinkering using the code we wrote in the Descriptors section (Have a look at it again just to be clear about things):

data_attr_child is a Data descriptor in some_object‘s class. So you cant write over it. Also, the version in ChildClass (‘desc3’) is used, not the one in ParentClass.

>>> some_object.data_attr_child
Retrieving attr desc3 from <__main__.ChildClass object at 0x1110c9790>...
'x2 + y2'
>>> some_object.data_attr_child = 'xyz'

Traceback (most recent call last):
  File "<pyshell#112>", line 1, in <module>
    some_object.data_attr_child = 'xyz'
  File "/Users/srjoglekar/", line 16, in __set__
    "set in " + str(obj))
AttributeError: Attr desc3 cannot be set in <__main__.ChildClass object at 0x10883f790>

Infact, even if you make an appropriate entry in some_object‘s dict, it still won’t matter (as per Rule 1).

>>> some_object.__dict__['data_attr_child'] = 'xyz'
>>> some_object.data_attr_child
Retrieving attr desc3 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'

The Non-Data Descriptor attribute, on the other hand, can be easily overwritten.

>>> some_object.non_data_attr_child
Retrieving attr desc4 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'
>>> some_object.non_data_attr_child = 'xyz'
>>> some_object.non_data_attr_child
>>> some_object.__dict__
{'data_attr_child': 'xyz', 'non_data_attr_child': 'xyz'}

You can, however, change the behaviour of data_attr_child, if you go to some_object‘s class and modify it in the dictproxy there itself.

>>> some_object.__class__.data_attr_child = 'abc'
>>> some_object.data_attr_child

Notice how the moment you replace the Data-Descriptor in the class with some non-data descriptor (or some object like a String in this case), the entry that we initially made in some_object‘s __dict__ comes into play. Therefore, some_object.data_attr_child returns 'xyz', not 'abc'.

The data_attr_parent attribute behaves similar to data_attr_child.

>>> some_object.data_attr_parent
Retrieving attr desc1 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'
>>> some_object.data_attr_parent = 'xyz'

Traceback (most recent call last):
  File "<pyshell#127>", line 1, in <module>
    some_object.data_attr_parent = 'xyz'
  File "/Users/srjoglekar/", line 16, in __set__
    "set in " + str(obj))
AttributeError: Attr desc1 cannot be set in <__main__.ChildClass object at 0x10883f790>
>>> some_object.__class__.data_attr_parent = 'xyz'
>>> some_object.__class__.data_attr_parent

Notice how you cant ‘write-over’ data_attr_parent in ChildClass itself. Once you do that, we go through Rules 1-2-3 and stop at 4, to get the result 'xyz'.

Rules for Setting Attributes

Way simpler than the rules for ‘getting them’. Quoting Shalabh’s book again,

  1. Check objectname.__class__.__dict__ for attrname. If it exists and is a data-descriptor, use the descriptor to set the value. Search all bases of objectname.__class__ for the same case.
  2. Insert something into objectname.__dict__ for key "attrname".

Thats it!🙂.


To put it concisely, __slots__ is a way to disallow objects from having their own __dict__ in Python. This means, that if you define __slots__ in a Class, then you cannot set arbitrary attributes(apart from the ones mentioned in the ‘slots’) on its objects.

Heres an example of such a class:

class SomeClass(object):
    __slots__ = ['x', 'y']

obj = SomeClass()

Now see how this behaves:

>>> obj.x = 4
>>> obj.y = 5
>>> obj.x
>>> obj.y
>>> obj.z = 6

Traceback (most recent call last):
  File "<pyshell#135>", line 1, in <module>
    obj.z = 6
AttributeError: 'SomeClass' object has no attribute 'z'

You can ofcourse do this:

>>> obj.__class__.z = 6
>>> obj.z

But then, remember you have now defined z in SomeClass‘s __dict__, not in obj‘s.

As Guido van Rossum himself mentions in his blog post, __slots__ were implemented in Python to introduce efficiency, not ‘stricter’ attribute-setting. The basic intuition is this: Suppose you have a class, whose objects you intend to construct in a large number. You don’t really need the flexibility of having ‘dynamic’ attributes on the objects themselves, but you want efficiency. Since slots essentially eliminates the __dict__ attribute in each one of the objects, you get a lot of memory savings this way.

Interestingly, slots are implemented using descriptors in Python.


Further Reading

Have a look at this book I have already quoted in the post. It goes into a lot of detail regarding attribute access in Python, including method resolutions.

Thats all for now. Cheers!

Joined Google :-D

Hello people! I haven’t really blogged in quite some time, and I kind-of feel guilty about it🙂. Truth is, I have been busy starting a new job/life at Google, as a Software Engineer at the Hyderabad office. I joined the Google Apps for Work team, and I work on the analytics part of things. I was pretty swamped with setting things up, getting the formalities done, finding a place to live, blah blah – Basically learning to be an adult! Things have finally settled down now, and I have (I think) found my groove when it comes to my overall routine – so I will probably start blogging as usual in the coming weeks.

To answer the obvious question, life at Google is …well…pretty awesome. Let me write out some points about my experience at the company thats supposed to be one of the best employers in the world. And as per my usual style, they will be bullet points, since putting down a coherent/easy-flowing train of thoughts is just beyond my abilities as a writer.

I. The cool work culture!! I don’t know how many other companies do this (since this is my first real job) but its definitely not all of them. Google doesn’t just talk about an open workspace, it does follow it. My team lead engineer sits on a desk thats exactly like the one I use (except for a lot of Swag he has that I don’t), and I can just go drop by if I have any doubts/issues regarding pretty much anything (even non work related). If my mentor needs to talk to me, he will just come by and sit on a bean-bag lying around my desk and discuss things like a friend at college. You literally won’t be able to tell Tech-levels of people at Google, and thats something I feel is really nice about the office culture.

II. Getting intimidated by the people and the technology. This is a big one. I don’t think I have ever felt so…small…in front of people around me, anytime before in my life. To put it honestly, I felt like an idiot around my team for the first week or two. Not because they acted in any such way – in fact I had to stalk them online to know more about them – but because all of them are basically smartasses. And the other aspect is obviously that you get overwhelmed by the internal infrastructure at a company like Google. Its just so vast and there are so many parts and bits and pieces working together, that its difficult to wrap your head around all of them at first. Its all humbling.

III. The perks!! I don’t really think I need to explain this one (since its covered extensively in articles online). We at Hyderabad don’t have the perks that Mountain View does, but its still pretty darn amazing. Free transport, Bunker rooms, Free food (Oh the damn awesome food :-D), Microkitchens, Games rooms on all floors, Amazing gym, Massage centre, Free internet at home, Matching your charity (if you do any), Techstop for chargers and stuff…the list goes on. Who knows, once we have the campus, the list might expand even further!

Trust me, I know I sound like a freaking fanboy throughout the post. But well… I am barely one year out of my college term, and this is pretty much paradise for a luxury-loving dork like me. I hope to have a great life here at Google, and justify the whole process by making as much of an impact as I can. Cheers!

A practical introduction to Functional Programming for Python coders

This post acquaints the reader with the fundamentals of Functional Programming in the context of Python. Most programmers rarely touch upon languages with a primary functional focus- such as Lisp or Haskell, except maybe as a part of an academic course. Since Python is a widely-used language that supports (mostly) all functional programming constructs, this post tries to demonstrate their usage and advantages. Functional Programming may not be the best/Pythonic way of doing everything in this language, but it has its advantages in some applications and that is what this post is all about.

1. What is Functional Programming?

Functional programming is a programming paradigm that revolves around pure functions. If you have ever done coding in your life, you probably associate functions with subroutines. That is the ‘CS’ perspective. Here, we lean more towards the mathematical definition:

220px-Function_machine2.svgA pure function is essentially any function that can be represented as a mathematical expression only. Which means no side-effects: no I/O operations, no global state changes, no database interactions (You can’t really represent these in a mathematical expression, can you?). The output from a pure function is ONLY dependent on its inputs. So if you were to call a pure function with the same inputs a million times, you would get the same result every single time. Ofcourse, in practical functional programming languages such as Haskell, there are ways to do I/O operations etc., but the focus is still on pure functions.

2. The lambda construct

The easiest way to initialize a pure function in Python is by using the lambda operative:

>>> square_func = lambda x: x**2
>>> square_func(2)
>>> some_list = [(1, 3), (5, 2), (6, 1), (4, 6)]
>>> some_list.sort(key=lambda x: x[0]**2 - x[1]*3)
>>> some_list
[(1, 3), (4, 6), (5, 2), (6, 1)]


The lambda operative helps you define functions in a one-line fashion, which is pretty convenient if its just a mathematical expression anyways. In fact, the lambda keyword is pretty prominent in functional programming (and not just Python), and has its roots in Lambda Calculus – one of the ‘ancestors’ of functional programming.

Functions initialized with lambda can also be called anonymous functions. If you look at line 5 in the code above, you are passing a lambda-initialized function to the sort method. You aren’t really giving it a name, just defining it on-the-go and passing it as an argument. Hence the term ‘anonymous’. Ofcourse, you can always assign anonymous functions to a variable (as on line 1) and call them like usual functions (as on line 2).

3. Functions as First-Class citizens

In Functional Programming, functions are first-class citizens. Which basically means that you can treat them as any other objects – you can assign them to variables, you can pass them as arguments, or even get them returned from other functions. You already saw some of this happening in the code before. Here’s some more:

>>> square_func = lambda x: x**2
>>> function_product = lambda F, m: lambda x: F(x)*m
>>> square_func(2)
>>> function_product(square_func, 3)(2)


square\_func is a function by itself. function\_product on the other hand, is a higher-order function that takes two inputs- A function F(x) and a multiplier m. It returns a function F'(x) which is equal to m*F(x). Therefore on line 5, function\_product(square\_func, 3) is a function that is being called with the argument 2, returning 2^{2}*3 = 12.

4. Immutability of Data and Data Flows

Immutability in the context of objects means that you never modify the value of data once initialized. In Functional Programming, whenever you call a function on some data, you always get new instances as a result – you never ‘update’ the value of any arguments. Programmatically, this implies that once you initialize a variable like x = 3, the variable x will never appear on the LHS of a statement again.

As a result, any functional code can be thought of as a feed-forward data flow. You never ‘come back’ to change the value of any variable, hence data always moves forward from the inputs to the ultimate output(s) – from one function, to another.


This immutability of data leads to another property called Referential Transparency. It means that the value of an expression is the same anywhere it might occur in the program – as long as the required variables are defined. Since you never update the value of any variables/objects (including functions), they mean the same in any context once defined. For this reason, functional code is extremely easy to analyze and debug. You never need to track the value of state variables or remember any updates.

This enables usage of memoization – you basically ‘remember’ the outputs of expensive functions with some common arguments in a sort of lookup table. This reduces the computational complexity at the expense of memory.

5. Recursion

Functional programming doesn’t really provide for iteration via while or for statements. Neither does it have the provision of state-updates. As a result, recursion is a strong theme in functional programming. It is worthwhile to remember that any iterative code can be converted to recursive code.

Here’s a functional version of the n-th Fibonacci number calculator:

>>> fibonacci = (lambda x, x_1=1, x_2=0:
	     x_2 if x == 0
	     else fibonacci(x - 1, x_1 + x_2, x_1))
>>> fibonacci(1)
>>> fibonacci(5)
>>> fibonacci(6)

Line 2 defines the base case of the computation, while line 3 makes the recursive call. If you think about it, x, x\_1 and x\_2 are essentially state variables whose updated versions are being made and passed into every recursive call. This is usually how state is handled in functional code.

It is always better to implement tail-recursion when writing functional code, especially in pure-functional languages such as Scheme. There’s a good reason for this. Tail-recursive code is easily optimized into iterative code by the under-lying compiler (though this doesn’t apply to Python), making the compiled code more efficient.

6. Lazy evaluation

This is an aspect of Functional Programming that Python does not adopt. In many pure functional languages such as Haskell, objects that don’t necessarily need evaluation are not evaluated. Evaluation means computing the value of a function expression. Consider the following line:

length([3 + 4, 5, 1/0])

In a language like Python, the presence of 1/0 would case an Exception immediately. However, if we were to implement lazy-evaluation, the value 3 would be returned- since there exist 3 objects in the list, and their values don’t need to be evaluated for the count. This causes a sort of graph reduction in the data flow, leading to lesser function calls (at the risk of ignoring errors).

Python 3.x does support a different kind of lazy evaluation, that returns iterators for calls such as some\_dict.keys(). This prevents the entire data from being loaded into memory, and is efficient from a programming perspective.

7. No iterators as sequences

This is a small point, but since the value of the next element in an iterator depends on its state(which violates Referential Transparency), iterators aren’t present in pure-functional code. Instead if we are writing pure-functional code we only deal with explicit immutable tuples – which you can generate from an iterator in Python using tuple().

7. map, reduce and filter

map, reduce and filter are three higher-order functions that appear in all pure functional languages – and in Python, too. Their prevalence suggests how often they are used in functional code to make it more elegant.

map basically provides a kind of parallelism by calling a function over all elements in a list/array. Here’s an example:

>>> map(lambda x: x**2, [1, 2, 3])
[1, 4, 9]

Notice how map can be parallelized – since you are calling the same function over all elements in an array without any modifications, you can make the calls in any given order.

filter offers another parallelism over a sequence. It takes as an input a boolean-returning function and a sequence, and retains only those values from the sequence that return True from the function. Example:

>>> filter(lambda x: x % 2 == 0, [1,2,3,4,5,6])
[2, 4, 6]

The above line filters out all the odd numbers from the list.

reduce is a construct that performs a serial iteration over a sequence. Its first argument is a function F(a, x), which takes in two arguments – An accumulator a and the current input x. F computes and returns the new value of a. The second argument to reduce is the sequence itself, [x1, x2, ..., x_n], and the third argument is the initial accumulator a. Internally, what it does is:

i.   a_1 = F(a, x1)

ii.  a_2 = F(a_1, x2)

iii. a_3 = F(a_2, x3)

and so on… What you get returned at the end, is a_n. An example:

>>> reduce(lambda x, y: x + y, [1, 2, 3], 0)

The accumulator doesn’t have to be of the same type as the elements in the list! It can be a sequence as well. Here’s an example of reduce being used to reverse a sequence:

>>> reduce(lambda L, element: [element] + L, [1, 2, 3], [])
[3, 2, 1]

Notice how no list is being modified in the above line of code. (As a side note, reduce needs to be imported from the functools library from Python 3.x).

Also, understand that since every call of F that reduce makes on an element depends on the elements that come before it, reduce cannot be parallelized.

Notice the usage of the words ‘map’ and ‘reduce’? These operators in Functional programming provide a good context to understand MapReduce, and are even touched upon in the old Google Lectures on the subject.

End note1: This blog post provides very succinct examples and code for you to better understand how functional programming should be done in Python using the aforementioned keywords/operators. It also introduces some advanced techniques such as pipelines. Do have a look!

End Note2: This article provides some perspective on why functional programming is the preferred approach in only specialized operations, and comments on the human-understandability of functional code (which is low).


A (small) introduction to Boosting

What is Boosting?

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model.

Lets look at the highlighted parts one-by-one:

1. Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature. Some examples are Decision Stumps or smaller Decision Trees. In most cases, either one of these two is used as a weak learner in Boosting frameworks.

2. Ensemble: The overall model built by Boosting is a weighted sum of all of the weak learners. The weights and training given to each ensures that the overall model yields a pretty high accuracy (sometimes state-of-the-art).

3. Iteratively build: Many ensemble methods such as bagging/random forests are capable of training their components in parallel. This is because the training of any of those weak learners does not depend on the training of the others. That is not the case with Boosting. At each step, Boosting tries to evaluate the shortcomings of the overall model built till the previous step, and then generates a weak learner to battle these shortcomings. This weak learner is then added to the total model. Therefore, the training must necessarily proceed in a serial/iterative manner.

4. Meta-algorithm: Since Boosting isn’t necessarily an ML algorithm by itself, but rather uses other (basic) algorithms to build a stronger one, it is said to be a ‘meta’ algorithm.

How does Boosting work?


Usually, a Boosting framework for regression works as follows:

(The overall model at step ‘i’ is denoted by F_i.)

1. Start with a simple model F_0, usually predicting a common (average) value for all instances.

2. For i from 1 to M do:

2.a) Evaluate the shortcomings of F_{i - 1}, in terms of a value r_j for each element x_j in the training set.
2.b) Generate a new weak learner h_i(x) based on the r_js.
2.c) Compute the corresponding weight \gamma_i.
2.d) Update the current model as follows:

F_i = F_{i - 1} + \nu \gamma_i h_i(x)

where \nu denotes the learning rate of the framework.

3. Return F_M as the overall output.

Each of the iterations in Boosting essentially tries to ‘improve’ the current model by introducing another learner into the ensemble. Having such an ensemble not only reduces the bias (which is generally pretty high for weak learners), but also the variance (since multiple learners contribute to the overall output, each with their own unique training). There are various versions of Boosting, and all of them primarily differ in the way they perform steps 2.a, 2.b. and 2.c.

For example, Gradient Boosting computes r_j as the gradient of a Loss function (whose expression involves target output y and model output till previous step, F_{i - 1}) with respect to F_{i - 1} at data point x_j. The ith weak learner h_i is then trained to predict r_j from x_j. The weight \gamma_i is then optimized so as to minimize the total Loss value with respect to F_i.

AdaBoost, short for Adaptive Boosting, works by assigning r_j as a ‘weight’ to x_j based on how miscalculated the model output is according to F_{i - 1}. h_i and \gamma_i are then trained to optimize the weighted sum of errors with respect to the output from h_i. AdaBoost is considered to be one of the best out-of-the-box classifiers today. Mathematically speaking, AdaBoost is a special case of Gradient Boosting.

The weak-learners used in Boosting are indeed ‘weak’. The most commonly used ones are decision trees of a fixed number of leaf nodes. Using complex methods such as SVMs in Boosting usually leads to overfitting. However, the total number of weak learners M should also be controlled, mostly based on experimentation. Having a low M will cause underfitting, but a very high value could lead to overfitting.

To improve the bias/variance characteristics of Boosting, bagging is generally employed. What this essentially does is train each weak learner on a subset of the overall dataset (rather than the whole training set). This causes a good decrease in variance and tends to reduce overfitting.

Classification applications would work slightly different from regression, of course. The most commonly used method is to build a boosted model for each class. This model would predict the probability of a data point belonging to that class. During training, the inputs belonging to the corresponding class will be given a target of 1, with the others getting a target of 0. The outputs from all the models could then be put into a softmax function for normalization.

What are the drawbacks of Boosting?

The most obvious drawback of boosting is the computational complexity. Performing so many iterations, and generating a new model at each, requires a lot of computations and time (and also space). The fact that the ensemble has to be built iteratively doesn’t help matters. However, using simple weak learners (instead of big decision trees) helps to mitigate this problem.

The biggest criticism for Boosting comes from its sensitivity to noisy data. Think of it this way. At each iteration, Boosting tries to improve the output for data points that haven’t been predicted well till now. If your dataset happens to have some (or a lot) of misclassified/outlier points, the meta-algorithm will try very hard to fit subsequent weak learners to these noisy samples. As a result, overfitting is likely to occur. The exponential loss function used in AdaBoost is particularly vulnerable to this issue (since the error from an outlier will get an exponentially-weighed importance for future learners).

Implementation in Scikit-Learn

  1. AdaBoost implementation
  2. Gradient Tree Boosting implementation