# Non-Mathematical Feature Engineering techniques for Data Science

“Apply Machine Learning like the great engineer you are, not like the great Machine Learning expert you aren’t.”

This is the first sentence in a Google-internal document I read about how to apply ML. And rightly so. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation.

So what is Feature Engineering?

Simply put, it is the art/science of representing data is the best way possible.

Why do I say art/science? Because good Feature Engineering involves an elegant blend of domain knowledge, intuition, and basic mathematical abilities. Heck, the most effective data representation ‘hacks’ barely involve any mathematical computation at all! (As I will explain in a short while).

What do I mean by ‘best’? In essence, the way you present your data to your algorithm should denote the pertinent structures/properties of the underlying information in the most effective way possible. When you do feature engineering, you are essentially converting your data attributes into data features.

Attributes are basically all the dimensions present in your data. But do all of them, in the raw format, represent the underlying trends you want to learn in the best way possible? Maybe not. So what you do in feature engineering, is pre-process your data so that your model/learning algorithm has to spend minimum effort on wading through noise. What I mean by ‘noise’ here, is any information that is not relevant to learning/predicting your ultimate goal. In fact, using good features can even let you use considerably simpler models since you are doing a part of the thinking yourself.

But as with any technique in Machine Learning, always use validation to make sure that the new features you introduce really do improve your predictions, instead of adding unnecessary complexity to your pipeline.

As mentioned before, good feature engineering involves intuition, domain knowledge (human experience) and basic math skills. So heres a few extremely simple techniques for you to (maybe) apply in your next data science solution:

### 1. Representing timestamps

Time-stamp attributes are usually denoted by the EPOCH time or split up into multiple dimensions such as (Year, Month, Date, Hours, Minutes, Seconds). But in many applications, a lot of that information is unnecessary. Consider for example a supervised system that tries to predict traffic levels in a city as a function of Location+Time. In this case, trying to learn trends that vary by seconds would mostly be misleading. The year wouldn’t add much value to the model as well. Hours, day and month are probably the only dimensions you need. So when representing the time, try to ensure that your model does require all the numbers you are providing it.

And not to forget Time Zones. If your data sources come from different geographical sources, do remember to normalize by time-zones if needed.

### 2. Decomposing Categorical Attributes

Some attributes come as categories instead of numbers. A simple example would be a ‘color’ attribute that is (say) one of {Red, Green, Blue}. The most common way to go about representing this, is to convert each category into a binary attribute that takes one value out of {0, 1}. So you basically end up with a number of added attributes equal to the number of categories possible. And for each instance in your dataset, only one of them is 1 (with the others being 0). This is a form of one-hot encoding.

If you are new to this idea, you may think of decomposition as an unnecessary hassle (we are essentially bloating up the dimensionality of the dataset). Instead, you might be tempted to convert the categorical attribute into a scalar value. For example, the color feature might take one value from {1, 2, 3}, representing {Red, Green, Blue} respectively. There are two problems with this. First, for a mathematical model, this would mean that Red is somehow ‘more similar’ to Green than Blue (since |1-3| > |1-2|). Unless your categories do have a natural ordering (such as stations on a railway line), this might mislead your model. Secondly, it would make statistical metrics (such as mean) meaningless – or worse, misleading yet again. Consider the color example again. If your dataset contains equal numbers of Red and Blue instances but no Green ones, the ‘average’ value of color might still come out to be ~2 – essentially meaning Green!

The safest place to convert a categorical attribute into a scalar, is when you have only two categories possible. So you have {0, 1} corresponding to {Category 1, Category 2}. In this case, an ‘ordering’ isn’t really required, and you can interpret the value of the attribute as the probability of belonging to Category 2 vs Category 1.

### 3. Binning/Bucketing

Sometimes, it makes more sense to represent a numerical attribute as a categorical one. The idea is to reduce the noise endured by the learning algorithm, by assigning certain ranges of a numerical attribute to distinct ‘buckets’. Consider the problem of predicting whether a person owns a certain item of clothing or not. Age might definitely be a factor here. What is actually more pertinent, is the Age Group. So what you could do, is have ranges such as 1-10, 11-18, 19-25, 26-40, etc. Moreover, instead of decomposing these categories as in point 2, you could just use scalar values, since age groups that lie ‘closer by’ do represent similar properties.

Bucketing makes sense when the domain of your attribute can be divided into neat ranges, where all numbers falling in a range imply a common characteristic. It reduces overfitting in certain applications, where you don’t want your model to try and distinguish between values that are too close by – for example, you could club together all latitude values that fall in a city, if your property of interest is a function of the city as a whole. Binning also reduces the effect of tiny errors, by ’rounding off’ a given value to the nearest representative. Binning does not make sense if the number of your ranges is comparable to the total possible values, or if precision is very important to you.

### 4. Feature Crosses

This is perhaps the most important/useful one of these. Feature crosses are a unique way to combine two or more categorical attributes into a single one. This is extremely useful a technique, when certain features together denote a property better than individually by themselves. Mathematically speaking, you are doing a cross product between all possible values of the categorical features.

Consider a feature A, with two possible values {A1, A2}. Let B be a feature with possibilities {B1, B2}. Then, a feature-cross between A & B (lets call it AB) would take one of the following values: {(A1, B1), (A1, B2), (A2, B1), (A2, B2)}. You can basically give these ‘combinations’ any names you like. Just remember that every combination denotes a synergy between the information contained by the corresponding values of A and B.

For example, take the diagram shown below:

All the blue points belong to one class, and the red ones belong to another. Lets put the actual model aside. First off, you would benefit from binning the X, Y values into {x < 0, x >= 0} & {y < 0, y >= 0} respectively. Lets call them {Xn, Xp} and {Yn, Yp}. It is pretty obvious that Quadrants I & III correspond to class Red, and Quadrants II & IV contain class Blue. So if you could now cross features X and Y into a single feature ‘Quadrant’, you would basically have {I, II, III, IV} being equivalent to {(Xp, Yp), (Xn, Yp), (Xn, Yn), (Xp, Yn)} respectively.

A more concrete/relatable example of a (possibly) good feature cross is something like (Latitude, Longitude). A common Latitude corresponds to so many places around the globe. Same goes for the Longitude. But once you combine Lat & Long buckets into discreet ‘blocks’, they denote ‘regions’ in a geography, with possibly similar properties for each one.

Sometimes, attributes can also be ‘combined’ into a single feature with simple mathematical hacks. In the above example, suppose you define modified features $X_{sign}$ and $Y_{sign}$ as follows:

$X_{sign} = \frac{x}{|x|}$

$Y_{sign} = \frac{y}{|y|}$

Now, you could just define a new feature $Quadrant_{odd}$ as follows:

$Quadrant_{odd} = X_{sign}Y_{sign}$

Thats all! If $Quadrant_{odd} = 1$, the class is Red. Else, Blue!

For sake of completeness, I will also mention some mathematically intensive feature engineering techniques, with links for you to read more about them:

5. Feature Selection : Using certain algorithms to automatically select a subset of your original features, for your final model. Here, you are not creating/modifying your current features, but rather pruning them to reduce noise/redundancy.

6. Feature Scaling : Sometimes, you may notice that certain attributes have a higher ‘magnitude’ than others. An example might be a person’s income – as compared to his age. In such cases, for certain models (such as Ridge Regression), it is infact necessary that you scale all your attributes to comparable/equivalent ranges. This prevents your model from giving greater weightage to certain attributes as compared to others.

7. Feature Extraction : Feature extraction involves a host of algorithms that automatically generate a new set of features from your raw attributes. Dimensionality reduction falls under this category.

# The magic behind Attribute Access in Python

Most people know just one thing when it comes to attribute access – the dot ‘.’ (as in x.some_attribute). In simple terms, attribute access is the way you retrieve an object linked to the one you already have. To someone who uses Python without delving too much into the details, it may seem pretty straightforward. However, under the hood, theres a lot that goes on for this seemingly trivial task.

Lets look at each of the components one by one.

### The __dict__ attribute

Every object in Python has an attribute denoted by __dict__. This dictionary/dictionary-like (I will explain this shortly) object contains all the attributes defined for the object itself. It maps the attribute name to its value.

Heres an example:


>>> class C(object):
x = 4

>>> c = C()
>>> c.y = 5
>>> c.__dict__
{'y': 5}



Notice how 'x' is not in c.__dict__. The reason for this is simple enough. While y was defined for the object c, x was defined for its class (C). Therefore, it will appear in the __dict__ of C. In fact, C‘s __dict__ contains a lot of other keys too (including '__dict__'):


>>> c.__class__.__dict__['x']
4
>>> c.__class__.__dict__
dict_proxy({'__dict__': <attribute '__dict__' of 'C' objects>, 'x': 4,
'__module__': '__main__', '__weakref__': <attribute '__weakref__' of 'C' objects>,
'__doc__': None})



We will look at what dictproxy means soon.

The __dict__ of an object is simple enough to understand. It behaves like a Python dict, and is one too.


>>> c.__dict__
{'y': 5}
>>> c.__dict__.__class__
<type 'dict'>
>>> c.__dict__ = {}
>>> c.y

Traceback (most recent call last):
File "<pyshell#81>", line 1, in <module>
c.y
AttributeError: 'C' object has no attribute 'y'
>>> c.__dict__['y'] = 5
>>> c.y
5



The __dict__ of a class however, is not that straight-forward. Its actually an object of a class called dictproxy. dictproxy is a special class whose objects behave like normal dicts, but they differ in some key behaviours.


>>> C.__dict__
dict_proxy({'__dict__': <attribute '__dict__' of 'C' objects>, 'x': 4, '__module__': '__main__', '__weakref__': <attribute '__weakref__' of 'C' objects>, '__doc__': None})
>>> C.__dict__.__class__
<type 'dictproxy'>
>>> C.__dict__['x']
4
>>> C.__dict__['x'] = 6

Traceback (most recent call last):
File "<pyshell#87>", line 1, in <module>
C.__dict__['x'] = 4
TypeError: 'dictproxy' object does not support item assignment
>>> C.x = 6
>>> C.__dict__ = {}

Traceback (most recent call last):
File "<pyshell#89>", line 1, in <module>
C.__dict__ = {}
AttributeError: attribute '__dict__' of 'type' objects is not writable



Notice how you cannot set a key in a dictproxy directly (C.__dict__['x'] = 4 does not work). You can accomplish the same using C.x = 6 however, since the internal behaviour then is different. Also notice how you cannot set the __dict__ attribute itself either(C.__dict__ = {} does not work).

Theres a reason behind this weird implementation. If you don’t want to get into the details, just know that its for the Python interpreter to keep working properly, and to enforce some optimizations. If you want a more detailed explanation, have a look at Scott H’s answer to this StackOverflow question.

### Descriptors

A descriptor is an object that has atleast one of the following magic methods in its attributes: __get__, __set__ or __delete__ (Remember, methods are ultimately objects in Python). Mind you, its the object we are talking about. Its class may or may not have implemented them.

Descriptors can help you define the behaviour of an object’s attribute in Python. With each of the magic methods just mentioned, you implement how the attribute (‘described’ by the descriptor) will be retrieved, set and deleted in the object respectively. There are two types of descriptors – Data Descriptors, and Non-Data Descriptors.

Non-Data Descriptors only have __get__ defined. All others are Data Descriptors. You would naturally think, why these two types are called so. The answer is intuitive. Usually, its data-related attributes that we tend to ‘set’ or ‘delete’ with respect to an object. Other attributes, like methods themselves, we don’t. So their descriptors are called Non-Data Descriptors. As with a lot of other things in Python, this is not a hard-and-fast rule, but a convention. You could just as well describe a method with a Data Descriptor. But then, its __get__ should return a function.

Heres an example of two classes that will come up with data and non-data descriptor objects respectively:


def __init__(self, name):
self._name = name

def __get__(self, obj, objclass):
try:
print("Retrieving attr " + self._name + " from " +
str(obj) + "...")
return objclass.x + " + " + obj.y
except:
raise AttributeError("Attr " + self._name + " could not be " +

def __set__(self, obj, value):
raise AttributeError("Attr " + self._name + " cannot be " +
"set in " + str(obj))

def __delete__(self, obj):
raise AttributeError("Attr " + self._name + " cannot be " +
"deleted in " + str(obj))

def __init__(self, name):
self._name = name

def __get__(self, obj, objclass):
try:
print("Retrieving attr " + self._name + " from " +
str(obj) + "...")
return objclass.x + " + " + obj.y
except:
raise AttributeError("Attr " + self._name + " could not be " +


Notice how the __get__ function takes in an object obj and (its) class objclass. Similarly, setting the value requires obj and some candidate value. Deletion just needs obj. Taking these parameters in (along with the initializer __init__) helps you differentiate between objects of the same descriptor class. Mind you, its the objects that are intended to be the descriptors.
(P.S. If you don’t define the __get__ method for a descriptor, the descriptor object itself will get returned).

Lets use these classes in some code.


class ParentClass(object):
x = "x1"
y = "y1"

class ChildClass(ParentClass):
x = "x2"
y = "y2"

some_object = ChildClass()



Thats it! You can access the ‘described’ objects as usual in Python.


>>> some_object.data_attr_parent
Retrieving attr desc1 from <__main__.ChildClass object at 0x1062c5790>...
'x2 + y2'



Descriptors are used for a lot of attribute and method related functionality in Python, including static methods, class methods and properties. Using descriptors, you can gain better control over how attributes and methods of a class/its objects are accessed – including defining some ‘behind the scenes’ functionality like logging.

Now lets look at the high-level rules governing attribute access in Python.

### The Rules

Quoting Shalabh Chaturvedi’s book verbatim, the workflow is as follows:

1. If attrname is a special (i.e. Python-provided) attribute for objectname, return it.
2. Check objectname.__class__.__dict__ for attrname. If it exists and is a data-descriptor, return the descriptor result. Search all bases of objectname.__class__ for the same case.
3. Check objectname.__dict__ for attrname, and return if found. If objectname is a class, search its bases too. If it is a class and a descriptor exists in it or its bases, return the descriptor result.
4. Check objectname.__class__.__dict__ for attrname. If it exists and is a non-data descriptor, return the descriptor result. If it exists, and is not a descriptor, just return it. If it exists and is a data descriptor, we shouldn’t be here because we would have returned at point 2. Search all bases of objectname.__class__for same case.
5. Raise AttributeError

To make things clearer, heres some tinkering using the code we wrote in the Descriptors section (Have a look at it again just to be clear about things):

data_attr_child is a Data descriptor in some_object‘s class. So you cant write over it. Also, the version in ChildClass (‘desc3’) is used, not the one in ParentClass.


>>> some_object.data_attr_child
Retrieving attr desc3 from <__main__.ChildClass object at 0x1110c9790>...
'x2 + y2'
>>> some_object.data_attr_child = 'xyz'

Traceback (most recent call last):
File "<pyshell#112>", line 1, in <module>
some_object.data_attr_child = 'xyz'
File "/Users/srjoglekar/metaclasses.py", line 16, in __set__
"set in " + str(obj))
AttributeError: Attr desc3 cannot be set in <__main__.ChildClass object at 0x10883f790>



Infact, even if you make an appropriate entry in some_object‘s dict, it still won’t matter (as per Rule 1).


>>> some_object.__dict__['data_attr_child'] = 'xyz'
>>> some_object.data_attr_child
Retrieving attr desc3 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'



The Non-Data Descriptor attribute, on the other hand, can be easily overwritten.


>>> some_object.non_data_attr_child
Retrieving attr desc4 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'
>>> some_object.non_data_attr_child = 'xyz'
>>> some_object.non_data_attr_child
'xyz'
>>> some_object.__dict__
{'data_attr_child': 'xyz', 'non_data_attr_child': 'xyz'}



You can, however, change the behaviour of data_attr_child, if you go to some_object‘s class and modify it in the dictproxy there itself.


>>> some_object.__class__.data_attr_child = 'abc'
>>> some_object.data_attr_child
'xyz'



Notice how the moment you replace the Data-Descriptor in the class with some non-data descriptor (or some object like a String in this case), the entry that we initially made in some_object‘s __dict__ comes into play. Therefore, some_object.data_attr_child returns 'xyz', not 'abc'.

The data_attr_parent attribute behaves similar to data_attr_child.


>>> some_object.data_attr_parent
Retrieving attr desc1 from <__main__.ChildClass object at 0x10883f790>...
'x2 + y2'
>>> some_object.data_attr_parent = 'xyz'

Traceback (most recent call last):
File "<pyshell#127>", line 1, in <module>
some_object.data_attr_parent = 'xyz'
File "/Users/srjoglekar/metaclasses.py", line 16, in __set__
"set in " + str(obj))
AttributeError: Attr desc1 cannot be set in <__main__.ChildClass object at 0x10883f790>
>>> some_object.__class__.data_attr_parent = 'xyz'
>>> some_object.__class__.data_attr_parent
'xyz'



Notice how you cant ‘write-over’ data_attr_parent in ChildClass itself. Once you do that, we go through Rules 1-2-3 and stop at 4, to get the result 'xyz'.

### Rules for Setting Attributes

Way simpler than the rules for ‘getting them’. Quoting Shalabh’s book again,

1. Check objectname.__class__.__dict__ for attrname. If it exists and is a data-descriptor, use the descriptor to set the value. Search all bases of objectname.__class__ for the same case.
2. Insert something into objectname.__dict__ for key "attrname".

Thats it! :-).

### __slots__

To put it concisely, __slots__ is a way to disallow objects from having their own __dict__ in Python. This means, that if you define __slots__ in a Class, then you cannot set arbitrary attributes(apart from the ones mentioned in the ‘slots’) on its objects.

Heres an example of such a class:


class SomeClass(object):
__slots__ = ['x', 'y']

obj = SomeClass()



Now see how this behaves:


>>> obj.x = 4
>>> obj.y = 5
>>> obj.x
4
>>> obj.y
5
>>> obj.z = 6

Traceback (most recent call last):
File "<pyshell#135>", line 1, in <module>
obj.z = 6
AttributeError: 'SomeClass' object has no attribute 'z'



You can ofcourse do this:


>>> obj.__class__.z = 6
>>> obj.z
6



But then, remember you have now defined z in SomeClass‘s __dict__, not in obj‘s.

As Guido van Rossum himself mentions in his blog post, __slots__ were implemented in Python to introduce efficiency, not ‘stricter’ attribute-setting. The basic intuition is this: Suppose you have a class, whose objects you intend to construct in a large number. You don’t really need the flexibility of having ‘dynamic’ attributes on the objects themselves, but you want efficiency. Since slots essentially eliminates the __dict__ attribute in each one of the objects, you get a lot of memory savings this way.

Interestingly, slots are implemented using descriptors in Python.

Have a look at this book I have already quoted in the post. It goes into a lot of detail regarding attribute access in Python, including method resolutions.

Thats all for now. Cheers!

Hello people! I haven’t really blogged in quite some time, and I kind-of feel guilty about it :-). Truth is, I have been busy starting a new job/life at Google, as a Software Engineer at the Hyderabad office. I joined the Google Apps for Work team, and I work on the analytics part of things. I was pretty swamped with setting things up, getting the formalities done, finding a place to live, blah blah – Basically learning to be an adult! Things have finally settled down now, and I have (I think) found my groove when it comes to my overall routine – so I will probably start blogging as usual in the coming weeks.

To answer the obvious question, life at Google is …well…pretty awesome. Let me write out some points about my experience at the company thats supposed to be one of the best employers in the world. And as per my usual style, they will be bullet points, since putting down a coherent/easy-flowing train of thoughts is just beyond my abilities as a writer.

I. The cool work culture!! I don’t know how many other companies do this (since this is my first real job) but its definitely not all of them. Google doesn’t just talk about an open workspace, it does follow it. My team lead engineer sits on a desk thats exactly like the one I use (except for a lot of Swag he has that I don’t), and I can just go drop by if I have any doubts/issues regarding pretty much anything (even non work related). If my mentor needs to talk to me, he will just come by and sit on a bean-bag lying around my desk and discuss things like a friend at college. You literally won’t be able to tell Tech-levels of people at Google, and thats something I feel is really nice about the office culture.

II. Getting intimidated by the people and the technology. This is a big one. I don’t think I have ever felt so…small…in front of people around me, anytime before in my life. To put it honestly, I felt like an idiot around my team for the first week or two. Not because they acted in any such way – in fact I had to stalk them online to know more about them – but because all of them are basically smartasses. And the other aspect is obviously that you get overwhelmed by the internal infrastructure at a company like Google. Its just so vast and there are so many parts and bits and pieces working together, that its difficult to wrap your head around all of them at first. Its all humbling.

III. The perks!! I don’t really think I need to explain this one (since its covered extensively in articles online). We at Hyderabad don’t have the perks that Mountain View does, but its still pretty darn amazing. Free transport, Bunker rooms, Free food (Oh the damn awesome food :-D), Microkitchens, Games rooms on all floors, Amazing gym, Massage centre, Free internet at home, Matching your charity (if you do any), Techstop for chargers and stuff…the list goes on. Who knows, once we have the campus, the list might expand even further!

Trust me, I know I sound like a freaking fanboy throughout the post. But well… I am barely one year out of my college term, and this is pretty much paradise for a luxury-loving dork like me. I hope to have a great life here at Google, and justify the whole process by making as much of an impact as I can. Cheers!

# A practical introduction to Functional Programming for Python coders

This post acquaints the reader with the fundamentals of Functional Programming in the context of Python. Most programmers rarely touch upon languages with a primary functional focus- such as Lisp or Haskell, except maybe as a part of an academic course. Since Python is a widely-used language that supports (mostly) all functional programming constructs, this post tries to demonstrate their usage and advantages. Functional Programming may not be the best/Pythonic way of doing everything in this language, but it has its advantages in some applications and that is what this post is all about.

1. What is Functional Programming?

Functional programming is a programming paradigm that revolves around pure functions. If you have ever done coding in your life, you probably associate functions with subroutines. That is the ‘CS’ perspective. Here, we lean more towards the mathematical definition:

A pure function is essentially any function that can be represented as a mathematical expression only. Which means no side-effects: no I/O operations, no global state changes, no database interactions (You can’t really represent these in a mathematical expression, can you?). The output from a pure function is ONLY dependent on its inputs. So if you were to call a pure function with the same inputs a million times, you would get the same result every single time. Ofcourse, in practical functional programming languages such as Haskell, there are ways to do I/O operations etc., but the focus is still on pure functions.

2. The lambda construct

The easiest way to initialize a pure function in Python is by using the lambda operative:


>>> square_func = lambda x: x**2
>>> square_func(2)
4
>>> some_list = [(1, 3), (5, 2), (6, 1), (4, 6)]
>>> some_list.sort(key=lambda x: x[0]**2 - x[1]*3)
>>> some_list
[(1, 3), (4, 6), (5, 2), (6, 1)]



The lambda operative helps you define functions in a one-line fashion, which is pretty convenient if its just a mathematical expression anyways. In fact, the lambda keyword is pretty prominent in functional programming (and not just Python), and has its roots in Lambda Calculus – one of the ‘ancestors’ of functional programming.

Functions initialized with lambda can also be called anonymous functions. If you look at line 5 in the code above, you are passing a lambda-initialized function to the sort method. You aren’t really giving it a name, just defining it on-the-go and passing it as an argument. Hence the term ‘anonymous’. Ofcourse, you can always assign anonymous functions to a variable (as on line 1) and call them like usual functions (as on line 2).

3. Functions as First-Class citizens

In Functional Programming, functions are first-class citizens. Which basically means that you can treat them as any other objects – you can assign them to variables, you can pass them as arguments, or even get them returned from other functions. You already saw some of this happening in the code before. Here’s some more:


>>> square_func = lambda x: x**2
>>> function_product = lambda F, m: lambda x: F(x)*m
>>> square_func(2)
4
>>> function_product(square_func, 3)(2)
12



$square\_func$ is a function by itself. $function\_product$ on the other hand, is a higher-order function that takes two inputs- A function $F(x)$ and a multiplier $m$. It returns a function $F'(x)$ which is equal to $m*F(x)$. Therefore on line 5, $function\_product(square\_func, 3)$ is a function that is being called with the argument $2$, returning $2^{2}*3 = 12$.

4. Immutability of Data and Data Flows

Immutability in the context of objects means that you never modify the value of data once initialized. In Functional Programming, whenever you call a function on some data, you always get new instances as a result – you never ‘update’ the value of any arguments. Programmatically, this implies that once you initialize a variable like $x = 3$, the variable $x$ will never appear on the LHS of a statement again.

As a result, any functional code can be thought of as a feed-forward data flow. You never ‘come back’ to change the value of any variable, hence data always moves forward from the inputs to the ultimate output(s) – from one function, to another.

This immutability of data leads to another property called Referential Transparency. It means that the value of an expression is the same anywhere it might occur in the program – as long as the required variables are defined. Since you never update the value of any variables/objects (including functions), they mean the same in any context once defined. For this reason, functional code is extremely easy to analyze and debug. You never need to track the value of state variables or remember any updates.

This enables usage of memoization – you basically ‘remember’ the outputs of expensive functions with some common arguments in a sort of lookup table. This reduces the computational complexity at the expense of memory.

5. Recursion

Functional programming doesn’t really provide for iteration via $while$ or $for$ statements. Neither does it have the provision of state-updates. As a result, recursion is a strong theme in functional programming. It is worthwhile to remember that any iterative code can be converted to recursive code.

Here’s a functional version of the n-th Fibonacci number calculator:


>>> fibonacci = (lambda x, x_1=1, x_2=0:
x_2 if x == 0
else fibonacci(x - 1, x_1 + x_2, x_1))
>>> fibonacci(1)
1
>>> fibonacci(5)
5
>>> fibonacci(6)
8



Line 2 defines the base case of the computation, while line 3 makes the recursive call. If you think about it, $x$, $x\_1$ and $x\_2$ are essentially state variables whose updated versions are being made and passed into every recursive call. This is usually how state is handled in functional code.

It is always better to implement tail-recursion when writing functional code, especially in pure-functional languages such as Scheme. There’s a good reason for this. Tail-recursive code is easily optimized into iterative code by the under-lying compiler (though this doesn’t apply to Python), making the compiled code more efficient.

6. Lazy evaluation

This is an aspect of Functional Programming that Python does not adopt. In many pure functional languages such as Haskell, objects that don’t necessarily need evaluation are not evaluated. Evaluation means computing the value of a function expression. Consider the following line:


length([3 + 4, 5, 1/0])



In a language like Python, the presence of 1/0 would case an Exception immediately. However, if we were to implement lazy-evaluation, the value 3 would be returned- since there exist 3 objects in the list, and their values don’t need to be evaluated for the count. This causes a sort of graph reduction in the data flow, leading to lesser function calls (at the risk of ignoring errors).

Python 3.x does support a different kind of lazy evaluation, that returns iterators for calls such as $some\_dict.keys()$. This prevents the entire data from being loaded into memory, and is efficient from a programming perspective.

7. No iterators as sequences

This is a small point, but since the value of the next element in an iterator depends on its state(which violates Referential Transparency), iterators aren’t present in pure-functional code. Instead if we are writing pure-functional code we only deal with explicit immutable tuples – which you can generate from an iterator in Python using $tuple()$.

7. map, reduce and filter

$map$, $reduce$ and $filter$ are three higher-order functions that appear in all pure functional languages – and in Python, too. Their prevalence suggests how often they are used in functional code to make it more elegant.

map basically provides a kind of parallelism by calling a function over all elements in a list/array. Here’s an example:


>>> map(lambda x: x**2, [1, 2, 3])
[1, 4, 9]



Notice how map can be parallelized – since you are calling the same function over all elements in an array without any modifications, you can make the calls in any given order.

filter offers another parallelism over a sequence. It takes as an input a boolean-returning function and a sequence, and retains only those values from the sequence that return True from the function. Example:


>>> filter(lambda x: x % 2 == 0, [1,2,3,4,5,6])
[2, 4, 6]



The above line filters out all the odd numbers from the list.

reduce is a construct that performs a serial iteration over a sequence. Its first argument is a function $F(a, x)$, which takes in two arguments – An accumulator $a$ and the current input $x$. $F$ computes and returns the new value of $a$. The second argument to reduce is the sequence itself, $[x1, x2, ..., x_n]$, and the third argument is the initial accumulator $a$. Internally, what it does is:

i.   $a_1 = F(a, x1)$

ii.  $a_2 = F(a_1, x2)$

iii. $a_3 = F(a_2, x3)$

and so on… What you get returned at the end, is $a_n$. An example:


>>> reduce(lambda x, y: x + y, [1, 2, 3], 0)
6



The accumulator doesn’t have to be of the same type as the elements in the list! It can be a sequence as well. Here’s an example of reduce being used to reverse a sequence:


>>> reduce(lambda L, element: [element] + L, [1, 2, 3], [])
[3, 2, 1]



Notice how no list is being modified in the above line of code. (As a side note, $reduce$ needs to be imported from the $functools$ library from Python 3.x).

Also, understand that since every call of $F$ that reduce makes on an element depends on the elements that come before it, reduce cannot be parallelized.

Notice the usage of the words ‘map’ and ‘reduce’? These operators in Functional programming provide a good context to understand MapReduce, and are even touched upon in the old Google Lectures on the subject.

End note1: This blog post provides very succinct examples and code for you to better understand how functional programming should be done in Python using the aforementioned keywords/operators. It also introduces some advanced techniques such as pipelines. Do have a look!

End Note2: This article provides some perspective on why functional programming is the preferred approach in only specialized operations, and comments on the human-understandability of functional code (which is low).

# A (small) introduction to Boosting

#### What is Boosting?

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model.

Lets look at the highlighted parts one-by-one:

1. Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature. Some examples are Decision Stumps or smaller Decision Trees. In most cases, either one of these two is used as a weak learner in Boosting frameworks.

2. Ensemble: The overall model built by Boosting is a weighted sum of all of the weak learners. The weights and training given to each ensures that the overall model yields a pretty high accuracy (sometimes state-of-the-art).

3. Iteratively build: Many ensemble methods such as bagging/random forests are capable of training their components in parallel. This is because the training of any of those weak learners does not depend on the training of the others. That is not the case with Boosting. At each step, Boosting tries to evaluate the shortcomings of the overall model built till the previous step, and then generates a weak learner to battle these shortcomings. This weak learner is then added to the total model. Therefore, the training must necessarily proceed in a serial/iterative manner.

4. Meta-algorithm: Since Boosting isn’t necessarily an ML algorithm by itself, but rather uses other (basic) algorithms to build a stronger one, it is said to be a ‘meta’ algorithm.

#### How does Boosting work?

Usually, a Boosting framework for regression works as follows:

(The overall model at step ‘i’ is denoted by $F_i$.)

1. Start with a simple model $F_0$, usually predicting a common (average) value for all instances.

2. For $i$ from $1$ to $M$ do:

2.a) Evaluate the shortcomings of $F_{i - 1}$, in terms of a value $r_j$ for each element $x_j$ in the training set.
2.b) Generate a new weak learner $h_i(x)$ based on the $r_j$s.
2.c) Compute the corresponding weight $\gamma_i$.
2.d) Update the current model as follows:

$F_i = F_{i - 1} + \nu \gamma_i h_i(x)$

where $\nu$ denotes the learning rate of the framework.

3. Return $F_M$ as the overall output.

Each of the iterations in Boosting essentially tries to ‘improve’ the current model by introducing another learner into the ensemble. Having such an ensemble not only reduces the bias (which is generally pretty high for weak learners), but also the variance (since multiple learners contribute to the overall output, each with their own unique training). There are various versions of Boosting, and all of them primarily differ in the way they perform steps 2.a, 2.b. and 2.c.

For example, Gradient Boosting computes $r_j$ as the gradient of a Loss function (whose expression involves target output $y$ and model output till previous step, $F_{i - 1}$) with respect to $F_{i - 1}$ at data point $x_j$. The $i$th weak learner $h_i$ is then trained to predict $r_j$ from $x_j$. The weight $\gamma_i$ is then optimized so as to minimize the total Loss value with respect to $F_i$.

AdaBoost, short for Adaptive Boosting, works by assigning $r_j$ as a ‘weight’ to $x_j$ based on how miscalculated the model output is according to $F_{i - 1}$. $h_i$ and $\gamma_i$ are then trained to optimize the weighted sum of errors with respect to the output from $h_i$. AdaBoost is considered to be one of the best out-of-the-box classifiers today. Mathematically speaking, AdaBoost is a special case of Gradient Boosting.

The weak-learners used in Boosting are indeed ‘weak’. The most commonly used ones are decision trees of a fixed number of leaf nodes. Using complex methods such as SVMs in Boosting usually leads to overfitting. However, the total number of weak learners $M$ should also be controlled, mostly based on experimentation. Having a low $M$ will cause underfitting, but a very high value could lead to overfitting.

To improve the bias/variance characteristics of Boosting, bagging is generally employed. What this essentially does is train each weak learner on a subset of the overall dataset (rather than the whole training set). This causes a good decrease in variance and tends to reduce overfitting.

Classification applications would work slightly different from regression, of course. The most commonly used method is to build a boosted model for each class. This model would predict the probability of a data point belonging to that class. During training, the inputs belonging to the corresponding class will be given a target of 1, with the others getting a target of 0. The outputs from all the models could then be put into a softmax function for normalization.

#### What are the drawbacks of Boosting?

The most obvious drawback of boosting is the computational complexity. Performing so many iterations, and generating a new model at each, requires a lot of computations and time (and also space). The fact that the ensemble has to be built iteratively doesn’t help matters. However, using simple weak learners (instead of big decision trees) helps to mitigate this problem.

The biggest criticism for Boosting comes from its sensitivity to noisy data. Think of it this way. At each iteration, Boosting tries to improve the output for data points that haven’t been predicted well till now. If your dataset happens to have some (or a lot) of misclassified/outlier points, the meta-algorithm will try very hard to fit subsequent weak learners to these noisy samples. As a result, overfitting is likely to occur. The exponential loss function used in AdaBoost is particularly vulnerable to this issue (since the error from an outlier will get an exponentially-weighed importance for future learners).

# Simple Turing Machine using TensorFlow

A post after many days! This one describes a simple code I wrote to simulate a Deterministic Turing Machine using TensorFlow. Its pretty basic with respect to the design, using TensorFlow only for the state change computations. The movement on the tape, and handling of halts is taken care of by old-school Python. This is not intended to be a practical usage of TensorFlow, its just something I did and am writing about here :-).

To demonstrate the code, I will use the 3-state Busy Beaver example from the Wiki page.

#### Encoding the Problem

Each of the states are encoded as real numbers (integers in this case, but for whatever reasons you could use floats too). So ‘A’, ‘B’, ‘C’ and ‘HALT’ become 0, 1, 2 and 3 respectively. Accordingly, the initial and final states are defined as follows:

initial_state = 0
final_states = [3]


Since there can be multiple final states generally, final_states is defined as a list. As with the states, the potential symbols are numbers too. The blank symbol, as per the Busy Beaver problem definition, is 0.

blank_input = 0


Now we come to the transition function. It is defined using two 2-D NumPy matrices, one for the inputs and the other for the corresponding outputs.

import numpy as np

transition_input = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1],
[2, 0],
[2, 1]])

transition_output = np.array([[1, 1, 1],
[2, 1, -1],
[0, 1, -1],
[1, 1, 1],
[1, 1, -1],
[3, 1, 1]])


Consider the transition_input[2] and transition_output[2]. The input denotes [current state, current symbol]. The output denotes [next state, next symbol, movement]. The movement value can either be 1(move pointer right, or increment index value) or -1(move pointer left, or decrement index value). Accordingly, transition_input[2] requires the current state to be 1 and current symbol being read to be 0. If these conditions are met, then the next state and next symbol are 0 and 1 respectively, with the pointer on the tape moving one step to the left.

#### The Turing Machine Code

The actual Turing Machine code follows. Its well documented as always, so you shouldn’t have a problem following whats going on if you are well-acquainted with Turing Machines and TensorFlow.

import tensorflow as tf

class TuringMachine(object):
"""
Implementation of a basic Turing Machine using TensorFlow.
"""

def __init__(self, initial_state, final_states, blank_input,
transition_input, transition_output):
"""
Initializes the TensorFlow Graph required for running the Turing
Machine.

States and symbols should both be encoded as real numbers prior to
initialization. Accordingly, 'initial_state' should be a float
denoting the starting state. 'final_states' should be an iterable
of numbers, denoting all the states the Machine can halt at.
'blank_input' should be a float denoting the blank input for the
Machine.
'transition_input' and 'transition_output' will together denote
the transition function followed by the Machine. Both should be
2-D NumPy matrices. Each inner vector in 'transition_input' will
contain [current state, current symbol], while each inner vector
in 'transition_output' will have [next state, next symbol, move].
"move" can either be 1: to increase pointer index by 1 (move
right), or -1: decrease pointer index by 1 (move left).

"""

#Store the necessary information as attributes
self._initial_state = float(initial_state)
self._final_states = set(final_states)
self._blank_input = blank_input

#Initialize Graph
self._graph = tf.Graph()

n_transitions = len(transition_input)
if len(transition_input) != len(transition_output):
raise ValueError("Number of transition inputs and outputs " +
"not same")

#Initialize all components in the Graph
with self._graph.as_default():

#Placeholder for current symbol thats being read from the
#tape
self._current_symbol = tf.placeholder("float")

#Variable to hold current state
self._current_state = tf.Variable(self._initial_state)

#Constant Ops for transitions
transition_in = tf.constant(transition_input)
transition_out = tf.constant(transition_output)

#Generate Ops to compute transition
concat_in = tf.pack([self._current_state,
self._current_symbol])
packed_in = tf.pack([concat_in for i in
range(len(transition_input))])
transition_diff = tf.reduce_sum(
tf.pow(tf.sub(packed_in, transition_input), 2), 1)
#This will return a 0 if theres an accurate match for the
#transition condition (current state, current symbol). If not,
#this will return a non-zero value.
self._transition_match = tf.reduce_min(transition_diff, 0)
#This will correspond to the index of the match
match_index = tf.cast(tf.argmin(transition_diff, 0), "int32")
#Pick out the transition output
self._transition = tf.reshape(
tf.slice(transition_out,  tf.pack([match_index, 0]),
tf.pack([1, 3])), [3])
#This will change state Variable
self._state_change = tf.assign(
self._current_state, tf.reshape(
tf.cast(tf.slice(self._transition, tf.constant([0]),
tf.constant([1])), "float32"), []))

#This will reset the state Variable
self._state_reset = tf.assign(self._current_state,
self._initial_state)

#Initialize Session
self._sess = tf.Session()

#Initialize Variables
init_op = tf.initialize_all_variables()
self._sess.run(init_op)

def run(self, input_tape, start_index, max_iter=1000):
"""
Runs the Turing Machine with the given input.
'input_tape' must be an index-able object, with 'start_index'
denoting the index number to start computation at.

Obviously, beware of inputs that may cause the machine to run
indefinitely. To control this, set 'max_iter' to some number of
iterations you would want to allow.

Will return final state and final index. If Machine halts at a
non-final state, or goes beyond max_iter, i.e. the Machine does
not accept the input, will return None, None.
The input tape will be modified in-place.
"""

#First reset the state
self._sess.run(self._state_reset)

#Run the Machine
index = start_index
for i in range(max_iter):
#Figure out input to be given
if (index > -1 and len(input_tape) > index):
machine_input = input_tape[index]
else:
machine_input = self._blank_input

#Run one step of the Machine
_, match_value, transition = self._sess.run(
[self._state_change, self._transition_match,
self._transition],
feed_dict = {self._current_symbol: machine_input})

#If match_value is not zero, return None, None
if match_value != 0:
return None, None

#First modify tape contents at current index as per transition
if (index > -1 and len(input_tape) > index):
#Index is within input tape limits
input_tape[index] = transition[1]
elif index == len(input_tape):
#Index has gone beyond the range of the input tape,
#to the right
input_tape.append(transition[1])
else:
#Index has gone beyond the range of the input tape,
#to the left
input_tape.insert(0, transition[1])
index = 0

#Change index as per transition
index = index + transition[2]

#If its a final state, return state, index
if transition[0] in self._final_states:
return transition[0], index

return None, None



A few things:

1. If the tape ‘pointer’ goes beyond the limits of input_tape, the blank input is automatically given to the machine (lines 114-118). Moreover,  depending on which end of the tape the pointer has gone across, the output value from the machine is inserted into the list (lines 130-142).

2. Since there is no way to check if the Machine will stop, the run method lets the user provide a maximum number of iterations. If the machine keeps running beyond the limit, (None, None) is returned.

3. The input_tape list is modified in-place.

4. If you want to track the transitions the Turing Machine follows, add a ‘print’ statement (or some kind of logging) at line 125.

#### Running the Turing Machine

First we define the contents of the input tape, and the index to start at.

>>> input_tape = [0, 1, 1, 1, 1, 1]
>>> start_index = 0


Then, we initialize a TuringMachine instance with the algorithm as encoded above:

>>> machine = TuringMachine(initial_state, final_states, blank_input,
transition_input, transition_output)


Finally, we ‘run’ the machine, and check if the input tape has been modified correctly:

>>> machine.run(input_tape, start_index)
(3, 5)
>>> input_tape
[1, 1, 1, 1, 1, 1, 1]
>>> machine.run(input_tape, start_index)
(3, 9)
>>> input_tape
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Thats it! This is my first implementation of a Turing Machine, so do leave a comment if there are any errors/improvements possible. Cheers!

# Predicting Trigonometric Waves few steps ahead with LSTMs in TensorFlow

I have recently been revisiting my study of Deep Learning, and I thought of doing some experiments with Wave prediction using LSTMs. This is nothing new, just more of a log of some tinkering done using TensorFlow.

### The Problem

The basic input to the model is a 2-D vector – each number corresponding to the value attained by the corresponding wave. Each wave in turn is: (a constant + a sine wave + a cosine wave). The waves themselves have different magnitudes, initial phases and frequencies. The goal is to predict the values that will be attained a certain (I chose 23) steps ahead on the curve.

So first off, heres the wave-generation code:


##Producing Training/Testing inputs+output
from numpy import array, sin, cos, pi
from random import random

#Random initial angles
angle1 = random()
angle2 = random()

#The total 2*pi cycle would be divided into 'frequency'
#number of steps
frequency1 = 300
frequency2 = 200
#This defines how many steps ahead we are trying to predict
lag = 23

def get_sample():
"""
Returns a [[sin value, cos value]] input.
"""
global angle1, angle2
angle1 += 2*pi/float(frequency1)
angle2 += 2*pi/float(frequency2)
angle1 %= 2*pi
angle2 %= 2*pi
return array([array([
5 + 5*sin(angle1) + 10*cos(angle2),
7 + 7*sin(angle2) + 14*cos(angle1)])])

sliding_window = []

for i in range(lag - 1):
sliding_window.append(get_sample())

def get_pair():
"""
Returns an (current, later) pair, where 'later' is 'lag'
steps ahead of the 'current' on the wave(s) as defined by the
frequency.
"""

global sliding_window
sliding_window.append(get_sample())
input_value = sliding_window[0]
output_value = sliding_window[-1]
sliding_window = sliding_window[1:]
return input_value, output_value



Essentially, you just need to call get_pair to get an ‘input, output’ pair – the output being 23 time intervals ahead on the curve. Each have the NumPy dimensionality of [1, 2]. The first value ‘1’ means that the batch size is 1 – we will feed one input at a time while training/testing.

Now, I don’t pass the input directly into the LSTM. I try to improve the LSTM’s understanding of the input, by providing its first and second derivatives as well. So, if the input at time t is x(t), the derivative is x'(t) = (x(t) – x(t-1)). Following the analogy, x”(t) = (x'(t) – x'(t-1)). Here’s the code for that:


#Input Params
input_dim = 2

#To maintain state
last_value = array([0 for i in range(input_dim)])
last_derivative = array([0 for i in range(input_dim)])

def get_total_input_output():
"""
Returns the overall Input and Output as required by the model.
The input is a concatenation of the wave values, their first and
second derivatives.
"""
global last_value, last_derivative
raw_i, raw_o = get_pair()
raw_i = raw_i[0]
l1 = list(raw_i)
derivative = raw_i - last_value
l2 = list(derivative)
last_value = raw_i
l3 = list(derivative - last_derivative)
last_derivative = derivative
return array([l1 + l2 + l3]), raw_o



So the overall input to the model becomes a concatenated version of x(t), x'(t), x”(t). The obvious question to ask would be- Why not do this in the TensorFlow Graph itself? I did try it, and for some reason (which I don’t understand yet), there seems to seep in some noise into the Variables that act as memory units to maintain state.

But anyways, here’s the code for that too:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *

#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim])

##First Order Derivative Layer
#This will store the last recorded value
last_value1 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value1 = tf.sub(input_layer, last_value1)
#Update last recorded value
last_assign_op1 = last_value1.assign(input_layer)

##Second Order Derivative Layer
#This will store the last recorded derivative
last_value2 = tf.Variable(tf.zeros([1, input_dim]))
#Subtract last value from current
sub_value2 = tf.sub(sub_value1, last_value2)
#Update last recorded value
last_assign_op2 = last_value2.assign(sub_value1)

##Overall input to the LSTM
#x and its first and second order derivatives as outputs of
#earlier layers
zero_order = last_assign_op1
first_order = last_assign_op2
second_order = sub_value2
#Concatenated
total_input = tf.concat(1, [zero_order, first_order, second_order])



If you have an idea of what might be going wrong, do leave a comment! In any case, the core model follows.

### The Model

So heres the the TensorFlow model:

1) The Imports:


#Imports
import tensorflow as tf
from tensorflow.models.rnn.rnn import *



2) Our input layer, as always, will be a Placeholder instance with the appropriate type and dimensions:


#Input Params
input_dim = 2

##The Input Layer as a Placeholder
#Since we will provide data sequentially, the 'batch size'
#is 1.
input_layer = tf.placeholder(tf.float32, [1, input_dim*3])



3) We then define out LSTM layer. If you are new to Recurrent Neural Networks or LSTMs, here are two excellent resources:

1. This blog post by Christopher Olah
2. This deeplearning.net post. It defines the math behind the LSTM cell pretty succinctly.

If you like to see implementation-level details too, then heres the relevant portion of the TensorFlow source for you.

Now the LSTM layer:


##The LSTM Layer-1
#The LSTM Cell initialization
lstm_layer1 = rnn_cell.BasicLSTMCell(input_dim*3)
#The LSTM state as a Variable initialized to zeroes
lstm_state1 = tf.Variable(tf.zeros([1, lstm_layer1.state_size]))
#Connect the input layer and initial LSTM state to the LSTM cell
lstm_output1, lstm_state_output1 = lstm_layer1(input_layer, lstm_state1,
scope=&quot;LSTM1&quot;)
#The LSTM state will get updated
lstm_update_op1 = lstm_state1.assign(lstm_state_output1)



We only use 1 LSTM layer. Providing a scope to the LSTM layer call (on line 8) helps in avoiding variable-scope conflicts if you have multiple LSTM layers.

The LSTM layer is followed by a simple linear regression layer, whose output becomes the final output.


##The Regression-Output Layer1
#The Weights and Biases matrices first
output_W1 = tf.Variable(tf.truncated_normal([input_dim*3, input_dim]))
output_b1 = tf.Variable(tf.zeros([input_dim]))
#Compute the output
final_output = tf.matmul(lstm_output1, output_W1) + output_b1



We have finished defining the model itself. But now, we need to initialize the training components. These help fine-tune the parameters/state of the model to make it ready for deployment. We won’t be using these components post training (ideally).

4) First, a Placeholder for the correct output associated with the input:


##Input for correct output (for training)
correct_output = tf.placeholder(tf.float32, [1, input_dim])



Then, the error will be computed using the LSTM output and the correct output as the Sum-of-Squares loss.


##Calculate the Sum-of-Squares Error
error = tf.pow(tf.sub(final_output, correct_output), 2)



Finally, we initialize an Optimizer to adjust the weights for the LSTM layer. I tried Gradient Descent, RMSProp as well as Adam Optimization. Adam works best for this model. Gradient Descent works really bad on LSTMs for some reason (that I can’t grasp right now). If you want to read more about Adam-Optimization, read this paper. I decided on the learning rate of 0.0006 after a lot of trial-and-error, and it seems to work best for the number of iterations I use (100k).


##The Optimizer



5) Finally, we initialize the Session and all required Variables as always.


##Session
sess = tf.Session()
#Initialize all Variables
sess.run(tf.initialize_all_variables())



The Training

Here’s the rudimentary code I used for training the model:


##Training

actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(80000):
input_v, output_v = get_total_input_output()
_, _, network_output = sess.run([lstm_update_op1,
train_step,
final_output],
feed_dict = {
input_layer: input_v,
correct_output: output_v})

actual_output1.append(output_v[0][0])
actual_output2.append(output_v[0][1])
network_output1.append(network_output[0][0])
network_output2.append(network_output[0][1])
x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()



Training takes almost a minute on my Intel i5 machine.

Consider the first wave. Initially, the network output is far from the correct one(The red one is the LSTM output):

But by the end, it fits pretty well:

Similar trends are seen for the second wave:

### Testing

In practical scenarios, the state at which you end training would rarely be the state at which you deploy. Therefore, prior to testing, I ‘fastforward’ both the waves first. Then, I flush the contents of the LSTM cell (mind you, the learned matrix parameters for the individual functions don’t change).


##Testing

for i in range(200):
get_total_input_output()

#Flush LSTM state
sess.run(lstm_state1.assign(tf.zeros([1, lstm_layer1.state_size])))



And here’s the rest of the testing code:


actual_output1 = []
actual_output2 = []
network_output1 = []
network_output2 = []
x_axis = []

for i in range(1000):
input_v, output_v = get_total_input_output()
_, network_output = sess.run([lstm_update_op1,
final_output],
feed_dict = {
input_layer: input_v,
correct_output: output_v})

actual_output1.append(output_v[0][0])
actual_output2.append(output_v[0][1])
network_output1.append(network_output[0][0])
network_output2.append(network_output[0][1])
x_axis.append(i)

import matplotlib.pyplot as plt
plt.plot(x_axis, network_output1, 'r-', x_axis, actual_output1, 'b-')
plt.show()
plt.plot(x_axis, network_output2, 'r-', x_axis, actual_output2, 'b-')
plt.show()



Its pretty similar to the training one, except for one small difference: I don’t run the training op anymore. Therefore, those components of the Graph don’t work at all.

Here’s the correct output with the model’s output for the first wave:

And the second wave:

Thats all for now! I am not a deep learning expert, and I still experimenting with RNNs, so do leave comments/suggestions if you have any! Cheers!