Residual Neural Networks as Ensembles

In a previous blog post, I had mentioned Residual Connections in the context of Google-Neural Machine Translation. I was not completely familiar with the intuition behind Residual Networks then, so heres a short post on what I gathered after reading some literature.

Residual Networks first got attention after this paper – “Deep Residual Learning for Image Recognition” by some folks from Microsoft Research. They used what they called residual connections in Convolutional Neural Networks, obtaining 1st-place results at ILSVRC 2015. However, it turns out that their notion of why the networks achieved such great performance might have been flawed.

The original authors thought that the new connections allowed training of very deep neural networks (their state-of-the-art model had 152 layers) by ‘preserving’ gradient across layers during training. This, according to them, solved the Vanishing Gradient Problem. However, the paper “Residual Networks Behave Like Ensembles of Relatively Shallow Networks” challenges this, attempting to prove that Residual NNs actually behave like ensembles of neural networks!

Lets take this step-by-step.

First, lets establish some nomenclature. In a classic neural network, suppose the output from the $n$th layer is written as $y_n$. Mathematically, you can write:

$y_n = f_n(y_{n-1})$ …(1)

Here, $f_n$ is a vectorial function that computes $y_n$ from $y_{n-1}$. In a standard neural network, the mathematical computation performed by $f$ is same across all layers (like the ReLU). What differentiates any $f_n$ from other layers is the set of weights assigned to every element of $y_{n-1}$ towards computing elements of $y_n$. These weights are of course learnt during training.

A typical layer in a Residual NN looks like this (taken from the original paper):

Based on our nomenclature, the expression for $y_n$ now becomes:

$y_n = f_n(y_{n-1}) + y_{n-1}$ …(2)

The difference, as you must have noticed, lies in the second term on the Right-Hand-Side. This term is what they call the identity skip(or short-cut) connection. ‘Skip‘, because you are skipping the $f_n$ computation, and ‘identity‘ because you are not multiplying $y_{n-1}$ by any set of weights before addition.

These identity short-cut connections are what make residual NNs special.

‘Unravelling’ a Residual Neural Network

The reference paper‘s biggest insight lies in the way they look at equation (2) – that is, as a recursive definition. They call this unravelling the NN.

Lets consider a simple residual NN with three layers. The input can be denoted as $y_0$, with the outputs from the three subsequent layers being $y_1$$y_2$ and $y_3$ respectively. $y_3$ is the output of the NN as a whole.

Expanding equation (2) step-by-step, we would get

$y_3 = f_3(y_2) + y_2$ …(3)

$y_3 = f_3(f_2(y_1) + y_1) + f_2(y_1) + y_1$ …(4)

$y_3 = f_3(f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0) + f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0$ …(5)

Equation (5) essentially defines $y_3$ as the sum of outputs from every single layer (including $y_0$) of the neural network.

Contrast this with what we would see if it was a standard network:

$y_3 = f_3(f_2(f_1(y_0)))$ …(6)

(Ofcourse, the weights learnt would be different in each case)

Consider a situation where $y_0$ and the weights for $f_2$ are such that $f_2(y_1)$ turns out to be a vector of very low values.

If this happens in equation (6), you see that there is very little $f_3$ can do. Since there is only one term on the RHS, any input basically follows one path through the NN.

But now consider equation (5). If $f_2(y_1)$ is close to a zero vector, we still get

$y_3 = f_3(f_1(y_0) + y_0) + f_1(y_0) + y_0$

It looks like the input signal bypassed the second layer completely on its way to the output! In fact, if we did not have $f_2$ at all, you would derive the above expression for the network. This flexibility (i.e. the ability to bypass any layer) is provided to the NN by the skip-connections.

And the ability to ‘skip’ is not just restricted to one layer per run- depending on the input and the weights at every level, any combination of them could be on/off for a given case. The name residual can be interpreted better now – residual can mean ‘unused’, which is pretty much what $f_2$ is in the above example! That is not the case for equation (6), since the network does not have the choice to ignore $f_2$‘s output in the single term on the RHS.

Diagrammatically, you could ‘unravel’ the network like this:

Each ‘path’ in the above figure denotes some combination of the layers working on $y_0$ as it makes its way to the output. Every node denotes convergence (addition) of a set of paths, before passing on the sum to the next layer.

Ideally, every output would get significant contributions from every one of those paths. But that is not usually the case, as observed by the authors of the paper.

Now lets see what happens to the figure the moment you switch off $f_2$, cutting off all corresponding paths:

As you can see, even though $f_2$ makes no contributions, the input still makes its way through the remaining paths to generate output.

Intuitively, you can see why the authors think of this structure as an ensemble. At every single node (including the last one), there are multiple paths, any of which may or may not contribute to the overall output for a given scenario. You can say that every $y_0$ activates a unique set of paths in a residual NN.

And how many such paths are there? Its equivalent to the total cases resulting from every one of the layers being either on or off. Simple probability theory will tell you that its $2^N$, where $N$ is the total number of layers in the NN.

Unusual properties of Residual NNs

Thinking of residual NNs as ensembles motivated the authors to conduct some experiments to test their hypothesis. Essentially, what they tried to do is see which properties of ensembles residual NNs satisfy:

1. Resilience to layer deletion: In most Neural Networks (the ones that resemble equation (2)), deleting a layer has disastrous results on the outputs – whatever the size of the overall network may be. And for good reason, as you are disrupting the one term that corresponds to the output.

But that is not the case for Residual Neural Networks. In fact, deleting 1 (or even 2 or 3) layers in large residual NNs introduces only around 6-7% of an error into the performance of the network! Moreover, deleting more and more layers actually has a pretty smooth (as in mathematically smooth) effect on the total error:

This is pretty close to how an ensemble behaves – deleting models from an ensemble does introduce error, but the increase is never drastic with respect to the number of models removed.

This can be explained easily by looking at equation (5). Even if you delete one layer from a residual NN, that still leaves $2^{N-1}$ terms on the RHS of the output. A bunch of them even yield the same result as they would have, with the layer in.

2. Shortening of effective paths: This is actually contradictory to what people first believed about residual NNs – that they promote deeper networks.

During training, the authors observed that the updates were not happening uniformly across all layers, as they would for normal NNs. Every training point would adjust the weights along a specific set of ‘paths’ as shown in the unravelled diagram. And most of these paths, even with 152-layer deep networks, were only 20-30 levels deep!

Even on-line, every input activated only a specific set of paths without taking significant contributions from every single layer and path in the network.

This is where the biggest revelation lies: Residual NNs work better not by increasing the effective paths, but actually reducing them! What works here is that every input has a chance to take its own unique set of paths to the output, without having to go through every single layer.

This is again similar to how ensembles are trained and run. During training, you won’t observe all smaller models in an ensemble getting significantly adjusted for every training point. On-line, there will always only be a subset of models that give a strong output for any input.

Thats it for now! Do read the reference paper if you feel interested, or go through the original paper to see their usage in the image processing scenario.

A (small) introduction to Boosting

What is Boosting?

Boosting is a machine learning meta-algorithm that aims to iteratively build an ensemble of weak learners, in an attempt to generate a strong overall model.

Lets look at the highlighted parts one-by-one:

1. Weak Learners: A ‘weak learner’ is any ML algorithm (for regression/classification) that provides an accuracy slightly better than random guessing. For example, consider a problem of binary classification with approximately 50% of samples belonging to each class. Random guessing in this case would yield an accuracy of around 50%. So a weak learner would be any algorithm, however simple, that slightly improves this score – say 51-55% or more. Usually, weak learners are pretty basic in nature. Some examples are Decision Stumps or smaller Decision Trees. In most cases, either one of these two is used as a weak learner in Boosting frameworks.

2. Ensemble: The overall model built by Boosting is a weighted sum of all of the weak learners. The weights and training given to each ensures that the overall model yields a pretty high accuracy (sometimes state-of-the-art).

3. Iteratively build: Many ensemble methods such as bagging/random forests are capable of training their components in parallel. This is because the training of any of those weak learners does not depend on the training of the others. That is not the case with Boosting. At each step, Boosting tries to evaluate the shortcomings of the overall model built till the previous step, and then generates a weak learner to battle these shortcomings. This weak learner is then added to the total model. Therefore, the training must necessarily proceed in a serial/iterative manner.

4. Meta-algorithm: Since Boosting isn’t necessarily an ML algorithm by itself, but rather uses other (basic) algorithms to build a stronger one, it is said to be a ‘meta’ algorithm.

How does Boosting work?

Usually, a Boosting framework for regression works as follows:

(The overall model at step ‘i’ is denoted by $F_i$.)

1. Start with a simple model $F_0$, usually predicting a common (average) value for all instances.

2. For $i$ from $1$ to $M$ do:

2.a) Evaluate the shortcomings of $F_{i - 1}$, in terms of a value $r_j$ for each element $x_j$ in the training set.
2.b) Generate a new weak learner $h_i(x)$ based on the $r_j$s.
2.c) Compute the corresponding weight $\gamma_i$.
2.d) Update the current model as follows:

$F_i = F_{i - 1} + \nu \gamma_i h_i(x)$

where $\nu$ denotes the learning rate of the framework.

3. Return $F_M$ as the overall output.

Each of the iterations in Boosting essentially tries to ‘improve’ the current model by introducing another learner into the ensemble. Having such an ensemble not only reduces the bias (which is generally pretty high for weak learners), but also the variance (since multiple learners contribute to the overall output, each with their own unique training). There are various versions of Boosting, and all of them primarily differ in the way they perform steps 2.a, 2.b. and 2.c.

For example, Gradient Boosting computes $r_j$ as the gradient of a Loss function (whose expression involves target output $y$ and model output till previous step, $F_{i - 1}$) with respect to $F_{i - 1}$ at data point $x_j$. The $i$th weak learner $h_i$ is then trained to predict $r_j$ from $x_j$. The weight $\gamma_i$ is then optimized so as to minimize the total Loss value with respect to $F_i$.

AdaBoost, short for Adaptive Boosting, works by assigning $r_j$ as a ‘weight’ to $x_j$ based on how miscalculated the model output is according to $F_{i - 1}$. $h_i$ and $\gamma_i$ are then trained to optimize the weighted sum of errors with respect to the output from $h_i$. AdaBoost is considered to be one of the best out-of-the-box classifiers today. Mathematically speaking, AdaBoost is a special case of Gradient Boosting.

The weak-learners used in Boosting are indeed ‘weak’. The most commonly used ones are decision trees of a fixed number of leaf nodes. Using complex methods such as SVMs in Boosting usually leads to overfitting. However, the total number of weak learners $M$ should also be controlled, mostly based on experimentation. Having a low $M$ will cause underfitting, but a very high value could lead to overfitting.

To improve the bias/variance characteristics of Boosting, bagging is generally employed. What this essentially does is train each weak learner on a subset of the overall dataset (rather than the whole training set). This causes a good decrease in variance and tends to reduce overfitting.

Classification applications would work slightly different from regression, of course. The most commonly used method is to build a boosted model for each class. This model would predict the probability of a data point belonging to that class. During training, the inputs belonging to the corresponding class will be given a target of 1, with the others getting a target of 0. The outputs from all the models could then be put into a softmax function for normalization.

What are the drawbacks of Boosting?

The most obvious drawback of boosting is the computational complexity. Performing so many iterations, and generating a new model at each, requires a lot of computations and time (and also space). The fact that the ensemble has to be built iteratively doesn’t help matters. However, using simple weak learners (instead of big decision trees) helps to mitigate this problem.

The biggest criticism for Boosting comes from its sensitivity to noisy data. Think of it this way. At each iteration, Boosting tries to improve the output for data points that haven’t been predicted well till now. If your dataset happens to have some (or a lot) of misclassified/outlier points, the meta-algorithm will try very hard to fit subsequent weak learners to these noisy samples. As a result, overfitting is likely to occur. The exponential loss function used in AdaBoost is particularly vulnerable to this issue (since the error from an outlier will get an exponentially-weighed importance for future learners).