# Weekly Review: 11/18/2017

I finished the Motion Planning course from Robotics this week. It was expected, since the material was quite in line with data structures and algorithms that I have studied during my undergrad. The next one, Mobility, seems to be a notch tougher than Aerial Robotics, mainly because of the focus on calculus and physics (neither of which I have touched heavily in years).

Heres the articles this week:

Neural Networks: Software 2.0

In this article from Medium, the Director of AI at Tesla gives a fresh perspective on NNs. He refers to the set of weights in a Neural Network as a program which is learnt, as opposed to coded in by a human. This line of thought is justified by the fact that many decisions in Robotics, Search, etc. are taken by parametric ML systems. He also compares it to traditional ‘Software 1.0’, and points out the benefits of each.

Baselines in Machine Learning

In this article, a senior Research Scientist from Salesforce points out that we need to pay greater attention to baselines in Machine Learning. A baseline is any meaningful ‘benchmark’ algorithm that you would compare your algorithm against. The actual reference point would depend on your task – random/stratified systems for classification, state-of-the-art CNNs for image processing, etc. Read Neal’s answer to this Quora question for a deeper understanding.

The article ends with a couple of helpful tips, such as:

1. Use meaningful baselines, instead of using very crude code. The better your baseline, the more meaningful your results.
2. Start off with optimizing the baseline itself. Tune the weights, etc. if you have to – this gives you a good base to start your work on.

TensorFlow Lite

TensorFlow Lite is now in the Developer Preview mode. It is a light-weight platform for inference (not training) using ML models on mobile/embedded devices. Google calls it an ‘evolution of TensorFlow mobile’. While the latter is still the system you should use in production, TensorFlow lite appears to perform better on many benchmarks (Differences here). Some of the major plus-points of this new platform are smaller binaries, and support for custom ML-focussed hardware accelerators via the Android Neural Networks API.

Flatbuffers

Reading up on Tensorflow Lite also brought me to Flatbuffers, which are a ‘liter’ version of Protobufs. Flatbuffer is a data serialization library  for performance-critical applications. Flatbuffers provide the benefits of a smaller memory footprint and lesser generated code, mainly due to skipping of the parsing/unpacking step. Heres the Github repo.

This YCombinator article gives a nice overview of Adversarial attacks on ML models – attacks that provide ‘noisy’ data inputs to intelligent systems, in order to get a ‘wrong’ output. The author points out how Gradient descent can be used to sort-of reverse engineer spurious noise, in order to get data ‘misclassified’ by a neural network. The article also shows examples of such faulty inputs, and they are surprisingly indistinguishable from the original data!

# Weekly Review: 11/11/2017

The Motion Planning course is going faster than I expected. I completed 2 weeks within 5 days. Thats good I guess, since it means I might get to the Capstone project before I take a vacation to India.

Heres the stuff from this week:

Graphcore and the Intelligent Processing Unit (IPU)

Graphcore aims to disrupt the world of ML-focussed computing devices. In an interesting blog post, they visualize neuron connections in different CNN architectures, and talk about how they compare to the human brain.

If you are curious about how IPUs differ from CPUs and GPUs, this NextPlatform article gives a few hints: mind you, IPUs are yet to be ‘released’, so theres no concrete information out yet. If you want to brush up on why memory is so important for neural network training (more than inference), this is a good place to start.

Overview of Different CNN architectures

This article on the CV-Tricks blog gives a high-level overview of the major CNN architectures so far: AlexNet, VGG, Inception, ResNets, etc. Its a good place to go for reference if you ever happen to forget what one of them did differently.

On that note, this blog post by Adit Deshpande goes into the ‘Brief History of Deep Learning’, marking out all the main research papers of importance.

Meta-learning and AutoML

The New York Times posted an article about AI systems that can build other AI systems, thus leading to what they call ‘Meta-learning’ (Learning how to learn/build systems that learn).

Google has been dabbling in meta-learning with a project called AutoML. AutoML basically consists of a ‘Generator’ network that comes up with various NN architectures, which are then evaluated by a ‘Scorer’ that trains them and computes their accuracy. The gradients with respect to these scores are passed back to the Generator, in order to improve the output architectures. This is their original paper, in case you want to take a look.

The AutoML team recently wrote another post about large-scale object detection using their algorithms.

Tangent

People from Google recently open-sourced their library for computing gradients of Python functions. Tangent works directly on your Python code(rather than view it as a black-box), and comes up with a derivative function to compute its gradient. This is useful in cases where you might want to debug how/why some NN architecture is not getting trained the way it’s supposed to. Here’s their Github repo.

Reconstructing films with Neural Network

This blog post talks about the use of Autoencoders and GANs to reconstruct films using NNs trained on them. They also venture into reconstructing films using NNs trained on other stylish films (like A Scanner Darkly). The results are pretty interesting.

# Weekly Review: 11/04/2017

A busy week. I finished my Aerial Robotics course! The next in the Specialization is Computational Motion Planning, which I am more excited about – mainly because the curriculum goes more towards my areas of expertise. Aerial Robotics was challenging primarily because I was doing a lot of physics/calculus which I had not attempted since a long time.

Onto the articles for this week:

Colab is now public!

Google made Colaboratory, a previously-internal tool public. ‘Colab’ is a document-collaboration tool, with the added benefits of being able to run script-sized pieces of code. This is especially useful if you want to prototype small proofs-of-concept, which can then be shared with documentation and demo-able output. I had previously used it within Google to tinker with TensorFlow, and write small scripts for database queries.

Visual Guide to Evolution Strategies

The above link is a great introduction to Evolutionary Strategies such as GAs and CMA-ES. They show a visual representation of how each of these algorithms converges on the optima from the first iteration to the last on simple problems. Its pretty interesting to see how each algorithm ‘broadens’ or ‘focuses’ the domain of its candidate solutions as iterations go by.

Baidu’s Deep Voice

In a 2-part series (Part 1 & Part 2), the author discusses the architecture of Baidu’s Text-to-Speech system (Deep Voice). Take a look if you have never read about/worked on such systems and want to have a general idea of how they are trained and deployed.

Capsule Networks

Geoff Hinton and his team at Google recently discussed the idea of Capsule networks, which try and remedy the rigidity in usual CNNs – by defining groups of specialized neurons called ‘capsules’ whose contribution to higher-level neurons is decided by the similarity of output. Heres a small intro on Capsule Networks, or the original paper if you wanna delve deeper.

Nexar Challenge Results

Nexar released the results of its Deep-Learning challenge on Image segmentation – the problem of ‘boxing’ and ‘tagging’ objects in pictures with multiple entities present. This is especially useful in their own AI-dashboard apps, which need to be quite accurate to prevent possible collisions in deployment.

As further reading, you could also check out this article on the history of CNNs in Image Segmentation, another one on Region-of-Interest Pooling in CNNs, and Deformable Neural Networks. (All of these concepts are mentioned in the main Nexar article)

# Weekly Review: 10/28/2017

This was a pretty busy week with a lot going on, but I finally seem to be settling into my new role!

The study for Aerial Robotics is almost over with a week to go. There hasn’t been much coding in this course, but that was to be expected since it was more about PID-Control Theory and quadrotor dynamics. I am particularly interested in the Capstone/’final’ project for this course, which would involve building an autonomous robot in Pi.

Anyway, on to the interesting tidbits from this week:

AlphaGo Zero

Google’s Deepmind recently announced a new version of their AI-based Go player, the AlphaGo Zero. What makes this one so special, is that it breaks the common notion of intelligent systems requiring a LOT of data to produce decent results. AlphaGo Zero was only provided the basic rules of Go, and it performed the rest of the learning all by playing against itself. Oh and BTW, AlphaGo Zero beats AlphaGo, the previous champion in the game. This is indeed a landmark in demonstrating the power of good-old RL.

Read this article for a basic overview, and their paper in Nature for a detailed explanation. Brushing up on Monte Carlo Tree Search would certainly help.

Word Mover’s Distance

Given an excellent embedding of words such as Word2Vec, it is not very difficult to compute the semantic distance between individual terms. However, when it comes to big blocks of text, a simple ‘average’ over term-embeddings isn’t good enough for computing their relative distances.

In such cases, the Word Mover’s Distance, inspired from Earth Mover’s Distance, provides a better solution. It figures out the semantically closest term(s) from one document to each term in another, and then the average effort required to ‘rephrase’ one text in words of another. Click on the article link for a detailed explanation.

Robots generalizing from simulations

OpenAI posted a blog article about how they trained a robot only through simulations. This means that the robot received no data from sensors during the training phase, but was able to perform basic tasks in deployment after some calibration.

During the simulations, they used dynamics randomization to alter basic traits of the environment. This data was then fed to an LSTM to understand the settings and goals. A key insight from this work is Hindsight Experience Replay. Quoting the article, “Hindsight Experience Replay (HER), allows agents to learn from a binary reward by pretending that a failure was what they wanted to do all along and learning from it accordingly. (By analogy, imagine looking for a gas station but ending up at a pizza shop. You still don’t know where to get gas, but you’ve now learned where to get pizza.)

Concurrency in Go

If you are a Go Programmer, take a look at this old (but good) talk on concurrency patterns and constructs in the language.

Generalization Bounds in Machine Learning

The Generalization Gap for an ML system is defined as the difference between the training error and the generalization error. The Generalization Bound tries to put a bound on this value, based on probability theory. Read this article for a detailed mathematical explanation.

# Weekly Review: 10/21/2017

Its been a long while since I last posted, but for good reason! I was busy shifting base from Google’s Hyderabad office to their new location in Sunnyvale. This is my first time in the USA, so there is a lot to take in and process!

Anyway, I am now working on Google’s Social-Search and Ranking team. At the same time, I am also doing Coursera’s Robotics Specialization to learn a subject I have never really touched upon. Be warned if you ever decide to give it a try: their very first course, titled Aerial Robotics, has a lot of linear math and physics involved. Since I last did all this in my freshman year of college, I am just about getting the weeks done!

Since I already have my plate full with a lot of ToDos, but I also feel bad for not posting, I found a middle ground: I will try, to the best of my ability, to post one article each weekend about all the random/new interesting articles I read over the course of the week. This is partly for my own reference later on, since I have found myself going back to my posts quite a few times to revisit a concept I wrote on. So here goes:

Eigenvectors & Eigenvalues

Anything ‘eigen’ has confused me for a while now, mainly because I never understood the intuition behind the concept. The highest-rated answer to this Math-Stackexchange question did the job: Every square matrix is a linear transformation. The corresponding eigenvectors roughly describe how the transformation orients the results (or the directions of maximum change), while the corresponding eigenvalues describe the distortion caused in those directions.

Transfer Learning

Machine Learning currently specializes in utilizing data from a certain {Task, Domain} combo (for e.g., Task: Recognize dogs in photos, Domain: Photos of dogs) to learn a function. However, when this same function/model is used on a different but related task (Recognize foxes in photos) or a different domain (Photos of dogs taken during the night), it performs poorly. This article discusses Transfer Learning, a method to apply knowledge learned in one setting on problems in different ones.

Dynamic Filters

The filters used in Convolutional Neural Network layers usually have fixed weights at a certain layer, for a given feature map. This paper from the NIPS conference discusses the idea of layers that change their filter weights depending on the input. The intuition is this: Even though a filter is trained to look for a specialized feature within a given image, the orientation/shape/size of the feature might change with the image itself. This is especially true while analysing data such as moving objects within videos. A dynamic filter will then be able to adapt to the incoming data, and efficiently recognise the intended features inspite of distortions.

# An introduction to Bayesian Belief Networks

A Bayesian Belief Network (BBN), or simply Bayesian Network, is a statistical model used to describe the conditional dependencies between different random variables.

BBNs are chiefly used in areas like computational biology and medicine for risk analysis and decision support (basically, to understand what caused a certain problem, or the probabilities of different effects given an action).

### Structure of a Bayesian Network

A typical BBN looks something like this:

The shown example, ‘Burglary-Alarm‘ is one of the most quoted ones in texts on Bayesian theory. Lets look at the structural characteristics one by one. We will delve into the numbers/tables later.

#### Directed Acyclic Graph (DAG)

We obviously have one node per random variable.

Directed: The connections/edges denote cause->effect relationships between pairs of nodes. For example Burglary->Alarm in the above network indicates that the occurrence of a burglary directly affects the probability of the Alarm going off (and not the other way round). Here, Burglary is the parent, while Alarm is the child node.

Acyclic: There cannot be a cycle in a BBN. In simple English, a variable $A$ cannot depend on its own value – directly, or indirectly. If this was allowed, it would lead to a sort of infinite recursion which we are not prepared to deal with. However, if you do realize that an event happening affects its probability later on, then you could express the two occurrences as separate nodes in the BBN (or use a Dynamic BBN).

#### Parents of a Node

One of the biggest considerations while building a BBN is to decide which parents to assign to a particular node. Intuitively, they should be those variables which most directly affect the value of the current node.

Formally, this can be stated as follows: “The parents of a variable $X$ ($parents(X)$) are the minimal set of ancestors of $X$, such that all other ancestors of $X$ are conditionally independent of $X$ given $parents(X)$“.

Lets take this step by step. First off, there has to be some sort of a cause-effect relationship between $Y$ and $X$ for $Y$ to be one of the ancestors of $X$. In the shown example, the ancestors of Mary Calls are Burglary, Earthquake and Alarm.

Now consider the two ancestors Alarm and Earthquake. The only way an Earthquake would affect Mary Calls, is if an Earthquake causes Alarm to go off, leading to Mary Calls. Suppose someone told you that Alarm has in fact gone off. In this case, it does not matter what lead to the Alarm ringing – since Mary will react to it based on the stimulus of the Alarm itself. In other words, Earthquake and Mary Calls become conditionally independent if you know the exact value of Alarm.

Mathematically speaking, $P(Mary Calls|Alarm,Earthquake) == P(Mary Calls|Alarm)$.

Thus, $parents(X)$ are those ancestors which do not become conditionally independent of $X$ given the value of some other ancestor. If they do, then the resultant connection would actually be redundant.

#### Disconnected Nodes are Conditionally Independent

Based on the directed connections in a BBN, if there is no way to go from a variable $X$ to $Y$ (or vice versa), then $X$ and $Y$ are conditionally independent. In the example BBN, pairs of variables that are conditionally independent are {Mary Calls, John Calls} and {Burglary, Earthquake}.

It is important to remember that ‘conditionally independent’ does not mean ‘totally independent’. Consider {Mary Calls, John Calls}. Given the value of Alarm (that is, whether the Alarm went off or not), Mary and John each have their own independent probabilities of calling. However, if you did not know about any of the other nodes, but just that John did call, then your expectation of Mary calling would correspondingly increase.

### Mathematics behind Bayesian Networks

BBNs provide a mathematically correct way of assessing the effects of different events (or nodes, in this context) on each other. And these assessments can be made in either direction – not only can you compute the most likely effects given the values of certain causes, but also determine the most likely causes of observed events.

The numerical data provided with the BBN (by an expert or some statistical study) that allows us to do this is:

1. The prior probabilities of variables with no parents (Earthquake and Burglary in our example).
2. The conditional probabilities of any other node given every value-combination of its parent(s). For example, the table next to Alarm defines the probability that the Alarm will go off given the whether an Earthquake and/or Burglary have occurred.

In case of continuous variables, we would need a conditional probability distribution.

The biggest use of Bayesian Networks is in computing revised probabilities. A revised probability defines the probability of a node given the values of one or more other nodes as a fact. Lets take an example from the Burglary-Alarm BBN.

Suppose we want to calculate the probability that an earthquake occurred, given that the alarm went off, but there was no burglary. Essentially, we want $P(Earthquake|Alarm,\sim Burglary)$. Simplifying the nomenclature a bit, $P(E|A,\sim B)$.

Here, you can say that the Alarm going off ($A$) is evidence, the knowledge that the Burglary did not happen ($\sim B$) is context and the Earthquake occurring ($E$) is the hypothesis. Traditionally, if you knew nothing else, $P(E) = 0.002$, from the diagram. However, with the context and evidence in mind, this probability gets changed/revised. Hence, its called ‘computing revised probabilities’.

A version of Bayes Theorem states that

$P(X|YZ) = \frac{P(X|Z)P(Y|XZ)}{P(Y|Z)}$ …(1)

where $X$ is the hypothesis, $Y$ is the evidence, and $Z$ is the context. The numerator on the RHS denotes that probability that $X$$Y$ both occur given $Z$, which is a subset of the probability that atleast $Y$ occurs given $Z$, irrespective of $X$.

Using (1), we get

$P(E|A, \sim B) = \frac{P(E|\sim B)P(A|\sim B, E)}{P(A|\sim B)}$ …(2)

Since $E$ and $B$ are independent phenomena without knowledge of $A$,

$P(E|\sim B) = P(E) = 0.002$ …(3)

From the table for $A$,

$P(A|\sim B, E) = 0.29$ …(4)

Finally, using the Total Probability Theorem,

$P(A| \sim B) = P(E) P(A| E, \sim B) + P(\sim E) P(A| \sim E, \sim B)$ …(5)

Which is nothing but average of $P(A| E, \sim B)$$P(A| \sim E, \sim B)$, weighted on $P(E)$$P(\sim E)$ respectively.

Substituting values in (5),

$P(A| \sim B) = 0.002 * 0.29 + 0.998 * 0.001 = 0.001578$ …(6)

From (2), (3), (4), & (6), we get

$P(E|A, \sim B) = 0.367$

As you can see, the probability of the Earthquake actually increases if you know that the Alarm went off but a Burglary was not the cause of it. This should make sense intuitively as well. Which brings us to the final part –

### The ‘Explain Away’ Effect

The Explain Away effect, commonly associated with BBNs, is a result of computing revised probabilities. It refers to the phenomenon where knowing that one cause has occurred, reduces (but does not eliminate) the probability that the other cause(s) took place.

Suppose instead of knowing that there has been no burglary like in our example, you infact did know that one has taken place. It also led to the Alarm going off. With this information in mind, your tendency to check out the ‘earthquake’ hypothesis reduces drastically. In other words, the burglary has explained away the alarm.

It is important to note that the probability for other causes just gets reduced, but does NOT go down to zero. In a stroke of bad luck, it could have happened that both a burglary and an earthquake happened, and any one of the two stimuli could have led to the alarm ringing. To what extent you can ‘explain away’ an evidence depends on the conditional probability distributions.

# Residual Neural Networks as Ensembles

In a previous blog post, I had mentioned Residual Connections in the context of Google-Neural Machine Translation. I was not completely familiar with the intuition behind Residual Networks then, so heres a short post on what I gathered after reading some literature.

Residual Networks first got attention after this paper – “Deep Residual Learning for Image Recognition” by some folks from Microsoft Research. They used what they called residual connections in Convolutional Neural Networks, obtaining 1st-place results at ILSVRC 2015. However, it turns out that their notion of why the networks achieved such great performance might have been flawed.

The original authors thought that the new connections allowed training of very deep neural networks (their state-of-the-art model had 152 layers) by ‘preserving’ gradient across layers during training. This, according to them, solved the Vanishing Gradient Problem. However, the paper “Residual Networks Behave Like Ensembles of Relatively Shallow Networks” challenges this, attempting to prove that Residual NNs actually behave like ensembles of neural networks!

Lets take this step-by-step.

First, lets establish some nomenclature. In a classic neural network, suppose the output from the $n$th layer is written as $y_n$. Mathematically, you can write:

$y_n = f_n(y_{n-1})$ …(1)

Here, $f_n$ is a vectorial function that computes $y_n$ from $y_{n-1}$. In a standard neural network, the mathematical computation performed by $f$ is same across all layers (like the ReLU). What differentiates any $f_n$ from other layers is the set of weights assigned to every element of $y_{n-1}$ towards computing elements of $y_n$. These weights are of course learnt during training.

A typical layer in a Residual NN looks like this (taken from the original paper):

Based on our nomenclature, the expression for $y_n$ now becomes:

$y_n = f_n(y_{n-1}) + y_{n-1}$ …(2)

The difference, as you must have noticed, lies in the second term on the Right-Hand-Side. This term is what they call the identity skip(or short-cut) connection. ‘Skip‘, because you are skipping the $f_n$ computation, and ‘identity‘ because you are not multiplying $y_{n-1}$ by any set of weights before addition.

These identity short-cut connections are what make residual NNs special.

#### ‘Unravelling’ a Residual Neural Network

The reference paper‘s biggest insight lies in the way they look at equation (2) – that is, as a recursive definition. They call this unravelling the NN.

Lets consider a simple residual NN with three layers. The input can be denoted as $y_0$, with the outputs from the three subsequent layers being $y_1$$y_2$ and $y_3$ respectively. $y_3$ is the output of the NN as a whole.

Expanding equation (2) step-by-step, we would get

$y_3 = f_3(y_2) + y_2$ …(3)

$y_3 = f_3(f_2(y_1) + y_1) + f_2(y_1) + y_1$ …(4)

$y_3 = f_3(f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0) + f_2(f_1(y_0) + y_0) + f_1(y_0) + y_0$ …(5)

Equation (5) essentially defines $y_3$ as the sum of outputs from every single layer (including $y_0$) of the neural network.

Contrast this with what we would see if it was a standard network:

$y_3 = f_3(f_2(f_1(y_0)))$ …(6)

(Ofcourse, the weights learnt would be different in each case)

Consider a situation where $y_0$ and the weights for $f_2$ are such that $f_2(y_1)$ turns out to be a vector of very low values.

If this happens in equation (6), you see that there is very little $f_3$ can do. Since there is only one term on the RHS, any input basically follows one path through the NN.

But now consider equation (5). If $f_2(y_1)$ is close to a zero vector, we still get

$y_3 = f_3(f_1(y_0) + y_0) + f_1(y_0) + y_0$

It looks like the input signal bypassed the second layer completely on its way to the output! In fact, if we did not have $f_2$ at all, you would derive the above expression for the network. This flexibility (i.e. the ability to bypass any layer) is provided to the NN by the skip-connections.

And the ability to ‘skip’ is not just restricted to one layer per run- depending on the input and the weights at every level, any combination of them could be on/off for a given case. The name residual can be interpreted better now – residual can mean ‘unused’, which is pretty much what $f_2$ is in the above example! That is not the case for equation (6), since the network does not have the choice to ignore $f_2$‘s output in the single term on the RHS.

Diagrammatically, you could ‘unravel’ the network like this:

Each ‘path’ in the above figure denotes some combination of the layers working on $y_0$ as it makes its way to the output. Every node denotes convergence (addition) of a set of paths, before passing on the sum to the next layer.

Ideally, every output would get significant contributions from every one of those paths. But that is not usually the case, as observed by the authors of the paper.

Now lets see what happens to the figure the moment you switch off $f_2$, cutting off all corresponding paths:

As you can see, even though $f_2$ makes no contributions, the input still makes its way through the remaining paths to generate output.

Intuitively, you can see why the authors think of this structure as an ensemble. At every single node (including the last one), there are multiple paths, any of which may or may not contribute to the overall output for a given scenario. You can say that every $y_0$ activates a unique set of paths in a residual NN.

And how many such paths are there? Its equivalent to the total cases resulting from every one of the layers being either on or off. Simple probability theory will tell you that its $2^N$, where $N$ is the total number of layers in the NN.

#### Unusual properties of Residual NNs

Thinking of residual NNs as ensembles motivated the authors to conduct some experiments to test their hypothesis. Essentially, what they tried to do is see which properties of ensembles residual NNs satisfy:

1. Resilience to layer deletion: In most Neural Networks (the ones that resemble equation (2)), deleting a layer has disastrous results on the outputs – whatever the size of the overall network may be. And for good reason, as you are disrupting the one term that corresponds to the output.

But that is not the case for Residual Neural Networks. In fact, deleting 1 (or even 2 or 3) layers in large residual NNs introduces only around 6-7% of an error into the performance of the network! Moreover, deleting more and more layers actually has a pretty smooth (as in mathematically smooth) effect on the total error:

This is pretty close to how an ensemble behaves – deleting models from an ensemble does introduce error, but the increase is never drastic with respect to the number of models removed.

This can be explained easily by looking at equation (5). Even if you delete one layer from a residual NN, that still leaves $2^{N-1}$ terms on the RHS of the output. A bunch of them even yield the same result as they would have, with the layer in.

2. Shortening of effective paths: This is actually contradictory to what people first believed about residual NNs – that they promote deeper networks.

During training, the authors observed that the updates were not happening uniformly across all layers, as they would for normal NNs. Every training point would adjust the weights along a specific set of ‘paths’ as shown in the unravelled diagram. And most of these paths, even with 152-layer deep networks, were only 20-30 levels deep!

Even on-line, every input activated only a specific set of paths without taking significant contributions from every single layer and path in the network.

This is where the biggest revelation lies: Residual NNs work better not by increasing the effective paths, but actually reducing them! What works here is that every input has a chance to take its own unique set of paths to the output, without having to go through every single layer.

This is again similar to how ensembles are trained and run. During training, you won’t observe all smaller models in an ensemble getting significantly adjusted for every training point. On-line, there will always only be a subset of models that give a strong output for any input.

Thats it for now! Do read the reference paper if you feel interested, or go through the original paper to see their usage in the image processing scenario.