02-better-generalization.utf8

background-image: url(img/city.jpg)
background-size: cover
class: inverse,  middle

## Optimization in Neural Networks
### Part 2: Better Generalization

fecha: 2020-10-25

---

## Introduction

### Techniques to reduce overfitting and improve generalization
* How techniques that reduce model complexity have a **regularizing effect** resulting in less
overfitting and better generalization.
* How to add a **penalty to the loss function to encourage smaller model weights**.
*  How to add a **penalty to the loss function to encourage sparse internal representations**.
* How to add a **constraint to the model** to force small model weights and lower complexity
models.
* How to add **dropout weights** during training to decouple model layers.
* How to add **noise** to the training process to promote model robustness.
* How to use **early stopping** to halt model training at the right time.

---

# Fix Overfitting with Regularization

### Reduce Overfitting by Constraining Complexity

There are two ways to approach an overfit model:

* 1. Reduce overfitting by training the network on more examples.
* 2. Reduce overfitting by changing the complexity of the network.

A benefit of very deep neural networks is that their performance continues to improve as
they are fed larger and larger datasets. A model with a near-infinite number of examples will
eventually plateau in terms of what the capacity of the network is capable of learning. A model
can overfit a training dataset because it has sufficient capacity to do so. Reducing the capacity
of the model reduces the likelihood of the model overfitting the training dataset, to a point
where it no longer overfits. The capacity of a neural network model, **it’s complexity, is defined by both it’s structure in terms of nodes and layers and the parameters in terms of its weights.**

---

Therefore, we can reduce the complexity of a neural network to **reduce overfitting** in one of two
ways:

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

* Change network complexity by changing the **network structure (number of weights)**.
* Change network complexity by changing the **network parameters (values of weights)**.

]

It is more common to focus on methods that constrain the size of the weights in a neural
network because a single network structure can be defined that is under-constrained, e.g. has a
much larger capacity than is required for the problem, and regularization can be used during
training to ensure that the model does not overfit.

Techniques that seek to reduce overfitting (reduce generalization error) by keeping
network weights small are referred to as **regularization methods.** More specifically, regularization
refers to a class of approaches that add additional information to transform an ill-posed problem
into a more stable well-posed problem.

---

## Regularization

* Regularization methods are so widely used to reduce overfitting that the term regularization
may be used for any method that improves the generalization error of a neural network model.

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[

Regularization is any modification we make to a learning algorithm that is intended
to reduce its generalization error but not its training error. Regularization is one of
the central concerns of the field of machine learning, rivaled in its importance only
by optimization.

---

## Regularization for Neural Networks

The simplest and perhaps most common regularization method is to add a penalty to the loss
function in proportion to the size of the weights in the model.

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

**Weight Regularization: Penalize the model during training based on the magnitude of the weights.**

]

This will encourage the model to map the inputs to the outputs of the training dataset
in such a way that the weights of the model are kept small. This approach is called weight
regularization or weight decay and has proven very effective for decades for both simpler linear
models and neural networks.

#### Most common additional regularization methods.
* **Activity Regularization:** Penalize the model during training based on the magnitude
of the activations.
* **Weight Constraint:** Constrain the magnitude of weights to be within a range or below
a limit.
* **Dropout:** Probabilistically remove inputs during training.
* **Noise:** Add statistical noise to inputs during training.
* **Early Stopping:** Monitor model performance on a validation set and stop training when
performance degrades.

---

# Penalize Large Weights with Weigth Regularization

---

## Penalize Large Weights with Weigth Regularization

We will cover these topics:

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
* **Large weights in a neural network are a sign of a more complex network that has overfit
the training data.**
* **Penalizing a network based on the size of the network weights during training can reduce
overfitting.**
* **An L1 or L2 vector norm penalty can be added to the optimization of the network to
encourage smaller weights.**

]

---

## Problem with Large Weights

* The longer we train the network, the more specialized the weights will become to the training data, overfitting the training data. The weights will grow in size in order to handle the specifics of the examples seen in the training data. Large weights make the network unstable. Although the weights will be
specialized to the training dataset, minor variation or statistical noise on the expected inputs
will result in large differences in the output.

* Generally, we refer to this model as having a large variance and a small bias. That is, the
model is sensitive to the specific examples, the statistical noise, in the training dataset.

.bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt2[
A model with large weights is more complex than a model with smaller weights. It is a sign of a network
that may be overly specialized to training data. In practice, **we prefer to choose the simpler models to solve a problem (e.g. Occam’s razor). We prefer models with smaller weights.**

]

---

Larger weights result in a larger penalty, in the form of a larger loss score. The optimization
algorithm will then push the model to have smaller weights, i.e. weights no larger than needed
to perform well on the training dataset. Smaller weights are considered more regular or less
specialized and as such, we refer to this penalty as weight regularization. When this approach of
penalizing model coefficients is used in other machine learning models such as linear regression
or logistic regression, it may be referred to as shrinkage, because the penalty encourages the
coefficients to shrink during the optimization process.

**The addition of a weight size penalty or weight regularization to a neural network has the effect of reducing generalization error and of allowing the model to pay less attention to less relevant input variables.**

### How to Penalize Large Weights

There are two parts to penalizing the model based on the size of the weights.

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

1. The calculation of the size of the weights.
2. The amount of attention that the optimization process should pay to the penalty.

]

---

## Calculate the size of the weights

Neural network weights are real-values that can be positive or negative, as such, simply adding
the weights is not sufficient. There are two main approaches used to calculate the size of the
weights, they are:

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

* Calculate the sum of the absolute values of the weights, called the **L1 norm (or L1)**.
* Calculate the sum of the squared values of the weights, called the **L2 norm (or L2)**.

]

L1 encourages weights to 0.0 if possible, resulting in more sparse weights (weights with more
0.0 values). L2 offers more nuance, both penalizing larger weights more severely, but resulting
in less sparse weights. The use of L2 in linear and logistic regression is often referred to as
**Ridge Regression.** This is useful to know when trying to develop an intuition for the penalty or
examples of its usage.
In other academic communities, L2 regularization is also known as **ridge regression or Tikhonov regularization.**

---

### Weight decay

The weights may be considered a vector and the magnitude of a vector is called its norm,
from linear algebra. As such, penalizing the model based on the size of the weights is also
referred to as a weight or parameter norm penalty. **It is possible to include both L1 and L2 approaches to calculating the size of the weights as the penalty.** This is akin to the use of both
penalties used in the **Elastic Net algorithm** for linear and logistic regression. The L2 approach
is perhaps the most used and is traditionally referred to as weight decay in the field of neural
networks. It is called shrinkage in statistics, a name that encourages you to think of the impact
of the penalty on the model weights during the learning process.

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[
This particular choice of regularizer is known in the machine learning literature
as **weight decay** because in sequential learning algorithms, **it encourages weight values to decay towards zero**, unless supported by the data. In statistics, it provides
an example of a parameter shrinkage method because it shrinks parameter values
towards zero

---

## Control Impact of the Penalty

* The calculated size of the weights is added to the loss objective function when training the
network. Rather than adding each weight to the penalty directly, they can be weighted using
a new hyperparameter called alpha (α) or sometimes lambda. This controls the amount of
attention that the learning process should pay to the penalty. Or put another way, the amount
to penalize the model based on the size of the weights. **The alpha hyperparameter has a value between 0.0 (no penalty) and 1.0 (full penalty).**

* This hyperparameter controls the amount of bias in the model from 0.0, or low bias (high variance), to 1.0, or high bias (low variance). If the penalty is too strong, the model will underestimate the weights and underfit the problem. If the penalty is too weak, the model will be allowed to overfit the training data. The vector norm of the weights is often calculated per-layer, rather than across the entire network.
This allows more flexibility in the choice of the type of regularization used (e.g. L1 for inputs,
L2 elsewhere) and flexibility in the alpha value, although it is common to use the same alpha
value on each layer by default.

---

## Examples of Weight Regularization

* **.dark-red[Examples of MLP Weight Regularization]**

Weight regularization was borrowed from penalized regression models in statistics. The most
common type of regularization is L2, also called simply weight decay, with values often on a
logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc. Reasonable values of lambda *regularization hyperparameter* range between 0 and 0.1.

* **.dark-red[Examples of CNN Weight Regularization]**

Weight regularization does not seem widely used in CNN models, or if it is used, its use is not
widely reported. L2 weight regularization with very small regularization hyperparameters such
as (e.g. 0.0005 or 5 × 10−4 ) may be a good starting point.

* **.dark-red[Examples of LSTM Weight Regularization]**

It is common to use weight regularization with LSTM models. An often used configuration is L2
(weight decay) and very small hyperparameters (e.g. 10−6 ). It is often not reported what weights
are regularized (input, recurrent, and/or bias), although one would assume that both input and
recurrent weights are regularized only.

---

## Tips for Using Weight Regularization

* **.dark-red[Use With All Network Types]**

Weight regularization is a generic approach. It can be used with most, perhaps all, types of
neural network models.

* **.dark-red[Standardize Input Data]**

It is generally good practice to update input variables to have the same scale. When input
variables have different scales, the scale of the weights of the network will, in turn, vary
accordingly. This introduces a problem when using weight regularization because the absolute
or squared values of the weights must be added for use in the penalty. This problem can be
addressed by either normalizing or standardizing input variables.

* **.dark-red[Use a Larger Network]**

It is common for larger networks (more layers or more nodes) to more easily overfit the training
data. When using weight regularization, it is possible to use larger networks with less risk of
overfitting. A good configuration strategy may be to start with larger networks and use weight
decay.

---

### Tips for Using Weight Regularization

* **.dark-red[Grid Search Parameters]**

It is common to use small values for the regularization hyperparameter that controls the
contribution of each weight to the penalty. Perhaps start by testing values on a log scale, such
as 0.1, 0.001, and 0.0001. Then use a grid search at the order of magnitude that shows the most
promise.

* **.dark-red[Use L1 + L2 Together]**

Rather than trying to choose between L1 and L2 penalties, use both. Modern and effective
linear regression methods such as the Elastic Net use both L1 and L2 penalties at the same
time and this can be a useful approach to try. This gives you both the nuance of L2 and the
sparsity encouraged by L1.

* **.dark-red[Use on a Trained Network]**

The use of weight regularization may allow more elaborate training schemes. For example, a
model may be fit on training data first without any regularization, then updated later with the
use of a weight penalty to reduce the size of the weights of the already well-performing model.

---

# Sparse Representations with Activity Regularization

---

## Sparse Representations with Activity Regularization

We will cover these topics:

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[

* Neural networks learn features from data and models, such as autoencoders and encoder-
decoder models, and explicitly seek effective learned representations.
* **Similar to weights, large values in learned features, e.g. large activations, may indicate an
overfit model.**
* **The addition of penalties to the loss function that penalize a model in proportion to the
magnitude of the activations may result in more robust and generalized learned features.**

]

---

## Problem With Learned Features

* Deep learning models are able to perform **feature learning**. That is, during the training of the network, the model will automatically extract the salient features from the input patterns or **learn features**. 
* These features may be used in the network in order to predict a quantity for regression or predict a class value for classification. These internal representations are tangible things. The output of a hidden layer within the network represent the learned features by the model at that point in the network. 
* **The learned features, or encoded inputs, must be large enough to capture the salient features of the input but also focused enough to not overfit the specific examples in the training dataset**. As such, there is a tension between the expressiveness and the generalization of the learned features.

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

* **In the same way that large weights in the network can signify an unstable and overfit model, large output values in the learned features can signify the same problems.**

]

---

## Encourage Small Activations

* The loss function of the network can be updated to penalize models in proportion to the
magnitude of their activation. This is similar to weight regularization where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its activation or activity, as such, this form of penalty or regularization is referred to as **activation regularization or activity regularization**.

The output of an encoder or, generally, the output of a hidden layer in a neural network
may be considered the representation of the problem at that point in the model. As such, this type of penalty may also be referred to as **representation regularization**. **The desire to have small activations or even very few activations with mostly zero values is also called a desire for sparsity**. As such, this type of penalty is also referred to as **sparse feature learning**.

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

* **One way to limit the information content of an overcomplete code is to make it sparse.**

]

---

* Sparsity is most commonly sought when a larger-than-required hidden layer (e.g. over-
complete) is used to learn features that may encourage overfitting. **The introduction of a sparsity penalty counters this problem and encourages better generalization.** A sparse overcomplete learned feature has been shown to be more effective than other types of learned features offering better robustness to noise and even transforms in the input, e.g. learned features of images may have improved invariance to the position of objects in the image.

* There is a general focus on sparsity of the representations rather than small vector magnitudes. A study of these representations that is more general than the use of neural networks is known as sparse coding.

---

## How to Encourage Small Activations

An activation penalty can be applied per-layer, perhaps only at one layer that is the focus of the learned representation, such as the output of the encoder model or the middle (bottleneck) of an autoencoder model. A constraint can be applied that adds a penalty proportional to the magnitude of the vector output of the layer. The activation values may be positive or negative, so we cannot simply sum the values. **Two common methods for calculating the magnitude of the activation are**:

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

* **Sum of the absolute activation values, called L1 vector norm.**
* **Sum of the squared activation values, called the L2 vector norm.**

]

The **L1 norm encourages sparsity**, e.g. allows some activations to become zero, whereas the **L2 norm encourages small activations values in general**. Use of the L1 norm may be a more commonly used penalty for activation regularization. A hyperparameter must be specified that indicates the amount or degree that the loss function will weight or pay attention to the penalty. Common values are on a logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc. **Activity regularization can be used in conjunction with other regularization techniques, such as weight regularization.**

---

## Tips for Using Activation Regularization

* **.dark-green[Use With All Network Types]**

Activation regularization is a generic approach. It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.

* **.dark-green[Use With Autoencoders and Encoder-Decoders]**

Activity regularization may be best suited to those model types that explicitly seek an efficient learned representation. These include models such as autoencoders (i.e. sparse autoencoders) and encoder-decoder models, such as encoder-decoder LSTMs used for sequence-to-sequence prediction problems.

* **.dark-green[Experiment With Different Norms]**

The most common activation regularization is the L1 norm as it encourages sparsity. Experiment
with other types of regularization such as the L2 norm or using both the L1 and L2 norms at
the same time, e.g. like the Elastic Net linear regression algorithm

---

### Tips for Using Activation Regularization

* **.dark-green[Use Rectified Linear Activation]**

The rectified linear activation function, also called relu, is an activation function that is now widely used in the hidden layer of deep neural networks. Unlike classical activation functions such as tanh (hyperbolic tangent function) and sigmoid (logistic function), the relu function allows exact zero values easily. This makes it a good candidate when learning sparse representations, such as with the L1 vector norm activation regularization.

* **.dark-green[Grid Search Parameters]**

It is common to use small values for the regularization hyperparameter that controls the
contribution of each activation to the penalty. Perhaps start by testing values on a log scale, such as 0.1, 0.001, and 0.0001.

* **.dark-green[Standardize Input Data]**

It is a generally good practice to rescale input variables to have the same scale. When input variables have different scales, the scale of the weights of the network will, in turn, vary accordingly. Large weights can saturate the nonlinear transfer function and reduce the variance in the output from the layer. This may introduce a problem when using activation regularization. This problem can be addressed by either normalizing or standardizing input variables.

---

### Tips for Using Activation Regularization

* **.dark-green[Use an Overcomplete Representation]**

Configure the layer chosen to be the learned features, e.g. the output of the encoder or the
bottleneck in the autoencoder, to have more nodes that may be required. This is called an
overcomplete representation that will encourage the network to overfit the training examples. This can be countered with a strong activation regularization in order to encourage a rich learned representation that is also sparse.

---

# Force Small Weights with Weight Constraints

---

## Force Small Weights with Weight Constraints

We will cover these topics:

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[

* **Weight penalties encourage but do not require neural networks to have small weights.**
* **Weight constraints, such as the L2 norm and maximum norm, can be used to force neural networks to have small weights during training.**
* **Weight constraints can improve generalization when used in conjunction with other regularization methods like dropout.**
]

---

## How to Use a Weight Constraint

A constraint is enforced on each node within a layer. All nodes within the layer use the same constraint, and often multiple hidden layers within the same network will use the same constraint. Recall that when we talk about the vector norm in general, that this is the magnitude of the vector of weights in a node, and by default is calculated as the L2 norm, e.g. the square root of the sum of the squared values in the vector. Some examples of constraints that could be used include:

.bg-washed-yellow.b--gold.ba.bw2.br3.shadow-5.ph4.mt2[

* Force the vector norm to be 1.0 (e.g. the unit norm).
* Limit the maximum size of the vector norm (e.g. the maximum norm).
* Limit the minimum and maximum size of the vector norm (e.g. the min max norm).

]

---

### How to Use a Weight Constraint

* The maximum norm, also called **max-norm or maxnorm**, is a popular constraint because it
is less aggressive than other norms such as the unit norm, simply setting an upper bound.

* When using a limit or a range, a hyperparameter must be specified. Given that weights are
small, the hyperparameter too is often a small integer value, such as a value between 1 and 4.

* If the norm exceeds the specified range or limit, the weights are rescaled or normalized such that their magnitude is below the specified parameter or within the specified range.

* T**he constraint can be applied after** each update to the weights, e.g. **at the end of each minibatch**.

---

## Tips for Using Weight Constraints

* **.dark-blue[Use With All Network Types]**

* **.dark-blue[Standardize Input Data]**

* **.dark-blue[Use a Larger Learning Rate]**

The use of a weight constraint allows you to be more aggressive during the training of the
network. Specifically, a larger learning rate can be used, allowing the network to, in turn, make larger updates to the weights each update. This is cited as an important benefit to using weight constraints. Such as the use of a constraint in conjunction with dropout

* **.dark-blue[Try Other Constraints]**

Explore the use of other weight constraints, such as a minimum and maximum range, non-negative weights, and more. You may also choose to use constraints on some weights and not others, such as not using constraints on bias weights in an MLP or not using constraints on recurrent connections in an LSTM.

---
class: animated slideInRight fadeOutLeft, inverse, middle

# Decouple Layers with Dropout

---

## Decouple Layers with Dropout

We will cover these topics:

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[

* **Large weights in a neural network are a sign of a more complex network that has overfit the training data.**
* **Probabilistically dropping out nodes in the network is a simple and effective regularization method.**
* **A large network with more training epochs and the use of a weight constraint are suggested when using dropout.**
]

---

## Randomly Drop Nodes

* **Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel. During training, some number of node outputs are randomly ignored or dropped out.** This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different view of the configured layer.

* **Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.** This conceptualization suggests that perhaps dropout breaks-up situations where network layers co-adapt to correct mistakes from prior layers, in turn making the model more robust.

* Dropout simulates a sparse activation from a given layer, which interestingly, in turn,
encourages the network to actually learn a sparse representation as a side-effect. As such, it may be used as an alternative to activity regularization for encouraging sparse representations in autoencoder models.

* Because the outputs of a layer under dropout are randomly subsampled, it has the effect of
reducing the capacity or thinning the network during training. As such, a wider network, e.g. more nodes, may be required when using dropout.

---

## How to Dropout

* Dropout is implemented per-layer in a neural network. It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer. Dropout may be implemented on any or all hidden layers in the network as well as the visible or input layer. It is not used on the output layer.

* Dropout is not used after training when making a prediction with the fit network. The
weights of the network will be larger than normal because of dropout. Therefore, before finalizing the network, the weights are first scaled by the chosen dropout rate. The network can then be used as per normal to make predictions.

* The rescaling of the weights can be performed at training time instead, after each weight
update at the end of the minibatch. This is sometimes called inverse dropout and does not
require any modification of weights during training.

* Dropout works well in practice, perhaps replacing the need for weight regularization (e.g.
weight decay) and activation regularization (e.g. representation sparsity).

---

## Tips for Using Dropout Regularization

* **.dark-blue[Use With All Network Types]**

* **.dark-blue[Dropout Rate]**

The default interpretation of the dropout hyperparameter is the probability of training a given node in a layer, where 1.0 means no dropout, and 0.0 means no outputs from the layer. A good value for dropout in a hidden layer is between 0.5 and 0.8. Input layers use a larger dropout (retention) rate, such as of 0.8.

* **.dark-blue[Use a Larger Network]**

It is common for larger networks (more layers or more nodes) to more easily overfit the training data. When using dropout regularization, it is possible to use larger networks with less risk of overfitting. In fact, a large network (more nodes per layer) may be required as dropout will probabilistically reduce the capacity of the network. A good rule of thumb is to divide the number of nodes in the layer before dropout by the proposed dropout rate and use that as the number of nodes in the new network that uses dropout.

---

* **.dark-blue[Grid Search Parameters]**

Rather than guess at a suitable dropout rate for your network, test different rates systematically.
For example, test values between 1.0 and 0.1 in increments of 0.1. This will both help you
discover what works best for your specific model and dataset, as well as how sensitive the model is to the dropout rate. A more sensitive model may be unstable and could benefit from an increase in size.

* **.dark-blue[Use a Weight Constraint]**

Network weights will increase in size in response to the probabilistic removal of layer activations. Large weight size can be a sign of an unstable network. To counter this effect a weight constraint can be imposed to force the norm (magnitude) of all weights in a layer to be below a specified value. For example, the maximum norm constraint is recommended with a value between 3 and 4.

* **.dark-blue[Use With Smaller Datasets]**

Like other regularization methods, dropout is more effective on those problems where there is a limited amount of training data and the model is likely to overfit the training data. Problems where there is a large amount of training data may see less benefit from using dropout.

---

# Robustness with Noise

---

## Robustness with Noise

We will cover these topics:

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[
* **Small datasets can make learning challenging for neural nets and the examples can be memorized.**
* **Adding noise during training can make the training process more robust and reduce generalization error.**
* **Noise is traditionally added to the inputs, but can also be added to weights, gradients, and even activation functions.**

]

---

## Challenge of Small Training Datasets

* Small datasets can introduce problems when training large neural networks. **.dark-red[The first problem is that the network may effectively memorize the training dataset.]** Instead of learning a general mapping from inputs to outputs, the model may learn the specific input examples and their associated outputs. This will result in a model that performs well on the training dataset, and poor on new data, such as a holdout dataset. **.dark-red[The second problem is that a small dataset provides less opportunity to describe the structure of the input space and its relationship to the output.]** More training data provides a richer description of the problem from which the model may learn. Fewer data points means that rather than a smooth input space, the points may represent a jarring and disjointed structure that may result in a difficult, if not unlearnable, mapping function. It is not always possible to acquire more data. Further, getting a hold of more data may not address these problems.

---

## Add Random Noise During Training

* At first, this sounds like a recipe for making learning more challenging. It is a counter-
intuitive suggestion to improving performance because one would expect noise to degrade
performance of the model during training.

* The addition of noise during the training of a neural network model has a regularization effect and, in turn, improves the robustness of the model. It has been shown to have a similar impact on the loss function as the addition of a penalty term, as in the case of weight regularization methods.

* In effect, adding noise expands the size of the training dataset. Each time a training sample is exposed to the model, random noise is added to the input variables making them different every time it is exposed to the model. In this way, adding noise to input samples is a simple form of data augmentation.

* Adding noise means that the network is less able to memorize training samples because they
are changing all of the time, resulting in smaller network weights and a more robust network
that has lower generalization error. The noise means that it is as though new samples are being drawn from the domain in the vicinity of known samples, smoothing the structure of the input space. This smoothing may mean that the mapping function is easier for the network to learn, resulting in better and faster learning.

---

## How and Where to Add Noise

The most common type of noise used during training is the addition of **Gaussian noise to input variables**. Gaussian noise, or white noise, has a mean of zero and a standard deviation of one and can be generated as needed using a pseudorandom number generator. The addition of Gaussian noise to the inputs to a neural network was traditionally referred to as jitter or random jitter after the use of the term in signal processing to refer to the uncorrelated random noise in electrical circuits. The amount of noise added (e.g. the spread or standard deviation) is a configurable hyperparameter. Too little noise has no effect, whereas too much noise makes the mapping function too challenging to learn.

* The standard deviation of the random noise controls the amount of spread and can be
adjusted based on the scale of each input variable. It can be easier to configure if the scale of the input variables has first been normalized. Noise is only added during training. No noise is added during the evaluation of the model or when the model is used to make predictions on new data. The addition of noise is also an important part of automatic feature learning, such as in the case of autoencoders, so-called denoising autoencoders that explicitly require models to learn robust features in the presence of noise added to inputs.

---

Although additional noise to the inputs is the most common and widely studied approach,
random noise can be added to other parts of the network during training. Some examples
include:

* **.dark-red[Add noise to activations, i.e. the outputs of each layer.]**

The addition of noise to the layer activations allows noise to be used at any point in the
network. This can be beneficial for very deep networks. Noise can be added to the layer
outputs themselves, but this is more likely achieved via the use of a noisy activation function.

* **.dark-green[Add noise to weights, i.e. an alternative to the inputs.]**

The addition of noise to weights allows the approach to be used throughout the network in
a consistent way instead of adding noise to inputs and layer activations. This is particularly useful in recurrent neural networks.

* **.dark-blue[Add noise to the gradients, i.e. the direction to update weights.]**

The addition of noise to gradients focuses more on **improving the robustness of the optimization process itself rather than the structure of the input domain**. The amount of noise can start high at the beginning of training and decrease over time, much like a decaying learning rate. This approach has proven to be an effective method for very deep networks and for a variety of different network types.

---

* **.dark-violet[Add noise to the outputs, i.e. the labels or target variables.]**

Adding noise to the activations, weights, or gradients all provide a more generic approach to adding noise that is invariant to the types of input variables provided to the model. **If the problem domain is believed or expected to have mislabeled examples, then the addition of noise to the class label can improve the model’s robustness to this type of error.** Although, it can be easy to derail the learning process. Adding noise to a continuous target variable in the case of regression or time series forecasting is much like the addition of noise to the input variables and may be a better use case.

---

## Tips for Adding Noise During Training

* **.dark-red[Problem Types for Adding Noise]**

Noise can be added to training regardless of the type of problem that is being addressed. It
is appropriate to try adding noise to both classification and regression type problems. The
type of noise can be specialized to the types of data used as input to the model, for example, two-dimensional noise in the case of images and signal noise in the case of audio data.

* **.dark-red[Add Noise to Different Network Types]**

* **.dark-red[Rescale Data First]**

It is important that the addition of noise has a consistent effect on the model. This requires that the input data is rescaled so that all variables have the same scale, so that when noise is added to the inputs with a fixed variance, it has the same effect. The also applies to adding noise to weights and gradients as they too are affected by the scale of the inputs. This can be achieved via standardization or normalization of input variables. If random noise is added after data scaling, then the variables may need to be rescaled again, perhaps per minibatch.

---

* **.dark-red[Test the Amount of Noise]**

You cannot know how much noise will benefit your specific model on your training dataset.
Experiment with different amounts, and even different types of noise, in order to discover what works best. Be systematic and use controlled experiments, perhaps on smaller datasets across a range of values.

* **.dark-red[Test the Amount of Noise]**

Noise is only added during the training of your model. Be sure that any source of noise is not added during the evaluation of your model, or when your model is used to make predictions on new data.

---

# Early Stopping`

---

## Halt Training at the Right Time with Early Stopping

We will cover these topics:

.bg-lightest-blue.b--dark-blue.ba.bw2.br3.shadow-5.ph4.mt2[

* **The challenge of training a neural network long enough to learn the mapping, but not so long that it overfits the training data.**
* **Model performance on a holdout validation dataset can be monitored during training and training stopped when generalization error starts to increase.**
* **The use of early stopping requires the selection of a performance measure to monitor, a trigger to stop training, and a selection of the model weights to use.**

]

---

## The Problem of Training Just Enough

* When training a large network, there will be a point during training when the model will stop generalizing and start learning the statistical noise in the training dataset. This overfitting of the training dataset will result in an increase in generalization error, making the model less useful at making predictions on new data. The challenge is to train the network long enough that it is capable of learning the mapping from inputs to outputs, but not training the model so long that it overfits the training data.

* One approach to solving this problem is to treat the number of training epochs as a
hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best performance on the train or a holdout test dataset. The downside of this approach is that it requires multiple models to be trained and discarded. This can be computationally inefficient and time-consuming, especially for large models trained on large datasets over days or weeks.

---

## Stop Training When Generalization Error Increases

* An alternative approach is to train the model once for a large number of training epochs.
During training, the model is evaluated on a holdout validation dataset after each epoch. If
the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped.

* The model at the time that training is stopped is then used and is known to have good
generalization performance. This procedure is called early stopping and is perhaps one of the oldest and most widely used forms of neural network regularization.

* If regularization methods like weight decay that update the loss function to encourage less complex models are considered explicit regularization, then early stopping may be thought of as a type of implicit regularization, much like using a smaller network that has less capacity.

---