Noise, Neural Networks, and Flow-Matching

28/12/2024

I think most people who have worked with Neural Networks know that adding noise usually improves their generalization. That noise can take different forms, such as directly adding noise to the input, using dropout, data augmentation, or something else.

But why? The reason isn't actually very mysterious or mathematical, and we can get a good intuition as to exactly what is going on by just setting up toy regression problems and seeing the effects of these things. So here is a little toy regression problem with a 1d input x, 1d output y, and some made up data points.

Here is the result I get if I train a standard 5-layer feed-forward Neural Network with 256 hidden units and the ELU activation function to fit this data using the Mean-Squared-Error:

To me, this is the kind of result I imagine when I think about using Neural Networks to solve regression problems by curve fitting.

Here is the result I get with a dropout of 0.5 applied to each layer:

As you can see, it does a pretty fantastic job of removing the additional bumps and extrapolations and tightening up the fit.

In practice, however, one thing that has always bothered me about dropout is that it seems that once you turn it off there is always some kind of small distributional shift, and the fitted curve doesn't necessarily pass exactly through the training data any longer. In this toy example I can't really see it, but nonetheless, for that reason I have tended not to use it when a lot of accuracy is required in the fit.

What about if we just add Gaussian noise to the input x-values? Well, we can think about adding Gaussian noise to the input as effectively the same thing as infinitely expanding our dataset with a bunch of additional data points which have a slightly perturbed x-value but the same y-value:

The standard deviation of the Gaussian controls how much perturbation to apply - in some sense the "width" of our data points. Training a network with noise therefore produces something that resembles a bit more the kind of thing you often get from a k-nearest-neighbor regression:

This is undoubtedly an "uglier" fit of the data than before. But while a curve like this might not really match our intuitions of what is a "good fit", it's well known that for real-world data nearest-neighbor regression often also exceeds our expectations - so perhaps it shouldn't be a surprise that fits like this perform well on practical problems too.

One way to think about this is to consider the process of adding noise as specifying the gradients of the curve around the data points. Adding the noise is like saying that a small change in x should not produce any change in y - that the local gradient should be zero for all data points.

Let's zoom out a bit:

Take a look at the left-hand-side of the graph. Specifying zero gradients on our data points does at least one nice thing: it stops the network from trying to extrapolate in such a crazy way outside of the data domain. This is generally a good thing - there are very few Machine Learning problems where outputting anything other than the y-value at the most extreme data points is a good idea when given an out-of-distribution x-value.

One downside of adding noise is that if the standard deviation is too large then it "blurs" or "over-smooths" the output. This is because we can end up with an "expanded" dataset where we have the same x-value corresponding to multiple different y-values.

This ambiguity causes the regression to tend toward the average of the two, which over-smooths the fit.

So it often makes more sense to try and only add a very small amount of noise - just enough to flatten the gradients and prevent bad extrapolation, but not enough to smooth out overlapping training data.

All of that is with a standard regression setup where we have a single mode in the dataset (i.e. an assumed single set of y-values for the set of x-values).

What happens if we have multiple modes? (i.e. assumed multiple different possible y-values for the same x-values). Here I've prepared another little toy regression problem which looks like it could be multi-modal on the left-hand-side.

If we train a standard regression as before we either get horrible over-fitting....

Or if we add noise then we get something that tends a bit more towards the average of the two modes.

In this case adding noise to the input really discourages the over-fitting because we get so many more pairs of data points with the same x-value and different y-values:

While averaging the two modes is obviously better than over-fitting (in particular if our data is noisy or contains outliers) - it's still hardly ideal - and we can see some over-smoothing on the right hand side as a result.

Dropout again is really nice here, and cleanly averages the two modes while fitting the rest of the curve nicely.

What about if we do the regression in a way which can handle the multi-modality? For this I've made a basic flow-matching setup which we can use to sample from the conditional probability distribution of y given x, and so should be able to handle the two modes.

(For an excellent introduction to Flow-Matching I highly recommend Iterative-Alpha-Deblending which is a different name for the same technique).

If we train the flow-matching setup without noise we end up with a similar over-fit result as with standard supervised learning - because the gaps are large enough between the data points the network has enough capacity to produce a complex curve that can produce fully deterministic values for y given x.

(Note, however, that the fit is not nearly as extreme, and the curve is a much tighter, more reasonable fit than in the standard supervised case.)

If we add noise to x this "widens" the data points. And as soon as the x-values are overlapping, the network can no longer solve the regression problem with a single mode. Instead it has to properly model the multiple modes of the distribution as intended:

In fact, as you can see it does more than this, and the degree to which the noise at each data point overlaps (and the relative density of the gaussian distributions at those overlapping regions) corresponds to the likelihood of the y-value of that data point being sampled.

Dropout doesn't seem to really work with flow-matching - and produces a noisy result.

We see similar things on our previous single-mode dataset. Now with more noise we don't get a "smoother" line, we get a line that passes through all the data-points, but which will also happily jump between different y-values when the corresponding x-values are overlapped.

When this behavior first clicked for me it totally changed my understanding of flow-matching and other multi-modal regression methods. The interaction between these methods and any noise we add is just entirely different.

The really nice thing is that they don't seem to produce extreme over-fitting behavior, and we can use noise to prevent that even more and get nice, multi-modal regressions that are not over-smoothed and pass through the training data perfectly.

And when we add too much noise, we won't get an over-smoothed result, we will get the opposite - a more noisy result! That's because similar x-values are more likely to result in sampling a range of y-values when all our x-values are "overlapped" due to the noise.

Dropout, like before, just looks noisy:

We also need to be even more careful about training these methods on noisy or corrupt data. A noisy set of y-values can no longer be smoothed out simply by adding noise to the x-values like with standard supervised learning. Instead our regressions will just even more strongly emulate any noise they are trained on and output a wider distribution of y-values:

Outliers are an issue too: they also don't get smoothed out like they would with standard supervised learning, and can instead cause a small bump on the y-axis to become the outlier value itself being sampled at some small probability.

That is to say, with these methods, the conventional Machine Learning adage of "noise cancels out, signal adds up" cannot really be relied on anymore.

I find it really interesting that the intuitions and tricks we have from standard supervised learning when it comes to noise don't work in the same way when we switch to multi-modal methods.

Similarly, I find flow-matching to be an incredibly elegant and powerful solution to multi-modal regression problems. In the past I've run similar experiments to these using GANs, Variational methods, and other things, and all have either required tricks to get working (such as information bottlenecks that don't really work in 1d) or given results which are over-smoothed or have other artefacts. Flow-matching, on the other hand, is both simple, stable, and easy to train.

There are still issues to solve with it. From my (limited) understanding it relies on the averaging behavior of the Mean-Squared-Error and so cannot be used with other loss functions. The space you perform it in therefore needs to be carefully normalized. Similarly, since the training and inference functions are different it cannot immediately be used with auto-regressive or recurrent models.

Nonetheless, I am certain these issues will be solved in time, which is an exciting prospect given multi-modal regression problems have historically always been extremely difficult to solve for practitioners like me.

With that, I hope you found this post interesting, and as always, thanks for reading.