My Neural Network isn't working! What should I do?

19/08/2017

So you're developing the next great breakthrough in deep learning but you've hit an unfortunate setback: your neural network isn't working and you have no idea what to do. You go to your boss/supervisor but they don't know either - they are just as new to all of this as you - so what now?

Well luckily for you I'm here with a list of all the things you've probably done wrong and compiled from my own experiences implementing neural networks and supervising other students with their projects:

  1. You Forgot to Normalize Your Data
  2. You Forgot to Check your Results
  3. You Forgot to Preprocess Your Data
  4. You Forgot to use any Regularization
  5. You Used a too Large Batch Size
  6. You Used an Incorrect Learning Rate
  7. You Used the Wrong Activation Function on the Final Layer
  8. Your Network contains Bad Gradients
  9. You Initialized your Network Weights Incorrectly
  10. You Used a Network that was too Deep
  11. You Used the Wrong Number of Hidden Units

You Forgot to Normalize Your Data

What?

When using neural networks it is essential to think exactly how you are going to normalize your data. This is a non-negotiable step - there is very little chance of your network working at all without doing this correctly and with some care. Since this step is so essential and so well known in the deep learning community it is very rarely mentioned in papers and so almost always trips up beginners.

How?

In general normalization means this - subtract the mean from your data and divide your data by it's standard deviation. Usually this is done individually for each input and output feature but you may often want to do it for groups of features or to treat the normalization of some features with special care.

Why?

The primary reason we need to normalize our data is that most parts of a neural network pipeline assume that both the input and output data are distributed with a standard deviation of around one and a mean of roughly zero. These assumptions appear everywhere in deep learning literature, from weight initialization, to activation functions, to the optimization algorithms which train the network.

And?

An untrained neural network will typically output values roughly in the range -1 to 1. If you are expecting it to output values in some other range, (for example RGB images which are stored as bytes are in the range 0 to 255) you are going to have some problems. When starting training the network will be hugely unstable as it will be producing values of -1 or 1 when values like 255 are expected - an error which is considered huge by most optimization algorithms used to train neural networks. This will produce huge gradients and likely your training error will explode. If somehow your training does not explode then the first few stages of the training will still be a waste as the first thing the network will learn is to scale and shift the output values into roughly the desired range. If you normalize your data (in this case you could simply divide by 128 and subtract 1) then none of this will be an issue.

In general, the scale of features in the neural network will also govern their importance. If you have a feature in the output with a large scale then it will generate a larger error compared to other features. Similarly, large scale features in the input will dominate the network and cause larger changes downstream. For this reason it isn't always enough to use the automatic normalization of many neural network libraries which blindly subtract the mean and divide by the standard deviation on a per-feature basis. You may have an input feature which typically ranges between 0.0 and 0.001 - is the range of this feature so small because it is an unimportant feature (in which case perhaps you don't want to re-scale it), or because it has some small unit in comparison to other features (in which case you do)? Similarly, be careful with features that have such a small range that their standard deviation becomes close to, or exactly, zero - these will produce instabilities of NaNs if you normalize them. It is important to think carefully about these issues - think about what each of your features really represent and consider normalization as the process of making the "units" of all the input features equal. This is one of the few aspects of Deep Learning where I believe a human is really required in the loop.


You Forgot to Check your Results

What?

You've trained your network for a few epochs and you can see the error going down - success! Does this mean you've done it? PhD awarded? Unfortunately not - it is almost certain there is still something wrong with your code. It could be a bug in the data pre-processing, the training code, or even the inference. just because the error goes down doesn't mean your network is learning anything useful.

How?

Checking your data looks correct at each stage of the pipeline is incredibly important. Usually this means finding some way to visualize the results. If you have image data then it is easy - animation data can also be visualized without too much trouble. If you have something more exotic you must find a way to sanity check it to make sure it looks correct at each stage of your pre-processing, training, and inference pipeline and compare it to ground truth data.

Why?

Unlike traditional programming, machine learning systems will fail silently in almost all cases. With traditional programming we are used to the computer throwing an error when things go wrong and using this as the signal to go back and check for bugs. Unfortunately this process doesn't work with machine learning applications and so instead we should be incredibly careful about checking our processes at every stage with a human eye so that we know when a bug has been produced and when we might need to go back and check our code more thoroughly.

And?

There are many ways to check if your network is working. Part of this is finding out exactly what the reported training error really means. Visualize the result of your network applied to data in the training set - how does the result of your network compare to the ground truth in practice? You might see the error go from 1.0 to 0.01 during training but the results could still be unusable if an error of 0.01 is still in practice an unacceptable result. If it works on the training set check it on the validation set - does it still work for data it hasn't seen before? My advice would be to get used to visualizing everything from the beginning - don't start only when your network isn't working - make sure that before you start experimenting with different neural network structures you have your full pipeline in place all the way to the end user with sanity checks along the way. This is the only way you can accurately evaluate a number of potential different approaches.


You Forgot to Preprocess Your Data

What?

Most data is tricky - and often data for things we know are similar can have vastly different numerical representations. To take an example from character animation - if we represent our data using the 3d positions of the character's joints relative to the center of the motion capture studio then performing a motion in one location, or facing one direction, may have a massively different numerical representation to performing the same motion in a different location, or facing a different direction. What we need to do instead is represent the data differently - for example in some local reference frame (such as relative to the character's center of mass) so that both motions we know are similar get a similar numerical representation.

How?

Think of exactly what you're features represent - is there some simple transformation you can do on them to ensure that data points which represent things we know are similar always get similar numerical representation? Is there a local coordinate system you can represent your data in that makes things more natural - perhaps a better color space - a different format?

Why?

Neural Networks make only a few basic assumptions about the data they take as input - but one of these essential assumptions is that the space the data lies in is somewhat continuous - that for most of the space, a point between two data points is at least somewhat "a mix" of these two data points and that two nearby data points are in some sense representing "similar" things. Having big discontinuities in the data space, or large clusters of separated data which represent the same thing, is going to make the learning task much more difficult.

And?

Another way to think about data pre-processing is as an attempt to reduce the combinatorial explosion of data variations that might be required. For example, if a neural network trained on character animation data has to learn the same set of motions for the character in every location and orientation then a lot of the capacity of the network is being wasted and a lot of the learning process is duplicated.


You Forgot to use any Regularization

What?

Regularization - typically in the form of dropout, noise, or some form of stochastic process injected to the network is another important aspect of training modern neural networks. Even if you think you have vastly more data than parameters, or that you have some situation where over-fitting does not matter or appears impossible, it can still often be helpful to add dropout or some other form of noise.

How?

The most basic way to regularize a neural network is to add dropout before each linear layer (convolutional or dense) in your network. Start with a medium to high retainment probability such as 0.75 or 0.9. Tweak based on how possible you think over-fitting is and if you find any evidence of it. If you think over-fitting is very unlikely consider setting the retainment probability to something very high such as 0.99.

Why?

Regularization isn't just about controlling over-fitting. By introducing some stochastic process into the training procedure you are in some sense "smoothing" out the cost landscape. This can speed up training, help deal with noise or outliers in the data, and prevent extreme weight configurations of the network.

And?

Data augmentation or other kinds of noise can also act as regularization just like dropout does, and sometimes with enough data augmentation dropout isn't required any more. While the common way to think about dropout is as a technique for combining the predictions of many random sub-networks, it is also possible to think about dropout as just one form of data augmentation - producing many similar variations of the input data during training. And as we know, the best way to avoid over-fitting is to have so much data that the neural network never sees the same thing twice! In the end, like all aspects of training neural networks, you need to be careful about the regularization you use. Remember to turn it off during prediction, and be aware that you can often get slightly different results once it has been turned off. In cases where you need extremely numerically accurate predictions some forms of regularization can sometimes make that difficult.


You Used a too Large Batch Size

What?

Using too large a batch size can have a negative effect on the accuracy of your network during training since it reduces the stochasticity of the gradient descent.

How?

Find the minimum batch size with which you can tolerate the training time. The batch size which makes optimal use of the GPU parallelism might not be the best when it comes to accuracy as at some point a larger batch size will require training the network for more epochs to achieve the same level of accuracy. Don't be scared to start with a very small batch size such as 16, 8, or even 1.

Why?

Using a smaller batch size produces choppier, more stochastic weight updates. This can have two positive effects. Firstly, it can help the training to "jump" out of local minima in which it might previously have gotten stuck, and secondly it can cause the training to settle in minima which are "flatter", something that generally indicates better generalization performance.

And?

Some other elements in the data can sometimes effectively act like a batch size. For example processing images at twice the resolution as before can have a similar effect as using four times the batch size. To get an intuition for this, consider that in a CNN the weight updates for each filter will be averaged over all the pixels to which it was applied in the input image, as well as for each image in the batch. Increasing the image resolution by two will produce an averaging effect over four times as many pixels in a very similar way as to if you increased the batch size by four. Overall the important thing is to consider how much the final gradient updates will be averaged at each iteration and make sure you balance the detrimental effect this has against the need to use as much of the potential parallelism of the GPU as possible.


You Used an Incorrect Learning Rate

What?

The learning rate can have a huge impact on how easy it is to train your network and if you are a newcomer it is almost certain you've set it incorrectly thanks the various default options used in common deep learning frameworks.

How?

Turn gradient clipping off. Find the highest value for the learning rate which doesn't make the error explode during training. Set the learning rate one order of magnitude lower than this - this is probably pretty close to the optimal learning rate.

Why?

Many deep learning frameworks turn on gradient clipping by default. This option prevents the optimization used during training from exploding by enforcing a maximum amount the weights can change at each step. This can be useful - in particular if your data contains many outliers which produce large errors and therefore large gradients and weight updates - but having it on by default also makes it very difficult to find the optimal learning rate by hand. I've found that most newcomers to deep learning have the learning rate set way too high and account for this with gradient clipping, making the overall training behavior slow and the effect of changing the learning rate unpredictable.

And?

If you've cleaned your data properly, removed most of the outliers, and set the learning rate correctly then you really shouldn't need gradient clipping. If without it you find your training error exploding occasionally then by all means turn gradient clipping on, but just remember that seeing your training error exploding is almost always an indication that something else is wrong with some of your data - clipping is a temporary fix.


You Used the Wrong Activation Function on the Final Layer

What?

Using an activation function on the final layer can sometimes mean that your network cannot produce the full range of required values. The most common error is using a ReLU on the final layer - resulting in a network can only produce positive values as output.

How?

If you are doing a regression then most often you don't want to use any kind of activation function on the final layer unless you know something specific about the kind of values you wish to produce as output.

Why?

Think again what your data values actually represent and what their ranges are after normalization. It is most likely the case that your output values are unbounded positive or negative numbers - in which case you shouldn't use an activation function on the final layer. If your output value may only make sense in some range E.G. it consists of probabilities in the range 0-1 there is most likely a specific activation function that should be used on the final layer such as a sigmoid activation function.

And?

There are a number of subtleties to using activation functions on the final layer. Perhaps you know your system will eventually clip outputs into the range -1, 1 after they are produced by the neural network. Then it might seem to make sense to add this clipping process as the final layer activation as this will ensure your network error function doesn't penalize values greater than 1 or less than -1. But no error means there will also be no gradient for these values which are greater or less than one - which in some cases will make your network impossible to train. Alternatively you may be tempted to use tanh on the final layer, knowing that this activation function outputs values in the range -1 to 1 - but this can cause problems too as the gradients of this function near 1 or -1 grow very small which could cause your weights to grow huge in an attempt to produce -1 or 1 exactly. In general your best bet is often to play it safe and use no activation function at all on the final layer rather than trying to do something clever that may backfire.


Your Network contains Bad Gradients

What?

Deep networks using ReLU activation functions can often suffer from so called "dead neurons" caused by bad gradients. This can negatively affect the performance of the network, or in some cases make it completely impossible to train.

How?

If you find that your training error does not change from epoch to epoch it may be that all your neurons have died due to using the ReLU activation function. Try switching to another activation function such as leaky ReLUs or ELUs and see if the same thing happens.

Why?

The gradient of the ReLU activation function is 1 for positive values and 0 for negative values. This is because a very small change in the input does not affect the output for inputs less than zero. This might not seem like a problem immediately due to the large gradient for positive values, but with many layers stacked on top of each other, and negative weights able to change large positive values with strong gradients into negative values with zero gradient, it can often be the case that some or even all of your network weights have a gradient of zero with respect to the cost function no matter what input they are given. In this situation we say that the network is "dead" since the weights are completely unable to update.

And?

Any operation with zero gradient such as clipping, rounding, or taking the maximum/minimum, will also produce bad gradients if they are used in computing the derivative of the cost function with respect to the weights. Be very careful if these appear anywhere in your symbolic graph as they can often cause unforeseen difficulties, E.G. if they are used in some custom error metric that is provided as part of the cost function.


You Initialized your Network Weights Incorrectly

What?

If you don't initialize your neural network weights correctly then it is very unlikely your neural network will train at all. Many other components in the neural network assume some form of correct or standardized weight initialization and setting the weights to zero, or using your own custom random initialization is not going to work.

How?

The 'he', 'lecun' or 'xavier' weight initializations are all popular choices which should work perfectly well in practically any situation. Just pick one (my favourite is 'lecun'), but feel free to experiment once your neural network is working until you find the best fit for your task.

Why?

You may have heard that you can initialize neural network weights using "small random numbers" but it isn't that simple. All of the above initializations were discovered using complex and detailed mathematics which explain exactly why they are optimal. Even more importantly, other neural network components have been built around these initializations and tested empirically using them - using your own initialization may make it far more difficult to reproduce other researcher's results.

And?

Other layers may need to be carefully initialized too. Network biases are initialized to zero, while other more complicated layers such as parametric activation functions may come with their own initializations which are just as important to get correct.


You Used a Network that was too Deep

What?

Deeper is better right? Well not always ... deeper is usually better when we are playing the benchmark game and trying to squeeze 1% more accuracy out of some task, but if your little 3, 4, 5 layer network is failing to learn anything then I can assure you that a 100 layer behemoth going to fail just as badly if not worse.

How?

Start with a shallow a neural network with 3 to 8 layers. Start experimenting with deeper networks only when you already have things working well and are starting to investigate how to increase the accuracy.

Why?

Although it might seem like it, Neural Networks didn't just suddenly start getting ground breaking results when someone decided to stack hundreds of layers. All of the improvements to neural networks that have been researched over the last decade have been small, fundamental changes which are just as applicable to the performance of smaller networks as deep ones. If your network is not working it is more likely something else is wrong other than the depth.

And?

Starting small also means that training your network will be faster, inference will be faster, and iterating on different designs and setups will be faster. Initially, all of these things will have a much bigger impact on your accuracy than simply stacking a few more layers.


You Used the Wrong Number of Hidden Units

What?

In some cases using vastly too many or too few hidden units can make your network difficult to train. Too few units and it may not have the capacity to express the task required, while too many and it may become slow and unwieldy to train with residual noise that is hard to remove.

How?

Start with somewhere between 256 and 1024 hidden units. Then, look at the numbers being used by fellow researchers working on similar applications and use those as inspiration. If fellow researchers are using vastly different numbers to the ones given above then there may be some specific reason why which is probably important for you to understand.

Why?

When deciding on the number of hidden units to use the key is to consider roughly what you think may be the fewest number of real values required to express the information you wish to pass through the network. You should then probably scale this number up a bit. This will allow for dropout, for the network to use a more redundant representation, and for a bit of leeway in your estimate. If you are doing classification you can probably use five to ten times the number of classes as a good initial guess, while if you are doing regression you can probably use two to three times the number of input or output variables. Of course - all of this is highly context dependent and there is no simple automatic solution - a good intuition is still the most important thing deciding on the number of hidden units.

And?

In reality the number of hidden units often has quite a small impact on neural network performance when compared to other factors, and in many cases overestimating the number of hidden units required will have little negative effect other than making training slower. Once your network is working, if you are still concerned, just try a whole bunch of different numbers and measure the accuracy until you find the one that works best.


See Also

Any other suggestions for common issues? E-mail me at contact@theorangeduck.com and I will try to add them to the list.