Backpropagation is an algorithm used to teach feed forward artificial neural networks. It works by providing a set of input data and ideal output data to the network, calculating the actual outputs and backpropagating the calculated error (the difference between the ideal and actual outputs) using gradient descent. This is useful for learning specific patterns of input and output data in order to be able to reproduce them afterwards even from slightly different or incomplete input data. Neural networks are able to learn any function that applies to input data by doing a generalization of the patterns they are trained with.

In this post I will start by explaining what feed forward artificial neural networks are and afterwards I will explain the backpropagation algorithm used to teach them. In the end I will provide a sample implementation. This post expects some knowledge of math and computer programming from the reader.

**Artificial neural networks**

Artificial neural networks try to simulate the basic functioning of natural neural networks. The human brain has billions of neurons that work together as one big network. Each natural neuron has stronger or weaker connections with other neurons in the brain. It is believed that in these connections (especially how strong they are) is most of the information that the brain retains.

An artificial neural network retains this basic aspect of natural neural networks and uses a set of real value weights to remember how strong the connections between its neurons are. Each neuron has a number of input connections from other neurons, a number of output connections to other neurons and an activation function that is calculated on the sum of inputs and provides the output value of the neuron.

**Feed forward** neural networks don’t have any cycles in their neuron connections network. This implies that the outputs connections of a neuron are different from the input ones. In this kind of network the neurons are typically arranged in layers.

The feed forward neural network above has 6 neurons arranged in 3 layers. N_{1}, N_{2} and N_{3} are called input neurons as they receive data directly from the user, not other neurons. For input neurons the activation function is not used. N_{4} and N_{5} are part of what is called a hidden layer of neurons. They are hidden as the user can’t set them a value or get their value directly, they only communicate with other neurons. N_{6} is an output neuron. From the output neurons the users get the result of the neural network processing. W_{ij} are the weights of the connections between the neurons (the value of how strong the connections are).

Hidden and output neurons apply an activation function to their sum of input data. The sum of input data is the sum of the products of the input neurons output values and the weights of the input connections from those neurons.

, where N_{i} are the input neurons to neuron N_{j}

For the activation function F(x) we have the output of the neuron N_{j}:

In order to be able to learn any pattern we need a non linear activation function. The most used activation function is the sigmoid function:

Because the sigmoid function takes values in the (0, 1) interval you will have to scale your output values to and from that interval.

In our example above we have:

N_{1}, N_{2} and N_{3} as input values.

N_{4} = F(w_{14}*N_{1} + w_{24}*N_{2} + w_{34}*N_{3})

N_{5} = F(w_{15}*N_{1} + w_{25}*N_{2} + w_{35}*N_{3})

Output = N_{6} = F(w_{46}*N_{4} + w_{56}*N_{5})

**Backpropagation**

The purpose of learning is to determine the weights W_{ij} that allow us to reproduce the provided patterns of inputs and outputs (function of inputs). Once the network is trained we can use it to get the expected outputs with incomplete or slightly different data. Basically, it learns a function of arbitrary complexity from examples. The complexity of the function that can be learned depends on the number of hidden neurons.

Before starting the backpropagation learning iterations you will have to initialize the weights with random values, typically in the interval (-1, 1).

A backpropagation step for a specific input pattern and ideal output starts by calculating the error at the output neurons. This error is the difference between the provided ideal output and the calculated actual output multiplied with the activation function derivate on that output point. For the sigmoid function the derivate is F^{’}(x) = F(x)*(1 – F(x).

OutputError_{j} = (IdealOutput_{j} – Output_{j}) * F^{’}(Output_{j})

In our example, using the sigmoid as an activation function:

N_{6}_Error = (N_{6}_Ideal – N_{6}) * N_{6} * (1-N_{6})

After we have the error for the output layer we calculate an error for each neuron in the hidden layers, going backwards, layer by layer. The error for a neuron in a hidden layer is the sum of the products between the errors of the neurons in the next layer and the weights of the connections to those neurons, multiplied by the derivate of the activation function.

HiddenError_{i} = ∑(OutputError_{j} * w_{ij}) * F^{’}(HiddenOutput_{i})

In our example:

N_{4}_Error = ( N_{6}_Error * w_{46} ) * N_{4} * (1 – N_{4})

N_{5}_Error = ( N_{6}_Error * w_{56} ) * N_{5} * (1 – N_{5})

We will use those errors to calculate the variation of the weights as a result of the current input pattern and ideal outputs. The variation (delta) of a weight is the product of the input neuron output value with the error of the output neuron for that connection. Please note that I use input neuron and output neuron for a connection as a separate term from the network input neurons and output neurons.

∆w_{ij} = Output_{i} * Error_{j}

In our example:

∆w_{46} = N_{4} * N_{6}_Error

∆w_{56} = N_{5} * N_{6}_Error

∆w_{14} = N_{1} * N_{4}_Error

∆w_{24} = N_{2} * N_{4}_Error

∆w_{34} = N_{3} * N_{4}_Error

∆w_{15} = N_{1} * N_{5}_Error

∆w_{25} = N_{2} * N_{5}_Error

∆w_{35} = N_{3} * N_{5}_Error

This process is repeated for all input patterns and the variations (deltas) are accumulated. At the end of a learning iteration we change the actual weights with the accumulated deltas for all the training patterns multiplied with a learning rate (a number typically between 0 and 1 which states how fast a network converges to a result). A neural network typically needs at least a few hundred iterations in order to learn a set of patterns.

∆w_{ij}_Final = ∑∆w_{ij}_Input_{k}

w_{ij} = w_{ij} + ( ∆w_{ij}_Final * LearningRate )

In our example, considering 2 input patterns and a learning rate of 0.3 we have for example:

∆w_{46}_Final = ∆w_{46}_Input1 + ∆w_{46}_Input2

New w_{46} = w_{46} + 0.3 * ∆w_{46}_Final

Resuming, in order to teach a network using backpropagation, we do the following steps:

– Initialize weights with random values

– For a specified number of training iterations do:

- For each input and ideal (expected) output pattern
- Calculate the actual output from the input
- Calculate output neurons error
- Calculate hidden neurons error
- Calculate weights variations (delta): ∆w
_{ij} - Add the weights variations to the accumulated delta

- Learn by using the accumulated deltas and adding them to the weights

I’ve made a C# implementation of a feed forward neural network with just one hidden layer that uses backpropagation: NeuralNet.cs.

I also made an example project (using Visual C# Express Edition) where I test the neural network and backpropagation on simple digit recognition: nnet.zip

Thanks alot, It is The best explanation of backpropagation algorithm.

Comment by farzaneh — September 3, 2010 @ 12:37 pm

Very nice explanation of backprop. I have tried to apply this in Python but somehow the network doesn’t learn correctly for some reason (and I am pretty sure I have done everything exactly as you explain in this article I also checked your C# implementation and the logic of my Python script seems to be exactly the same). I even asked for help at Stack Overflow:

http://stackoverflow.com/questions/3988238/help-me-with-my-backprop-implementation-in-python

🙂

Comment by Richard Knop — October 31, 2010 @ 11:56 pm

Just tested your dataset (as written by kriss on stackoverflow) in C# and it works just fine. I tested all the entries in the dataset after the network learned.

I know too little python to help you with your code. Your implementation is for sure not exactly like mine, but I don’t know which different detail gives you problems. Try to get the smallest data set with the problem, print some intermediary information and work from there to discover the problem.

In order to test the data set yourself you can use the linked nnet.zip project and create a network with 5 hidden neurons and 0.7 learning rate. Do 5000 iterations, just like I do for XOR.

Comment by Mihai Varzaru — November 1, 2010 @ 8:12 pm

Thank you..your site is useful for me.

I can now get, how to do my programming.

Comment by Anup — December 28, 2010 @ 5:38 pm

Nice explained I want explanation with example for too clear understanding pls help

Comment by Sharmila — January 29, 2011 @ 10:58 am

Ahh, I think my neural net is finally working. Thank you for explaining this in plain English where other sites condense each paragraph into a single complex equation! This helped me a lot.

Comment by Tyler — March 24, 2011 @ 12:17 am

Very clean and transparent description. I was frustrated reading many paged on ANN and BP, but this one made me clear. If you update with Bias and Moment, and also checking with MSE for completion of error. It would be complete one — I think. Plz do this update.

Thanks alot again for such clear article.

Comment by Awlad Hossain — April 1, 2011 @ 5:34 pm

u r very tight. Best explanation ever. Why cant scientist write simple explanations such as these…

Comment by loyboy — February 8, 2012 @ 1:10 pm

Im not sure what you mean by accumulating the weight deltas and then updating the weights at the end of each epoch. If I train on 100 examples I will accumulate a certain amount of weight delta. However if I train against 1000 examples my weight deltas will be approx 10 times larger. Surely that can’t be right? Am I supposed to divide the amount of accumulated delta by the number of examples to get an average delta per example? Is the average delta applied to the weight?

I’ve tried the above algorithm, as described, on my own implementation in java. When I calculate the weight delta for a particular instance, then I get a reasonable delta figure to adjust the weight by. But when I update the weights in batch mode (i.e. accumulating delta and applying at the end of the epoch) then the deltas are huge and totally swamp the original weights.

Thanks in advance for clarifying this point.

Comment by Tom McCann — February 29, 2012 @ 6:21 am