Backpropagation is an algorithm used to teach feed forward artificial neural networks. It works by providing a set of input data and ideal output data to the network, calculating the actual outputs and backpropagating the calculated error (the difference between the ideal and actual outputs) using gradient descent. This is useful for learning specific patterns of input and output data in order to be able to reproduce them afterwards even from slightly different or incomplete input data. Neural networks are able to learn any function that applies to input data by doing a generalization of the patterns they are trained with.

In this post I will start by explaining what feed forward artificial neural networks are and afterwards I will explain the backpropagation algorithm used to teach them. In the end I will provide a sample implementation. This post expects some knowledge of math and computer programming from the reader.

**Artificial neural networks**

Artificial neural networks try to simulate the basic functioning of natural neural networks. The human brain has billions of neurons that work together as one big network. Each natural neuron has stronger or weaker connections with other neurons in the brain. It is believed that in these connections (especially how strong they are) is most of the information that the brain retains.

An artificial neural network retains this basic aspect of natural neural networks and uses a set of real value weights to remember how strong the connections between its neurons are. Each neuron has a number of input connections from other neurons, a number of output connections to other neurons and an activation function that is calculated on the sum of inputs and provides the output value of the neuron.

**Feed forward** neural networks don’t have any cycles in their neuron connections network. This implies that the outputs connections of a neuron are different from the input ones. In this kind of network the neurons are typically arranged in layers.

The feed forward neural network above has 6 neurons arranged in 3 layers. N_{1}, N_{2} and N_{3} are called input neurons as they receive data directly from the user, not other neurons. For input neurons the activation function is not used. N_{4} and N_{5} are part of what is called a hidden layer of neurons. They are hidden as the user can’t set them a value or get their value directly, they only communicate with other neurons. N_{6} is an output neuron. From the output neurons the users get the result of the neural network processing. W_{ij} are the weights of the connections between the neurons (the value of how strong the connections are).

Hidden and output neurons apply an activation function to their sum of input data. The sum of input data is the sum of the products of the input neurons output values and the weights of the input connections from those neurons.

, where N_{i} are the input neurons to neuron N_{j}

For the activation function F(x) we have the output of the neuron N_{j}:

In order to be able to learn any pattern we need a non linear activation function. The most used activation function is the sigmoid function:

Because the sigmoid function takes values in the (0, 1) interval you will have to scale your output values to and from that interval.

In our example above we have:

N_{1}, N_{2} and N_{3} as input values.

N_{4} = F(w_{14}*N_{1} + w_{24}*N_{2} + w_{34}*N_{3})

N_{5} = F(w_{15}*N_{1} + w_{25}*N_{2} + w_{35}*N_{3})

Output = N_{6} = F(w_{46}*N_{4} + w_{56}*N_{5})

**Backpropagation**

The purpose of learning is to determine the weights W_{ij} that allow us to reproduce the provided patterns of inputs and outputs (function of inputs). Once the network is trained we can use it to get the expected outputs with incomplete or slightly different data. Basically, it learns a function of arbitrary complexity from examples. The complexity of the function that can be learned depends on the number of hidden neurons.

Before starting the backpropagation learning iterations you will have to initialize the weights with random values, typically in the interval (-1, 1).

A backpropagation step for a specific input pattern and ideal output starts by calculating the error at the output neurons. This error is the difference between the provided ideal output and the calculated actual output multiplied with the activation function derivate on that output point. For the sigmoid function the derivate is F^{’}(x) = F(x)*(1 – F(x).

OutputError_{j} = (IdealOutput_{j} – Output_{j}) * F^{’}(Output_{j})

In our example, using the sigmoid as an activation function:

N_{6}_Error = (N_{6}_Ideal – N_{6}) * N_{6} * (1-N_{6})

After we have the error for the output layer we calculate an error for each neuron in the hidden layers, going backwards, layer by layer. The error for a neuron in a hidden layer is the sum of the products between the errors of the neurons in the next layer and the weights of the connections to those neurons, multiplied by the derivate of the activation function.

HiddenError_{i} = ∑(OutputError_{j} * w_{ij}) * F^{’}(HiddenOutput_{i})

In our example:

N_{4}_Error = ( N_{6}_Error * w_{46} ) * N_{4} * (1 – N_{4})

N_{5}_Error = ( N_{6}_Error * w_{56} ) * N_{5} * (1 – N_{5})

We will use those errors to calculate the variation of the weights as a result of the current input pattern and ideal outputs. The variation (delta) of a weight is the product of the input neuron output value with the error of the output neuron for that connection. Please note that I use input neuron and output neuron for a connection as a separate term from the network input neurons and output neurons.

∆w_{ij} = Output_{i} * Error_{j}

In our example:

∆w_{46} = N_{4} * N_{6}_Error

∆w_{56} = N_{5} * N_{6}_Error

∆w_{14} = N_{1} * N_{4}_Error

∆w_{24} = N_{2} * N_{4}_Error

∆w_{34} = N_{3} * N_{4}_Error

∆w_{15} = N_{1} * N_{5}_Error

∆w_{25} = N_{2} * N_{5}_Error

∆w_{35} = N_{3} * N_{5}_Error

This process is repeated for all input patterns and the variations (deltas) are accumulated. At the end of a learning iteration we change the actual weights with the accumulated deltas for all the training patterns multiplied with a learning rate (a number typically between 0 and 1 which states how fast a network converges to a result). A neural network typically needs at least a few hundred iterations in order to learn a set of patterns.

∆w_{ij}_Final = ∑∆w_{ij}_Input_{k}

w_{ij} = w_{ij} + ( ∆w_{ij}_Final * LearningRate )

In our example, considering 2 input patterns and a learning rate of 0.3 we have for example:

∆w_{46}_Final = ∆w_{46}_Input1 + ∆w_{46}_Input2

New w_{46} = w_{46} + 0.3 * ∆w_{46}_Final

Resuming, in order to teach a network using backpropagation, we do the following steps:

– Initialize weights with random values

– For a specified number of training iterations do:

- For each input and ideal (expected) output pattern
- Calculate the actual output from the input
- Calculate output neurons error
- Calculate hidden neurons error
- Calculate weights variations (delta): ∆w
_{ij} - Add the weights variations to the accumulated delta

- Learn by using the accumulated deltas and adding them to the weights

I’ve made a C# implementation of a feed forward neural network with just one hidden layer that uses backpropagation: NeuralNet.cs.

I also made an example project (using Visual C# Express Edition) where I test the neural network and backpropagation on simple digit recognition: nnet.zip