Backpropagation Part 2
In part 1 we understood how moving in the direction of gradient of a function can be used to find the max value of that function.
Before moving forward lets understand some terms :
Gradient Descent:
Intuitively we can move in opposite to the gradient and find the minimum of that function which is called Gradient Descent.
Activation Function:
We will be using the sigmoid activation here :
a = σ(Z) = 1/(1+e^z) where Z is weighted input
Note : understand the notations j,k,l above which will be used intensely later. Also note we usually see a = w.x+b but here we use a^l-1 in place of x. They are same equations here a^l-1 is nothing but the input to the layer l coming from layer l-1. So when l = 2 then a^l-1 is nothing but the initial inputs x from layer l-1= 1 .
Cost Function:
We need a way to measure how well our neural network is doing, the cost function comes to aid for this :
where: n is the total number of training examples; the sum is over individual training examples; y(x) is the corresponding desired output; L denotes the number of layers in the network; a is the vector of activation’s output from the network when x is input.
The cost function for an input x is a function of parameters weight and the bias, so we minimize cost by finding gradients of function w.r.t weights and bias and moving in the opposite direction of gradient.
Backpropagation:
Its an algorithm used to compute gradients of the cost function which can be used to minimize the cost function by changing weights & bias in the opposite direction of gradient.
Can’t we use calculus to find gradients instead of backpropagation ?
We need to understand that NN has huge number of weight parameters, some networks even has millions of parameters, using calculus to find partial derivative is quite costly to train the NN. Which is why we find an algorithm called backpropagation.
There are 4 fundamental equations behind the backpropagation, lets derive and understand one by one :
Equation 1 : Deriving the error at the output layer L.
Error at output layer L is hadamard product of (a-y) and derivative of input activation sigmoid.
We start with intuitive assumption that error could be expressed as change in C w.r.t to the weighted input Z.
Equation 2: Error at previous layers l-1,l-2,….
Equation 1 provided the error at output layer L, this error is nothing but a propagated error from previous layers due to wrong weights and biases, Now we try to back propagate this error depending on the weighted contribution of this final error by the previous layer neurons.
As we can see the error at neuron at layer l is dependent on the weighted error calculated at the layer l+1
Equation 3 : Partial derivative of cost C w.r.t bias
Equation 4: Partial derivative of Cost C w.r.t weight W
Steps in Backpropagation : Now using above equation’s we can find the gradients of Cost c which is ∂C/∂w and ∂C/∂b and then we subtract randomly chosen W and bias by gradients and slowly move towards minimum C.
Following the above steps recursively should slowly reduce our cost function and the accuracy of our model increase gradually.