Hello everyone, welcome to the course of machine learning with Python. In this video, we shall learn about the backpropagation learning algorithm for multi layer perceptron. So, there are a few key notations. So, this is our simple multi layer perceptron architecture. So, there are three layers l one l two and three and this is the dense connection between the layers. So, in suffix l denotes the number of layers in our network does in this figure beside in suffix in equals to three is suffix tail denotes the number of nodes in layer in not counting the bias image.
So, these are basically the bias and the bias units are always plus one in one is the input layer and iL iL is the output layer v suffix I superscript l is the bias associated with unit I in layer in plus one. So, if I consider this particular arrow This is basically The buyers associated with the tired neuron in the layer two, okay. So, these buyers will be denoted as B suffix three one okay. The neural network as shown above has parameters w comma V. Now, this can be expanded into w one v one and W two v two know that these are basically superscripts. So, W one v one is nothing but the weight and the bias associated between layer one and layer two and W two v two is the weight and bias associated between layer two and layer three. So, in general w suffix it superscript l denotes the parameter or weight associated with the connection between unit j in layer L and unit I in layer l plus one.
Note that bias units don't have inputs or connections going into them since They always output value plus one in this example, this w one is nothing but a three cross three, we'll matrix B one will be nothing but a vector of dimension three, W two is also a vector of dimension three, because we have only one output unit over here and v two is a scalar fellow. So, this is basically our PT, the forward propagation, so we will write e suffix I superscript l to denote the activation of the unit I in layer in for L equals to one we also use a suffix is superscript one is equals to x to denote a hit input. That means x one is equals to E suffix, one superscript one x two is equals to a suffix one superscript two, and x three is equal to a suffix three superscript one now, essentially Fixed one superscript two is nothing but if of W one one superscript one multiplied with x one plus w one two superscript one multiplied with x two plus w one three superscript one multiplied with x three plus v one superscript one.
So, this whole thing inside the bracket can be written as 01 superscript two okay. So, this is called the pre activation and this is post activation, and if here is called the activation function, similarly, we can compute a suffix to superscript two and a suffix three superscript two, the final output each suffix wb x is equal to E one superscript three which is nothing but if of W 11212 plus w 122 a two two plus W 132 a three two plus v one two and this whole thing inside the bracket can be written as 013. So, all the equations shown above can be written in simplified fashion in victim matrix form as a victor superscript l plus one that means, the host activation output of layer l plus one is equals to f of the activation of the pre activation input of layer l plus one which is eight superscript n plus one which is nothing but if of W superscript l multiplied with post activation of layer in class vias will be associated with level in plasma.
No loss of cost function associated with multi layer perceptron you In a training set of examples, we define the overall cost function to be mean squared error loss, which is denoted by g of w comma V is equals to one upon aim as their total image training samples, sum over items from one to M half of the predicted output minus the actual output hold to the power to our predicted output minus actual output squared, okay, so this is what we call the mean squared error loss. However, we can define multi class log loss or categorical cross entropy function as shown by this equation easily for classification task, categorical cross entropy loss or the multi class log loss function is mostly used. Now backpropagation learning algorithm the weights are updated using backpropagation learning algorithm, where each weights are updated using gradient descent update rule For that, we have to calculate the partial derivative of the loss function with respect to the weight and bias.
User has to specify number of layers that is in suffix l number of nodes in each layer, that is suffix l four L equals to three up to in suffix in minus one. Note that the user needed to specify the number of nodes in the input layer and output layer because in the input layer, the number of nodes will be close to the number of features available in the data set. And in the output layer, the number of nodes will be close to the number of class number of epochs for which the network's to be trained has also been specified by the user. So these things which are specified by the user are called the hyper parameters of the neural network. Now user also has to specify the type of gradient descent. So there could be Batch gradient descent in Egypt.
The ADA in the final layer is average of the errors due to all the training samples in the data set. And we update the parameters and the weights using back propagation after all the training samples are fed to the network. Then comes mini batch gradient descent compute error in the final layer for a batch or a fraction of inputs and the backpropagation to update weight after each batch is completed, and epoch is said to be complete if all the batches of input data are fed to the network, the batch size should be specified by the user. Then comes Stochastic gradient descent in each epoch update wait after each of the training sample is fed. So usually, user prefers mini batch gradient descent. Now comes the back propagation learning algorithm.
So for each epoch, as we have mentioned that the number of proton ik box should be specified by the user So, for each epoch, we perform a feed forward pass computing the activations for layer il two a three and so on up to the output layer. For each output unit I in the layer in L that means, the final output layer, we said delta i superscript is equals to minus y minus e i have a null note that a I have entered is nothing but the activation of the unit I have layer in and y is nothing but our desired output. This whole thing is not multiplied with if time zip suffix i to the power nl here if prime denotes the derivative of the activation function. Now, for a liquids to enter minus one in L minus two minus three up to two and for each note I in layer in we said bill i superscript l is equals to W j superscript in build j superscript l plus one, sum it over G equals to one two is L plus one and multiply these week if dashed zit I superscript l. Now, we compute the desired partial derivative with respect to the weight vectors and the bias using the formula as shown below.
Finally, we compute the change parameters, which is nothing but w ij is equal to the previous value of W ij yn minus eta times the derivative of the cost function with respect to w ij L. And the bias will be close to the previous value of the bias minus eta times the derivative of the cost function with respect to the bias in it. Now, EDA is called the learning rate. So, it was similar hard to grasp the concept of backpropagation learning algorithm. However, we'll be using tensor flow framework for building and training the neural network. In tensor flow, these backpropagation and these derivatives are automatically computed. Hence, we need not to bother about it right now.
However, I will suggest to go through these equations once more and understand the concept of backpropagation. Very well. In the next video, we will implement an artificial neural network for handwritten digit recognition using Kerris and TensorFlow. So thank you for your attention. See you in the next lecture.