Hello everyone welcome to the course of machine learning with Python. In this video we shall learn about classification technique called logistic regression. So, some prerequisite so why linear regression is not suitable for classification problem Consider the following tumor size versus its malignancy cloud zero denotes benign and one denotes madman. Now, as you can see, these small dots are basically denotes the benign stage of the tumor and this triangle basically the malignant stage of the tumor Okay. Now, we can draw a regression line to the data points as shown. So, we can draw a regression like that, as this is a classification problem, we put forward the hypothesis as following for unknown tumor of size x, let the corresponding value as obtained from the regression line is y.
If y is more than point five, then the corresponding tumor is malignant or benign. But we can see that there are several example Please highlight the F condition. Moreover linear equation gives continuous valued output and is more suitable when the target value is continuous or quantitative in nature. So, why linear equation is not suitable for classification problem in classification, we are more interested to learn the conditional probability of y equals to one given x for different values of x. Hence, our hypothesis function should produce a legitimate probability values as output that is, it should provide output between zero to one. However, linear equation produces real valued output, which theoretically can be within minus infinity to plus infinity.
Hence, linear equations cannot model hypothesis function for classification problem properly, it is worth noting that probability of y equals to one given x is equals to one minus probability of y equals to zero given x or in other words, probability of y equals to one given x plus probability of y equals to zero given x is equals to one that is probability that one observation belongs to one class is equals to one minus probability that observation does not belong to that class. So towards logistic regression for binary classification, so consider the following scenario of heart disease versus age. So, we have heart disease yes or no and age along the x axis, so, h here is the predicted called the predictor variable and the heart disease is basically what we want to predict. So, given a new person's age predict if he or she has heart disease, hence, mathematically speaking, our task is to estimate probability of Y that means for the disease is equal to Yes, given that is value x, okay, now, let's consider the training set and we obtain the lot like this.
Okay, so first calculate probability of y equals two v four given x for different ranges of the values of x then feed a car That estimate is the probability p probability y equals two years given x. So, we fit a curve like this. So, the logistic function the logistic function also known as the sigmoid function is given by g of z is equals to logistic of Zed or also called a sigmoid of Zed is equals to one upon one plus e to the power minus six. So, this logistic function is also known as sigmoid function or in other words, it can be written as equals because it by one plus e to the power, clearly g of z lies between zero to one. So, we can plot the function g of z for different values of Zed and we can obtain a graph like this, if we differentiate g of Zed with respect to Zed we get the following.
So, g prime z is equal to GS it into one minus g of Zed. So, if you go through all the steps, I hope you will understand it is just the basic calculus. So, if we simplify we can really operate that the differentiation of the sigmoid function is nothing but the value of the function at that particular point into one minus the integral of the function, so we can also plot the graph of g transit Ella, which is it and we have a graph like this, okay, so we can see that at zero equals to zero, g prime Zed is maximum, and as g prime state is always positive, we can say that this graph of Jesus is more critically increasing a hypothesis function for logistic regression. So let there is only one predictor variable or feature, then our hypothesis function for logistic regression is h of theta x is equal to g of theta zero plus two to one x, where g of zero is nothing but the logistic or the sigmoid function one by one plus e to the power minus six.
So h of x is equals to one upon one plus e to the power minus theta zero plus theta one x. So here it is zero and T two what are called the model parameters live there are more than one let's say key number of predictor variables or features that p x is number victor of K many values then our hypothesis function for logistic regression is h of theta x is equals to g of theta zero plus theta j Xj j runs from one to k. Now, where g of zero is again nothing but logistic or the sigmoid function. So, finally, what is our H of T days if we put the value of theta zero plus theta g exchanges from one to K the place of zit we obtained is this function like this, okay. So, here theta T, you'll note that there are two k plus one values at that model.
Now, the interpretation of the hypothesis function, so, h of theta x is equal to probability of y equals to one given x parameter as beta okay. So, note that the binary classification our predicted output Y hat will be nothing but he died so, I put this function gives the predicted output similarly, y one minus h of theta is equal to probability that y equals to zero given x parameter as well theta. Okay? Note that there are only two classes one corresponds to Y equals to one and another corresponds to I was to zero Hence, it also called a primary classification. So we use the training data to estimate the model parameters theta, now comes the cost function, the cost function of the logistic regression has the following form. So, if y equals to one, the cost function is equals to negative logarithm of H of X, note that this logarithm is a natural logarithm and if y equals to zero, then it is negative log of one minus h of x.
Now the interpretation of the cost function so if y equals to one, then if h of x is also equals to one, then cost equals to zero that means for correct prediction, there is no cost but if h of theta x tends to zero, then cost also tends to infinity. That means for wrong prediction, we are highly penalizing our algorithms. Similarly for wildcards to zero, if h of x is zero, then cost equals to one that means, there is no cost of correct classification, but if there is a nice classification that means H of T to existence to one, then cost also tends to infinity. So, this is no cost. So, we can express the cost function in the longitude equation as this value now, the overall cost function is nothing but the average of the individual cost functions. So, g of theta is nothing but the average of all the costs for all the data points x i and Y.
Hence, in this case data is equals to this complicated formula, but if you look into it, it is not that complicated. So, it is minus one by m, then within bracket we have y into log of h of x plus one minus y into law of h of x. Now, this is sum for all instances. So, that is when we have used the superscript I and is is from one to M So, this is also called binary cross entropy loss or the log loss function. So, we optimize this cost function using gradient descent algorithm to find the values of the model parameters for which the cost function g of theta is minimum. So, we use gradient descent to optimize the cost function in order to do so, we have to find the gradient or the differentiation of g with respect to Tita for different values of theta or the different fetal it does, you don't want it that way.
Okay. So, we know this is our cost function, and we also know that h of x looks like this. So it is as a so this cost function has a complicated dependency on these three types right. So, let z is nothing but these theta zero plus theta t exceeded and from one to K then h of theta x is nothing but g of z. Then data we can instead of h of theta x, we can simply write it is g of Zed. Now we can differentiate it according to the chain rule of differentiation.
So fast. We differentiate g with respect to g of Zed then g of z with respect to z and z with respect to t, okay. Now, if we start from the first partial differentiation, we obtain an expression like this, then if we operate part from the second partial differentiation, we obtain a result like this. So, this has been already proved when we discussed the velocity function, and as you can see there is nothing for the linear combination of theta and x. So, desert delta will produce one for D equals to zero and produce x two for G goes to one, two. Now, again, this is our final result if we combine all those partial derivatives.
Now, note that if we add one extra column of all the ones in the left hand side of the data set, then again we can simply ignore the fact that there are basically two differentiations instead, we can use only one differentiation and assume that exterior Oh there is another feature which is equals to one okay. So, as I have said this can be obtained by adding a feature column with all values one to the left of the data set table now simplifying these we obtain our the gradient of the cost function with respect to theta as this. So, then what is the gradient descent update rule. So, the value of the tea today at any time instance of any tradition t plus one will be equals to the value of the theatres in the previous iteration minus alpha times the gradient of the cost function with respect to today okay.
So, we can put the value of the gradient that we have obtained over here and this is basically come constitute the gradient descent update rule for logistic regression okay. As we have already said that in the class of linear regression, that alpha is a user defined small positive value known as red. Now, what is the decision boundary as we can see why is basically each of three tags is nothing but probability that y equals to one given x parameterize by theta so this is a binary classification where one goes to zero for class one and y equals to one for class two fine now let's y equals to zero if the hypothesis function produces output less than point five and if the hypothesis function produces output more than point five then y equals to one for some input features x. So, in other words, we can say that as h of x is nothing but g of Zed so if g of zero is less than point five then y goes to zero and if g of zero is water content in y goes to one so y goes to zero if z is less than zero and what if z is greater than zero okay.
So, hence our decision boundary is equals to zero because there is nothing but it does you know, Class C data exchange data from our okay. So this should probably separates the two classes. So as we know that this is the equation of a K dimensional hyperplane so in the next video, we shall see how to implement logistic regression in Python.