Multivariate regression

12 minutes
Share the link to this page
Copied
  Completed
You need to have access to the item to view this lesson.
One-time Fee
$69.99
List Price:  $99.99
You save:  $30
€59.73
List Price:  €85.34
You save:  €25.60
£51.81
List Price:  £74.02
You save:  £22.20
CA$96.84
List Price:  CA$138.36
You save:  CA$41.51
A$106.75
List Price:  A$152.51
You save:  A$45.75
S$89.95
List Price:  S$128.50
You save:  S$38.55
HK$545.64
List Price:  HK$779.53
You save:  HK$233.88
CHF 55.86
List Price:  CHF 79.80
You save:  CHF 23.94
NOK kr703.17
List Price:  NOK kr1,004.57
You save:  NOK kr301.40
DKK kr445.91
List Price:  DKK kr637.04
You save:  DKK kr191.13
NZ$118.76
List Price:  NZ$169.67
You save:  NZ$50.90
د.إ257.03
List Price:  د.إ367.21
You save:  د.إ110.17
৳8,495.52
List Price:  ৳12,136.98
You save:  ৳3,641.45
₹6,172.17
List Price:  ₹8,817.76
You save:  ₹2,645.59
RM295.67
List Price:  RM422.40
You save:  RM126.73
₦106,514.70
List Price:  ₦152,170.38
You save:  ₦45,655.68
₨19,808.40
List Price:  ₨28,298.93
You save:  ₨8,490.52
฿2,243.26
List Price:  ฿3,204.79
You save:  ฿961.53
₺2,883.76
List Price:  ₺4,119.83
You save:  ₺1,236.07
B$378.92
List Price:  B$541.35
You save:  B$162.42
R1,231.03
List Price:  R1,758.69
You save:  R527.66
Лв116.90
List Price:  Лв167.02
You save:  Лв50.11
₩97,041.13
List Price:  ₩138,636.13
You save:  ₩41,595
₪233.01
List Price:  ₪332.89
You save:  ₪99.87
₱3,968.43
List Price:  ₱5,669.43
You save:  ₱1,701
¥10,316.66
List Price:  ¥14,738.72
You save:  ¥4,422.05
MX$1,309.87
List Price:  MX$1,871.33
You save:  MX$561.45
QR255.14
List Price:  QR364.50
You save:  QR109.36
P938.51
List Price:  P1,340.79
You save:  P402.27
KSh9,032.87
List Price:  KSh12,904.65
You save:  KSh3,871.78
E£3,397.07
List Price:  E£4,853.17
You save:  E£1,456.09
ብር9,985.48
List Price:  ብር14,265.58
You save:  ብር4,280.10
Kz64,180.83
List Price:  Kz91,690.83
You save:  Kz27,510
CLP$67,863.28
List Price:  CLP$96,951.70
You save:  CLP$29,088.41
CN¥499.22
List Price:  CN¥713.21
You save:  CN¥213.98
RD$4,414.63
List Price:  RD$6,306.89
You save:  RD$1,892.25
DA9,089.11
List Price:  DA12,985
You save:  DA3,895.89
FJ$157.67
List Price:  FJ$225.25
You save:  FJ$67.58
Q535.38
List Price:  Q764.86
You save:  Q229.48
GY$14,604.61
List Price:  GY$20,864.62
You save:  GY$6,260.01
ISK kr8,549.85
List Price:  ISK kr12,214.60
You save:  ISK kr3,664.74
DH634.40
List Price:  DH906.33
You save:  DH271.92
L1,171.66
List Price:  L1,673.87
You save:  L502.21
ден3,674.28
List Price:  ден5,249.20
You save:  ден1,574.92
MOP$561.02
List Price:  MOP$801.49
You save:  MOP$240.47
N$1,234.02
List Price:  N$1,762.97
You save:  N$528.94
C$2,569.13
List Price:  C$3,670.35
You save:  C$1,101.21
रु9,858.12
List Price:  रु14,083.64
You save:  रु4,225.51
S/245.88
List Price:  S/351.28
You save:  S/105.39
K291.36
List Price:  K416.25
You save:  K124.88
SAR262.47
List Price:  SAR374.98
You save:  SAR112.50
ZK1,666.62
List Price:  ZK2,380.99
You save:  ZK714.36
L303.21
List Price:  L433.17
You save:  L129.96
Kč1,456.45
List Price:  Kč2,080.74
You save:  Kč624.28
Ft23,442.21
List Price:  Ft33,490.30
You save:  Ft10,048.09
SEK kr657.57
List Price:  SEK kr939.43
You save:  SEK kr281.85
ARS$95,451.09
List Price:  ARS$136,364.55
You save:  ARS$40,913.45
Bs482.36
List Price:  Bs689.11
You save:  Bs206.75
COP$278,383.76
List Price:  COP$397,708.14
You save:  COP$119,324.37
₡35,369.65
List Price:  ₡50,530.25
You save:  ₡15,160.59
L1,828.90
List Price:  L2,612.83
You save:  L783.92
₲503,139.27
List Price:  ₲718,801.20
You save:  ₲215,661.92
$U2,810.15
List Price:  $U4,014.67
You save:  $U1,204.52
zł253.83
List Price:  zł362.63
You save:  zł108.80
Already have an account? Log In

Transcript

Hello everyone, welcome to the course of machine learning with Python. In this video we shall discuss about multiple linear regression. So what is the problem statement suppose there are key many features or predictor variables such as x one x two x three up to x k and our target variable y. Now, we want to explain why as a linear combination of all those predictor variables, we have a vector of features or creditors that means x vector is equal so, the x vector is equals to x one x two x three up to x k transpose. So, why transpose because we know that this should be a column vector. Now, the model representation here we can express our multiple linear regression problem by means of following probabilistic model y equals to theta zero plus theta one x one plus beta two x two plus beta three x three plus up to theta k x k plus epsilon.

So, epsilon is here Random residual error. So, here we have a vector of the model parameters, which is theta zero theta one theta two up to theta K and the other total key plus one model parameters. So, the model parameter vector is of dimension k plus one and as I said epsilon is the randomness to it, like in the case of simple linear regression, here also we shall use our sample data to estimate the model parameter vector theta that is our estimated model parameter vector will be theta hat note that this is a capital theta to denote the victim. Now the multiple linear regression as options so the predictor should be uncorrelated and linearly independent of one another. That means if one predictor depends on another variable, we simply discard that predictor the models error term should have constant variance the error term is preferably normally distributed with zero mean now the hypothesis function, the prediction value of the target variable is obtained from the estimated model parameters as y hat is equal to beta zero hat plus beta one hat x one plus beta two hat x two plus up to theta k hat x k that means, it is nothing but theta zero hat plus some over T does he have X g, where g runs from one to K now, the cost function here we shall define cost function in the following way.

So, g of the model parameter vector So, this is a function of the model parameter vector will be equals to one by twice in sum over the actual value of y minus the predicted value of y whole square items from one to n. So, there are total m number of training samples if I replace the value of predicted value of y by y hat is equal to theta zero plus theta t x ci g runs from one two key we shall have this equation of the cost function Note that here x Victor superscript i and Y superscript i is the IEA data sample and Y hat superscript i is the corresponding output of the model note that we have we are using superscript to denote the instance of the data or that particular data sample. Now, our objective is to estimate the model parameters theta cap from the given sample of data so, that the cost function g is minimized now, gradient descent algorithm for cost function optimization gradient descent algorithm is an optimization technique for minimizing the cost function J gradient descent is an iterative algorithm and in a nutshell it does the following.

So, we start with some random guesses or the some random initial values of theta which is a model parameter. It keeps on changing the values of theta that is to zero theta one head to head and up to Cheetah head to reduce the G We end up at the minima. Okay, so how do we do these algorithms okay. So, this is the algorithm of gradient descent. So, first the input is the feature set that means the values of x and the corresponding values of the target variable, that is the values of y the output will be the model parameters that we want to estimate for which the cost function is minimum now the initializations So, first we initialize the model parameters with some random values and we initialize iteration number t to be equals to one now, we should repeat until convergence the following steps. So, first we'll compute the cost function for the given value of the theta cap.

And for a particular theta j at t plus one iteration will be equals to the value of the theta j at TF iteration minus alpha times the gradient of the cost function with respect to theta j and will ceremoniously operate the same for every g equals to 012 up two key and then we increase the iteration number by one. And we repeat this tape until we get convergence. Okay, so we'll discuss about convergence a little later. So, here alpha is called a small positive insert different parameter and it is called the learning rate. Now, what is the intuition of the gradient descent algorithm the plot of the cost function with any of the parameters considering the other parameters fix is one of the following form we have already seen this now, as for the gradient descent rule, the update is So, this is what is called the update rule of the gradient descent.

So, theta t at t plus monetization will be equals to the value of theta j at TF iteration minus alpha times the gradient of the cost function with respect to theta J. Now, if the gradient is positive that means, this particular value is positive and we have already specified that alpha is a small positive number then this entire thing is positive then the upgrade in the value of theta z. So, what is The particular update so, the update is minus alpha Dale then theta j of capital T right. So, if this particular gradient is positive then the update in the value of theta j will be negative indicating that in the next update that theta z will be smaller than its previous family okay. So, we can pictorially represent it like this. So, let's say we first had to determine the value of theta j is here then So, this is the gradient we see that the gradient is positive okay.

So, in the next tape or in the next iteration, the value of the theta j will be deduced okay it is denoted by these red dots and again the gradient is positive, so, it will keep on decreasing. On the other hand, if the gradient is negative, then the update in the value of theta j is positive indicated that the next update theta z would be greater than it will be as well. So, we can see these big typically like this. So, here the gradient of the value of theta j is the If so, in the next iteration that it is available will be more than its previous value. So, as we can see, wherever we are in this particular graph will be always moving towards the minimum of the function okay. And as we reach to the minimum of the function the gradient will become zero and there will not be any update in the value of theta z okay.

So, this is the equation of the gradient descent algorithm. Now, the cost function of the linear regression with multiple parameter looks like this we have already specified we can calculate the derivative of the cost function with respect to theta zero and theta d for mpg runs from 123 up to k. So, how do we do that? So, this is basically the gradient calculation of cost function g with respect to t does zero head and this is the calculation of the cost function with respect to theta g for j equals 2123 up to K so, I will recommend to pause this video and obtain these result by yourself okay hence the upgrade rule in the gradient descent algorithm for the iteration will look like this fine because as you can see this is basically we are optimizing the gradient if we multiply these with the alpha and this will be the update.

So, this is basically the uptick rule of the gradient descent in the context of multiple linear regression or the naughties if we consider another predictor x zero whose value is always one, then we can use the last equation as the generalized gradient descent update rule. That means, if we add a particular column of one all ones at the left of the data set, then we can say that x cheedo is equals to one for all eyes and then we can use this last equation as a generalized rule for the gradient descent update in case of linear regression. Now, the stopping criteria gradient descent would be we can consider the following two methods as a stopping criteria for the descent elbow random one being the convergence of the model parameters. So, in two successive iterations, the change in the model parameter is not significant, then we can say that the algorithm has converged as the model parameters are vectors to compute the values of the model parameters in two successive iterations, we can either take norm or absolute differences.

Alternatively, if at least one of the parameters is not conversed, then we keep on running the algorithm until all the parameters are converged. And the second method would be predefined number of iterations where we run the algorithm for a predefined number of iteration So, let's the thousand iterations or 2000 traditions and whatever we get, we'll be happy with that okay. Now, how to check whether gradient descent is working or not, we can plot cost function versus the number of iterations if it is continuously decreasing, then we can say that the gradient descent is working however, we may have to fine tune the value of the learning rate that is alpha depending upon how fast flow the cost function is converging. So, let us consider the example here. So, here the cost function is decreasing smoothly and gradient descent is working properly. However, if we obtain a cost function was a traditional graph like this, which is oscillatory we may have to decrease the alpha and similarly, if it is too sluggish that means, the change in the cost function with respect to iteration is not that much then in order to speed up the gradient descent we may increase the alpha.

So, in both cases the gradient descent is not working properly right. Now, the implementation note choosing the value of learning rate alpha So, as we can see, there is no general apply to all value of the learning rate. However, we can choose the value of alpha as 0.001 0.003 so many values again this list is not exhaustive one may choose different values like 00 5.005 as a learning rate of openness And again another implementation note that is the feature scaling. So, linear regression and most of the other machine learning problems performs well when the range of the predictor variables are roughly in the same range. This is because if the range of the one predictor variable is significantly higher than the another predictor variable, then the first one starts to dominate over the second one making gradient descent algorithm sluggish hence, feature scaling is necessary for faster convergence of the gradient descent algorithm.

So, this is done either by the following two methods. So, one is called the standardization. So, for a particular predictor variable exchange, whose mean is exchange par, and the standard deviation is sigma x shaped therefore, features can replace x j with XT minus Xj bar all divided by sigma of exchange. So this ensures that the after scaling exchange will have zero mean and English Standard deviation or you can also do min max killing where we replace x with x j minus x j part divided by maximum value of Xj minus minimum value of exactly this is a range of the values of x j. These issues that after scaling Xj will lie between minus one to plus one. So in the next video, we'll see how to implement gradient descent in Python.

So see you in the next lecture. Thank you.

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.