Welcome to clinical data management program using SAS. In this video I'll be discussing about the concept of linear regression. So what is linear regression in linear regression? Basically, we fit a predictive model, where we predict the value of our dependent variable using multiple independent variables, we can also have an independent variables. If we have an independent variable, then we call it a simple linear regression. If we have multiple independent variables, then we call it as multiple linear regression.
So in linear regression, our basic objective is to predict the value of the dependent variables using the independent variables. And the line of best fit for a linear regression model is a straight line. So you can see this a straight line where I'm predicting y is my dependent variable based on the values of x. Now, let's discuss about the standard form of the linear regression equation. The standard form of the linear regression equation is y equal to alpha plus beta one x one plus beta two x two plus Because we extract cluster data and accent plus EI where via is the dependent variable for the is observation is my intercept for the constant beta one beta two beta three beta for beta n are my regression coefficients or slopes x one x two X $3 extent are my independent variables and l at the error terms that is the part of the dependent variable that remains unexplained.
So, he is basically a unexplained variation. Now, what are the different features of a straight line first, a straight line can be defined by two things the slope or gradient of the line second is the point at which the line is crossing the vertical axis of the graph, that is the y axis of the graph and that we call is called as the intercept of the line. The point where it cuts is called the intercept of the line that is in my linear regression equation, alpha is the intercept and the slope or gradient of the line or basically the regression coefficients. Now, let's come to the concept of the method of least squares. What is the method of least A method of least squares is used to is used to find the unknown parameters for our linear regression model it is used to find the line which best fits the data.
So, of all the number of possible amount of lines that can be drawn, the line of best fit is the line which has the minimum number of or weakens the minimum amount or least amount of difference between the observed data points and the line. So here see we can see the independent variable as x and y is the dependent variable and over here, for in method of least squares, we basically minimize the value of submission of A minus y hat post query verifies the observed value of our dependent variable and Y hat is the predicted value of our dependent variable when we are having the points above the line that means our differences are positive. That means it indicates that the model is under estimating their value and when we have the points below the line that means the differences are Negative and that means the model is overestimating their value.
So this is the concept of method of least squares where we find the line of best fit for our model. And that we do by minimizing the differences between our observed and predicted values. That's why we believe minimize via a man is buying at all square. Next, let's come to the concept of the goodness of fit of the model. So, what do you mean my goodness of fit goodness of fit is how well your model is fit? How do you understand how well the model is fit in order to understand how well our classical linear regression model is fit, we are going to use the concept of r square.
So what is r square r square is the ratio of explained variation to the total variation explained variation is the variation which is explained by all the independent variables and total variation is some of the explained variation and the unexplained variation. So, in total variation, even the error terms also are considered. So, basically suppose in my model if I have See three independent variables and if I suppose born increasing my independent variables, so, more I increase my independent variables, my explained variation will gradually increase. And if my explain variation increases Oscar which is the ratio of explained variation by total variation, my Oscar rally will also increase. So, in that case, we are going to consider that Oscar as r square value is increasing that means, my model is a good fit because my r square value is increasing, but there is a one there is a point that we need need to note that, when we're increasing the number of independent variables, we are considering which of the independent variables are actually creating a significant impact on our dependent variable.
So, even if the variables are even if the independent variables are redundant in nature still may ask for valuable entries. So, that will not prove the efficiency of the model in that case We are going to use the concept of adjusted R square, which only considered those independent variables that creates a significant impact on the dependent variable and which results in the model efficient Jared increases the model efficiency. So, adjusted R squared is that accurate measure of the goodness of fit which is adjusted value of r square to the degrees of freedom, it increases the model efficiency. Therefore, we always see the adjusted R squared value is always less than r square and more is the adjusted R square value better is a goodness of fit of our model and more efficient is our model test of significance of the estimated parameters. So, there are several tests that we need to do to check the significance of our parameters.
So, first test is called Global test global test means when where global test is the test, where we check the overall significance of the parameters, where my H naught is all the parameters are equal to zero. That means that They are insignificant and each one is at least one is nonzero that is their significant this test is conducted by using F statistic or F test. Next is a local test for local test v local test is done to check the individual significance of the parameters. In case of local test my h notice again the individual parameter is insignificant there is the parameter value is equal to zero and each one is parameter values naught is equal to non zero that means, it is significant and local test is conducted by using a t statistic or T test. Now, let's come to the concept of the knowledge discuss about the different assumptions of classical linear regression model.
The first assumption is the relationship between the dependent and the independent variables should be linear that is there should be linear relationship between the dependent independent variables. Next the error terms should be uncorrelated with the independent variables, that is the error term should not be related or should not be any associated with the independent variable Looks next the expected value of receivables should be zero that is expected value of their return should be zero, the variance of the error term should be constant, the variance of the return should be constant means this candidness within the variance or the spread of the error term should be constant this phenomena of the variance of the return should be constant is called the phenomena of homoscedasticity the residuals are random or uncorrelated with respect to time that is residuals should not be correlated with respect to time that is error term at a time period should not be correlated with their returns the optimal is valid time period that is we can say that et should not be correlated to E t minus one e t minus one should not be correlated to t minus two and so, on.
Next the error term should be normally distributed. The next assumption the independent variables should be independent that is the independent variables should not get influenced by each other that is there should be minimum amount of multicollinearity. Next the independent variable should be non stochastic in nature that is, it should not follow any other parameter distributions. So, these are the assumptions of my classical linear regression model. Now, let's come to the concept of multicollinearity. What do you mean by multicollinearity multicollinearity is when my independent variables are correlated with each other or when the independent variables are getting influenced by each other we call the console concept as multicollinearity.
We also considered when the independent variables of the predictors are interrelated among each other is called the concept of multicollinearity. So, the concept of multicollinearity is the characters in a regression model are often called independent variables, but this term does not impair the practice are themselves independent statistically from one another. In fact, for natural systems the practice can be highly inter correlated. So, multiple narratives are resolved to describe the case when the correlation of predictor variables is high that is when the relative variables are very much influenced by each other It has been noted that the variance of the estimated regression coefficient depends on the inter correlation of the predictors. So, more than independent variables are getting influenced by each other the variance of the estimated regression coefficients will increase and it will result in a concept called vi a or variance inflation factor. So, multicollinearity has the following negative effects variances the regression coefficients can be inflated So, much that the individual coefficients are not statistically significant even though regression equation is strong and the practicability is good.
So, this is the concept of my variance inflation factor that is a variance of the individual regression coefficients will be inflated So, much that the individual impact of the independent variables cannot be seen. And though the overall regression equation model will be quite strong and the creative ability of the predictive power of the model will be strong. Next relative magnitudes and even the science of the coefficients may defy the interpretation. Next, the values of the individual regression coefficients may change radically with the remover or addition of a predictor variable in the equation. In fact, the sign of the coefficient might even switch. So, these are the different disadvantages of multicollinearity.
Next, what are the signs of multicollinearity high correlation between the pairs of predictor variables regression of potions, who signs or magnitudes do not make any good physical sense, statistically non significant regression coefficients are important predictors, extreme sensitivity of sign or magnitude of regression coefficients to insertion or deletion of predictor variable. Now, let's discuss the concept of Vi f. So, the VI for the variance inflation factor occurs when there is multicollinearity. Among our independent variables that is when the independent variables are inter correlated among each other it results in increasing radius of the independent variables are VI. So vi f is a statistic that can be used to Identify multicollinearity in a matrix of predictor variables variants inflation, refers here to the mentioned effect of multicollinearity. On the variance of estimated regression coefficients multicollinearity depends not just on the by variate correlations between pairs of predictors, but on the multivariate predictability of any one predicted from the other predictors.
Accordingly, the VI F is based on the multiple coefficient of determination of regression model for each character in multivariate linear regression model on all the other predictors. So the VI F is equal to one by one minus r squared, where r squared is the multiple coefficient of determination for the IFA and vi a VI is the variance inflation factor associated with the AI predictor. If the if greater is independent of the other predictors, the variance inflation factor is one while it is greater can be almost perfectly predicted from the apprentice variance inflation factor approaches to infinity. So in this video, We'll be doing this here. In a coming video I'll be discussing about the rest of the concepts that comes under classical linear regression model. Thank you.
Goodbye. See you all for the next video.