Hello everyone, welcome to the course of machine learning with Python. In this particular video, we will see how to use different statistical techniques to examine the relationship among the variables. So, why it is important to know how to describe the distribution of a single variable. Most studies pose the question that involves excluding the relationship between two videos using the collected data. Here are a few examples of such questions or also other research questions with the two variables highlighted. So how is the number of calories in a sandwich related to type of sandwich like which egg or chicken so here number of calories is one variable and type of sandwich is another video other smoking habits of a person related to person's gender so smoking habit is one variable and gender is another variable.
Similarly, what is the relationship between passes age and farsightedness? Similarly, is there any relationship between gender and the test score on a particular standardized test? can we predict a person's favorite type of music based on his or her Each So in most of the studies involving two variables, each of the variables has a role we distinguish between the explanatory variable also commonly referred as the independent variable, the variable that claims to explain or to predict or effect that response. Typically it is divided by x. The response variable, also commonly referred as the dependent variable is the outcome of the study denoted by y. So, can we predict the x&y of the above examples.
So in the first example, how is the number of calories in a sandwich related to the type of the sandwich so here number of calories is basically our dependent variable and type of sandwich is the explanatory or the independent variable. Similarly, in the second example, gender is the independent of the explanatory variable and y is the outcome of the response variable. In the third example, h is the expected variable and farsightedness is the outcome of the response. In the fourth example, gender is the explanatory variable and taste Is the dependent variable similarly in the fifth example is the extent of the variable and a favorite type of music that we want to predict is the output variable okay. I hope that this is clear to you know moving forward if we further classify each of the two relevant variables according to the type that means, categorical or quantitative we get the following four possibilities of role type classification So, categorical explanatory variable and quantitative response variable categorical exploitative variable and categorical response variable quantitative explanatory variable and quantitative response variable quantitative expected variable and categorical response rate.
So, this is how we can visualize this in a bit. So, how is the number of calories in a sandwich related to or affected by the type of sandwich So, in this case, calorie is basically our quantity and the sandwich is a categorical variable. Now, here x that is exposed to the variable is the type of sandwich and output variable is basically number of calories. So, this is basically a categorical explanatory To quantitative response kind of classification So, that is why you can see this is basically C to Q kind of relationship okay. Similarly are the smoking habits of a person related to person's gender This is basically categorical to categorical because what smoking a bit and the gender are categorical variable. Similarly what is the relationship between person's age and farsightedness?
This is quantitative to quantitative because both age and sightedness are quantitative in nature. Similarly, is there any relationship between gender and the test scores so gender is a categorical variable and test score is a quantitative variable. So this will be basically categorical to quantitative relationship. Okay, and then the last type the favorite type of music so he has a favorite type of music is basically categorical in nature where the age which is basically our explanatory variable is quantitative in nature. So it is quantitative explanatory and categorical response. So just to see the relationship moving forward, let's examine categorical categorical relationships.
So as Harvey is done among 100 college students, and they were asked how do they feel about their body in means that they feel that they are overweight, underweight or just about right, the table of the data looks like for me, okay, so there are 200 students and the gender are also noted male or female and the body image overweight, right or just right. Okay, now suppose we ask the question like, are men and the woman just as likely to think their weight is about right? Is there any difference between the genders in feelings about body image? So to answer this type of question, we need to examine the relationship between the two categorical variables that is the gender and the body image. So here gender is the expected variable and body image is the response variable. So both of them are basically categorical in nature.
So that is why it is a C to C relationship. In order to summarize the relationship between two categorical variables, we create a display called a two way table. So how does a two way table look like in our example, so this is the two way table in case of our example. So we can see that about 560 female think that they are about right, and 37 feels that they're underweight? Okay, so there are Total 760 female and taught for 40 men Similarly, there are 855 students who think that they are about right and hundred and 10 students feel that they're underweight. So this total this total along this row gives the summary of the categorical variable body image and this total along the column gives a summary of the categorical variable gender.
So our way to visualize the distribution of categorical variables and the relationship among them is to plot a double bar chart Okay, so we can plot it our budget where it is basically based on a certain categorical variable, let's say about right so this blue bar indicates female and the orange bar indicates male Okay, so this is a nice way to visualize basically the relationship among the categorical variables, okay. As we are examining the relationship between the gender and the body image, we can create a conditional percentage table and plot the corresponding double barcia. Okay, so this is basically my original table, we can form a question conditional percentage table, how to find out the conditional percentage table. So this each row is divided by that row total, okay. So 560 divided by 760 it gives me 73.7%. Similarly, 162 divided by 760 is 21.4% obviously, to convert it to personal you have to interpret it the value 163 by 760 800.
Okay, so note that along the row of values 100% So along each road of a loser we see 100%. So, then we can plot something called conditional percentage barcia. From this position percentage bar chart we can see that almost equal percentage of the male student and the female student think that they're about right now categorical to quantitative relationship. So, the survey of the study habits and the attitudes in a psychological test designed to measure the motivation study habits and attitudes toward learning of college students. Is there a relationship between the gender and the SS at school? In other words, Is there a gender effect on teachers performance data were collected from 40 randomly selected college students and here is what the raw data look like.
So, the 40 students is the gender column and this is the SS 18 that means the study habits and attitudes score so here the expected variable is the gender and ethnicity score is the response variable. So this is basically a categorical to quantitative relationship following is the five number summary of the AC t score separated gender wise so we just separate out all the female scores and all the male scores and followed by we can form the pipe number statistics are the five number summary. So this implies 153 is the median of the female SSH s course and 114. point five is the median score by the male students okay. So we can draw the box plot individually for female and mill. This is what we call the payer boxplot or side by side box plot and this will Give us a clear idea about the distribution of the scores among the two variable or two categorical variables female and male.
So, we can clearly see that the median value for the female opt in SSH school is much higher than the median when that then the male opt in in the AC t score. Now quantitative to qualitative relationship. So, we look at the height and weight data that we were collected from the 57 minutes and 24 females and use the data to explain how the weight of a person is related to his or her height This implies that height will be our explanatory variable and weight will be our response variable. Now both height and weight are basically qualitative in nature. So that is why it is a quantitative to quantitative relationship due to relationship. Okay, so this is basically how our data set looks like.
And we create a scatterplot from the given data, which looks like the one shown beside Okay, now to draw the scatter plot is simple. You just have to plot each point specified by the coordinates here the quarter And nothing but height comma. Now interpreting the scatterplot how do we explore the relationship between two quantitative variables using the scatterplot. So recall that when we describe the distribution of a single quantitative variable with a histogram, we describe the overall pattern of the distribution. And any deviation from the pattern is called the outliers. We do the same thing with the scatterplot.
Okay, so we just try to describe the overall pattern, and then also the deviations from the pattern. And the overall pattern, there are really three things we want to focus on. One is the direction Another one is the form another one is district, okay, fine. Now the direction of the relationship can be either positive, negative, or neither. So this describes a positive relationship. This describes a negative relationship, okay.
So positive relationship means if one expected variables goes up, then the distance variable also goes up. Here's an A relationship means if the explanatory variable goes up, then the response variable goes down and vice versa. This is nice Positive non negative in some where it is negative in some ways is positive. And this describes no relationship at all the form of the relationship is this general shape. When I didn't define the form, we try to find the simplest way to describe the shape of the scatterplot. There are many possible forms, okay, so this is basically a linear form, this is a curvilinear the strength of the relationship is determined by how closely the data the form of the relationship.
Let's look for example, at the following two scatter plots displaying positive linear relationships. Okay, so this is basically a strong relationship. This is a weaker relationship, consider the following dataset. There is one outlier spotted in the data set. As you can see, this is an outlier, okay, because it significantly varies from the other data points. So can you guess the direction of the relationship so the direction is positive, the fall of the relationship is linear and the string is weak.
So this is basically how we can interpret the scatterplot in the next video, we should continue with this relationship between the quantitative variables. So see you in the next lecture. Thank you.