Hello everyone, welcome to the course on machine learning with Python. In the last video we have seen different kinds of variables namely categorical variable and quantitative variable. In this particular video, we will examine the distribution of categorical variable and quantitative variable. So let's start so we begin to determine this is by excluding one variable at a time. As we saw in the data and variables. The data for each variable are a long list of values with a numerical or not, and are not very informative in that form.
So in order to convert these raw data into useful information, we need to summarize them examine the distribution of the video by distribution of a variable we mean what values the variable takes and how often the variable takes those values. We will first learn how to summarize and examine the distribution of a single categorical variable and then do the same for quantitative video. So examining the distribution of categorical variable, so suppose we have taken a survey among the office scores on the have daily commutes so few office when people were randomly chosen and they were asked how they travel from home to office on a regular basis the data set we opted looks something like below. So here we have person and here we have types of daily commute okay. So, following the findings from the given data set, there are three modes of transport that people have in what is public transport, next is own two wheeler vehicle and the third one is own four wheeler vehicle, and there are total 200 rows or individuals in this data set.
Suppose we asked what percentage of sampled of his course fall into each category? This question will be easily answered once we summarize and look at the distribution of the variable types of medical so in order to summarize the distribution of a categorical variable, we first create a table of different values, or the categories the variable takes, how many times each value occurs, and more importantly, how often each value occurs. This table is called the frequency distribution table here is the frequency distribution table of our exams. So, as you can see, these are just a made up number. So, the people I will public talk for this 527 two wheeler is 425 and four wheeler is 240. So, total is 1200 if we divide 527 by 1200 and multiply to 200 I get 43.92.
So, which describes how much percentage of the pupil of a public transport similarly, that is 35.42 for two wheeler and 20.67 for fulfillment now, we can express the result pictorially with the help of a bar chart or a pie chart. So, this is basically my table frequency distribution table from here I can draw a bar chart with counts with fewer counts. So, public transport 527 two wheeler for 25 fulfiller to four similarly, we can plot the bar chart with percentage Okay, so, public transport is equals to 42.92%, two wheeler 35.42% and four wheeler 20.67% okay similarly, we can describe the data using pie chart here the adea of the Each segment of the circle is proportional to the percentage Okay, so here you can see the public transport is around 44%, four wheeler is around 100% and two wheeler is around 35%. Now let's move towards that examining the distribution of quantitative variable okay.
So, to describe the distribution of a quantitative variable, we need a plot called histogram. So, we break the range of the values that a quantitative variable takes into the intervals and count how many observations fall into each interval. So example here at that exam grades are 15 students as you can see, it ranges from as low as 48 to as high as 97. So we first need to break the range of the values into intervals also for the beams or the classes. So one of the very famous or very customary way to break the number that is achieved or the obtained by the student is to break in the range of pins. Okay, so between 42 50 then 50 to 60 like that, so, we can get a table as shown beside So, between 40 to 50 there is one observation or one come between 50 to 60 there are two observations between 60 to 70 they are full observations between 70 to 80 they are five observations between 80 to 90 there are two observations and between 90 to 100 there is one observation now, you can note that these intervals are basically one side closing another site open that means, this particular interval that is between 40 to 50 includes 40 but exclude 50 similarly, between 50 to 60 include the number 50 but exclude the number 60 Okay.
So, that is how we can form the interval table okay interval and count. Now, from this table we can draw the histogram not to question the histogram from this table we plot the intervals along the x axis and show the number of observations in each interval that means, the frequency of the interval along y axis which is represented by the height of the rectangle located above the Okay. So, this is basically the histogram of the bits as we can see between 40 to 50, there is only one observation between 260 there are two observations, the decision to 270 there are four observations and Sobhuza now, how to measure the central tendency of a quantitative variable. So, there are three main numerical measures for measuring the central distribution or the central tendency of a quantitative variable and these three are most mean and the median. So, what is more more is the most commonly occurring value in the distribution of a quantitative variable.
What is mean mean is the average of a set of observations that is the sum of the observation divided by the number of observation. So, if in observations of a variable x are x one x two x three x and then they mean not the average that we call x bar is calculated as x bar equals to sum of x one to x n divided by M. Okay, so, this is just a good old formula for calculating the average value. Similarly, we have something called median the median is midpoint of the distribution is the number such that half of the observation fall above and half of the observation fall below. So, how to calculate the media. So, first we have to order the data set from smallest to largest consider whether in the number of observation is even or odd if n is odd, then the median aim is the center of the revision of the order placed that is the observation is the one sitting in the in class one by quips spot in the audit list.
Now, if n is even then median in is the mean of two center observations that means, the mean of the observations sitting in the position in by to and in by two plus one Okay, so, let's go through an example to understand how to calculate media. Let's say a quantitative variable takes the following events. So first what we will do we first ordered the values so, we start from the smallest number and go up to the highest number and note that if there is any deputations we should allow that Okay, so as you can see, there are two trees here. For three years and and the three So, there are two threes coming out at least as well. So, this is not a set it is a list. So, all the vendors that is coming over here should come over here as well along with reputations fine.
Now, how many observations are there as you can see there that 21 number of observation and 21 is an odd number. So, therefore, n plus one by two plus the data that means, in this case the level two data will be our median or what is the level data 1-234-567-8910 11 So, this data is basically our median. So, our median data is plus so, along with the measure of central tendency there is another thing called measure of dispersion okay. So, this question basically measures how much a particular observation deviates from his Central Valley, the Central Valley could be mean median etc okay. So, what is the measure of dispersion So, these measures provide different ways to quantify the variability of the distribution we will discuss the following Three most commonly used measures of spread one is called the range Another one is called the variance and the standard deviation and the third one is the interquartile range.
So, what is range range is nothing but the distance between the smallest observation and the largest observation. So, range is nothing but max minus mean. So, it is really easy to compute. So, let's say we consider the previous example, we know that the maximum is 23, the minimum is one, so, the range of submission is maximum that's mean which is 22. Now, what is the variance or standard deviation, so, if there are in observation, say x one x two x one and their average value is x bar, then the variance and the standard deviation is calculated as variance equals to one upon in multiplied with summation x i minus x bar squared sum from one to n. That means, we are actually calculating the deviation of each observation from the mean value, square that up sum all those deviations squared divided by the total number of observations.
Exactly become the variance of our quantitative variable. Now the standard deviation is nothing but square root of the variance. So this is the formula for the calculation of the standard deviation. So here I have shown how to calculate the standard deviation so let's say this is our data set we calculate the mean to be 8.8. Then we calculate values minus mean column and values minus means squared corner Okay, then we sum up this fourth column divided by the total number of observation in this case this is 10, and we open 9.56 as my variance and the standard deviation will be square root of the 9.56 which is 3.09. Okay, the interquartile range the interpret is measures the variability of distribution by giving us the range covered by the middle 50% of the data.
The following figure will illustrate the idea so let's say this is the total observation range. So this is the mean this is the max what is the total range of the observation Okay, the median divides the data into lower 50% and Upper 50%. So let's say this is between mean to median This is the bottom 50%. of the determined to medium to man There is 50% of your quartile one device mean minimal observation and the median observation into two parts okay. So, what is one basically divides into 25% lower and then between quartile one to median there is another 25% of the data similarly, quartile three divides the data between medium to the maximum. So, this is what is one this is what in three the middle 50% data is basically the interquartile range which is nothing but the difference between quarter three and quarter one and as I have already described the difference between maximum and the minimum is called the range okay.
So, I hope that you have understood what is conduct inter partitions. So, this is the example of calculating interquartile range of the previous observations of the previous data set that we have used. So, as we can see that the minimum value is one and the maximum value is 23. Similarly, we have computed the median quartile one and quartile three now, the combination of all these five numbers that is minimum quartile median quaternary The maximum is called the five number summary and it provides a weak numerical description of both the center and the spread of the distribution. So, so far this one in the next video we will examine the relationship among the categorical and quantitative variables. So see you in the next lecture.
Thank you