Exploratory data analysis using python

Machine Learning Using Python Statistics and Exploratory Data Analysis
8 minutes
Share the link to this page
Copied
  Completed

Transcript

Hello, everyone, welcome to the course of machine learning with Python. In this video, we will see exploratory data analysis in Python. So first, we shall import necessary libraries. So here we have imported pandas, NumPy and matplotlib.pi plot. So let's go ahead and run this. Now we'll read the data file.

So this data has been stored in the folder called data. And finally, nice height underscore weight underscore gender underscore cool underscore sun. So Cookson is a cry in calamity desert in Africa. And this is the data of height and weight and gender of few people. So let's go ahead and get the data. So we'll use read underscore CSV function in built in pandas.

So what is the type of this data? It is nothing but a panda's data frame. Now we'll see the first 10 data of these data frames as you can see, there is serial number hi you weight, age and meal. This column is basically a categorical column where amin goes to one means a particular person is male, zero means the particular person is female. Now let's go ahead and pick the shape of the data set. So it is having 544 rows and five columns.

So there are total I 44 data points. Now we take the heights and weights in two separate NumPy areas. So let's go ahead and run this particular cell. So now these heights will contain all the height values and weights will contain all the weight files. Now we plot the scatter plot of weight versus height. For that I have used PLT dot scatter function and the x label is height and the y label is weights and the title of the plot is weight versus height.

Let's go ahead and run this cell. So we can see this is the scatter plot of weight versus height of the concern people. Now the box plot of heights and weights can be obtained as follows. So we are creating two subplots under one sub We plotting the box plot of heights and in another subplot we'll gutting the box plot of weights. So you can see this is basically the box plot of heights. And this is a box plot of weights.

Now we'll plot the histogram of heights and weights. so here also in one subplot, we'll blocking the histogram of heights. And in another subplot, we'll be plotting the histogram of weights. So you can see this is the histogram of heights and this is the histogram of weights. Now what are the mean, median and standard deviation of heights so in order to opt in the mean height, we'll be using NP dot min NumPy dot mean function for obtaining median of the height we'll be using NumPy dot median function or MP dot median function, and to opt in the standard deviation of height we'll be using NumPy dot standard deviation or STD MP dot STD function. Okay, let's go ahead and undersell the mean height is 130 8.26.

The median height is 140 8.59 and the standard deviation of height is 27.577. Now we'll be finding the quarter one, quarter three and the IQ or of the heights. So, the quarter one of height, and the quarter three of the height is obtained as NP dot personal function within this percentile function, I am passing the heights and within this list I am passing 25 and 75. That means, I want to obtain 25 percentile and 75 percentile or in other words first quartile and the third quartile. So the 25th percentile will be stored in one underscore height variable and 75 percentile will be stored in cube three underscore height variable, the interquartile range of height is equals to nothing but the difference between the quarter three and the quartile one of height. Let's go ahead and run the cell and print the values of quarter one quarter three and the quartile range of height.

So, we can see that the first quarter or the quartile one is 120 5.095, quartile three is 150 7.48 and the interquartile range of height is 32.3845 or 385. So, so similarly we can find Mean, Median and standard deviation and quarter one, quarter three and interquartile range of the weights data Okay, so we can find mean median standard deviation, quarter one, quarter three and the interquartile range of weight data by yourself using the same functions instead of height it will be with using the with now finding the correlation coefficient between hikes and weights. So for that we'll be using np.co are our co e f correlation coefficient function within this function I am passing heights comma weights. Now let's go ahead and run this cell. Now. You will be thinking of why I am using one comma zero.

So if we Don't use this it will be printing a matrix okay. So that means it is basically the correlation between height and height This is the correlation between weight and weight these diagonal elements will be always one, this is called the autocorrelation. And the cross correlation that we are actually interested in will be printed off, okay, so we'll be using this particular value or this value. So for that we'll be using either one comma zero or zero comma one. So this is the correlation coefficient between heights and weights now both have height and weight grouped by male or female. So for that, we'll be using the data frame inbuilt boxplot function.

So this is the data frame comes in on data. And within that there is a method called box plot. So I will be plotting height and weight and group by mil. Okay, so y equals to male means I want to group it by me if I run this cell, so we can create two plots. So this is basically box are grouped by mail where this is height and this is a weight and note that under each plot there are two box plots simultaneously plotted. So zero means this is for female, and one this is for male.

Similarly for weight This is female weight box plot and this is male weight box block now we'll take height of male and female separately so for that we'll be using data from methought called wedding. So heights underscore male equals to consent data dot query, and the query is where male equals equals to one and we'll be obtaining only the height similarly heights of the female were male was too close to zero. Okay, now we'll print the number of males and the number of females just to be sure of that this works. So the number of males is 257 and number of females in the data cities to discover similarly we can take awaits of the male and female but he or not that instead of querying it, Also use of conditional selection like this. So let's go ahead and run this particular cell. Now these weights underscore main will contain all the weight value of all the males in the data set.

And similarly, weights underscore female will contain the weights of all the female data points inside the data. Now we plot the scatter plot of weight versus height grouped by male and female. So how will you do that? So we'll plot the scatter plot of height of the male and weight of the male and we'll be using blue color to denote the male populations scatter plot and for female population scatter, God will be using rate color, okay, and these two plot both the scatter plot of male and female will be brought it in the same plot. Let's go ahead and run this and see what happens. So as you can see, there are some blue dots and some red dots so the blue dots will indicate male and the red dots will indicate female As I have given plot dot legend that means it will plot a legend like these that will show that which one is what.

So this blue.is, basically denoting male and the red.is, basically denoting female. Okay, so, so far this is the exploratory data analysis using Python. In the next video, we'll move to another module that is the regression analysis. Thank you see you in the next lecture.

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.