Hello, everyone, welcome to the course of machine learning with Python. In this particular video we will include is pandas, which is another important library in Python for data manipulation. So let's go ahead and explore some properties of pandas. So pandas basically stands for panel data library. Parallel data is a very useful time in econometrics, it works really well for tabular data formats, which are very frequent offerings in machine learning. So tabular database where the data are basically organized in the form of table Okay, it provides very useful and easy to use API for tabular data processing, which we shall use throughout the course it is built on the top of NumPy library hence one can find a lot of similarities between NumPy and pandas.
It provides very fast and easy to use API's for reading and writing to various file formats, for example, CSV, MS, Excel, SQL, HTML, etc. Know how to use pandas, so one has to install pandas before using it so the command is pip install pandas. So in Anaconda prompt if you type Install pandas so it will install pandas and all the necessary dependencies. Now if we use Anaconda distribution then pandas comes pre installed in Anaconda distribution so we can skip the first step of installing pandas not to use the library function inside the pandas one has to import it first here we shall discuss different operations of panda's data frame. So data frame is the default object of pandas for concrete documentation, one can visit the official pandas website. Okay, now we shall see how to use pandas in Jupiter notebook.
So let's go ahead and open our Jupiter notebook. So this is the Jupiter notebook I have created to demonstrate how pandas works. So this is the official pandas documentation. And I would highly recommend to see this particular website 10 minutes to pandas. If you click on this, you will be redirected to that 10 minutes to pandas website. Now here you can spend just 10 to 15 minutes time to grasp the various important concept in pandas.
Okay, so I will tell you Read the comment to see this particular website. Now let's go ahead and import pandas as PD. Okay, so it will take some time to load the pandas into your Jupiter notebook environment. Now let's read a CSV file. The CSV file is inside the data folder of the parent folder where this particular Jupiter notebook is located. So I am reading the CSV file, you can also specify the entire path of the CSV file or the file you want to actually read through.
So data equals to PD dot read underscore CSV. This is the API read underscore CSV, which is basically provides the useful functionalities to read the CSV file into a data frame. Okay, so data equals to PD dot read underscore CSV is which part of the pipeline so let's go ahead and run this particular cell. So it has read the content of the CSV file into the data frame called data. Now, how do we know that this is a data frame? So if we just type tip within the first bracket data and execute this particular cell, see this is pandas dot code dot frame?
Dot data frame so data frame is the default object of the pandas library. Okay, so similar to CSV pandas offer different API's for reading different other files for example data underscore Excel is the APA for reading the Microsoft Excel science. Okay, similarly read underscore HTML is for reading HTML files read underscore Jason is for reading JSON files and there are many most Okay, fine. Now let's go ahead and print the information of the data frame at a glance. So for that, we have to type data frame dot info up after that apparently is fine. Let's go ahead and run this particular so my data frame is data.
So I am typing data dot info, then a first bracket so it has returned some values Okay, so let's understand what does this mean. So range index is 500 empty, so that is it has 500 rows. Okay, so data columns, so there are four columns. So what are those four columns gender, height, weight and index? This gender column column is 500 non null object is all the values are actually there inside the gender column. So you sometimes don't need seeing them pandas will treat this as a null object okay, but here I can see that all the values are there.
So, that is why there is no null object. So, gender is basically a column of hyphenated non null objects similarly hi it is basically 500 non non in 64 data type okay so, it is basically a column of integer type values okay all are non non similarly weight is basically also an integer and index is also integer. Now, let's view the first few rows of the data okay using data dot head with that person so the comment is data frame dot head to print the first few rows of the data frame. So how many rows will be printed qualified rows file, you can specify the number of rows you actually want to see through. So let's say I have specific 10. So it will basically pink the first 10 increase of this particular data frame.
As you can see this gender has With an index file Now let's print the shape of the data frame so like NumPy here also we can print the shape using dot shape attribute. So here the dot show attribute is applied over the data frame. So you can see that this data has 500 rows and four columns okay now let's see the data types of each column. So for that, we have to pass the comment data dot details okay or data filter details in your data frame name is different use that name dot details, okay. So this is you can see the gender for the gender column. This is basically nothing but strings okay.
So that is where it has returned object, right. Similarly, heights are all integers. Similarly, weights are all integers and index are all integers. Right, good. Now let's display the data frame using data frame dot describe command. So here our data frame is data.
So data dot describe that we'll be using over here, so only numerical columns will be explained over here. Now as you can see, this gender column is Non numeric okay it is basically object. So, the gender column is not used over here to display the data frame only the numerical columns are used here to display the data frame now let's understand what does it tell it also returns another data frame which contains the column name as the same as that numerical columns in the original data frame and it contains few rows. So count that means it has 500 elements file mean that is the average so in under this high so one system and 494 is basically the average height okay. Similarly 106 is basically the average WAV file. Similarly, we can find standard deviation NEMA maximum first quarter, third quarter and median file.
Okay, now pandas series series are basically the building blocks of panda's data frame. Each column of a data frame is nothing but a series long list go ahead and select a particular column from the data frames. So how do we do that? So this is a data frame name within the third bracket is this In fact the column that I want to pick So, I am picking the gender column and I am storing this in some object called gender Okay fine. Now, after reading the cell you can see that only the gender column has been picked and the proper indexing is also there okay. Now, if I just type what is the data type of gender we can see that this is nothing but pandas are co dot series dot c So, this is basically a series type object okay.
So series are nothing but the individual columns in the data frame fine, but we can make a new column so let's say my new column name is height plus weight and I have used data within bracket height and data we get back and wait and some that column to create this new column okay and see what does it look like. So, we can see that there is another column height plus weight has been created, whose elements are nothing but the values produced by the summation of height column and then we can we can also remove a column for example, who should not specify a new column with new column. So for that, they say why To drop height plus weight, this newly created column I'm going to drop. So I have to specify which column I have to drop. Now as it is I want to drop column.
So I have to specify x is equals to one, if I specify x is equal to zero, it will try to drop the rule, but it will not find any row named height plus weight over here. So it will basically throw an error. So I have to specify that x is equals to one in order to denote that I want to draw a corner Fine. Let's go ahead and let's see, there is a peculiarity that even if I have dropped this, the height plus weight column has not been dropped. Why? Because I have not done it increase.
So in order to do it in place, what we have to do, we have to pass another argument in plus equals to true now if we go ahead and understand you can see that the height plus weight column hasn't changed. Okay, so this is basically the default characteristics of pandas. So if you by default increases equals to false that We do not lose our original digital so any changes if we want to make the changes permanent we have to do in place equals true true. So similar to selecting the columns we can also select the rows in the contrast but for that we have to use block okay. So data frame dot lock will never specify the road name or the row number okay. So for that data.la we can start it I have mentioned zero so that is I want to see data zero indexed okay, but we can also see it a subset of the data frames so let's say this is our original data frame.
Now, I want to see that the first 10 rows and only the height and weight column okay. So for that I can use data dot block, I have mentioned that the first 10 rows and facility and I have selected height comma weight column and I have passed that as a least so the column names must be passed to the least Okay, if you go ahead and type this you can see that first 10 elements of this original data frame has been selected and the height and weight column is on ABC. Similarly, we can do conditional selection. So this particular conditional selection will select all the females of height greater than 190. Okay, so how is this data within bracket gender is equal to close to female that means I want to select all the rows whose particular gender column is specified as female, and the particular height column is greater than 190.
And I have used an add clause over here, that means both conditions must be true in order to select this, right, so, go ahead and shift enter this particular cell, you can see that all the females has been selected whose height in other one, okay, so, this is a very useful comment, which is called the conditional selection in practice, okay, similarly, we can do sorting. So to sort the values, we have to specify which column we want to sort, okay, and the comment for this is salt underscore. So if we go ahead and do this operation on this data frame, we can see that this particular data frame is sorted according to the height. So we can also Specify increase equals to true in order to make this change. There is a really useful operation in pandas known as group by operation. So, in order to demonstrate coupe operation, let's say our data frame which is the origin data from data and I am grouping it by gender and I am producing the mean So, what does it produce So, it will produce another data frame okay see here the indices female or male.
So, the mean height of all the females in the data set is 170 point two two and the mean height of all the males in the data frame is 160 9.64. Similarly, the mean weight of female and the mean weight of men is also different and similarly, the body weight index So, this is basically a very useful way to summarize the entire data frame. We also have another operation called the query operation. So, here it is like conditional selection, but note that we did not have to specify the data from name over here. So I have just specified is called a query that hide you Another equals to 190. So it will return the data frame of all the rows whose height is basically that moment.
So here I have given a glimpse of few pandas operation. I will recommend to go to the official pandas website and explore so many pandas operation that is specified over here. In the next lecture, we'll be actually focusing more into the probability and the statistics part. So see you in the next lecture. Thank you