Python 3: Automating Your Job Tasks

Okay, we've seen how to load and read data from various file formats. But can we do the same for HTML content using pandas? The answer is absolutely yes, you can find the link attached to this lecture pointing to the section of the pandas documentation, where HTML is discussed in detail. For now, the first thing you should do in order to avoid any errors while using the HTML specific methods in pandas is to install an additional module that pandas needs in order to process HTML and XML data. And that module is called El XML. So just go to your CMD and type in pip install el XML.

I've already installed it but you should pause the video and do it before moving on to the rest of this lecture. Okay, for instance, let's say that we want to identify load read and analyze a table that is located on a web page. As always, we want to have that table loaded as a data frame in our code. To be even more granular, I chose a web page that contains several tables. So you can also learn how to reference and pick the desired table for your Python needs. First of all, let's define the URL where the webpage and tables are located.

So I'm going to consider this Wikipedia page, I'm going to copy the URL and I'm going to head over to the Jupiter notebook where I will create a variable called URL equals and I will pass the actual link as a string. So open and close double quotes and paste the link right here, Shift Enter. Now let's have a look at the page at this URL and see how many tables are there on the page. As you can see, this is the Wikipedia page of the Python programming language. Now we can see a table like structure right here. So If we right click and hit inspect, we can see that this is indeed a table.

Now let's scroll down. And let's see if we have other tables on this page. For instance, we have this table right here. So let's hit inspect as well. And indeed, this is a table yet another table on this page. Okay, so we have identified several tables in this web page.

Let's assume we want to load this table right here, which contains pythons built in data types as a data frame in our code. For this, the first thing we should do is use the read underscore HTML method that pandas provides for this purpose. So let's return to our notebook. and type in D equals pandas dot read underscore HTML. And in between the parentheses of this method, we simply pass the URL variable that we previously defined. Let's also check D Shift Enter.

Okay, so we can immediately notice by scrolling down through this entire result that we have a list returned, where each table structure on the webpage is an element of this list. So notice the square brackets at the end. And at the beginning of this list, and also these tables being separated by comma, as you can see right here, for example, and also down below, and so on until the end of this list. Now, if we look closely, the table of data types that we're looking to get is the second element of the list. So this would be the one right here. Therefore, to read and load only the table that we're interested in, we would need to use an index, right?

So we have a list, so why not use basic indexing? This means if we go back to our syntax up here, let me copy and paste it. We have pandas dot read HTML of URL, Let's insert the index one in between square brackets. And now if we check the shift enter, we indeed have the table of data types returned as a pandas object, a panda's data frame. Okay, great job. This is exactly what we were looking to achieve.

There are other things I could have mentioned about handling HTML data with pandas. But I let you discover more on your own in the link attached to this lecture. Moreover, we will use HTML data yet again in the application we're going to build at the end of this section. And you're going to learn new things as we discuss these applications code. But more on that later for now. I hope you enjoyed this video.

And I'll see you in the next one where you are going to learn about indexing and slicing tables with pandas. See you in the next one.

Python 3: Automating Your Job Tasks

Reading HTML Content from URLs and HTML Files with Pandas

Transcript

Sign Up

Sign Up

Share