Python 3: Automating Your Job Tasks

Okay, great to see you in this lecture where we are going to upgrade the application we've just built in the previous video. Actually, I'm going to have this new code in a separate file, web scraper underscore pagination dot p y in the same folder. And I'm only going to highlight the differences as compared to the code you've seen earlier. Basically, this new application is going to perform the exact same tasks, meaning extracting the name, link and price of each of the 21 products from a given test website and save them to an Excel spreadsheet. The new thing here is that this time, we don't have all the products listed on a single page. Instead, we have pagination enabled on the website, meaning there are multiple pages in this case, four of them that contain our products, and our web scrapers should be able to handle such a scenario for the purpose.

Of this lecture, I'm going to use another link from the same website. This is the link right here, you can find this link attached to this video, as well as in the notebook that follows. So before seeing and testing the code for our new application version, let's take a look at how the information is structured. This time, we have the first six products listed on page one, then the next six products on page two, then yet another batch of six tablets listed on page three. And finally, the last three products residing on page four. So this means that our application will have to automatically iterate over all these four pages and extract the product information from each page, as we already did in the previous video.

To enable this iterating behavior, let's try and find an identifier interlink of each page that will uniquely reference that particular page. Going back to the original link As you can see it on the screen right now, we don't see any specific identifier. However, as soon as you click on each page number, let's say page number two, the link changes accordingly. And this text gets appended to the initial link. So we have question mark page equals two. And this happens for page three, and page four as well.

So the first thing we should do is define the common part of this link in our code. So this should be this part right here, up to the equal sign, including the equal sign. So that's exactly what I did here in the application using a variable called link. After importing the modules we need Of course, Next I have created an empty list. This one right here called products that will eventually hold all the products extracted from all the pages. Now in order to extract all the products listed on Each of the four pages, we need to use the unique link of each page inside the parentheses of the request dot get method.

Therefore, we have to iterate over the four pages using a basic for loop and the range function. As you can see it right here, where range of one comma five equates to 123, and four, which are our page numbers, right. So for each page in this range, we're getting that page by concatenating, the string referenced by the link variable with the corresponding page number, which it is, by the way converted from integer to string, otherwise, we wouldn't be able to compose the necessary link. Then we are just passing the string obtained as a result of the concatenation to the get method from within the requests module. Next, as we iterate over each page, we are also loading and parsing the content of that page. Then we have to identify the div tags corresponding to the products that are listed on that particular page.

This is done the same way as we did in the previous lecture using the Find all method and the correct class value. Finally, inside the same for loop, we have to write another for loop that will iterate through all the products that have been identified on each page, and append each product to the general list of products up here using the append method. And this process is performed for each of the four pages. As soon as the list of products is complete, meaning all four pages have been scanned and all 21 products have been saved to the products list. Then the rest of the code is exactly the same as in the previous video, performing the same tasks, meaning extracting the name, link and price for each product to three different lists. Then zipping the list together in a list of stocks.

Building the panda's data frame. And finally writing the data frame to the Excel file. Now it's time to test our new application version to see if the iteration through all four pages is done correctly. And if all the data is indeed saved to the Excel file, I'm going to use the windows cmd. Again, and run the script. Python D web scraping web scraper underscore pagination dot p y.

This would be the second Python script in my folder. And we have web data successfully written to excel quitting the program. So no exceptions have been raised. Let's check the folder as well. And there's our products underscore pagination dot XLS. x file.

Let's open it and success. We have all 21 products saved to the Excel spreadsheet along with their names, links and prices. Basically the result is identical to the one we got in The previous video, only that this time we scraped multiple web pages instead of a single page. Feel free to check out the notebook following this video and download the Python script attached to that notebook to save the upgraded version of our web scrapping application. So I hope you enjoyed this section on web scrapping with Python. And I will see you soon bye

Python 3: Automating Your Job Tasks

APPLICATION - Handling Website Pagination When Extracting Data

Transcript

Sign Up

Sign Up

Share