Beautiful Soup: Parsing HTML web pages in Python

Wanted to scrape or parse an HTML or an XML document? You've come to the right place! In this short tutorial, I'll show you how to use the Beautiful Soup library (which is awesome, by the way) to parse HTML or XML documents with great ease.

Let's get started.

Beautiful Soup - Introduction

Here's what they say about it:

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

It's a Python library that provides great features and allows you quickly and easily scrape web pages for information. You can capture all the links on a web page, extract any specific text, and do more crazy stuff with great ease!

Beautiful Soup - Installation

Using pip tool, install the beautifulsoup4 package.

pip install beautifulsoup4

Please refer the Beautiful Soup installation guide if you face any problem while installing it.

Parsing web pages - Beautiful Soup in action!

Now that you have installed everything you need, let me show you how awesome this library actually is. For this tutorial, I'll be targetting the homepage of the Html Agility Pack library (which is another great library for HTML scraping!)

Parsing HTML web pages in C# with Html Agility Pack

Here's what we will parse:

Title
All links.
Current release version.
Total downloads.
Reviews from the "MOST HELPFUL REVIEWS" section.

Let's start, shall we?

I'm using urllib to send the HTTP request to the server. Once the source code is read, we create a BeautifulSoup object. Here's how:

from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen('https://htmlagilitypack.codeplex.com/').read()

# Create a BeautifulSoup object.
soupObj = BeautifulSoup(response)

Now that we have our object, let's extract some useful information.

Title

# Title tag
print(soupObj.title)

# Text inside the title tag.
print(soupObj.title.text)

All links

# Printing href attribute value for each a tag.
for link in soupObj.find_all('a'):
    print(link.get('href'))

The above two operations are very basic. Let's now try getting something important, like the current latest version of the library.

Current stable version and total downloads

By inspecting the element (right click > Inspect element) we can see that it is a table (the <table> tag) and it lies inside a div tag whose id equals current_rating. Our required data (version and total downloads) is a row in that table (the <tr> tag). 1st row contains the info about the current version and the 4th row contains info about the total downloads. Every row (the <tr> tag) contains two tag - <th> and <td> tags. Our data is inside the td tag.

We know enough, let's start by selecting the div tag.

ratingsDiv = soupObj.find('div', id='current_rating')

Now that we have our required div tag, we select the 1st row relative to this div tag. The <td> tag inside this selected row contains our required data. Here's how to extract it:

# Selecting the first row inside the div tag.
firstRow = ratingsDiv.find_all('tr')[0]

# Selecting and printing the text inside td tag.
currVersion = firstRow.find('td')
print(currVersion.text.strip())

Similarly, selecting the 4th row relative to the div tag and extracting the content inside the <td> tag gives us the total number of downloads.

# Selecting the forth row inside the div tag.
forthRow = ratingsDiv.find_all('tr')[3]

totalDownloads = forthRow.find('td')
print(totalDownloads.text.strip())

Most Helpful Reviews

Using the same technique (Right click > Inspect element) we see that the reviews in the "MOST HELPFUL REVIEWS" section are inside a div tag with id equals recent_reviews. The two reviews shows are inside <p> tags.

Selecting the div:

reviewsDiv = soupObj.find('div', id='recent_reviews')

Printing all the reviews inside the p tags:

for para in reviewsDiv.find_all('p'):
    print(para.text.strip())

Did you see how easy it was?

Now that you know how to work with the library, why don't you head over to the documentation page and starting working on your own project?

Feel free to post any questions, comments or suggestions you might have, and yes, don't forget to share this post. Have fun!

Coding The Void

Friday, 5 June 2015

Beautiful, Beautifulsoup, HTML, Parse, python, scrape, scraping, Soup, web

Beautiful Soup: Parsing HTML web pages in Python

Share

Beautiful Soup - Introduction

Beautiful Soup - Installation

Parsing web pages - Beautiful Soup in action!

Title

All links

Current stable version and total downloads

Most Helpful Reviews

About Unknown

0 comments:

Post a Comment

Popular Posts

Labels

Follow us on FaceBook

Blog Archive

About