Friday, 5 June 2015

Beautiful Soup: Parsing HTML web pages in Python

Posted By: Unknown - 00:50

Share

& Comment

Wanted to scrape or parse an HTML or an XML document? You've come to  the right place! In this short tutorial, I'll show you how to use the Beautiful Soup library (which is awesome, by the way) to parse HTML or XML documents with great ease.

Let's get started.

Beautiful Soup - Introduction

Here's what they say about it:
You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
It's a Python library that provides great features and allows you quickly and easily scrape web pages for information. You can capture all the links on a web page, extract any specific text, and do more crazy stuff with great ease!

Beautiful Soup - Installation

Using pip tool, install the beautifulsoup4 package.
pip install beautifulsoup4
Please refer the Beautiful Soup installation guide if you face any problem while installing it.

Parsing web pages - Beautiful Soup in action!

Now that you have installed everything you need, let me show you how awesome this library actually is. For this tutorial, I'll be targetting the homepage of the Html Agility Pack library (which is another great library for HTML scraping!)


Here's what we will parse:
  • Title
  • All links.
  • Current release version.
  • Total downloads.
  • Reviews from the "MOST HELPFUL REVIEWS" section.

Let's start, shall we?

I'm using urllib to send the HTTP request to the server. Once the source code is read, we create a BeautifulSoup object. Here's how:

from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen('https://htmlagilitypack.codeplex.com/').read()

# Create a BeautifulSoup object.
soupObj = BeautifulSoup(response)

Now that we have our object, let's extract some useful information.

Title

# Title tag
print(soupObj.title)

# Text inside the title tag.
print(soupObj.title.text)

All links

# Printing href attribute value for each a tag.
for link in soupObj.find_all('a'):
    print(link.get('href'))

The above two operations are very basic. Let's now try getting something important, like the current latest version of the library.

Current stable version and total downloads

By inspecting the element (right click > Inspect element) we can see that it is a table (the <table> tag) and it lies inside a div tag whose id equals current_rating. Our required data (version and total downloads) is a row in that table (the <tr> tag). 1st row contains the info about the current version and the 4th row contains info about the total downloads. Every row (the <tr> tag) contains two tag - <th> and <td> tags. Our data is inside the td tag.

We know enough, let's start by selecting the div tag.

ratingsDiv = soupObj.find('div', id='current_rating')

Now that we have our required div tag, we select the 1st row relative to this div tag. The <td> tag inside this selected row contains our required data. Here's how to extract it:

# Selecting the first row inside the div tag.
firstRow = ratingsDiv.find_all('tr')[0]

# Selecting and printing the text inside td tag.
currVersion = firstRow.find('td')
print(currVersion.text.strip())

Similarly, selecting the 4th row relative to the div tag and extracting the content inside the <td> tag gives us the total number of downloads.

# Selecting the forth row inside the div tag.
forthRow = ratingsDiv.find_all('tr')[3]

totalDownloads = forthRow.find('td')
print(totalDownloads.text.strip())

Most Helpful Reviews

Using the same technique (Right click > Inspect element) we see that the reviews in the "MOST HELPFUL REVIEWS" section are inside a div tag with id equals recent_reviews. The two reviews shows are inside <p> tags.

Selecting the div:

reviewsDiv = soupObj.find('div', id='recent_reviews')

Printing all the reviews inside the p tags:

for para in reviewsDiv.find_all('p'):
    print(para.text.strip())

Did you see how easy it was?

Now that you know how to work with the library, why don't you head over to the documentation page and starting working on your own project?

Feel free to post any questions, comments or suggestions you might have, and yes, don't forget to share this post. Have fun!

About Unknown

Hi. I'm a freelancer software developer, also interested in exploit development, reverse engineering and web development. Here I'll be sharing stuff I find interesting. Feel free to swing by the blog!

0 comments:

Post a Comment

Copyright © 2013 Coding The Void™ is a registered trademark.

Designed by Templateism. Hosted on Blogger Platform.