Friday 5 June 2015

Beautiful Soup: Parsing HTML web pages in Python

Posted By: Unknown - 00:50
Wanted to scrape or parse an HTML or an XML document? You've come to  the right place! In this short tutorial, I'll show you how to use the Beautiful Soup library (which is awesome, by the way) to parse HTML or XML documents with great ease.

Let's get started.

Beautiful Soup - Introduction

Here's what they say about it:
You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
It's a Python library that provides great features and allows you quickly and easily scrape web pages for information. You can capture all the links on a web page, extract any specific text, and do more crazy stuff with great ease!

Beautiful Soup - Installation

Using pip tool, install the beautifulsoup4 package.
pip install beautifulsoup4
Please refer the Beautiful Soup installation guide if you face any problem while installing it.

Parsing web pages - Beautiful Soup in action!

Now that you have installed everything you need, let me show you how awesome this library actually is. For this tutorial, I'll be targetting the homepage of the Html Agility Pack library (which is another great library for HTML scraping!)


Here's what we will parse:
  • Title
  • All links.
  • Current release version.
  • Total downloads.
  • Reviews from the "MOST HELPFUL REVIEWS" section.

Let's start, shall we?

I'm using urllib to send the HTTP request to the server. Once the source code is read, we create a BeautifulSoup object. Here's how:

from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen('https://htmlagilitypack.codeplex.com/').read()

# Create a BeautifulSoup object.
soupObj = BeautifulSoup(response)

Now that we have our object, let's extract some useful information.

Title

# Title tag
print(soupObj.title)

# Text inside the title tag.
print(soupObj.title.text)

All links

# Printing href attribute value for each a tag.
for link in soupObj.find_all('a'):
    print(link.get('href'))

The above two operations are very basic. Let's now try getting something important, like the current latest version of the library.

Current stable version and total downloads

By inspecting the element (right click > Inspect element) we can see that it is a table (the <table> tag) and it lies inside a div tag whose id equals current_rating. Our required data (version and total downloads) is a row in that table (the <tr> tag). 1st row contains the info about the current version and the 4th row contains info about the total downloads. Every row (the <tr> tag) contains two tag - <th> and <td> tags. Our data is inside the td tag.

We know enough, let's start by selecting the div tag.

ratingsDiv = soupObj.find('div', id='current_rating')

Now that we have our required div tag, we select the 1st row relative to this div tag. The <td> tag inside this selected row contains our required data. Here's how to extract it:

# Selecting the first row inside the div tag.
firstRow = ratingsDiv.find_all('tr')[0]

# Selecting and printing the text inside td tag.
currVersion = firstRow.find('td')
print(currVersion.text.strip())

Similarly, selecting the 4th row relative to the div tag and extracting the content inside the <td> tag gives us the total number of downloads.

# Selecting the forth row inside the div tag.
forthRow = ratingsDiv.find_all('tr')[3]

totalDownloads = forthRow.find('td')
print(totalDownloads.text.strip())

Most Helpful Reviews

Using the same technique (Right click > Inspect element) we see that the reviews in the "MOST HELPFUL REVIEWS" section are inside a div tag with id equals recent_reviews. The two reviews shows are inside <p> tags.

Selecting the div:

reviewsDiv = soupObj.find('div', id='recent_reviews')

Printing all the reviews inside the p tags:

for para in reviewsDiv.find_all('p'):
    print(para.text.strip())

Did you see how easy it was?

Now that you know how to work with the library, why don't you head over to the documentation page and starting working on your own project?

Feel free to post any questions, comments or suggestions you might have, and yes, don't forget to share this post. Have fun!

Wednesday 3 June 2015

Fibonacci Sequence: Recursive and Iterative solutions (Dynamic Programming)

Posted By: Unknown - 10:37
While studying Algorithms, you'll definitely encounter the problem of calculating the Fibonacci sequence. Here's a small post showing you 3 methods (Recursive, Top-down and Bottom-up) along with their time and space complexities do to it.

Here we go.

Recursive

def FibonacciRec(n):

    if (n == 0 or n == 1):
        return n

    return FibonacciRec(n - 1) + FibonacciRec(n - 2)

Time complexity: O(2^n)
Space complexity: O(n)

This above recursive solution is a naive approach. Following two are the better approaches to do the same thing. I've implemented them in a single class FibonacciCalc.

Top-down (Memorization) and Bottom-up (Dynamic Programming)

class FibonacciCalc:

    def __init__(self):
        self.memory = {}


    def FibonacciTopDown(self, n):
        if (n == 0 or n == 1):
            return 1

        if (n in self.memory):
            return self.memory[n]
        
        ret_value = self.FibonacciTopDown(n - 1) + self.FibonacciTopDown(n - 2)
        
        self.memory[n] = ret_value

        return self.memory[n]


    def FibonacciBottomUp(self, n):
        if (n == 0 or n == 1):
            return 1

        prev = current = 1

        for i in range(2, n + 1):
            tmp = current
            current += prev
            prev = tmp

        return current

Time and space complexity for the Top-down approach: O(n)

Time and space complexity for the Bottom-up approach: O(n) and O(1) respectively.

There you go.
Have fun!

Copyright © 2013 Coding The Void™ is a registered trademark.

Designed by Templateism. Hosted on Blogger Platform.