Friday, 5 June 2015

Beautiful Soup: Parsing HTML web pages in Python

Posted By: Unknown - 00:50
Wanted to scrape or parse an HTML or an XML document? You've come to  the right place! In this short tutorial, I'll show you how to use the Beautiful Soup library (which is awesome, by the way) to parse HTML or XML documents with great ease.

Let's get started.

Beautiful Soup - Introduction

Here's what they say about it:
You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
It's a Python library that provides great features and allows you quickly and easily scrape web pages for information. You can capture all the links on a web page, extract any specific text, and do more crazy stuff with great ease!

Beautiful Soup - Installation

Using pip tool, install the beautifulsoup4 package.
pip install beautifulsoup4
Please refer the Beautiful Soup installation guide if you face any problem while installing it.

Parsing web pages - Beautiful Soup in action!

Now that you have installed everything you need, let me show you how awesome this library actually is. For this tutorial, I'll be targetting the homepage of the Html Agility Pack library (which is another great library for HTML scraping!)


Here's what we will parse:
  • Title
  • All links.
  • Current release version.
  • Total downloads.
  • Reviews from the "MOST HELPFUL REVIEWS" section.

Let's start, shall we?

I'm using urllib to send the HTTP request to the server. Once the source code is read, we create a BeautifulSoup object. Here's how:

from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen('https://htmlagilitypack.codeplex.com/').read()

# Create a BeautifulSoup object.
soupObj = BeautifulSoup(response)

Now that we have our object, let's extract some useful information.

Title

# Title tag
print(soupObj.title)

# Text inside the title tag.
print(soupObj.title.text)

All links

# Printing href attribute value for each a tag.
for link in soupObj.find_all('a'):
    print(link.get('href'))

The above two operations are very basic. Let's now try getting something important, like the current latest version of the library.

Current stable version and total downloads

By inspecting the element (right click > Inspect element) we can see that it is a table (the <table> tag) and it lies inside a div tag whose id equals current_rating. Our required data (version and total downloads) is a row in that table (the <tr> tag). 1st row contains the info about the current version and the 4th row contains info about the total downloads. Every row (the <tr> tag) contains two tag - <th> and <td> tags. Our data is inside the td tag.

We know enough, let's start by selecting the div tag.

ratingsDiv = soupObj.find('div', id='current_rating')

Now that we have our required div tag, we select the 1st row relative to this div tag. The <td> tag inside this selected row contains our required data. Here's how to extract it:

# Selecting the first row inside the div tag.
firstRow = ratingsDiv.find_all('tr')[0]

# Selecting and printing the text inside td tag.
currVersion = firstRow.find('td')
print(currVersion.text.strip())

Similarly, selecting the 4th row relative to the div tag and extracting the content inside the <td> tag gives us the total number of downloads.

# Selecting the forth row inside the div tag.
forthRow = ratingsDiv.find_all('tr')[3]

totalDownloads = forthRow.find('td')
print(totalDownloads.text.strip())

Most Helpful Reviews

Using the same technique (Right click > Inspect element) we see that the reviews in the "MOST HELPFUL REVIEWS" section are inside a div tag with id equals recent_reviews. The two reviews shows are inside <p> tags.

Selecting the div:

reviewsDiv = soupObj.find('div', id='recent_reviews')

Printing all the reviews inside the p tags:

for para in reviewsDiv.find_all('p'):
    print(para.text.strip())

Did you see how easy it was?

Now that you know how to work with the library, why don't you head over to the documentation page and starting working on your own project?

Feel free to post any questions, comments or suggestions you might have, and yes, don't forget to share this post. Have fun!

Wednesday, 3 June 2015

Fibonacci Sequence: Recursive and Iterative solutions (Dynamic Programming)

Posted By: Unknown - 10:37
While studying Algorithms, you'll definitely encounter the problem of calculating the Fibonacci sequence. Here's a small post showing you 3 methods (Recursive, Top-down and Bottom-up) along with their time and space complexities do to it.

Here we go.

Recursive

def FibonacciRec(n):

    if (n == 0 or n == 1):
        return n

    return FibonacciRec(n - 1) + FibonacciRec(n - 2)

Time complexity: O(2^n)
Space complexity: O(n)

This above recursive solution is a naive approach. Following two are the better approaches to do the same thing. I've implemented them in a single class FibonacciCalc.

Top-down (Memorization) and Bottom-up (Dynamic Programming)

class FibonacciCalc:

    def __init__(self):
        self.memory = {}


    def FibonacciTopDown(self, n):
        if (n == 0 or n == 1):
            return 1

        if (n in self.memory):
            return self.memory[n]
        
        ret_value = self.FibonacciTopDown(n - 1) + self.FibonacciTopDown(n - 2)
        
        self.memory[n] = ret_value

        return self.memory[n]


    def FibonacciBottomUp(self, n):
        if (n == 0 or n == 1):
            return 1

        prev = current = 1

        for i in range(2, n + 1):
            tmp = current
            current += prev
            prev = tmp

        return current

Time and space complexity for the Top-down approach: O(n)

Time and space complexity for the Bottom-up approach: O(n) and O(1) respectively.

There you go.
Have fun!

Sunday, 8 February 2015

Pushbullet API: Pushing stuff with Python

Posted By: Unknown - 10:20

Hey, it has been a long time since I posted some stuff. So, here I'm with a simple but useful stuff for you guys out there. If you're wondering what Pushbullet is, well, here's what it is

"Pushbullet bridges the gap between your phone, tablet, and computer, enabling them to work better together. From seeing your phone's notifications on your computer, to easily transferring links, files, and more between devices, Pushbullet saves you time by making what used to be difficult or impossible, easy." - Pushbullet.com

You should definitely give it a try.

Now that we know how awesome it is, let's create a simple script to retrieve user/device info and push a link to all devices using the Pushbullet API. All you need is the access token of your account registered with the Pushbullet server. Go to your Account Settings to get it.

PushbulletAPI_Devices.py
__author__ = 'codingthevoid'

import json
import urllib2

# Access Token
pbActionToken = ''

# Target URL to get all devices
targetUrl = 'https://api.pushbullet.com/v2/devices'

# Setup required HTTP request headers
headers = { 'Content-Type' : 'application/json', 'Authorization' : 'Bearer ' + pbActionToken }

# Create a HTTP request object
reqObj = urllib2.Request(targetUrl, None, headers)

# Open the target URL and get the response.
response = urllib2.urlopen(reqObj)

# Decode the JSON response and print it
print(json.loads(response.read()))

... and here's how to push a link to all the devices connected to your account

PushbulletAPI_PushLink.py
__author__ = 'codingthevoid'

import json
import urllib2

# Access Token
pbActionToken = ''

# Target URL to push stuff to all devices
targetUrl = 'https://api.pushbullet.com/v2/pushes'

# Setup required HTTP request headers
headers = { 'Content-Type' : 'application/json', 'Authorization' : 'Bearer ' + pbActionToken }

# Pushing a link with title "My URL title!", message "Some message", and the url "http://codingthevoid.blogspot.in/"
values = { 'type' : 'link', 'title' : 'My URL title!', 'body' : 'Some message', 'url' : 'http://codingthevoid.blogspot.in/' }

# JSON encode the request body.
jsonEncodedValues = json.dumps(values)

# Create a HTTP POST request object
reqObj = urllib2.Request(targetUrl, jsonEncodedValues, headers)

# Open the target URL and get the response.
response = urllib2.urlopen(reqObj)

# Decode the JSON response and print it
print(json.loads(response.read()))

Just replace your access token with "<access_token>" and you're good to go. Feel free to extend this and implement other features the Pushbullet API has to offer you. Don't forget to share. Enjoy!

Copyright © 2013 Coding The Void™ is a registered trademark.

Designed by Templateism. Hosted on Blogger Platform.