Coding The Void

Latest Updates

Friday 5 June 2015

Beautiful, Beautifulsoup, HTML, Parse, python, scrape, scraping, Soup, web

Beautiful Soup: Parsing HTML web pages in Python

Posted By: Unknown - 00:50

Wanted to scrape or parse an HTML or an XML document? You've come to the right place! In this short tutorial, I'll show you how to use the Beautiful Soup library (which is awesome, by the way) to parse HTML or XML documents with great ease.

Let's get started.

Beautiful Soup - Introduction

Here's what they say about it:

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

It's a Python library that provides great features and allows you quickly and easily scrape web pages for information. You can capture all the links on a web page, extract any specific text, and do more crazy stuff with great ease!

Beautiful Soup - Installation

Using pip tool, install the beautifulsoup4 package.

pip install beautifulsoup4

Please refer the Beautiful Soup installation guide if you face any problem while installing it.

Parsing web pages - Beautiful Soup in action!

Now that you have installed everything you need, let me show you how awesome this library actually is. For this tutorial, I'll be targetting the homepage of the Html Agility Pack library (which is another great library for HTML scraping!)

Parsing HTML web pages in C# with Html Agility Pack

Here's what we will parse:

Title
All links.
Current release version.
Total downloads.
Reviews from the "MOST HELPFUL REVIEWS" section.

Let's start, shall we?

I'm using urllib to send the HTTP request to the server. Once the source code is read, we create a BeautifulSoup object. Here's how:

from urllib import request
from bs4 import BeautifulSoup

response = request.urlopen('https://htmlagilitypack.codeplex.com/').read()

# Create a BeautifulSoup object.
soupObj = BeautifulSoup(response)

Now that we have our object, let's extract some useful information.

Title

# Title tag
print(soupObj.title)

# Text inside the title tag.
print(soupObj.title.text)

All links

# Printing href attribute value for each a tag.
for link in soupObj.find_all('a'):
    print(link.get('href'))

The above two operations are very basic. Let's now try getting something important, like the current latest version of the library.

Current stable version and total downloads

By inspecting the element (right click > Inspect element) we can see that it is a table (the <table> tag) and it lies inside a div tag whose id equals current_rating. Our required data (version and total downloads) is a row in that table (the <tr> tag). 1st row contains the info about the current version and the 4th row contains info about the total downloads. Every row (the <tr> tag) contains two tag - <th> and <td> tags. Our data is inside the td tag.

We know enough, let's start by selecting the div tag.

ratingsDiv = soupObj.find('div', id='current_rating')

Now that we have our required div tag, we select the 1st row relative to this div tag. The <td> tag inside this selected row contains our required data. Here's how to extract it:

# Selecting the first row inside the div tag.
firstRow = ratingsDiv.find_all('tr')[0]

# Selecting and printing the text inside td tag.
currVersion = firstRow.find('td')
print(currVersion.text.strip())

Similarly, selecting the 4th row relative to the div tag and extracting the content inside the <td> tag gives us the total number of downloads.

# Selecting the forth row inside the div tag.
forthRow = ratingsDiv.find_all('tr')[3]

totalDownloads = forthRow.find('td')
print(totalDownloads.text.strip())

Most Helpful Reviews

Using the same technique (Right click > Inspect element) we see that the reviews in the "MOST HELPFUL REVIEWS" section are inside a div tag with id equals recent_reviews. The two reviews shows are inside <p> tags.

Selecting the div:

reviewsDiv = soupObj.find('div', id='recent_reviews')

Printing all the reviews inside the p tags:

for para in reviewsDiv.find_all('p'):
    print(para.text.strip())

Did you see how easy it was?

Now that you know how to work with the library, why don't you head over to the documentation page and starting working on your own project?

Feel free to post any questions, comments or suggestions you might have, and yes, don't forget to share this post. Have fun!

Wednesday 3 June 2015

Algorithm, Complexity, DP, Dynamic Programming, Fibonacci, Iteration, Iterative, python, Recursion, Recursive, Sequence, Series

Fibonacci Sequence: Recursive and Iterative solutions (Dynamic Programming)

Posted By: Unknown - 10:37

While studying Algorithms, you'll definitely encounter the problem of calculating the Fibonacci sequence. Here's a small post showing you 3 methods (Recursive, Top-down and Bottom-up) along with their time and space complexities do to it.

Here we go.

Recursive

def FibonacciRec(n):

    if (n == 0 or n == 1):
        return n

    return FibonacciRec(n - 1) + FibonacciRec(n - 2)

Time complexity: O(2^n)
Space complexity: O(n)

This above recursive solution is a naive approach. Following two are the better approaches to do the same thing. I've implemented them in a single class FibonacciCalc.

Top-down (Memorization) and Bottom-up (Dynamic Programming)

class FibonacciCalc:

    def __init__(self):
        self.memory = {}


    def FibonacciTopDown(self, n):
        if (n == 0 or n == 1):
            return 1

        if (n in self.memory):
            return self.memory[n]
        
        ret_value = self.FibonacciTopDown(n - 1) + self.FibonacciTopDown(n - 2)
        
        self.memory[n] = ret_value

        return self.memory[n]


    def FibonacciBottomUp(self, n):
        if (n == 0 or n == 1):
            return 1

        prev = current = 1

        for i in range(2, n + 1):
            tmp = current
            current += prev
            prev = tmp

        return current

Time and space complexity for the Top-down approach: O(n)

Time and space complexity for the Bottom-up approach: O(n) and O(1) respectively.

There you go.
Have fun!

Sunday 8 February 2015

API, automation, code, coding, notification, push, pushbullet, python, upload

Pushbullet API: Pushing stuff with Python

Posted By: Unknown - 10:20

Hey, it has been a long time since I posted some stuff. So, here I'm with a simple but useful stuff for you guys out there. If you're wondering what Pushbullet is, well, here's what it is

"Pushbullet bridges the gap between your phone, tablet, and computer, enabling them to work better together. From seeing your phone's notifications on your computer, to easily transferring links, files, and more between devices, Pushbullet saves you time by making what used to be difficult or impossible, easy." - Pushbullet.com

You should definitely give it a try.

Now that we know how awesome it is, let's create a simple script to retrieve user/device info and push a link to all devices using the Pushbullet API. All you need is the access token of your account registered with the Pushbullet server. Go to your Account Settings to get it.

PushbulletAPI_Devices.py

__author__ = 'codingthevoid'

import json
import urllib2

# Access Token
pbActionToken = ''

# Target URL to get all devices
targetUrl = 'https://api.pushbullet.com/v2/devices'

# Setup required HTTP request headers
headers = { 'Content-Type' : 'application/json', 'Authorization' : 'Bearer ' + pbActionToken }

# Create a HTTP request object
reqObj = urllib2.Request(targetUrl, None, headers)

# Open the target URL and get the response.
response = urllib2.urlopen(reqObj)

# Decode the JSON response and print it
print(json.loads(response.read()))

... and here's how to push a link to all the devices connected to your account

PushbulletAPI_PushLink.py

__author__ = 'codingthevoid'

import json
import urllib2

# Access Token
pbActionToken = ''

# Target URL to push stuff to all devices
targetUrl = 'https://api.pushbullet.com/v2/pushes'

# Setup required HTTP request headers
headers = { 'Content-Type' : 'application/json', 'Authorization' : 'Bearer ' + pbActionToken }

# Pushing a link with title "My URL title!", message "Some message", and the url "http://codingthevoid.blogspot.in/"
values = { 'type' : 'link', 'title' : 'My URL title!', 'body' : 'Some message', 'url' : 'http://codingthevoid.blogspot.in/' }

# JSON encode the request body.
jsonEncodedValues = json.dumps(values)

# Create a HTTP POST request object
reqObj = urllib2.Request(targetUrl, jsonEncodedValues, headers)

# Open the target URL and get the response.
response = urllib2.urlopen(reqObj)

# Decode the JSON response and print it
print(json.loads(response.read()))

Just replace your access token with "<access_token>" and you're good to go. Feel free to extend this and implement other features the Pushbullet API has to offer you. Don't forget to share. Enjoy!

Saturday 25 October 2014

C#, extraction, HAP, HTML, Html Agility Pack, scrape, scraping, website

Parse HTML with Html Agility Pack in C#

Posted By: Unknown - 03:35

Here's a short, simple tutorial on how to parse or scrape an html page. I'll be using Html Agility Pack to help me parse the document. One could use Regular Expressions as well, but let's not get into that at the moment.

Alright so, what is Html Agility Pack?

"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

In simple words, it allows you to parse HTML documents with ease, even in some malformed HTML document. Wow, that's impressive!

Let's start

Just for the sake of testing this awesome library, we're going to target its main page (http://htmlagilitypack.codeplex.com/) and extract the reviews from "Most Helpful Reviews" section.

Create a new project and add a reference to the Html Agility Pack library. By the way, the one I used targets the .NET Framework 2.0.

Here's how my scraping function looks like (without any error handling):

Most helpful reviews function - C#

Yup, it's small and does the job pretty well. Don't worry yet, let me explain the code. I'm using XPath to navigate to my desired element (the review element in this case) here.

Let me break the XPath query I used:

// means select all the specified elements no matter where they are present in the document.
//div means I want to select all div elements and I do not simply care where they are present in the root document.
[@id='recent_reviews'] means I want to select those div elements whose id equals 'recent_reviews'.
//p means I want to select all p elements inside the div elements (which we have already selected just above) without caring where they are located.

The SelectNodes method will return our desired elements list. We then simply iterate over each element and extract the inner text of the element (the p element in this case).

In case you're wondering how I got the location of the element inside the document, Right click -> Inspect element (Google Chrome) works wonders ;)

Let me show you just that

Review elements location in the document.

As you can see all p elements are located inside a div element with id equals 'recent_reviews'.

I hope this now makes it a bit easier to understand the code. Feel free to leave any questions, comments, or suggestions you might have. Enjoy scraping!

Thursday 31 July 2014

anti hooking, API, bypass, C#, Dynamic, Hooking, usermode hooking, Win32 API, Windows

Dynamic Win32 API Call + Anti User-mode hooking [C#]

Posted By: Unknown - 06:53

When I was working with various user-mode hooking techniques, I developed this very simple way to bypass any trivial user-mode hooks applied to Win32 APIs.

Key features:

32-bit and 64-bit support.
Bypasses trivial user-mode hooks at runtime (IAT, Hotpatching, etc.)
Dynamically calls Win32 API by loading the library or module (if not already loaded) and getting the target function address at runtime.

To-do list / What doesn't work at the moment since I'm so lazy to implement it:

No support for Nt/Zw version APIs i.e. system calls. Don't use this to call them - unless you want your application to crash.
No support for functions not implementing the standard function prologue. (GetCurrentProcess, GetCurrentProcessId, etc.)

Here's the source code of the native DLL: Anti.cpp

Here's a simple wrapper class for using the DLL: DynamicApi.cs

Here's how you'd use it: SampleUsage.cs

Alright so, that's it for now. If you've any questions or suggestions, just leave them below in the comments section. Have fun! ;)

Wednesday 20 November 2013

Bitmap, C#, Fast, faster, GetPixel, Image, Pixel, Processing, QBitmap, SetPixel, Unsafe

QBitmap - MUCH faster image processing [C#]

Posted By: Unknown - 05:59

As the GetPixel and SetPixel functions provided by the .NET framework are (so) slow, I had to write my own implementation of it. This Bitmap class allows you to access image pixels directly from the memory (via unsafe code.) Trust me, this is way faster!

Features:

Way faster than the core framework's implementation. (~ 10 times faster!)
Supports Format32bppArgb, Format24bppRgb.

To-do list:

Support for more image formats.
Suggestions?

Here it is, QBitmap.cs:

// http://codecav.blogspot.in/
public unsafe class QBitmap : IDisposable
    {
        private Bitmap _bitmap;
        private BitmapData _bitmapData;
 
        public int Width { get; private set; }
 
        public int Height { get; private set; }
 
        public PixelFormat PixelFormat { get; private set; }
 
        public QBitmap(Bitmap bitmap)
        {
            _bitmap = bitmap;
 
            Initialize();
        }
 
        private void Initialize()
        {
            if (_bitmap == null)
                throw new ArgumentNullException();
 
            Width = _bitmap.Width;
            Height = _bitmap.Height;
            PixelFormat = _bitmap.PixelFormat;
        }
 
        public void LockBits()
        {
            try
            {
                _bitmapData = _bitmap.LockBits(new Rectangle(0, 0, Width, Height), ImageLockMode.ReadWrite, PixelFormat);
            }
            catch (Exception e)
            {
                throw new Exception(e.Message);
            }
        }
 
        public void UnlockBits()
        {
            try
            {
                _bitmap.UnlockBits(_bitmapData);
            }
            catch (Exception e)
            {
                throw new Exception(e.Message);
            }
        }
 
        public Color GetPixel(int x, int y)
        {
            byte a = 0, r, g, b;
            var pData = (byte*) _bitmapData.Scan0.ToPointer();
 
            switch (PixelFormat)
            {
                case PixelFormat.Format32bppArgb:
                    b = *(pData + (y * _bitmapData.Stride) + (x * 4));
                    g = *(pData + (y * _bitmapData.Stride) + (x * 4) + 1);
                    r = *(pData + (y * _bitmapData.Stride) + (x * 4) + 2);
                    a = *(pData + (y * _bitmapData.Stride) + (x * 4) + 3);
                    break;
 
                case PixelFormat.Format24bppRgb:
                    b = *(pData + (y * _bitmapData.Stride) + (x * 3));
                    g = *(pData + (y * _bitmapData.Stride) + (x * 3) + 1);
                    r = *(pData + (y * _bitmapData.Stride) + (x * 3) + 2);
                    break;
 
                // Support for more formats.
                case PixelFormat.Format8bppIndexed:
                case PixelFormat.Format4bppIndexed:
                case PixelFormat.Format1bppIndexed:
                default:
                    throw new NotSupportedException("The specified pixel format is not supported.");
            }
 
            return Color.FromArgb(a, r, g, b);
        }
 
        public void SetPixel(int x, int y, Color color)
        {
            var pData = (byte*) _bitmapData.Scan0.ToPointer();
 
            switch (PixelFormat)
            {
                case PixelFormat.Format32bppArgb:
                    *(pData + (y*_bitmapData.Stride) + (x*4)) = color.B;
                    *(pData + (y*_bitmapData.Stride) + (x*4) + 1) = color.G;
                    *(pData + (y*_bitmapData.Stride) + (x*4) + 2) = color.R;
                    *(pData + (y*_bitmapData.Stride) + (x*4) + 3) = color.A;
                    break;
 
                case PixelFormat.Format24bppRgb:
                    *(pData + (y*_bitmapData.Stride) + (x*3)) = color.B;
                    *(pData + (y*_bitmapData.Stride) + (x*3) + 1) = color.G;
                    *(pData + (y*_bitmapData.Stride) + (x*3) + 2) = color.R;
                    break;
 
                // Support for more formats.
                case PixelFormat.Format8bppIndexed:
                case PixelFormat.Format4bppIndexed:
                case PixelFormat.Format1bppIndexed:
                default:
                    throw new NotSupportedException("The specified pixel format is not supported.");
            }
        }
 
        private bool _disposed;
        public void Dispose()
        {
            if (_disposed)
                return;
 
            _bitmap.Dispose();
            _bitmap = null;
 
            _disposed = true;
        }
    }

Usage:

// QBitmap usage (http://codecav.blogspot.in/)
 
var image = (Bitmap) Image.FromFile("Path\To\Image.png");
 
using (var qBitmap = new QBitmap(image))
{
        qBitmap.LockBits();
 
        // Perform operations on the image.
 
        qBitmap.UnlockBits();
}

Feel free to ask/suggest anything. Have fun with it!

Tuesday 19 November 2013

Algorithm, Backtrack, Backtracking, C#, sudoku, Sudoku Solver

Sudoku Solver - Backtracking Algorithm

Posted By: Unknown - 03:09

I coded this a few weeks ago, so here it is. A simple, neat and highly customizable Sudoku solver (with a simple board) in C# based on the Backtracking algorithm.

In action:

Sudoku solver in C#

And here's the source code

SudukoBoard.cs:

// SudukoBoard.cs public class SudokuBoard : GroupBox { public sealed class SudokuCell : TextBox { public int X { get; private set; } public int Y { get; private set; } ///  /// Thread-safe Get/Set Text. /// 
 public string Value { get { string text = string.Empty; Invoke(new MethodInvoker(delegate { text = Text; })); return text; } set { Invoke(new MethodInvoker(delegate { Text = value; })); } } // Ctor. public SudokuCell(int x, int y, int width, int height) { X = x; Y = y; Width = width; Height = height; TextAlign = HorizontalAlignment.Center; Font = new Font("Verdana", 8, FontStyle.Regular); } // Only digits (except 0) are allowed. protected override void OnKeyPress(KeyPressEventArgs e) { if (char.IsWhiteSpace(e.KeyChar) || (!char.IsDigit(e.KeyChar) && e.KeyChar != (char) 8) || e.KeyChar == (char) 48) e.Handled = true; base.OnKeyPress(e); } } // An array representing the board. private readonly SudokuCell[,] _cell; private readonly int _cellWidth, _cellHeight; // Current board values. All operations/verifications are performed on this array. public readonly string[,] CurrBoardValues; public SudokuCell this[int x, int y] { get { return _cell[x, y]; } } // Ctor. public SudokuBoard(int cellWidth, int cellHeight) { _cellWidth = cellHeight; _cellHeight = cellHeight; Width = cellWidth * 9 + 20; Height = cellHeight * 9 + 28; _cell = new SudokuCell[9, 9]; CurrBoardValues = new string[9, 9]; CreateBoardControl(); } // Creates the actual sudoku board control (an array of SudokuCell in a GroupBox.) private void CreateBoardControl() { int extraSpaceX, extraSpaceY = 0; for (int y = 0; y < 9; y++) { extraSpaceX = 0; if (y != 0 && y % 3 == 0) extraSpaceY += 3; for (int x = 0; x < 9; x++) { if (x != 0 && x % 3 == 0) extraSpaceX += 3; _cell[x, y] = new SudokuCell(x, y, _cellHeight, _cellHeight) { MaxLength = 1, Location = new Point(Location.X + 7 + x * _cellWidth + extraSpaceX, Location.Y + 15 + y * _cellHeight + extraSpaceY) }; _cell[x, y].TextChanged += OnSudokuCellTextChanged; Controls.Add(_cell[x, y]); } } } private void OnSudokuCellTextChanged(object sender, EventArgs eventArgs) { var cell = (SudokuCell) sender; TryMove(cell); } #region Public methods public void ClearBoard() { for (int y = 0; y < 9; y++) { for (int x = 0; x < 9; x++) { _cell[x, y].Value = CurrBoardValues[x, y] = string.Empty; } } } public void UpdateBoard() { for (int y = 0; y < 9; y++) { for (int x = 0; x < 9; x++) { _cell[x, y].Value = CurrBoardValues[x, y]; } } } public bool TryMove(SudokuCell cell) { return TryMove(cell.X, cell.Y, cell.Value); } public bool TryMove(int x, int y, string value) { CurrBoardValues[x, y] = value; /* !IsMoveAllowedInRow(x, y, value) || !IsMoveAllowedInColumn(x, y, value) || !IsMoveAllowedInCurrentBox(x, y, value) */ if (!IsMoveAllowed(x, y, value)) { _cell[x, y].Value = string.Empty; CurrBoardValues[x, y] = string.Empty; return false; } return true; } #endregion #region Rules public bool IsMoveAllowed(int x, int y, string value) { return IsMoveAllowedInRow(x, y, value) && IsMoveAllowedInColumn(x, y, value) && IsMoveAllowedInCurrentBox(x, y, value); } private bool IsMoveAllowedInRow(int x, int y, string value) { for (int i = 0; i < 9; i++) { if (i == x) continue; if (CurrBoardValues[i, y] == value) return false; } return true; } private bool IsMoveAllowedInColumn(int x, int y, string value) { for (int i = 0; i < 9; i++) { if (i == y) continue; if (CurrBoardValues[x, i] == value) return false; } return true; } private bool IsMoveAllowedInCurrentBox(int x, int y, string value) { int boxX = 0, boxY = 0; if (x > 5) boxX = 6; else if (x > 2) boxX = 3; if (y > 5) boxY = 6; else if (y > 2) boxY = 3; for (int j = 0; j < 3; j++) { for (int i = 0; i < 3; i++) { if (boxX + i == x && boxY + j == y) continue; if (CurrBoardValues[boxX + i, boxY + j] == value) return false; } } return true; } #endregion }

SudokuSolver.cs:

// SudokuSolver.cs public class SudokuSolver { private int _numOfTrials; private readonly SudokuBoard _board; private readonly AutoResetEvent _solvedEvent; public SudokuSolver(SudokuBoard board) { _board = board; _numOfTrials = 0; _solvedEvent = new AutoResetEvent(false); } #region Solve public void Solve() { var thread = new Thread(StartBacktrackSolver) { IsBackground = true }; thread.Start(); // Waiting for the solver thread to solve. _solvedEvent.WaitOne(); _board.UpdateBoard(); OnGameCompleted(_numOfTrials); _numOfTrials = 0; } private void StartBacktrackSolver() { BacktrackSolver(0, 0); _solvedEvent.Set(); } // Backtrack solving algorithm. private bool BacktrackSolver(int x, int y) { if (x == 9) { if (y == 8) return true; // game solved! x = 0; y++; } // pre-solved value. if (!string.IsNullOrEmpty(_board.CurrBoardValues[x, y])) { return BacktrackSolver(x + 1, y); } // Trying every possible (1 - 9) value on each cell. for (int i = 1; i < 10; i++) { if (_board.IsMoveAllowed(x, y, i.ToString())) { _numOfTrials++; _board.CurrBoardValues[x, y] = i.ToString(); // Moving on to the next cell. if (BacktrackSolver(x + 1, y)) return true; // game solved. // DEAD END reached. } } // DEAD END (no move was possible!) _board.CurrBoardValues[x, y] = string.Empty; return false; } #endregion public delegate void GameCompletedHandler(int numOfTrials); public event GameCompletedHandler GameCompleted; private void OnGameCompleted(int numOfTrials) { if (GameCompleted != null) GameCompleted(numOfTrials); } }

Usage: http://pastebin.com/AK213QTN

Wanna try it? Here's just the executable:
http://www.mediafire.com/?umzad2b7jsgyjcg

Found a bug or something? Let me know.
Enjoy! ;)

Parse HTML pages with Html Agility Pack in C#

Popular Posts