Saturday 25 October 2014

Parse HTML with Html Agility Pack in C#

Posted By: Unknown - 03:35

Here's a short, simple tutorial on how to parse or scrape an html page. I'll be using Html Agility Pack to help me parse the document. One could use Regular Expressions as well, but let's not get into that at the moment.

Alright so, what is Html Agility Pack?

"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

In simple words, it allows you to parse HTML documents with ease, even in some malformed HTML document. Wow, that's impressive!

Let's start

Just for the sake of testing this awesome library, we're going to target its main page (http://htmlagilitypack.codeplex.com/) and extract the reviews from "Most Helpful Reviews" section.

Create a new project and add a reference to the Html Agility Pack library. By the way, the one I used targets the .NET Framework 2.0.

Here's how my scraping function looks like (without any error handling):
Parse HTML with Html Agility Pack in C#
Most helpful reviews function - C#
Yup, it's small and does the job pretty well. Don't worry yet, let me explain the code. I'm using XPath to navigate to my desired element (the review element in this case) here.

Let me break the XPath query I used:

  • // means select all the specified elements no matter where they are present in the document. 
  • //div means I want to select all div elements and I do not simply care where they are present in the root document.
  • [@id='recent_reviews'] means I want to select those div elements whose id equals 'recent_reviews'.
  • //p means I want to select all p elements inside the div elements (which we have already selected just above) without caring where they are located.
The SelectNodes method will return our desired elements list. We then simply iterate over each element and extract the inner text of the element (the p element in this case).

In case you're wondering how I got the location of the element inside the document, Right click -> Inspect element (Google Chrome) works wonders ;)

Let me show you just that
Parse HTML with Html Agility Pack in C#
Review elements location in the document.
As you can see all p elements are located inside a div element with id equals 'recent_reviews'.
I hope this now makes it a bit easier to understand the code. Feel free to leave any questions, comments, or suggestions you might have. Enjoy scraping!

Thursday 31 July 2014

Dynamic Win32 API Call + Anti User-mode hooking [C#]

Posted By: Unknown - 06:53
When I was working with various user-mode hooking techniques, I developed this very simple way to bypass any trivial user-mode hooks applied to Win32 APIs.

Key features:
  • 32-bit and 64-bit support.
  • Bypasses trivial user-mode hooks at runtime (IAT,  Hotpatching, etc.)
  • Dynamically calls Win32 API by loading the library or module (if not already loaded) and getting the target function address at runtime.
To-do list / What doesn't work at the moment since I'm so lazy to implement it:
  • No support for Nt/Zw version APIs i.e. system calls. Don't use this to call them - unless you want your application to crash.
  • No support for functions not implementing the standard function prologue. (GetCurrentProcess, GetCurrentProcessId, etc.)


Here's the source code of the native DLL: Anti.cpp

Here's a simple wrapper class for using the DLL: DynamicApi.cs

Here's how you'd use it: SampleUsage.cs

Alright so, that's it for now. If you've any questions or suggestions, just leave them below in the comments section. Have fun! ;)

Copyright © 2013 Coding The Void™ is a registered trademark.

Designed by Templateism. Hosted on Blogger Platform.