Here's a short, simple tutorial on how to parse or scrape an html page. I'll be using Html Agility Pack to help me parse the document. One could use Regular Expressions as well, but let's not get into that at the moment.
Alright so, what is Html Agility Pack?
"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."
In simple words, it allows you to parse HTML documents with ease, even in some malformed HTML document. Wow, that's impressive!
Let's start
Just for the sake of testing this awesome library, we're going to target its main page (http://htmlagilitypack.codeplex.com/) and extract the reviews from "Most Helpful Reviews" section.
Create a new project and add a reference to the Html Agility Pack library. By the way, the one I used targets the .NET Framework 2.0.
Here's how my scraping function looks like (without any error handling):
Yup, it's small and does the job pretty well. Don't worry yet, let me explain the code. I'm using XPath to navigate to my desired element (the review element in this case) here.
Let me break the XPath query I used:
Create a new project and add a reference to the Html Agility Pack library. By the way, the one I used targets the .NET Framework 2.0.
Here's how my scraping function looks like (without any error handling):
Most helpful reviews function - C# |
Let me break the XPath query I used:
- // means select all the specified elements no matter where they are present in the document.
- //div means I want to select all div elements and I do not simply care where they are present in the root document.
- [@id='recent_reviews'] means I want to select those div elements whose id equals 'recent_reviews'.
- //p means I want to select all p elements inside the div elements (which we have already selected just above) without caring where they are located.
The SelectNodes method will return our desired elements list. We then simply iterate over each element and extract the inner text of the element (the p element in this case).
In case you're wondering how I got the location of the element inside the document, Right click -> Inspect element (Google Chrome) works wonders ;)
As you can see all p elements are located inside a div element with id equals 'recent_reviews'.
I hope this now makes it a bit easier to understand the code. Feel free to leave any questions, comments, or suggestions you might have. Enjoy scraping!