When parsing HTML, the recommended way is to use XPath queries or dom traversal to get the desired elements. Getting the XPath to a specific set of element can often involve a bit of trial and error. However, both Firebug and Chrome Developer Tools offers utilities to help this process.
Using Firebug, simply select the node you want to get using inspect, then right click the selected node and click “Copy XPath”:
The process is the exactly same in Chrome Developer Tools:
Both ways will copy a XPath selector into your clipboard.
Scraping Google Search Results
Let’s take another example. We want to extract the link to all sites for a regular Google Search.
First, issue a regular search. Then open up Firebug or Chrome Developer Tools. Right click on the a-element for any SERP item and click “Copy XPath”. I got*
Evaluate this XPath in your console using the $x() method (both Firebug and Chrome supports this). That will yield one element.
The next step is to generalize this query to match all elements. We know that each serp item is a list and our query contains li (I did choose the fifth SERP item). We can simply remove the indexer () to get all list items. Our query is now:
This will give us all a-elements. However, we want the actual URL. We can apply an attribute selector to retrieve this:
As we can see, this query does return the href-attribute for all SERP items. Easy, right?
I recommend using XPath Checker for FireFox to simplify this process even further. This extension makes it so much easier to evaluate queries and see the matching results.
*Note that Googles markup can change depending on which datacenter you end up on.