A conventional web scraping approach would be to use a regular expression or Xpath based scraping tool that would allow you to conveniently select an HTML element via a browser interface, capture its Xpath and then configure the scraper to extract all elements which match this Xpath for an input set of URLs. Along with Xpath, most of these tools also provide an alternate scraping mechanism wherein users enter the regular expression which is to be matched with the source codes of each individual URL from the input set of pages. These pages can either be derived from a third party crawler which runs through your entire website to return a sitemap comprising of all URLs or this functionality is directly built into the scraping software itself.
However, most of us belonging to the SEO industry encounter limited scraping requirements and would prefer a simple macro or formula based excel sheet that would do part of the same job rather than being overwhelmed by the complex interfaces presented by most of the conventional scraping tools. Thus by employing a minimalistic approach, I have created a simple macro based excel sheet with two inbuilt functions which should suffice for most of our day to day average SEO scraping requirements:
Please note that in both of the below mentioned functions, the arguments can either be passed as quoted string constants or cell references holding the input values.
Regexecute(#input string#, #regex #, #index#, #delimiter#, #replacement regex#)
The input parameters have been explained below:
#input string# – This argument would hold the input string on which the regular expression is to be applied
#regex# – The regex which will be applied to the input string to derive the match collection
#index# – In cases where the regular expression matches multiple string instances from the input text, the number specified (say 'n') in the #index# argument will return only the nth matched element
#delimiter# – This is an optional parameter and when specified, the function will return all matched instances each separated by a delimiter as specified in the #delimiter# argument
#replacement regex# – This, again is an optional parameter and the regular expression string passed on to this argument will be matched with the output of the first regex match and all matched instances will be replaced with blanks (""). This is mainly useful when you want to weed out unwanted characters from the end string and return just the exact text value to be extracted
URLregexScrape(#url#, #regex #, #index#, #delimiter#)
This function is exactly similar to the first except that the input string is derived from the HTML source code of the #url# parameter which is meant to hold the page URL to be scraped
Take the example of a massive ecommerce website like flipkart.com hosting innumerable products, each having a list of attributes which are not accessible via the CMS (or you do not have access to the site's backend). Now, if I would like to scrape a product attribute like the number of user ratings received for the product and this is to be extracted for all product pages on the website (please see snapshot below):
The ideal way to go about this would be to first derive the list of all Flipkart product pages by running the website through a crawler like Xenu and then filter out just the product page URLs from the output after applying a suitable filter. As can be seen from the snapshot above, move on to inspecting the source code of the HTML element holding the rating count. In our example, the count is held within the span tag, as shown below:
Now, to extract the user rating count '112' from the above HTML tag, a regular expression of the form 'ratingCount(.*)</span>' will scrape out a sub string from the source code which would appear as:
'ratingCount" >112</span> Ratings | <a href="#read-reviews"><span itemprop="reviewCount">39</sp'
The equivalent function call for this would be:
Here, A2 is the cell reference which holds the first product URL entry and '1' is the #index# which returns just the first matched instance. In this case, it is important to specify the #index# since there are two instances of ratings held within the similar HTML structures and we only want to extract the first instance which is the user rating count of the product. The second undesired instance is shown in the screenshot below (highlighted in red):
In case we'd like to extract the second instance which is the product retailer's rating, we just need to change the index to 2.
Finally, we would need to prune out the unwanted string components from the above mentioned sub string and this would only be possible after specifying the second optional parameter '#replacement regex#' which is the fifth argument to the function. The revised function call would appear as below:
In case we want all matched instances to be returned within a single cell, we would need to specify the #delimiter# parameter which has been left blank in this case (since we are only concerned with the first matched instance which is the user rating). Say for instance, we pass the delimiter as '^', the ouput would now be '112^142557' where '142557' is the seller rating.
The screenshot given below shows the above function being employed to derive both the user as well as seller rating in separate columns on the worksheet:
The above function can be used to scrape out virtually any on page element from list of pages on your website. A few regular expressions you can play around with to scrape some of the common on page SEO elements are given below:
|HTML Tag||Tag Regex (#regex #)||Replacement Regex (#replacement regex#)|
The first function 'Regexecute' can be particularly useful in cases where you need to apply complex pattern based filtering conditions on data which is already existent as text and need not be scraped from a website URL. A common application would be to apply filtering conditions on columns of data based on patterns matched by this function.