Data Scraper and the XPath to Sourcing Success

One of my pet peeves around recruiting tool discussions is how shiny new objects get the bulk of the attention. This, of course, is not a problem exclusive to recruiting. Websites like Product Hunt and BetaList help fill our feeds with bright shiny and new tools that can distract us from what we NEED. Spoiler Alert! Not all that glitters is 24-karat. The alternative is let the test of time prevail and potentially miss out on a great new product. My goal is to show a more in-depth review of a tool with at least one real-world example of how the tool can save you time.

xppath

Data Scraper/Data Miner (both names appear on menus and web pages) is similar to a few Chrome Extensions in look and feel. You can right-click on a webpage element, and it will try to “Get Similar” data fields and create an orderly file for export.

xp1

The uncommon features that make this free extension part of my regular arsenal require some understanding of scraping, but are well worth exploring. First off, you can Save Recipes with site-specific XPath or jQuery logic. This means that once you have mapped a website perfectly, you will never have to do the hard work again. Saved recipes sync to your private cloud account, or can be shared as a Public Recipe. You can clone public recipes or customize them for private use. This is great for beginners to learn web scraping since you can see the syntax used on popular sites as a way to advance your skills.

Here is a screenshot of the pop-up when on Twitter. My “Private Recipes” appear above while community submitted ones are below. These can be rated with thumbs up or down; some have an “Example” link to see how the expected source page. You can click any one of these formulas to begin extraction. Links to great help videos and your personal data collections are found at the bottom.

xp2

But wait there’s more! It handles multi-page (pagination) scraping with configurable delays and somewhat intelligently skips empty fields as seen in gray. This works surprising well. Since all the work is performed in your browser this type of scraping is almost undetectable when done at a proper pace.

xp3

Now that we have our data, we can export to CSV and convert to standard Excel for cleanup. The new option is to use the Collections feature to perform even more tricks. Collections act as a mini database with all the data you have collected. In this case, I pulled the 90 names and links for speakers from bluetoothworldevent.com and I can use the search box (far left) to search within the text of this private collection.

xp4

I was lazy though, as I did not grab their company and title so now is my chance to fix it. In Chrome, you can right-click on most items on a website and select “inspect element” to view where that item appears in the source code. Right-click again on the element to “Copy XPath” (think shorthand for page formatting). In the larger screenshot, you can see the copied XPaths in Notepad and how the element changes color as you hover over the source code.

 

xp5

 

The best idea for clean results is to find the common denominator. If you look closely, you can see how each link starts with the same base, with tiny changes at the end. Now we add only those slight changes that define each unique element. This lets us test changes in real-time from the extraction page. Here is new Public Recipe for Bluetooth world with all data fields.

xp6

Lastly, there is a Beta version that can be run concurrently with the original (link at the bottom of the tutorial page). This version shows even more promising features in the pipeline that would usually require either dedicated scripting or use of additional scraping tools.

xp7

Like any technology, you only learn from playing with it. Data Scraper/Data Miner works out-of-the-box with many popular websites, but you will find the most benefit when you dive deep and create a few formulas of your own. I recommend it this for anyone who wants to take their sourcing skills to the next level.

aaron lintAbout the Author: Aaron Lintz is a Talent Sourcing Specialist with @Commvault Systems. Over the last decade, he has held corporate sourcing and agency recruiting roles, helped develop applicant tracking solutions, and managed email & social marketing programs. His passions for experimentation, automation, and willingness to share make him a natural sourcer.
Follow Aaron on Twitter @AaronLintz or connect with him on LinkedIn.


  • Data Miner

    Great article. 🙂

  • John Ricciardi

    Nice job Aaron! Thanks for the read.

  • Jan Bernhart

    This is great stuf Aaron. I’ve played with it but can’t seem to get the pagination right. Their instruction video gave me more questions than answers. Did you happen to see any other instructions on how to do that? (For instance, I want to scrape all members names from http://www.meetup.com/underscore/members/?offset=20&sort=social&desc=1 . The url part is simple, next page is just change in offset=. But how to create a recipe with working pagination, still puzzles me.

    • //li[@class=’nav-pageitem selected’]/following-sibling::li/a

      I found this from the public recipe, tested this, and it works!

  • Yves Greijn

    Nice stuff! Problem I have is how to properly extract the information (both a google linkedin scrape and separate from that a CSE scrape). How can I appoint certain information to a certain column when exporting? Thanks for your help!

    • It does play nice with CSE, but you may want to write your own script. With the correct formatting, the metadata on LinkedIn separates these for easy extraction:
      Role, org, location, Url, and thumbnail image if available

      If you format the CSE properly, it will pull the name cleanly from “gs-title” in this format “Name | LinkedIn.” That is easy to clean in excel, or you can fix this by learning some basic Xpath that are similar to excel formulas. Nothing Off the shelf works perfectly for all CSEs.

      If you really want to expand this into a big project contact Miner and I am sure they can solve this in an hour or so. https://data-miner.io/buy

      https://uploads.disquscdn.com/images/d573b321e431355476cf5f88966a1e0c1fd947c081e7852efd2bc19076706f4e.png


Just add your e-mail!