top of page

Web Scraping and Natural Language Understanding


This article is a deep dive to the techniques used to analyze thousands of Glassdoor employee reviews. I have run this analysis for several purposes: to analyze the customer experience for Comcast Cable and to understand the fluctuations in employee sentiment for Verizon Wireless.

PLEASE COME BACK 10/18 for full article here!

Prerequisites:

FREE WEB SCRAPING

There are many available tools to scrape web pages for free. The one I used in the LinkedIn article is a free Chrome plugin called...Web Scraper. This tool is incredibly powerful. It may take a bit of a learning curve to master all its features - such as the ability to scan through multiple pages of data, click on "read more" buttons, etc - but it can do it all! Here's a video by the author describing the use of the tool:

Web Scraper, by Martin Balodis (webscraper.io)

Here's a link directly to the Chrome Web Store where you can download the extension.

HOW I SCRAPED GLASSDOOR

In order to grab the free form text reviews from employees of Comcast and Verizon, I build a script (or a "sitemap" in the parlance of webscraper.io). Here is my sitemap which you can copy/paste directly into his tool:

Verizon Glassdor Sitemap

Here's how you import the sitemap into your instance of Web Scraper: Make sure the developer toolbar is open and then do the following:

1. Click on "Import sitemap" under the "Create new sitemap"

2. Copy/paste the code above

3. Give your sitemap a name, such as "Verizon"

4. Click the Import Sitemap button

You now have my scraper and can scrape and then export to Excel. Look at webscraper.io's documentation to learn how to make changes to the scraper. I'll end this portion showing you the sitemap I built for Glassdoor:

This is viewable from the webscraper tool under Sitemap -> selector graph

To run the scraper just click on "Scrape":

and then "Export data as CSV" and open the file in Excel.

You will have to work with Excel to massage the data to your liking but hold on because I will share with you the first 50 records of the Verizon Wireless scraping and a free tool built in to Excel which will allow you to score the sentiment for every one of the reviews and to use on your own text! It is free for up to 30,000 Watson inquiries so have fun and read on!

A technical article is available on IBM's Watson DeveloperWorks site


bottom of page