I’ve wanted to experiment with intelligent web scraping for some time. I don’t know if intelligent web scraping is even a thing, but I wanted a script that could do three things:

  1. Self-assign future paths for scraping
  2. Reconcile demographic information among different labels
  3. Automatically draw links between entities

After deciding on Node.js and PostgreSQL for the backend and economists as the subject, I set out writing a script that would scrape the internet from a single character in history, Adam Smith, to build a map of all (significant, documented) economists. Below are the resulting graphs from this endeavor. If graphs aren’t your thing, skip to the process.

Part 2 of this post will dive into specific machine learning techniques and an explanation of the node script.

Double click a name to reveal more details about the economist!

Influences

Country

Economists by Date

The Process

Self-directed scraping

Starting with Adam Smith’s page, the script scans for links, storing them for future processing. Among the links of other economist pages are links for publications, concepts, places, dates, etc. All of these links are added to a database for later processing. When a link is retrieved from the queue, the script loads it and applies NLP techniques looking for a combination of keywords and phrases that indicate it is an actual economist. It assigns a confidence score to this decision, and only very confident entities are marked for display.

Script statistics

Total run time 2 hours, 36 minutes
Possible economists found 3,872
# > 90% confidence 512
Information reconciliation

The source of information often categorized demographic information under different labels. For example, a person’s birthday may be labeled: DOB, birth, date of birth, or simply with a parenthesized date following an individual’s name. Additionally, some concepts, such as school (e.g., Austrian, physiocracy, Chicago, Keynesian, econometrics) may be referred to as: school of thought, tradition, field, or principle. The script needed to abstract these keywords and then reconcile them piecemeal to fill in the blanks. This was achieved by defining a simple map of what an economist looks like:

{
   "name" : string,
   "country" : string,
   "link" : URI,
   "nationality" : string,
   "dob" : date,
   "dod" : date,
   "works" : array,
   "contributions" : array,
   "field" : string,
   "influences" : array,
   "influenced" : array,
   "tradition" : string,
   "era" : string, deduced from dob,
}
Linking entities

Once all entities were gathered, a function was developed to find only “high quality” economists. That is, entities where all properties of the above map were completed and there were references to other entities in their original source page or works. Of the 512 economists that were found in the original scraping, 92 were marked as “high quality”, and are here displayed graphically.