Charting influential economists
I’ve wanted to experiment with intelligent web scraping for some time. I don’t know if intelligent web scraping is even a thing, but I wanted a script that could do three things:
- Self-assign future paths for scraping
- Reconcile demographic information among different labels
- Automatically draw links between entities
After deciding on Node.js and PostgreSQL for the backend and economists as the subject, I set out writing a script that would scrape the internet from a single character in history, Adam Smith, to build a map of all (significant, documented) economists. Below are the resulting graphs from this endeavor. If graphs aren’t your thing, skip to the process.
Part 2 of this post will dive into specific machine learning techniques and an explanation of the node script.
Double click a name to reveal more details about the economist!
Influences
Country
Economists by Date
The Process
Self-directed scraping
Starting with Adam Smith’s page, the script scans for links, storing them for future processing. Among the links of other economist pages are links for publications, concepts, places, dates, etc. All of these links are added to a database for later processing. When a link is retrieved from the queue, the script loads it and applies NLP techniques looking for a combination of keywords and phrases that indicate it is an actual economist. It assigns a confidence score to this decision, and only very confident entities are marked for display.
Script statistics
Total run time | 2 hours, 36 minutes |
Possible economists found | 3,872 |
# > 90% confidence | 512 |
Information reconciliation
The source of information often categorized demographic information under different labels. For example, a person’s birthday may be labeled: DOB, birth, date of birth, or simply with a parenthesized date following an individual’s name. Additionally, some concepts, such as school (e.g., Austrian, physiocracy, Chicago, Keynesian, econometrics) may be referred to as: school of thought, tradition, field, or principle. The script needed to abstract these keywords and then reconcile them piecemeal to fill in the blanks. This was achieved by defining a simple map of what an economist looks like:
{ "name" : string, "country" : string, "link" : URI, "nationality" : string, "dob" : date, "dod" : date, "works" : array, "contributions" : array, "field" : string, "influences" : array, "influenced" : array, "tradition" : string, "era" : string, deduced from dob, }
Linking entities
Once all entities were gathered, a function was developed to find only “high quality” economists. That is, entities where all properties of the above map were completed and there were references to other entities in their original source page or works. Of the 512 economists that were found in the original scraping, 92 were marked as “high quality”, and are here displayed graphically.