Introduction to scraping

About me

Tom Cardoso, data journalist at The Globe and Mail

You can find the GitHub repository for this presentation here.

Today's schedule

Part 1: Introduction (9:15am to 9:45am)
Part 2: The basics of markup (9:45am to 10:30am)
15min break, 10:30am to 10:45am
Part 3: Patterns and selections (10:45am to 12pm)
Lunch break, 12pm to 1:15pm
Part 4: Writing your first scraper with rvest (1:15pm to 2:45pm)
15min break, 2:45pm to 3pm
Part 5: Offline document scraping (3:00pm to 3:30pm)
Part 6: (Time allowing) Let's build a scraper from scratch! (3:30pm to end)

What is scraping?

Hint: not this

A way of systematically and reproducibly collecting information

Consists of three steps:

1. Visiting a web page
2. Selecting and extracting data from that page
3. Saving the results of that extraction to a local file (most often a CSV or other database)

Scraping is useful for:

Extracting text
Downloading the contents of a table
Downloading images
Bulk downloading files (such as PDFs)
Automating web form entry

Many different scraping techniques

Manual entry (yes, that's right)
Text pattern matching (e.g. regular expressions)
Using application programming interfaces (commonly known as APIs)
Parsing the DOM
Using a headless browser

We're going to focus on parsing the DOM

By the end of today, you'll (hopefully) have learned how to:

Pick a scraping strategy
Navigate the HTML structure of a web page
Indentify markup patterns you can exploit for scraping
Write a basic selection query in your browser's console
Write a basic scraper using rvest
Extract data from offline documents (namely PDFs)

Next section: Part 2: The basics of markup