Tom Cardoso, data journalist at The Globe and Mail
You can find the GitHub repository for this presentation here.
Part 1: Introduction (9:15am to 9:45am)
Part 2: The basics of markup (9:45am to 10:30am)
15min break, 10:30am to 10:45am
Part 3: Patterns and selections (10:45am to 12pm)
Lunch break, 12pm to 1:15pm
Part 4: Writing your first scraper with rvest (1:15pm to 2:45pm)
15min break, 2:45pm to 3pm
Part 5: Offline document scraping (3:00pm to 3:30pm)
Part 6: (Time allowing) Let's build a scraper from scratch! (3:30pm to end)
Hint: not this
A way of systematically and reproducibly collecting information
Consists of three steps:
1. Visiting a web page
2. Selecting and extracting data from that page
3. Saving the results of that extraction to a local file (most often a CSV or other database)
Scraping is useful for:
Extracting text
Downloading the contents of a table
Downloading images
Bulk downloading files (such as PDFs)
Automating web form entry
Many different scraping techniques
Manual entry (yes, that's right)
Text pattern matching (e.g. regular expressions)
Using application programming interfaces (commonly known as APIs)
Parsing the DOM
Using a headless browser
We're going to focus on parsing the DOM
By the end of today, you'll (hopefully) have learned how to:
Pick a scraping strategy
Navigate the HTML structure of a web page
Indentify markup patterns you can exploit for scraping
Write a basic selection query in your browser's console
Write a basic scraper using rvest
Extract data from offline documents (namely PDFs)
Next section: Part 2: The basics of markup