Introduction to scraping

About me

Tom Cardoso, data journalist at The Globe and Mail

You can find the GitHub repository for this presentation here.

Today's schedule

  • Part 1: Introduction (9:15am to 9:45am)

  • Part 2: The basics of markup (9:45am to 10:30am)

  • 15min break, 10:30am to 10:45am

  • Part 3: Patterns and selections (10:45am to 12pm)

  • Lunch break, 12pm to 1:15pm

  • Part 4: Writing your first scraper with rvest (1:15pm to 2:45pm)

  • 15min break, 2:45pm to 3pm

  • Part 5: Offline document scraping (3:00pm to 3:30pm)

  • Part 6: (Time allowing) Let's build a scraper from scratch! (3:30pm to end)

What is scraping?

Hint: not this

A way of systematically and reproducibly collecting information

Consists of three steps:

  1. 1. Visiting a web page

  2. 2. Selecting and extracting data from that page

  3. 3. Saving the results of that extraction to a local file (most often a CSV or other database)

Scraping is useful for:

  • Extracting text

  • Downloading the contents of a table

  • Downloading images

  • Bulk downloading files (such as PDFs)

  • Automating web form entry

Many different scraping techniques

  • Manual entry (yes, that's right)

  • Text pattern matching (e.g. regular expressions)

  • Using application programming interfaces (commonly known as APIs)

  • Parsing the DOM

  • Using a headless browser

We're going to focus on parsing the DOM

By the end of today, you'll (hopefully) have learned how to:

  • Pick a scraping strategy

  • Navigate the HTML structure of a web page

  • Indentify markup patterns you can exploit for scraping

  • Write a basic selection query in your browser's console

  • Write a basic scraper using rvest

  • Extract data from offline documents (namely PDFs)

Next section: Part 2: The basics of markup