Offline document scraping

Phew! This is the last section of the day.

While scraping can get you a long way, sometimes there's no way around having to work with a traditional document, such as a PDF.

I'm a fan of the hybrid approach — build a scraper to bulk download files, then analyze them in a different tool. Let's try out bulk downloading now.

pdf.R

I've used methods like this to download thousands of files at once. It can get pretty intense!

But once you've got your files, how do you extract information from them? Let's talk through the options.

Tabula

Tesseract, pdfplumber, docs2csv

Adobe Acrobat

Let's use Tabula and Acrobat to try to extract some data from the PDFs we just downloaded.

That's it! We'll take a short break, then get down to writing our own scrapers.

Previous section: Part 4: Writing your first scraper with rvest