Writing your first scraper with rvest

Let's get into the R portion of the class. Please let me know if I'm going too slow (or too fast!).

Robust scraping of the sort we're about to attempt can be frustrating. It's a lot of write, run, tweak, repeat. Whenever I'm coding for work, I spend most of my day feeling like this:

Or like this

Or like this…

That's okay! The most important thing to remember when dealing with any kind of coding problem: someone else has already had this problem, asked about it online, and got an answer, which means… you can Google it! Or ask about it on forums such as StackOverflow.

Before we get to the good stuff, let's talk about the tidyverse a little bit. It's a community and set of coding packages for the R language that make analysis monumentally easier. rvest is part of the tidyverse.

There's a lot to know about the tidyverse, but the most important thing we'll need to know for this class is the pipe function. It looks like this:

%>%

It, together with other parts of the tidyverse, allows you to turn this…

						
							for (n in unique(d$x)) {
								subd <- d[d$x == n]
							…
						
					
Into this:
						
							d %>%
								group_by(x) %>%
							…
						
					

All the pipe does is tell R: "take the stuff from the left-hand side of the operation, and apply it to the stuff on the right."

Let's use tidyverse packages together to get a feel for how this works.

tidyverse.R

A slightly more advanced example: sunshine.R

Now let's load the rvest package and get down to it.

At its core, rvest hinges on one command: read_html(). That command tells rvest which page it should fetch. Let's try it out.

procurement.R

Building that index of URLs is a common scraping tactic — build a scraper, grab a list of URLs, and then build a second scraper that consumes those URLs.

The only issue with this approach is that you have no guarantee the HTML structure will be consistent page to page. I originally planned to show you the nested scraper working on the Treasury Board website, but the HTML structure of the tables was all over the place. A great real-life example of this variability!

Before reaching for rvest, you should make sure you actually need it. Here's an example: # https://www.canada.ca/en/health-canada/services/drugs-health-products/drug-products/prescription-drug-list/list.html

Because this page is a single table, you can actually just select the table with your cursor, copy it, and paste it into Excel!

rvest won't always work. High-traffic sites are wise to our scraping ways, and are fairly scrape-proof. Let's take a look at the source for Facebook and see for ourselves.

Even on fairly easy-to-scrape sites, you'll want to be careful not to scrape too aggressively. rvest will want to go as fast as possible, but that could get your IP address automatically banned by the server.

That's where throttling comes in. Let's build a basic scraper with some throttling functionality.

globeandmail.R

One important mantra to follow when scraping: never manipulate or clean up your data within the scrape process. It's much better to get a raw scrape of a page, then clean it up either in a different file or a different variable. Depending on your data, you may have thousands of pages to scrape — you don't want to have to scrape it multiple times if at all possible.

A corollary of this rule is that you should always err on the side of collecting more rather than less data. Put another way: if the web page you're scraping has a series of fields (let's say date, name, address, age, and postal code), you may think you don't need the postal code because you already have the address. You should still scrape it. You never know what you'll end up needing later.

Finally, as much as possible, try to make your scrape reproducible. That is, try to design it such that you can re-run it again two months later with minimal tweaking. rvest does a good job of offering reproducibility out of the box, but just beware that you may end up revisiting an old scraper two years later.

Now that that's all done, let's pick a website together and write a new scraper from scratch. This could go off the rails, so apologies in advance…

newscrape.R

Finally, in some cases you may want to scrape a website by partly filling in forms or submitting data. For example, say you want to scrape a web page that requires you to paginate through a list (here's an example)…

You can automate the clicking of that "next" button and filling out of forms with a package called RSelenium. It's very powerful, but also very involved, and a bit outside the scope of this course.

If you're curious, I've included a link to an RSelenium tutorial in the online course notes.

15 minute break!

Previous section: Part 3: Patterns and selections

Next section: Part 5: Offline document scraping