Tom Cardoso, investigative reporter
at The Globe and Mail
Hint: Not this.
Answer:A way of systematically and reproducibly collecting information.
Consists of three steps:
Scraping is useful for:
Three parts:
Scraping is primarily concerned with extracting data from what we call the “front end,” or the stuff that gets rendered in your browser (servers are often called the “back end”).
You can extract data directly from a server (such as by using APIs), but that’s beyond the scope of this session, and usually requires coding expertise.
This is important: When scraping, you’re selecting and collecting information that’s been made public by the person or organization running the web page. The information’s all public!
The front end is basically an instruction manual:
If web pages were an IKEA dresser…
For the most part, we only care about the HTML.
We don’t care what the page looks like, or what happens when you click a button. We just want the data!
If you’ve ever hit “View Page Source” in your browser before, you’ll know what we’re talking about.
HTML follows a tree model. The top-level node is the <html>
element, and everything else is a child of that element.
<html>
<body>
<p>Hello, world!</p>
<table>
<tr>
<td>Apple</td>
<td>3 oz.</td>
<td>$14</td>
</tr>
<tr>
<td>Orange</td>
<td>5 oz.</td>
<td>$5</td>
</tr>
</table>
</body>
</html>
Scraping works because websites are templated. For the most part, people don’t code sites by hand. Instead, they build templates, and templates on top of those templates, and so on.
Luckily for us, that means we can take advantage of predictable structures to extract information!
I realize this is a lot to take in. You may be feeling like this…
First, let’s focus on getting alerts. The principles for alerters and data downloaders are the same, so they’re a great way to practice your scraping skills.
Pros:
Cons:
Pros:
Cons:
Bonus slide: Tabula