Code-free scraping 101

About me

Tom Cardoso, investigative reporter
at The Globe and Mail

@tom_cardoso

bit.ly/no-code-caj

Scraping? What?
Websites, how do they work?
HTML and its friends
Distill.io
Klaxon
Extracting data w/ Carly Penrose
What about PDFs?

What is scraping?

Hint: Not this.

Image of a hacker wearing a ski mask ominously manipulating a laptop

Answer:
A way of systematically and reproducibly collecting information.

Consists of three steps:

Visiting a web page
Selecting and extracting data from that page in an automated way
Often, saving the results of that extraction to a local file (most often a CSV file or database)

Scraping is useful for:

Extracting text
Downloading images
Bulk downloading files (such as PDFs)
Automating web form entry
Receiving automated alerts when a website changes
Downloading the contents of a website into a table

How do websites work?

Man drowning in a soup of web technology acronyms

Three parts:

A server, which hosts files and data, and is often an application unto itself that can generate pages on-the-fly.
A browser, or “client,” which connects to a server and requests stuff.
The gluey part in the middle: domain names, load balancers, templating engines, etc.

Diagram of the relationship between a web client and server

Scraping is primarily concerned with extracting data from what we call the “front end,” or the stuff that gets rendered in your browser (servers are often called the “back end”).

You can extract data directly from a server (such as by using APIs), but that’s beyond the scope of this session, and usually requires coding expertise.

This is important: When scraping, you’re selecting and collecting information that’s been made public by the person or organization running the web page. The information’s all public!

The front end is basically an instruction manual:

Diagram showing three layers of website technologies: HTML, CSS and JavaScript

If web pages were an IKEA dresser…

HTML would be the individual parts.
CSS would be the coat of paint you apply after the fact.
JavaScript is the functionality, allowing you to open and close your drawers (most of the time, anyway — this is IKEA, after all…).

For the most part, we only care about the HTML.

We don’t care what the page looks like, or what happens when you click a button. We just want the data!

If you’ve ever hit “View Page Source” in your browser before, you’ll know what we’re talking about.

HTML follows a tree model. The top-level node is the <html> element, and everything else is a child of that element.

<html>
  <body>
  <p>Hello, world!</p>
    <table>
      <tr>
        <td>Apple</td>
        <td>3 oz.</td>
        <td>$14</td>
      </tr>
      <tr>
        <td>Orange</td>
        <td>5 oz.</td>
        <td>$5</td>
      </tr>
    </table>
  </body>
</html>

Scraping works because websites are templated. For the most part, people don’t code sites by hand. Instead, they build templates, and templates on top of those templates, and so on.

Luckily for us, that means we can take advantage of predictable structures to extract information!

I realize this is a lot to take in. You may be feeling like this…

Dog using a computer, with text saying 'I have no idea what I'm doing'

Dog using lab equipment, with text saying 'I have no idea what I'm doing'

Enough theory. Let’s talk tools!

First, let’s focus on getting alerts. The principles for alerters and data downloaders are the same, so they’re a great way to practice your scraping skills.

Distill.io Chrome extension