Gathering data: Scraping the web


Intro

A couple of weeks ago I uploaded on my GitHub account the first version of a simple Scraping script.

Getting web pages and working with them from a script, when API are not available, can be quite useful.

In this case, I “scraped” the CRT.sh website, looking up information about mine. Nothing aggressive, one request was all it took. But working with the HTML code afterwards is a bit more elaborate.

A few libraries

In this case, although scraping is a very common use-case in Data Science at large, it turns out, “parsing” HTML is not too straightforward.

At the end I had to use different tricks to extract particular contents (e.g. tables, using rvest), XPATH to locate that (using package XML2), curl (from package… curl).

Then reading in the HTML tables would not keep the cells’ carriage return, so I had to trick my way around that too.

My code on GitHub for this example.

What for?

In this case, we’re looking for sub-domains associated to TLS certificates for our website. Putting the demo code in a loop can help look up the same thing for several domains.

YES: Some error control will be needed, too (I should know, I ran into problems with that particular demo code precisely because of that a few days later). Please remember, my GitHub is not so much about perfect code as it is about demos, just like this Blog. I probably re-use some of that code (I do) but then I will most certainly add some error controls, checks, and lots of refactoring of all sorts. The code with less controls is simpler to read, that’s all. This is NOT what I recommend, though 😉

But more generally, getting back to the point: Having a script read information from a Web-page is a great way of gathering data automatically (when an API is not an option, that is).

And that’s all for today. (Shorter than usual, I know :))