J E L L Y E N T

Git scraping is the name I’ve given a scraping methodology that I’ve been experimenting with for about a years now. It’s if fact be knowledgeable unparalleled, and extra people can additionally shapely silent use it.

The gain is stuffed with attention-grabbing records that changes over time. These changes can generally be extra attention-grabbing than the underlying static records—The @nyt_diff Twitter delusion tracks changes made to Most recent York Circumstances headlines as an illustration, which affords a pointy perception into that e-newsletter’s editorial job.

We already bag a mountainous instrument for efficiently tracking changes to textual enlighten over time: Git. And GitHub Actions (and other CI methods) form it uncomplicated to form a scraper that runs every immediate time, records the most modern mutter of a resource and records changes to that resource over time in the commit historical previous.

Loyal here’s a most modern occasion. Fires proceed to rage in California, and the CAL FIRE gain form affords an incident plot exhibiting the most modern fire pronounce around the mutter.

Firing up the Firefox Neighborhood pane, filtering to requests precipitated by XHR and sorting by dimension, largest first unearths this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That’s a 241KB JSON endpoints with fats minute print of the many fires around the mutter.

So… I started working a git scraper in opposition to it. My scraper lives in the simonw/ca-fires-historical previous repository on GitHub.

Each and every 20 minutes it grabs the most modern reproduction of that JSON endpoint, moderately-prints it (for diff readability) the use of jq and commits it abet to the repo if it has modified.

This methodology I now bag a commit log of changes to that records about fires in California. Loyal here’s an occasion commit exhibiting that final evening the Zogg Fires share contained elevated from 90% to 92%, the alternative of personnel alive to dropped from 968 to 798 and the alternative of engines responding dropped from 82 to 59.

The implementation of the scraper is totally contained in a single GitHub Actions workflow. It’s in a file known as .github/workflows/jam.yml which looks worship this:

name: Jam most modern records

on:
push:
workflow_dispatch:
agenda:
- cron:  '6,26,46 '

jobs:
scheduled:
runs-on: ubuntu-most modern
steps:
- name: Strive this repo
makes use of: actions/checkout@v2
- name: Salvage most modern records
bustle: |-
curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
- name: Commit and push if it modified
bustle: |-
git config individual.name "Computerized"
timestamp=$(date -u) git commit -m "Most recent records:${timestamp}" || exit 0
git push

That’s now not a lot of of code!

It runs on a agenda at 6, 26 and 46 minutes previous the hour—I lift collectively to offset my cron cases worship this since I rob that nearly all of crons bustle exactly on the hour, so working now not-on-the-hour feels polite.

The scraper itself works by fetching the JSON the use of curl, piping it thru jq . to moderately-print it and saving the tip consequence to incidents.json.

The “commit and push if it modified” block makes use of a sample that commits and pushes on condition that the file has modified. I wrote about this sample in this TIL about a months beforehand.

I bag a whole bunch of repositories working git scrapers now. I’ve been labeling them with the git-scraping topic so as that they recent up in a single mutter on GitHub (of us bag began the use of that topic as correctly).

I’ve written just a few quantity of those beforehand:

I’m hoping that by giving this methodology a name I will be succesful to again extra people so that you just can add it to their toolbox. It’s an severely unparalleled draw of turning every form of attention-grabbing records sources proper into a changelog over time.