J E L L Y E N T
Git scraping: label changes over time by scraping to a Git repository

Git scraping is the name I’ve given a scraping methodology that I’ve been experimenting with for about a years now. It’s if fact be knowledgeable unparalleled, and extra people can additionally shapely silent use it.

The gain is stuffed with attention-grabbing records that changes over time. These changes can generally be extra attention-grabbing than the underlying static records—The @nyt_diff Twitter delusion tracks changes made to Most recent York Circumstances headlines as an illustration, which affords a pointy perception into that e-newsletter’s editorial job.

We already bag a mountainous instrument for efficiently tracking changes to textual enlighten over time: Git. And GitHub Actions (and other CI methods) form it uncomplicated to form a scraper that runs every immediate time, records the most modern mutter of a resource and records changes to that resource over time in the commit historical previous.

Loyal here’s a most modern occasion. Fires proceed to rage in California, and the CAL FIRE gain form affords an incident plot exhibiting the most modern fire pronounce around the mutter.

Firing up the Firefox Neighborhood pane, filtering to requests precipitated by XHR and sorting by dimension, largest first unearths this endpoint:

https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That’s a 241KB JSON endpoints with fats minute print of the many fires around the mutter.

So… I started working a git scraper in opposition to it. My scraper lives in the simonw/ca-fires-historical previous repository on GitHub.

Each and every 20 minutes it grabs the most modern reproduction of that JSON endpoint, moderately-prints it (for diff readability) the use of jq and commits it abet to the repo if it has modified.

This methodology I now bag a commit log of changes to that records about fires in California. Loyal here’s an occasion commit exhibiting that final evening the Zogg Fires share contained elevated from 90% to 92%, the alternative of personnel alive to dropped from 968 to 798 and the alternative of engines responding dropped from 82 to 59.

Screenshot of a diff against the Zogg Fires, exhibiting personnel alive to shedding from 968 to 798, engines shedding 82 to 59, water tenders shedding 31 to 27 and p.c contained rising from 90 to 92.

The implementation of the scraper is totally contained in a single GitHub Actions workflow. It’s in a file known as .github/workflows/jam.yml which looks worship this:

name: Jam most modern records

on:
  push:
  workflow_dispatch:
  agenda:
    - cron:  '6,26,46 '

jobs:
  scheduled:
    runs-on: ubuntu-most modern
    steps:
    - name: Strive this repo
      makes use of: actions/checkout@v2
    - name: Salvage most modern records
      bustle: |-
        curl https://www.fire.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq . > incidents.json
    - name: Commit and push if it modified
      bustle: |-
        git config individual.name "Computerized"
        git config individual.electronic mail "actions@potentialities.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Most recent records: ${timestamp}" || exit 0
        git push

That’s now not a lot of of code!

It runs on a agenda at 6, 26 and 46 minutes previous the hour—I lift collectively to offset my cron cases worship this since I rob that nearly all of crons bustle exactly on the hour, so working now not-on-the-hour feels polite.

The scraper itself works by fetching the JSON the use of curl, piping it thru jq . to moderately-print it and saving the tip consequence to incidents.json.

The “commit and push if it modified” block makes use of a sample that commits and pushes on condition that the file has modified. I wrote about this sample in this TIL about a months beforehand.

I bag a whole bunch of repositories working git scrapers now. I’ve been labeling them with the git-scraping topic so as that they recent up in a single mutter on GitHub (of us bag began the use of that topic as correctly).

I’ve written just a few quantity of those beforehand:

I’m hoping that by giving this methodology a name I will be succesful to again extra people so that you just can add it to their toolbox. It’s an severely unparalleled draw of turning every form of attention-grabbing records sources proper into a changelog over time.

Read Extra

Related Post

5 Commentaires

Leave a Comment

Recent Posts

An oil tanker with 60M gallons of oil aboard is all thru the meantime sinking [video]
Amazon’s $23M book about flies (2011)
Google Coral Dev Board mini SBC is now on hand for $100
Glow: Markdown reader for the terminal with a TUI and encrypted cloud stash
The manner you would possibly well abolish your occupation, one entirely extremely contented one year at a time

Recent Posts

An oil tanker with 60M gallons of oil aboard is all thru the meantime sinking [video]
Amazon’s $23M book about flies (2011)
Google Coral Dev Board mini SBC is now on hand for $100
Glow: Markdown reader for the terminal with a TUI and encrypted cloud stash
The manner you would possibly well abolish your occupation, one entirely extremely contented one year at a time
fr_FRFrench
en_USEnglish fr_FRFrench