I wanted to get dustbin collection days into the house calendar
server. Shouldn't be too hard, right?
It's not quite as simple as "recyclables week A, main rubbish
week B", because collections get deferred for bank holidays
(especially around Christmas), and sometimes (as last summer when the
council refused to pay extra money to the contracting company, I mean
"had a labour shortage") some collections get cancelled completely.
The local council provides this information in various ways. It used
to put a card through the door a couple of times a year, and sometimes
it still does; a PDF version of that is made available, but generally
the new one isn't released (electronically or physically) until after
the old one has expired. And of course that requires me to type in all
the exceptions by hand, and doesn't get updated for emergencies.
But help is at hand! They have a web
page on which you can
specify your address, and get back the next collection for each sort
of rubbish. Not much in the way of advance notice, but they do
actually keep it up to date for extra bank holidays and such like. So
I can just scrape that and parse the page, right? Right?
Well.
If you are me, you already know your house's
UPRN,
which of course is what they (quite reasonably) use as an input to the
lookup. But you can't just submit that. Or even type in an address. Or
even bookmark the results page. No, you have to go in through their
postcode lookup. Which needs JavaScript, so that's rather beyond what
poor old WWW::Mechanize
can manage. (Somewhere behind all this
there's a straightforward API call, but I wasn't able to get it to
respond to my prodding any more simply than going through the pages;
the necessary parameters are put together by the JavaScript, and even
replaying a request captured in the browser didn't work reliably.)
This calls, in fact, for a headless browser. Selenium
is the
canonical answer to this problem, but that needs a great big Java
daemon – and Java in general doesn't have the best of security
reputations, nor what one might call a small footprint. So instead I
ended up using PhantomJS
– canonically a dead project, but it still
works, it's in Debian/stable, and it's much more lightweight.
This is basically a central lump of code with tentacles. To the user
it presents itself as a JavaScript interpreter; to the web it runs a
WebGTK browser. One directs it with JavaScript, which I've been
learning since last year, and one can also mark code as to be run
inside the context of the loaded page.
So the procedure ends up being:
- load the first page
- enter my postcode
- click on the lookup
- wait
- check the dropdown for my address
- select it, and trigger a "change" event on the dropdown
- wait
- submit the form
- wait
- get back the results page, and parse it for the dates
In-browser JavaScript has useful methods like
document.getElementsByTagName()
so I do the final HTML parsing
there, and dump JSON onto stdout for a
calmanager plugin to pick
up and update my iCalendar server. (That does things like lumping
multiple collections together into a single calendar entry, and making
the actual diary event go off on the previous evening to remind me to
put the bins out on the night before what might be an early morning
pickup.)
I'm not planning to make this code public, but if you have a use for
it, let me know.
I wonder how much the council paid for this overcomplicated setup?
Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.