I've just replaced a chunk of code I use for work with new versions
written in Rust.
What it's for is web server log processing; we get about two
million HTTP requests a day, and various summary numbers are wanted by
the Powers That Be. This includes distinct "visitors" (an IP address
and user-agent combination is sufficient), and ditto "visits" (a
series of page requests by one visitor, with no more than a half-hour
gap between them).
Making things slightly more complicated, we have multiple
public-facing servers, and some client software will spread its usage
between them. The clocks are synchronised (via NTP), but that still
means I need to look at multiple logs; and because weekly log rotation
does not necessarily happen at the exact same moment, I need to look
at more than just the active one on each server at a given moment.
So the basic approach, which I had in Perl, was:
- open all the logs (156-ish for a year)
- parse the first line of each
- start loop
- process the first-timestamped line across all files
- parse the next line out of that file
- when you reach the end of a file, close it
where "process" involves looking at the requested URL, and the client
host address and user-agent.
That's all fine. But with growing traffic (a) this was taking longer
and (b) it was starting to have trouble fitting the full list of
"visitor" tuples into the 32G machines that are the largest I have.
(And swapping that information to disc, even solid-state disc, using
DBM::Deep is Very Slow.)
Also I get asked to segment results by IP address range: for example,
from the UK versus the EU (when I first wrote this code it was "non-UK
EU", sob) and the rest of the world. (Obviously this is of very
limited utility given the existence of hosting companies and the
bizarre desire of UK legal practices in particular to funnel all their
web browsing through Microsoft machines in the US, but it may be
regarded as indicative.) OK, I can do that with MaxMind's IP address
location lists. But that then means I need to keep a separate stack of
IP-host identifiers for each of the active classes…
Also I need to take the MaxMind lists (most conveniently for me in CSV
format) and work out that, for example, the UK address ranges have the
ID tag 2635167, that's 56,218 blocks in IPv4 and 20,702 in IPv6 space,
and maybe they can be coalesced so that I have fewer ranges to check
an address against. (E.g. if 2.0.0.0/8 and 3.0.0.0/8 were both in the
same country code, I could run them together as the single range
2.0.0.0/7.) That was a separate Perl program that took a bit of work
to reconfigure for new areas of interest (e.g. "give us everything
from Singapore").
OK, enough. Also a multi-day run time for the analyser isn't great
even when it does get to the end without exhausting memory. Time to
rewrite from Perl to Rust.
So what have I ended up doing?
geoipselector takes a specification file with a series of queries,
extracts the matching ranges from the Maxmind lists, and coalesces
them into blocks. So you tell it "all the ranges for which the
country_iso_code field is GB should be marked as the UK block".
It does the coalescing too.
classifylog takes a set of logs, and a file with a set of address
ranges and classes, and spits out separate subsets of logs that match
each class. So this can take the whole year's logs and give me
separate sets of logs from GB, EU and unknown addresses.
Of course I can feed a hand-written set of classes and ranges into it
instead; for example a couple of institutions want similar reports on
their own users' use of the site, and they can usually manage to give
me a list of the relevant networks. (In that case I don't need the
"unknown" addresses.) The perl version of the analyser did this
segmenting on the fly, but I've split it out as separate code.
I still need to take an unknown address and test it against every
single network to see if it's a member, but as before I keep a hashmap
of addresses I've already seen.
monthlysegment is written for one of those institutional users; they
want a report broken down by month. No problem, just look at the
logged dates and filter as above.
sasstats is the core worker in all this (the School of Advanced
Study, part of the University of London, is our parent institution).
It takes logs from a single class, processes the entries in order, and
spits out the aggregate stats. It does the other half of the job of
the Perl analyser.
Finally, csvmerge takes the sasstats output files and integrates
them into a single report covering multiple classes.
It's not at all parallel, though of course I could run multiple copies
of sasstats on different classes (or months) at the OS level. But
the session and visitor counting is essentially a bottleneck since it
all has to flow through a single aggregator; and the single address
cache when classifying logs is the sort of thing that threading
libraries seem to make hard work. (Most of the examples I see say
"spawn N workers to do this on parallel", which is fine, but they
don't care about reconciling the results from those N workers back
into a single place.) For future improvement.
I'm not planning to make this code public because I don't suppose it
will be of any use to anyone except me. Also I suspect my Rust will be
very embarrassing once I've learned a bit better. But ask me if you're
interested.
But more importantly, and more importantly than the increased speed
and memory efficiency, I had fun. I often find it harder to do
things in Rust than in Perl, which is perhaps why I still feel like a
novice after some years of playing with it, but the result is well
worth it. Yes, splitting up the logs took about half a day, and
analysing all the classes separately about half an hour, as opposed to
the four or more day Perl experience on the same hardware, but what
matters more to me is that it was fun to write; even the compiler
error messages are often useful. And because it's strongly typed, the
class of error that I often made in Perl, in which I used the wrong
variable or treated it the wrong way, becomes a compile-time error
rather than run-time debugging job.