RogerBW's Blog: Log Processing Fun With Rust

I've just replaced a chunk of code I use for work with new versions written in Rust.

What it's for is web server log processing; we get about two million HTTP requests a day, and various summary numbers are wanted by the Powers That Be. This includes distinct "visitors" (an IP address and user-agent combination is sufficient), and ditto "visits" (a series of page requests by one visitor, with no more than a half-hour gap between them).

Making things slightly more complicated, we have multiple public-facing servers, and some client software will spread its usage between them. The clocks are synchronised (via NTP), but that still means I need to look at multiple logs; and because weekly log rotation does not necessarily happen at the exact same moment, I need to look at more than just the active one on each server at a given moment.

So the basic approach, which I had in Perl, was:

open all the logs (156-ish for a year)
parse the first line of each
start loop
- process the first-timestamped line across all files
- parse the next line out of that file
- when you reach the end of a file, close it

where "process" involves looking at the requested URL, and the client host address and user-agent.

That's all fine. But with growing traffic (a) this was taking longer and (b) it was starting to have trouble fitting the full list of "visitor" tuples into the 32G machines that are the largest I have. (And swapping that information to disc, even solid-state disc, using DBM::Deep is Very Slow.)

Also I get asked to segment results by IP address range: for example, from the UK versus the EU (when I first wrote this code it was "non-UK EU", sob) and the rest of the world. (Obviously this is of very limited utility given the existence of hosting companies and the bizarre desire of UK legal practices in particular to funnel all their web browsing through Microsoft machines in the US, but it may be regarded as indicative.) OK, I can do that with MaxMind's IP address location lists. But that then means I need to keep a separate stack of IP-host identifiers for each of the active classes…

Also I need to take the MaxMind lists (most conveniently for me in CSV format) and work out that, for example, the UK address ranges have the ID tag 2635167, that's 56,218 blocks in IPv4 and 20,702 in IPv6 space, and maybe they can be coalesced so that I have fewer ranges to check an address against. (E.g. if 2.0.0.0/8 and 3.0.0.0/8 were both in the same country code, I could run them together as the single range 2.0.0.0/7.) That was a separate Perl program that took a bit of work to reconfigure for new areas of interest (e.g. "give us everything from Singapore").

OK, enough. Also a multi-day run time for the analyser isn't great even when it does get to the end without exhausting memory. Time to rewrite from Perl to Rust.

So what have I ended up doing?

geoipselector takes a specification file with a series of queries, extracts the matching ranges from the Maxmind lists, and coalesces them into blocks. So you tell it "all the ranges for which the country_iso_code field is GB should be marked as the UK block". It does the coalescing too.

classifylog takes a set of logs, and a file with a set of address ranges and classes, and spits out separate subsets of logs that match each class. So this can take the whole year's logs and give me separate sets of logs from GB, EU and unknown addresses.

Of course I can feed a hand-written set of classes and ranges into it instead; for example a couple of institutions want similar reports on their own users' use of the site, and they can usually manage to give me a list of the relevant networks. (In that case I don't need the "unknown" addresses.) The perl version of the analyser did this segmenting on the fly, but I've split it out as separate code.

I still need to take an unknown address and test it against every single network to see if it's a member, but as before I keep a hashmap of addresses I've already seen.

monthlysegment is written for one of those institutional users; they want a report broken down by month. No problem, just look at the logged dates and filter as above.

sasstats is the core worker in all this (the School of Advanced Study, part of the University of London, is our parent institution). It takes logs from a single class, processes the entries in order, and spits out the aggregate stats. It does the other half of the job of the Perl analyser.

Finally, csvmerge takes the sasstats output files and integrates them into a single report covering multiple classes.

It's not at all parallel, though of course I could run multiple copies of sasstats on different classes (or months) at the OS level. But the session and visitor counting is essentially a bottleneck since it all has to flow through a single aggregator; and the single address cache when classifying logs is the sort of thing that threading libraries seem to make hard work. (Most of the examples I see say "spawn N workers to do this on parallel", which is fine, but they don't care about reconciling the results from those N workers back into a single place.) For future improvement.

I'm not planning to make this code public because I don't suppose it will be of any use to anyone except me. Also I suspect my Rust will be very embarrassing once I've learned a bit better. But ask me if you're interested.

But more importantly, and more importantly than the increased speed and memory efficiency, I had fun. I often find it harder to do things in Rust than in Perl, which is perhaps why I still feel like a novice after some years of playing with it, but the result is well worth it. Yes, splitting up the logs took about half a day, and analysing all the classes separately about half an hour, as opposed to the four or more day Perl experience on the same hardware, but what matters more to me is that it was fun to write; even the compiler error messages are often useful. And because it's strongly typed, the class of error that I often made in Perl, in which I used the wrong variable or treated it the wrong way, becomes a compile-time error rather than run-time debugging job.

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.