RogerBW's Blog

Log Processing Fun With Rust 10 March 2026

I've just replaced a chunk of code I use for work with new versions written in Rust.

What it's for is web server log processing; we get about two million HTTP requests a day, and various summary numbers are wanted by the Powers That Be. This includes distinct "visitors" (an IP address and user-agent combination is sufficient), and ditto "visits" (a series of page requests by one visitor, with no more than a half-hour gap between them).

Making things slightly more complicated, we have multiple public-facing servers, and some client software will spread its usage between them. The clocks are synchronised (via NTP), but that still means I need to look at multiple logs; and because weekly log rotation does not necessarily happen at the exact same moment, I need to look at more than just the active one on each server at a given moment.

So the basic approach, which I had in Perl, was:

  • open all the logs (156-ish for a year)
  • parse the first line of each
  • start loop
    • process the first-timestamped line across all files
    • parse the next line out of that file
    • when you reach the end of a file, close it

where "process" involves looking at the requested URL, and the client host address and user-agent.

That's all fine. But with growing traffic (a) this was taking longer and (b) it was starting to have trouble fitting the full list of "visitor" tuples into the 32G machines that are the largest I have. (And swapping that information to disc, even solid-state disc, using DBM::Deep is Very Slow.)

Also I get asked to segment results by IP address range: for example, from the UK versus the EU (when I first wrote this code it was "non-UK EU", sob) and the rest of the world. (Obviously this is of very limited utility given the existence of hosting companies and the bizarre desire of UK legal practices in particular to funnel all their web browsing through Microsoft machines in the US, but it may be regarded as indicative.) OK, I can do that with MaxMind's IP address location lists. But that then means I need to keep a separate stack of IP-host identifiers for each of the active classes…

Also I need to take the MaxMind lists (most conveniently for me in CSV format) and work out that, for example, the UK address ranges have the ID tag 2635167, that's 56,218 blocks in IPv4 and 20,702 in IPv6 space, and maybe they can be coalesced so that I have fewer ranges to check an address against. (E.g. if 2.0.0.0/8 and 3.0.0.0/8 were both in the same country code, I could run them together as the single range 2.0.0.0/7.) That was a separate Perl program that took a bit of work to reconfigure for new areas of interest (e.g. "give us everything from Singapore").

OK, enough. Also a multi-day run time for the analyser isn't great even when it does get to the end without exhausting memory. Time to rewrite from Perl to Rust.

So what have I ended up doing?

geoipselector takes a specification file with a series of queries, extracts the matching ranges from the Maxmind lists, and coalesces them into blocks. So you tell it "all the ranges for which the country_iso_code field is GB should be marked as the UK block". It does the coalescing too.

classifylog takes a set of logs, and a file with a set of address ranges and classes, and spits out separate subsets of logs that match each class. So this can take the whole year's logs and give me separate sets of logs from GB, EU and unknown addresses.

Of course I can feed a hand-written set of classes and ranges into it instead; for example a couple of institutions want similar reports on their own users' use of the site, and they can usually manage to give me a list of the relevant networks. (In that case I don't need the "unknown" addresses.) The perl version of the analyser did this segmenting on the fly, but I've split it out as separate code.

I still need to take an unknown address and test it against every single network to see if it's a member, but as before I keep a hashmap of addresses I've already seen.

monthlysegment is written for one of those institutional users; they want a report broken down by month. No problem, just look at the logged dates and filter as above.

sasstats is the core worker in all this (the School of Advanced Study, part of the University of London, is our parent institution). It takes logs from a single class, processes the entries in order, and spits out the aggregate stats. It does the other half of the job of the Perl analyser.

Finally, csvmerge takes the sasstats output files and integrates them into a single report covering multiple classes.

It's not at all parallel, though of course I could run multiple copies of sasstats on different classes (or months) at the OS level. But the session and visitor counting is essentially a bottleneck since it all has to flow through a single aggregator; and the single address cache when classifying logs is the sort of thing that threading libraries seem to make hard work. (Most of the examples I see say "spawn N workers to do this on parallel", which is fine, but they don't care about reconciling the results from those N workers back into a single place.) For future improvement.

I'm not planning to make this code public because I don't suppose it will be of any use to anyone except me. Also I suspect my Rust will be very embarrassing once I've learned a bit better. But ask me if you're interested.

But more importantly, and more importantly than the increased speed and memory efficiency, I had fun. I often find it harder to do things in Rust than in Perl, which is perhaps why I still feel like a novice after some years of playing with it, but the result is well worth it. Yes, splitting up the logs took about half a day, and analysing all the classes separately about half an hour, as opposed to the four or more day Perl experience on the same hardware, but what matters more to me is that it was fun to write; even the compiler error messages are often useful. And because it's strongly typed, the class of error that I often made in Perl, in which I used the wrong variable or treated it the wrong way, becomes a compile-time error rather than run-time debugging job.

Add A Comment

Your Name
Your Email
Your Comment

Note that I will only approve comments that relate to the blog post itself, not ones that relate only to previous comments. This is to ensure that the blog remains outside the scope of the UK's Online Safety Act (2023).

Your submission will be ignored if any field is left blank, but your email address will not be displayed. Comments will be processed through markdown.

Search
Archive
Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 2300ad 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech aviation base commerce battletech bayern beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime crystal cthulhu eternal cycling dead of winter disaster doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 essen 2024 essen 2025 existential risk falklands war fandom fanfic fantasy feminism filk film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror horrorm science fiction hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2021 hugo 2022 hugo 2023 hugo 2024 hugo 2025 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow openscad opera parody paul temple perl perl weekly challenge photography podcast poetry politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant review reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense talon television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 typst vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1