RogerBW's Blog

Log Processing Fun With Rust 10 March 2026

I've just replaced a chunk of code I use for work with new versions written in Rust.

What it's for is web server log processing; we get about two million HTTP requests a day, and various summary numbers are wanted by the Powers That Be. This includes distinct "visitors" (an IP address and user-agent combination is sufficient), and ditto "visits" (a series of page requests by one visitor, with no more than a half-hour gap between them).

Making things slightly more complicated, we have multiple public-facing servers, and some client software will spread its usage between them. The clocks are synchronised (via NTP), but that still means I need to look at multiple logs; and because weekly log rotation does not necessarily happen at the exact same moment, I need to look at more than just the active one on each server at a given moment.

So the basic approach, which I had in Perl, was:

  • open all the logs (156-ish for a year)
  • parse the first line of each
  • start loop
    • process the first-timestamped line across all files
    • parse the next line out of that file
    • when you reach the end of a file, close it

where "process" involves looking at the requested URL, and the client host address and user-agent.

That's all fine. But with growing traffic (a) this was taking longer and (b) it was starting to have trouble fitting the full list of "visitor" tuples into the 32G machines that are the largest I have. (And swapping that information to disc, even solid-state disc, using DBM::Deep is Very Slow.)

Also I get asked to segment results by IP address range: for example, from the UK versus the EU (when I first wrote this code it was "non-UK EU", sob) and the rest of the world. (Obviously this is of very limited utility given the existence of hosting companies and the bizarre desire of UK legal practices in particular to funnel all their web browsing through Microsoft machines in the US, but it may be regarded as indicative.) OK, I can do that with MaxMind's IP address location lists. But that then means I need to keep a separate stack of IP-host identifiers for each of the active classes…

Also I need to take the MaxMind lists (most conveniently for me in CSV format) and work out that, for example, the UK address ranges have the ID tag 2635167, that's 56,218 blocks in IPv4 and 20,702 in IPv6 space, and maybe they can be coalesced so that I have fewer ranges to check an address against. (E.g. if 2.0.0.0/8 and 3.0.0.0/8 were both in the same country code, I could run them together as the single range 2.0.0.0/7.) That was a separate Perl program that took a bit of work to reconfigure for new areas of interest (e.g. "give us everything from Singapore").

OK, enough. Also a multi-day run time for the analyser isn't great even when it does get to the end without exhausting memory. Time to rewrite from Perl to Rust.

So what have I ended up doing?

geoipselector takes a specification file with a series of queries, extracts the matching ranges from the Maxmind lists, and coalesces them into blocks. So you tell it "all the ranges for which the country_iso_code field is GB should be marked as the UK block". It does the coalescing too.

classifylog takes a set of logs, and a file with a set of address ranges and classes, and spits out separate subsets of logs that match each class. So this can take the whole year's logs and give me separate sets of logs from GB, EU and unknown addresses.

Of course I can feed a hand-written set of classes and ranges into it instead; for example a couple of institutions want similar reports on their own users' use of the site, and they can usually manage to give me a list of the relevant networks. (In that case I don't need the "unknown" addresses.) The perl version of the analyser did this segmenting on the fly, but I've split it out as separate code.

I still need to take an unknown address and test it against every single network to see if it's a member, but as before I keep a hashmap of addresses I've already seen.

monthlysegment is written for one of those institutional users; they want a report broken down by month. No problem, just look at the logged dates and filter as above.

sasstats is the core worker in all this (the School of Advanced Study, part of the University of London, is our parent institution). It takes logs from a single class, processes the entries in order, and spits out the aggregate stats. It does the other half of the job of the Perl analyser.

Finally, csvmerge takes the sasstats output files and integrates them into a single report covering multiple classes.

It's not at all parallel, though of course I could run multiple copies of sasstats on different classes (or months) at the OS level. But the session and visitor counting is essentially a bottleneck since it all has to flow through a single aggregator; and the single address cache when classifying logs is the sort of thing that threading libraries seem to make hard work. (Most of the examples I see say "spawn N workers to do this on parallel", which is fine, but they don't care about reconciling the results from those N workers back into a single place.) For future improvement.

I'm not planning to make this code public because I don't suppose it will be of any use to anyone except me. Also I suspect my Rust will be very embarrassing once I've learned a bit better. But ask me if you're interested.

But more importantly, and more importantly than the increased speed and memory efficiency, I had fun. I often find it harder to do things in Rust than in Perl, which is perhaps why I still feel like a novice after some years of playing with it, but the result is well worth it. Yes, splitting up the logs took about half a day, and analysing all the classes separately about half an hour, as opposed to the four or more day Perl experience on the same hardware, but what matters more to me is that it was fun to write; even the compiler error messages are often useful. And because it's strongly typed, the class of error that I often made in Perl, in which I used the wrong variable or treated it the wrong way, becomes a compile-time error rather than run-time debugging job.


  1. Posted by Owen Smith at 03:43pm on 10 March 2026

    At work I keep finding syntax errors or typos in variable names in Python that people wrote 5 years ago. The (to my mind unnecessary but forced on us) conversion from Python 2 to 3 introduced a ton of errors, which again we are still finding. It seems bizarre to me that as an industry we're still promoting and developing interpreted languages, I thought we learned decades ago that compiled languages are better for many reasons. Compile time versus runtime errors being one of the big ones.

  2. Posted by RogerBW at 04:06pm on 10 March 2026

    I think you are conflating two separate considereations.

    It would be entirely possible to construct a variant of Python with more translation-time checking even if that isn't "compilation" in the strictest sense, for example making the type system compulsory; similarly it would be possible for Rust to have a less strict compiler that would allow through more problems that would produce run-time errors.

    I have many objections to Python, some of which I've blogged about, but I don't hold its interpreted nature against it.

  3. Posted by John P at 10:19pm on 11 March 2026

    I've probably not understood everything you're trying to do, but would it make sense to turn the data you need into records in a database. Then you can slice & dice any way you want? Either by pulling the log files in or getting the various servers to chuck them at a webservice periodically. In fact, could get them to dupe the data into the database at the same time as they write it to their local log? The database could have a table of addresses/networks that serves as a lookup for grouping. Dunno. That was just my first thought anyway. 2p.

  4. Posted by RogerBW at 09:28am on 12 March 2026

    There are definitely cases in which that would be useful, but since the vast majority of the processing here is necessarily of every single record, one at a time in chronological order, the database's ability to select many records that match a criterion isn't particularly helpful.

    (And then there's the overhead of translating them into the database in the first place.)

    If I could throw a year's worth of logs at a standard sorting program and tell it to arrange them all in order, I would, rather than the faffing about with reading the first line of each of 150 separate files; but that just runs out of memory.

Add A Comment

Your Name
Your Email
Your Comment

Note that I will only approve comments that relate to the blog post itself, not ones that relate only to previous comments. This is to ensure that the blog remains outside the scope of the UK's Online Safety Act (2023).

Your submission will be ignored if any field is left blank, but your email address will not be displayed. Comments will be processed through markdown.

Search
Archive
Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 2300ad 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech aviation base commerce battletech bayern beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime crystal cthulhu eternal cycling dead of winter disaster doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 essen 2024 essen 2025 existential risk falklands war fandom fanfic fantasy feminism filk film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror horrorm science fiction hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2021 hugo 2022 hugo 2023 hugo 2024 hugo 2025 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow openscad opera parody paul temple perl perl weekly challenge photography podcast poetry politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant review reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense talon television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 typst vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1