RogerBW's Blog

How I saved 2.2 terabytes by disabling compression 17 February 2019

As part of the procedure for making sure a large set of work-related data remains intact and recoverable, I keep backups at home.

In the early days this was compressed and then burned to CD-ROMs and DVD-ROMs, but the data sets got larger, so I've been keeping them on the file server. (Which is on RAID-6 and backed up nightly.)

Force of habit meant that, while I did each monthly backup by rsync, I still compressed the results separately as if I were doing to burn them. So while an individual month's backup might have shrunk from 22G to 11G or so, the next month took up another 11G even though most of the contents were unchanged. This was up to about 2.3 tebibytes (out of about 48 on the server, so while it wasn't a major burden it was starting to make itself felt).

So I've decompressed them all, and then hard-linked identical files together; and the result is a mere 110G, less than 5% of the compressed size.

Compression might still offer some space saving, but only if it were done at the individual file level, and it's useful having the files immediately ready for access. This is why I do my backups with rsync into a filesystem: almost always, what I want to restore is not the full machine image but a single file or directory, and having to rootle through some non-filesystem interface to get out what I want produces more faff than having smaller backups would save.

Tags: computing linux

Posted by John Dallman at 09:33am on 17 February 2019

So a transparently compressing filesystem, which also had hardlinks, would be ideal? Like you, I do backups into filesystems and for the same reason; recovering one or two files is much commoner than needing to rebuild a machine.
Posted by RogerBW at 09:45am on 17 February 2019

Well, with zfs (which is what I'm using) it's actually standard practice to turn on filesystem compression anyway - even if the data are relatively incompressible, like video files, the cost in CPU time is less than the saving in disc transfer time. But any saving from that doesn't show up in the disc usage stats. I can see how much space a whole filesystem is taking up, but this particular one contains both the backups and other things.

The next stage would be to turn on block-level deduplication in the filesystem - at which point the hardlinks would become irrelevant. But this is moderately expensive in both CPU and memory on the file server.
Posted by Peter C at 12:26pm on 18 February 2019

The popular advice to avoid dedupe is based on systems from the 2000s running the original Sun ZFS.

The usual warning is about the size of the DDT in RAM. Essentially, the DDT should not grow larger than RAM. People tend to assume that this is because it is accessed randomly and needs to be cached to not crater performance on hard disks, but a more important reason is that if the system crashes, it may run out of kernel memory trying to replay the journal on reboot. ZoL is more forgiving than FreeBSD on this front, and is how I got my data back when a FreeNAS box went castors-up.

The size of the DDT scales with the number of records. (It's a B-tree so slightly greater than linear scaling, but close enough for our purposes.) The advice of 2-5 GB of RAM per TB of disk is based on recordsize=128k, which was the historical maximum. Contemporary OpenZFS lets you set recordsize=1M -- which I recommend you use by default unless you can justify some other value -- and with a "I know what I'm doing" sysctl, recordsize=16M. This reduces the DDT size by 8 or 128 on large files which is nice, although the downside is that it makes no difference if you mainly have small files, and will do less deduplication except in the cases where the files could have just been hardlinked together anyway.

On a sample test box with 4.3TB of "Linux ISOs", with recordsize=1M, compression=lz4 and dedup=on, the average record is 1022kB, which compares favourably with the record size of 1049kB/1MiB. There are 4.2M DDT entries, "size 895B on disk, 144B in core", i.e. about 3GB of diskspace and 600MB of memory (or 150MB/TB). Both are negligible on a modern system. Dedupe and compression save about 100GB each, so it seems to have been worthwhile to turn both on despite these files supposedly having already been compressed and shouldn't have much intra- or inter-file redundancy.

The larger recordsize may already reduce the I/O hit of dedupe to acceptable levels, but it can be mitigated further by adding a small L2ARC such as a reasonable-quality USB key and setting "secondarycache=metadata" to ensure that the L2ARC only accumulates DDTs, directories, inode tables etc, and is not filled up with large files which are cheap to get from disk and would wear out the flash. This is not a bad idea even without dedup. A €20 128GB key serves my needs here; there is never more than about 8GB written to it, but my old 8GB keys are too slow and knackered for this purpose.

(All numbers given here are proper power-of-ten SI units, unless I'm quoting somebody elses misuse of power-of-two non-SI units. Some figures obviously have enough slop in them that it doesn't really matter anyway.)

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

Search

Archive

Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 2300ad 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech aviation base commerce battletech bayern beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime crystal cthulhu eternal cycling dead of winter disaster doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 essen 2024 essen 2025 existential risk falklands war fandom fanfic fantasy feminism filk film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror horrorm science fiction hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2021 hugo 2022 hugo 2023 hugo 2024 hugo 2025 hugo-nebula reread humour in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow openscad opera parody paul temple perl perl weekly challenge photography podcast poetry politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant review reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense talon television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 typst vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult

Special All book reviews, All film reviews

Produced by aikakirja v0.1