RogerBW's Blog

Moving filenames to UTF-8 11 May 2019

I've been renaming my large music collection into UTF-8.

Not re-tagging; I tag files when I notice a lack of tags, not as one big project. Just renaming them, because while working on my mpd tools I noticed that mpd won't index a file that's named invalidly in UTF-8. (Which makes my job as a tool-writer easier, because that means I can assume it when working with filenames.)

Checking something like a million files for invalid characters was obviously not something I wanted to do by eye, but fortunately I'm running *ix so I have monstrously powerful tools at my disposal.

find /path/to/files -mindepth 1 -maxdepth 1 | split -l1

One depth level at a time, because I want to rename a directory before I rename the files in it. Each filename ends up in a separate file in my working directory (called xa, xb, etc.).

for V in x*; do isutf8 -q $V && rm $V; done

For each file with a filename in it, see if it's valid utf-8 (the testing program is in the Debian "moreutils" package, which I install everywhere because of the monstrously-useful parallel and vidir commands). If it is, delete it because we don't need to do anything about it. (In an ideal world I'd have bothered to parallelise this.)

cat x* |sort >a; rm x*; cp -a a b

Put all those filenames into one big file, and make another copy of it.

I then looked through file b to see how the filenames had been encoded. (This had often been done blindly on the basis of what some CD lookup service returned.) Mostly they were ISO-8859-1/Windows-1252 (about two thirds of the total) or CP850 (about one third), but a couple of discs of Polish music seemed to be in Windows-1250 or something very like it (not ISO-8859-2, at any rate). I visually inspected the odd characters in the names in file b (if it has 0x80-0x9f in it it's not ISO-8859-anything), and used iconv to convert them into something valid. (Using vi with LANG=C, so that it wouldn't try to do any automatic character set conversions for me.)

And then there were the Japanese discs (anime soundtracks mostly). I resorted at times to running iconv with each of the character sets it can convert from (into UTF-8), putting the results into a series of files, and then grepping all of them with a known track name in the original Japanese characters to see if any of them was the right one. Even that didn't always work, and in a few cases I simply looked up the relevant CD and copied the track names from whatever site would give me a plausible-looking listing.

And then there were the special cases. For example, the Hepburn romanisation of Japanese uses a macron (overbar) to indicate a long vowel. But whoever put the track listing of Geinoh Yamashirogumi's Akira Symphonic Suite into whatever service I used (cddb? freedb? Just me transcribing it?) didn't have access to macrons, because they were using ISO-8859-1, so they encoded them as circumflexes instead - track 6 was filed as "Shômyô" rather than either "Shōmyō" or "Shoumyou". So I fixed that too.

Apostrophes were surprisingly annoying. I tend to use the plain old ASCII ones, but various people had used curly quotes ‘’ or the grave accent character ` – and each of those could be encoded in a variety of exciting ways.

Then it was time to do the actual renaming:

sed -e 's/^/"/' -e 's/$/"/' <a >aq
sed -e 's/^/"/' -e 's/$/"/' <b >bq
paste -d " " aq bq |sed -e 's/^/mv -i /' >v
sh v

which builds and executes a script to rename each file from its invalid name to the corrected one. (If any of the filenames had had shell metacharacters in them, such as `, this wouldn't have worked. Shellmetas are a pain.)

(Yeah, this is the kind of thing that you can just do with a Unix command-line: say "here's an arbitrarily huge bunch of stuff; process it the way I want it processed and then act on the result". This paradigm just doesn't seem to happen in GUI-based computing, with a few rare exceptions. Just as we have factories to do the boring physical work so that humans don't have to, I feel that I have a computer to do the boring mental work so that I don't have to.)

Then it was time to increase the search depth by one level and do it again. While being careful not to double-convert in cases where a UTF-8-named directory contained UTF-8-named tracks.

I didn't keep track of the total number of invalidly-named files, though I'd guess it was something like a thousand. The total play time of all indexed files has increased by about five days - so about 0.3% of the total.

Another small perversion, though not actually invalid: Unicode that's been squashed down for transmission through an ASCII-only medium, so that character á would be shown as #U00e1. Well, that's easy to convert!


  1. Posted by John Dallman at 10:40am on 11 May 2019

    Yup, looks like a good method for doing that kind of number of files just once. Bigger file sets would imply more automation, but that would need more testing.

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.

Search
Archive
Tags 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s 2300ad 3d printing action advent of code aeronautics aikakirja anecdote animation anime army astronomy audio audio tech base commerce battletech bayern beer boardgaming book of the week bookmonth chain of command children chris chronicle church of no redeeming virtues cold war comedy computing contemporary cornish smuggler cosmic encounter coup covid-19 crime crystal cthulhu eternal cycling dead of winter doctor who documentary drama driving drone ecchi economics en garde espionage essen 2015 essen 2016 essen 2017 essen 2018 essen 2019 essen 2022 essen 2023 essen 2024 existential risk falklands war fandom fanfic fantasy feminism film firefly first world war flash point flight simulation food garmin drive gazebo genesys geocaching geodata gin gkp gurps gurps 101 gus harpoon historical history horror hugo 2014 hugo 2015 hugo 2016 hugo 2017 hugo 2018 hugo 2019 hugo 2020 hugo 2021 hugo 2022 hugo 2023 hugo 2024 hugo-nebula reread in brief avoid instrumented life javascript julian simpson julie enfield kickstarter kotlin learn to play leaving earth linux liquor lovecraftiana lua mecha men with beards mpd museum music mystery naval noir non-fiction one for the brow opera parody paul temple perl perl weekly challenge photography podcast politics postscript powers prediction privacy project woolsack pyracantha python quantum rail raku ranting raspberry pi reading reading boardgames social real life restaurant reviews romance rpg a day rpgs ruby rust scala science fiction scythe second world war security shipwreck simutrans smartphone south atlantic war squaddies stationery steampunk stuarts suburbia superheroes suspense television the resistance the weekly challenge thirsty meeples thriller tin soldier torg toys trailers travel type 26 type 31 type 45 vietnam war war wargaming weather wives and sweethearts writing about writing x-wing young adult
Special All book reviews, All film reviews
Produced by aikakirja v0.1