RogerBW's Blog: Moving filenames to UTF-8

I've been renaming my large music collection into UTF-8.

Not re-tagging; I tag files when I notice a lack of tags, not as one big project. Just renaming them, because while working on my mpd tools I noticed that mpd won't index a file that's named invalidly in UTF-8. (Which makes my job as a tool-writer easier, because that means I can assume it when working with filenames.)

Checking something like a million files for invalid characters was obviously not something I wanted to do by eye, but fortunately I'm running *ix so I have monstrously powerful tools at my disposal.

find /path/to/files -mindepth 1 -maxdepth 1 | split -l1

One depth level at a time, because I want to rename a directory before I rename the files in it. Each filename ends up in a separate file in my working directory (called xa, xb, etc.).

for V in x*; do isutf8 -q $V && rm $V; done

For each file with a filename in it, see if it's valid utf-8 (the testing program is in the Debian "moreutils" package, which I install everywhere because of the monstrously-useful parallel and vidir commands). If it is, delete it because we don't need to do anything about it. (In an ideal world I'd have bothered to parallelise this.)

cat x* |sort >a; rm x*; cp -a a b

Put all those filenames into one big file, and make another copy of it.

I then looked through file b to see how the filenames had been encoded. (This had often been done blindly on the basis of what some CD lookup service returned.) Mostly they were ISO-8859-1/Windows-1252 (about two thirds of the total) or CP850 (about one third), but a couple of discs of Polish music seemed to be in Windows-1250 or something very like it (not ISO-8859-2, at any rate). I visually inspected the odd characters in the names in file b (if it has 0x80-0x9f in it it's not ISO-8859-anything), and used iconv to convert them into something valid. (Using vi with LANG=C, so that it wouldn't try to do any automatic character set conversions for me.)

And then there were the Japanese discs (anime soundtracks mostly). I resorted at times to running iconv with each of the character sets it can convert from (into UTF-8), putting the results into a series of files, and then grepping all of them with a known track name in the original Japanese characters to see if any of them was the right one. Even that didn't always work, and in a few cases I simply looked up the relevant CD and copied the track names from whatever site would give me a plausible-looking listing.

And then there were the special cases. For example, the Hepburn romanisation of Japanese uses a macron (overbar) to indicate a long vowel. But whoever put the track listing of Geinoh Yamashirogumi's Akira Symphonic Suite into whatever service I used (cddb? freedb? Just me transcribing it?) didn't have access to macrons, because they were using ISO-8859-1, so they encoded them as circumflexes instead - track 6 was filed as "Shômyô" rather than either "Shōmyō" or "Shoumyou". So I fixed that too.

Apostrophes were surprisingly annoying. I tend to use the plain old ASCII ones, but various people had used curly quotes ‘’ or the grave accent character ` – and each of those could be encoded in a variety of exciting ways.

Then it was time to do the actual renaming:

sed -e 's/^/"/' -e 's/$/"/' <a >aq
sed -e 's/^/"/' -e 's/$/"/' <b >bq
paste -d " " aq bq |sed -e 's/^/mv -i /' >v
sh v

which builds and executes a script to rename each file from its invalid name to the corrected one. (If any of the filenames had had shell metacharacters in them, such as `, this wouldn't have worked. Shellmetas are a pain.)

(Yeah, this is the kind of thing that you can just do with a Unix command-line: say "here's an arbitrarily huge bunch of stuff; process it the way I want it processed and then act on the result". This paradigm just doesn't seem to happen in GUI-based computing, with a few rare exceptions. Just as we have factories to do the boring physical work so that humans don't have to, I feel that I have a computer to do the boring mental work so that I don't have to.)

Then it was time to increase the search depth by one level and do it again. While being careful not to double-convert in cases where a UTF-8-named directory contained UTF-8-named tracks.

I didn't keep track of the total number of invalidly-named files, though I'd guess it was something like a thousand. The total play time of all indexed files has increased by about five days - so about 0.3% of the total.

Another small perversion, though not actually invalid: Unicode that's been squashed down for transmission through an ASCII-only medium, so that character á would be shown as #U00e1. Well, that's easy to convert!

Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.