I've been renaming my large music collection into UTF-8.
Not re-tagging; I tag files when I notice a lack of tags, not as
one big project. Just renaming them, because while working on my mpd
tools I noticed that mpd won't index a file that's named invalidly in
UTF-8. (Which makes my job as a tool-writer easier, because that means
I can assume it when working with filenames.)
Checking something like a million files for invalid characters was
obviously not something I wanted to do by eye, but fortunately I'm
running *ix so I have monstrously powerful tools at my disposal.
find /path/to/files -mindepth 1 -maxdepth 1 | split -l1
One depth level at a time, because I want to rename a directory before
I rename the files in it. Each filename ends up in a separate file in
my working directory (called xa, xb, etc.).
for V in x*; do isutf8 -q $V && rm $V; done
For each file with a filename in it, see if it's valid utf-8 (the
testing program is in the Debian "moreutils" package, which I install
everywhere because of the monstrously-useful parallel
and vidir
commands). If it is, delete it because we don't need to do anything
about it. (In an ideal world I'd have bothered to parallelise this.)
cat x* |sort >a; rm x*; cp -a a b
Put all those filenames into one big file, and make another copy of
it.
I then looked through file b to see how the filenames had been
encoded. (This had often been done blindly on the basis of what some
CD lookup service returned.) Mostly they were
ISO-8859-1/Windows-1252
(about two thirds of the total) or
CP850 (about one
third), but a couple of discs of Polish music seemed to be in
Windows-1250 or something very like it (not ISO-8859-2, at any rate).
I visually inspected the odd characters in the names in file b (if it
has 0x80-0x9f in it it's not ISO-8859-anything), and used iconv to
convert them into something valid. (Using vi with LANG=C, so that it
wouldn't try to do any automatic character set conversions for me.)
And then there were the Japanese discs (anime soundtracks mostly). I
resorted at times to running
iconv with each of the
character sets it can convert from (into UTF-8), putting the results
into a series of files, and then grepping all of them with a known
track name in the original Japanese characters to see if any of them
was the right one. Even that didn't always work, and in a few cases I
simply looked up the relevant CD and copied the track names from
whatever site would give me a plausible-looking listing.
And then there were the special cases. For example, the Hepburn
romanisation of Japanese uses a macron (overbar) to indicate a long
vowel. But whoever put the track listing of Geinoh Yamashirogumi's
Akira Symphonic Suite into whatever service I used (cddb? freedb?
Just me transcribing it?) didn't have access to macrons, because they
were using ISO-8859-1, so they encoded them as circumflexes instead -
track 6 was filed as "Shômyô" rather than either "Shōmyō" or
"Shoumyou". So I fixed that too.
Apostrophes were surprisingly annoying. I tend to use the plain old
ASCII ones, but various people had used curly quotes ‘’ or the grave
accent character ` – and each of those could be encoded in a variety
of exciting ways.
Then it was time to do the actual renaming:
sed -e 's/^/"/' -e 's/$/"/' <a >aq
sed -e 's/^/"/' -e 's/$/"/' <b >bq
paste -d " " aq bq |sed -e 's/^/mv -i /' >v
sh v
which builds and executes a script to rename each file from its
invalid name to the corrected one. (If any of the filenames had had
shell metacharacters in them, such as `, this wouldn't have worked.
Shellmetas are a pain.)
(Yeah, this is the kind of thing that you can just do with a Unix
command-line: say "here's an arbitrarily huge bunch of stuff; process
it the way I want it processed and then act on the result". This
paradigm just doesn't seem to happen in GUI-based computing, with a
few rare exceptions. Just as we have factories to do the boring
physical work so that humans don't have to, I feel that I have a
computer to do the boring mental work so that I don't have to.)
Then it was time to increase the search depth by one level and do it
again. While being careful not to double-convert in cases where a
UTF-8-named directory contained UTF-8-named tracks.
I didn't keep track of the total number of invalidly-named files,
though I'd guess it was something like a thousand. The total play time
of all indexed files has increased by about five days - so about 0.3%
of the total.
Another small perversion, though not actually invalid: Unicode that's
been squashed down for transmission through an ASCII-only medium, so
that character á
would be shown as #U00e1
. Well, that's easy to
convert!
Comments on this post are now closed. If you have particular grounds for adding a late comment, comment on a more recent post quoting the URL of this one.