UTF-8 and Normalizing Latin Names from Amazon
I managed to avoid understanding anything related to UTF-8 until yesterday. I had a few hacks that allowed me to normalize song titles coming from Amazon so I can match them to the titles in other online libraries. For example:
“Danza de los Ñáñigos” would be normalized to “Danza de los Nanigos”.
My old hacks did not cover many of the Latin American characters with diacritics and limited my ability to collect Spanish speaking artists (Ferret crashes when faced with the UTF8 Multibyte characters). Unfortunately I couldn’t find a library that allowed me to do what I wanted (though this perl reference was a good explanation).
I ended up with an improved hack where I manually built a translation table for the extended latin characters, use STRING::unpack(’U*’) to translate U+00CO thru U+017E to the corresponding ASCII equivalent characters. Amazon Webservices do insert some multi-byte control characters in some track titles which I now just discard.
It seems that Rails 1.2 may have some UTF-8 libraries included - looking forward to that. Here are some other links that I found useful if you are trying to get a handle on UTF-8.
- Wikipedia Article that demystified the representation and what each bit is. The variety of representations in different articles (Octal, Hex or decimal) is very confusing.
- Great dynamic Unicode/UTF Table
- This article that has code and a discussion helped understand some of Ruby’s problems with Unicode/multi-byte representations.
- Unicode Hacks - Rails support for Unicode - seems this is what will be included in 1.2. Read the documentation.
