Charset detection for fun and non-profit…

FreeDB [www.freedb.org] is a free online CD-information database allowing applications to query the FreeDB server over the internet for disc & track titles. The database was built with user-submitted information; as such entries were being submitted in whatever default character set the submitting-user had. The problem arises when someone with a different default character set retrieves this CD-information and finds it completely scrambled.

The problem of character sets is not limited to FreeDB but applies to any system where text from one character set meets another, like internet email for example. For this reason Unicode was created, a single set of characters which encompass all existing written languages.

I wrote a simple application to process entries from the FreeDB database; it attempts to ‘detect’ the original character set and then converts it to Unicode (UTF-8) before writing out the converted entry.

Firstly it strips the FreeDB format from the entry – it is plain US-ASCII in form and would bias the character set detection.

- Any lines beginning with “#” are skipped
- Only the VALUE part of the NAME=VALUE format is kept

It then passes the remaining text through character set detection provided by cpDetector.

Out of the 1,865,309 entries in the latest FreeDB database (freedb-complete-20051104.tar.bz2) cpdetector only failed to ‘detect’ 1,106 of them. Below is a breakdown of the detected character sets.

CODE:
  1. UTF-8        1007658
  2. US-ASCII     444896
  3. UTF-16LE     31442
  4. UTF-16BE     14899
  5. windows-1252 214906
  6. GB18030      76914
  7. Big5         56379
  8. x-EUC-CN     6778
  9. EUC-KR       5823
  10. Shift_JIS    4042
  11. x-EUC-TW     438
  12. EUC-JP       28
  13. (unknown)    1106

Of the entries that failed I found they were a mix of obscure character sets like ‘IBM-866’, ‘windows-1251’, ‘koi8-r’ & ‘x-mac-cyrillic’ – which can’t be detected by cpdetector at the moment.

Over the next couple of posts I plan to release the tool to do the conversion as well as a fully converted copy of the database.

Related Links

- Unicode? Character Sets? UTF-what?
- SourceForge.net – cpdetector
- SourceForge.net – jchardet
- Mozilla – Charset Detectors

Unicode? Character Sets? UTF-what?

I was inspired to add this note after recent frustrations about the complete ignorance of character sets in both Commercial & Open software...

I seriously recommend everyone reads the following, less ISO-blahblahblah and more UTF-8, no excuses!

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html