<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Weblog of Michael Cutler &#187; I18N</title>
	<atom:link href="http://blog.lobstertechnology.com/category/i18n/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.lobstertechnology.com</link>
	<description>"I felt a great disturbance in the Force, as if millions of peers suddenly cried out in terror and were suddenly silenced."</description>
	<lastBuildDate>Tue, 17 Oct 2006 14:40:43 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Charset detection for fun and non-profit…</title>
		<link>http://blog.lobstertechnology.com/2005/11/29/charset-detection-for-fun-and-non-profit/</link>
		<comments>http://blog.lobstertechnology.com/2005/11/29/charset-detection-for-fun-and-non-profit/#comments</comments>
		<pubDate>Tue, 29 Nov 2005 23:52:28 +0000</pubDate>
		<dc:creator>Michael Cutler</dc:creator>
				<category><![CDATA[I18N]]></category>

		<guid isPermaLink="false">http://blog.lobstertechnology.com/2005/11/29/charset-detection-for-fun-and-non-profit%e2%80%a6/</guid>
		<description><![CDATA[The problem of character sets is not limited to FreeDB but applies to any system where text from one character set meets another, like internet email for example. For this reason Unicode was created, a single set of characters which encompass all existing written languages.]]></description>
			<content:encoded><![CDATA[<p>FreeDB [<a href="http://www.freedb.org/">www.freedb.org</a>] is a free online CD-information database allowing applications to query the FreeDB server over the internet for disc &#038; track titles. The database was built with user-submitted information; as such entries were being submitted in whatever default character set the submitting-user had. The problem arises when someone with a different default character set retrieves this CD-information and finds it completely scrambled.</p>
<p>The problem of character sets is not limited to FreeDB but applies to any system where text from one character set meets another, like internet email for example. For this reason Unicode was created, a single set of characters which encompass all existing written languages.</p>
<p>I wrote a simple application to process entries from the FreeDB database; it attempts to ‘detect’ the original character set and then converts it to Unicode (UTF-8) before writing out the converted entry.</p>
<p>Firstly it strips the FreeDB format from the entry – it is plain US-ASCII in form and would bias the character set detection.</p>
<p>- Any lines beginning with “#” are skipped<br />
- Only the VALUE part of the NAME=VALUE format is kept</p>
<p>It then passes the remaining text through character set detection provided by cpDetector.</p>
<p>Out of the 1,865,309 entries in the latest FreeDB database (freedb-complete-20051104.tar.bz2) cpdetector only failed to ‘detect’ 1,106 of them. Below is a breakdown of the detected character sets.</p>
<div class="igBar"><span id="lcode-1"><a href="#" onclick="javascript:showPlainTxt('code-1'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">CODE:</span>
<div id="code-1">
<div class="code">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">UTF-<span style="color:#800000;color:#800000;">8</span>&nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#800000;color:#800000;">1007658</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">US-ASCII&nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">444896</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">UTF-16LE&nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">31442</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">UTF-16BE&nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">14899</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">windows-<span style="color:#800000;color:#800000;">1252</span> <span style="color:#800000;color:#800000;">214906</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">GB18030&nbsp; &nbsp; &nbsp; <span style="color:#800000;color:#800000;">76914</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Big5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">56379</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">x-EUC-CN&nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">6778</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">EUC-KR&nbsp; &nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">5823</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Shift_JIS&nbsp; &nbsp; <span style="color:#800000;color:#800000;">4042</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">x-EUC-TW&nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">438</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">EUC-JP&nbsp; &nbsp; &nbsp; &nbsp;<span style="color:#800000;color:#800000;">28</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#006600; font-weight:bold;">&#40;</span>unknown<span style="color:#006600; font-weight:bold;">&#41;</span>&nbsp; &nbsp; <span style="color:#800000;color:#800000;">1106</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>Of the entries that failed I found they were a mix of obscure character sets like ‘IBM-866’, ‘windows-1251’, ‘koi8-r’ &#038; ‘x-mac-cyrillic’ – which can’t be detected by cpdetector at the moment.</p>
<p>Over the next couple of posts I plan to release the tool to do the conversion as well as a fully converted copy of the database.</p>
<p><strong>Related Links</strong></p>
<p>-	<a href="http://blog.lobstertechnology.com/2005/07/09/unicode-character-sets-utf-what/">Unicode? Character Sets? UTF-what?</a><br />
-	<a href="http://sourceforge.net/projects/cpdetector/">SourceForge.net – cpdetector</a><br />
-	<a href="http://sourceforge.net/projects/jchardet/">SourceForge.net – jchardet</a><br />
-	<a href="http://www.mozilla.org/projects/intl/chardet.html">Mozilla – Charset Detectors</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.lobstertechnology.com/2005/11/29/charset-detection-for-fun-and-non-profit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode? Character Sets? UTF-what?</title>
		<link>http://blog.lobstertechnology.com/2005/07/09/unicode-character-sets-utf-what/</link>
		<comments>http://blog.lobstertechnology.com/2005/07/09/unicode-character-sets-utf-what/#comments</comments>
		<pubDate>Fri, 08 Jul 2005 23:35:35 +0000</pubDate>
		<dc:creator>Michael Cutler</dc:creator>
				<category><![CDATA[I18N]]></category>
		<category><![CDATA[Uncategorised]]></category>

		<guid isPermaLink="false">http://blog.lobstertechnology.com/?p=18</guid>
		<description><![CDATA[Because the whole world only use the letters A to Z don't they?]]></description>
			<content:encoded><![CDATA[<p>I was inspired to add this note after recent frustrations about the complete ignorance of character sets in both Commercial &#038; Open software...</p>
<p>I seriously recommend everyone reads the following, less ISO-blahblahblah and more UTF-8, no excuses!</p>
<p><em>The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</em><br />
<a href='http://www.joelonsoftware.com/articles/Unicode.html'/>http://www.joelonsoftware.com/articles/Unicode.html</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.lobstertechnology.com/2005/07/09/unicode-character-sets-utf-what/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
