Charset detection for fun and non-profit…

FreeDB [www.freedb.org] is a free online CD-information database allowing applications to query the FreeDB server over the internet for disc & track titles. The database was built with user-submitted information; as such entries were being submitted in whatever default character set the submitting-user had. The problem arises when someone with a different default character set retrieves this CD-information and finds it completely scrambled.

The problem of character sets is not limited to FreeDB but applies to any system where text from one character set meets another, like internet email for example. For this reason Unicode was created, a single set of characters which encompass all existing written languages.

I wrote a simple application to process entries from the FreeDB database; it attempts to ‘detect’ the original character set and then converts it to Unicode (UTF-8) before writing out the converted entry.

Firstly it strips the FreeDB format from the entry – it is plain US-ASCII in form and would bias the character set detection.

- Any lines beginning with “#” are skipped
- Only the VALUE part of the NAME=VALUE format is kept

It then passes the remaining text through character set detection provided by cpDetector.

Out of the 1,865,309 entries in the latest FreeDB database (freedb-complete-20051104.tar.bz2) cpdetector only failed to ‘detect’ 1,106 of them. Below is a breakdown of the detected character sets.

CODE:
  1. UTF-8        1007658
  2. US-ASCII     444896
  3. UTF-16LE     31442
  4. UTF-16BE     14899
  5. windows-1252 214906
  6. GB18030      76914
  7. Big5         56379
  8. x-EUC-CN     6778
  9. EUC-KR       5823
  10. Shift_JIS    4042
  11. x-EUC-TW     438
  12. EUC-JP       28
  13. (unknown)    1106

Of the entries that failed I found they were a mix of obscure character sets like ‘IBM-866’, ‘windows-1251’, ‘koi8-r’ & ‘x-mac-cyrillic’ – which can’t be detected by cpdetector at the moment.

Over the next couple of posts I plan to release the tool to do the conversion as well as a fully converted copy of the database.

Related Links

- Unicode? Character Sets? UTF-what?
- SourceForge.net – cpdetector
- SourceForge.net – jchardet
- Mozilla – Charset Detectors

More thoughts on Comment Spam

After my previous post on the subject of comment spam, I decided to use the might of Google to see how many sites have fallen prey to comment spamming bots.

Given that the phrases “my parents didnt told me about it” and “think that will make relief” are pretty obscure; I was confident they would produce accurate results from phrase searches.

Below are approximate figures based on what Google tells me, most are mixed with ‘normal’ sites in the results due to Google’s PageRank:

“think that will make relief” - 80,200 results
“my parents didnt told me about it” - 88,800 results
“thins that excited you at 14” – 87,200 results
“substances that cure you” – 82,400 results
“black girls on their mission” – 77,700 results

The sites spammed tend to be Blogs (Wordpress, MT etc.) and Forums which don’t require any user-signup.

Interestingly, the bots don’t appear to have any direct benefit to spammers. The links they post are mostly to ‘mainstream’ websites like www.apple.com etc.

However, I suppose if I were the spammer... I would probably use this as a method to locate all the sites with open-comment policies to abuse at a later date. ;)

Blocking Wordpress comment spammers by User-Agent

I have been plagued with automated comment spam lately, it is still at a level where its managable manually but..... I am lazy.

The comment message itself is nearly always along the lines of:

XML:
  1. Excellent! I enjoyed reading your material. think that will make relief: http://www.av.com ,
  2. <a href="http://www.adobe.com" rel="nofollow">substances that cure you</a> ,
  3. <a href="http://www.apple.com" rel="nofollow">my parents didnt told me about it</a>

It would appear to be more of a test message, blogs that accept the comment will probably be hammered with real spam at a later date.

I use ModSecurity on my server and wondered if there was an easy way to filter out these requests before they even reach Wordpress. I dug out my access_logs looking for the offending requests. The programs being used to post the comment spam appear to be quiet simplistic, posting directly to "wp-comments-post.php"

CODE:
  1. blog.lobstertechnology.com 209.200.xxx.xxx - - [16/Oct/2005:04:36:20 +0100]
  2.    "POST /wp-comments-post.php HTTP/1.1" 302 5
  3.    "http://blog.lobstertechnology.com/wp-comments-post.php"
  4.    "Jakarta Commons-HttpClient/3.0-rc3"
  5.  
  6. blog.lobstertechnology.com 207.195.xxx.xxx - - [12/Nov/2005:09:57:15 +0000]
  7.    "POST /wp-comments-post.php HTTP/1.1" 302 5
  8.    "-"
  9.    "Mozilla/4.78 (TuringOS; Turing Machine; 0.0)"

Others are a little more sophisitcated and at least bother to change the default User-Agent:

CODE:
  1. blog.lobstertechnology.com 209.200.xxx.xxx - - [09/Nov/2005:12:24:23 +0000]
  2.    "POST /wp-comments-post.php HTTP/1.1" 302 5
  3.    "http://blog.lobstertechnology.com/wp-comments-post.php"
  4.    "Mozilla/4.0"

I crafted a very simple ModSecurity filter to catch these, although it is a little crude, it will only trigger when the listed User-Agents send a request to "/wp-comments-post.php". Adjust as required:

XML:
  1. <ifmodule mod_security.c>
  2.  
  3.    # Turn the filtering engine On or Off
  4.    SecFilterEngine On
  5.  
  6.    ...
  7.  
  8.    # proof of concept Wordpress User-Agent filter
  9.    <location /wp-comments-post.php>
  10.       SecFilterSelective HTTP_USER_AGENT "HttpClient"
  11.       SecFilterSelective HTTP_USER_AGENT "Java"
  12.       SecFilterSelective HTTP_USER_AGENT "TuringOS"
  13.    </location>
  14.  
  15. </ifmodule>

Related Links
ModSecurity - http://www.modsecurity.org/

Talking XMLRPC to a Wordpress instance

This is a short example demonstration how to use the Apache XMLRPC library to talk to a running Wordpress instance.

Wordpress 1.5 supports various XMLRPC operations by default ( MT & Blogger APIs for example ) however the two shown below are "demo" operations that all Wordpress installations should support and require no authentication.

To run this example you will need to include the dependencies - Apache XMLRPC & Jakarta COMMONS Codec - however if you are lazy, here is my "xmlrpc-2.0-dep.jar" JAR containing both.

Code

JAVA:
  1. import java.util.Vector;
  2.  
  3. import org.apache.xmlrpc.XmlRpcClient;
  4.  
  5. public class XMLRPCTest {
  6.  
  7.    /**
  8.     * @param args
  9.     */
  10.    public static void main(String[] args) {
  11.  
  12.       try {
  13.          
  14.          Object result;
  15.  
  16.          /* You can specify an alternative target URL here */
  17.          XmlRpcClient xmlrpc = new XmlRpcClient ("http://blog.lobstertechnology.com/xmlrpc");
  18.          
  19.          
  20.          /*
  21.           * demo.sayHello operation, returns a java.lang.String
  22.           *
  23.           */
  24.          result = xmlrpc.execute( "demo.sayHello", new Vector() );
  25.          System.out.println( "demo.sayHello: " + result +
  26.             " (" + result.getClass().getName() + ")" );
  27.  
  28.          
  29.          /*
  30.           * demo.addTwoNumbers operation, returns a java.lang.Integer
  31.           *
  32.           */
  33.          Vector params = new Vector ();
  34.          params.addElement ("5");
  35.          params.addElement ("8");
  36.          result = xmlrpc.execute( "demo.addTwoNumbers", params );
  37.          System.out.println( "demo.addTwoNumbers: " + result +
  38.             " (" + result.getClass().getName() + ")" );
  39.          
  40.       } catch ( Exception e ) {
  41.          e.printStackTrace();
  42.       }
  43.      
  44.    }
  45.  
  46. }

Example

CODE:
  1. demo.sayHello: Hello! (java.lang.String)
  2. demo.addTwoNumbers: 13 (java.lang.Integer)

Related Links

XMLRPCTest.java - Java Source Code
xmlrpc-2.0-dep.jar - Dependencies
Apache XMLRPC - http://ws.apache.org/xmlrpc/
Jakarta COMMONS Codec - http://jakarta.apache.org/commons/codec/

Using MEncoder to convert a DVD to DivX

More for my own benefit, but here goes...

This is an example of using MEncoder (Windows version) to convert a DVD to DivX. The original movie was widescreen and is being rescaled here to 720x408 to keep the 16:9 aspect ratio. The video bitrate I used was 1024kbit but you can tweak this as desired.

It may seem unusual for the first run to output to NUL ( /dev/null ) but actually the first run is writing information out to the file "divx2pass.log" and the second pass writes the movie out.

XML:
  1. mencoder -dvd-device D:\DVD\DVD_VIDEO dvd://1 -ovc lavc -lavcopts vcodec=mpeg4:vbitrate=1024:mbd=2:turbo:vpass=1 -oac mp3lame -lameopts vbr=3 -ffourcc DX50 -vf scale=720:408 -o NUL
  2. mencoder -dvd-device D:\DVD\DVD_VIDEO dvd://1 -ovc lavc -lavcopts vcodec=mpeg4:vbitrate=1024:mbd=2:turbo:vpass=2 -oac mp3lame -lameopts vbr=3 -ffourcc DX50 -vf scale=720:408 -o DVD_VIDEO.avi

Related Links
MPlayer (provides MEncoder) - http://www.mplayerhq.hu/homepage/
MEncoder Introduction Guide

MediaCodeSpeedEdit tool for DVD-Writers by ala42

Stumbled across this when trying to find out why my 16x DVD media wouldn't burn at anything higher than 4x.

Download your drive's latest firmware, feed it into MediaCodeSpeedEdit and you can edit the burn speeds for all media the drive can recognise.

Save the modified firmware and re-flash your drive with it. Pretty neat!

My only gripe is that the way you do it seems a little odd from the user-interface point of view. You select the media code of your blank discs by name, then double-click it to replace its burn speeds with the speeds of another media code. But hey..... it works!

Related Links

MediaCodeSpeedEdit - http://ala42.cdfreaks.com/MCSE/

[Oracle Toplink] Importing a 10.1.3DP3 project into the 10.1.3DP4 workbench doesnt work…