Upgrading to Wordpress 2.0

Keen to try out my own PlugIns in Wordpress 2.0 and swayed by this post I took the plunge and installed the nightly Wordpress 2.0 [20051219] build.

I manage the entire installation of my Wordpress blog in CVS; although I have several customisations to the core Wordpress code, I merged them seamlessly into the 2.0 code without problems. During the upgrade process I did get some warnings that the ‘wp_usermeta’ table didn’t exist – because I hadn’t run /wp-admin/upgrade.php yet which was no big deal. Apart from that the whole process was very straight-forward!

All the plugins I use appear to work perfectly: Weighted Words, iG:Syntax Hiliter, Spelling Checker & my own

My first opinions…

Although the new fancy rich-text editor looks great, I am still going to use the original code-version as I am so used to it and prefer writing my own xhtml. I am not so sure if I like the new admin colour scheme yet either, I have grown quite attached to the previous grey one.

I am extremely impressed that Wordpress has changed hugely ‘under-the-hood’ seemingly without breaking any of the existing interfaces and therefore plugins.

Version 2.0 boasts a lot of new features and the Trac Timeline is a hive of activity. I will probably keep running on the latest nightly build until the final release.

More thoughts on SpamKit…

After my recent release of SpamKit Plugin I have been contemplating the whole spam problem in much greater depth. Gerry highlighted one major problem with my SpamKit plugin itself – trackbacks are considered spam because they don’t include the time-based token. I started to look into amending my plugin to support them when I realised that this would be a serious loop hole.

As far as the SpamKit plugin is concerned it would see trackbacks & pingbacks as new comments, without any time-based token. Thankfully Wordpress passes a ‘comment_type’ flag of ‘trackback’ or ‘pingback’ in these cases, and there is also a separate action of ‘trackback_post’ which the plugin can hook into. But what is stopping a particular clever spammer from doing trackback / pingback spam? Nothing…

So, slightly disheartened by this particular problem I went back to the mental-drawing -board. SpamKit plugin is great for a very specific problem, but its certainly not the grand unified solution I am looking for. All it would take is for a smarter-spammer to pull the comment form with my time-based-token and then send it along with the spam message for instant approval. It’s worryingly easy to do, I did it in about 10 lines of PHP code – no joke.

The spam problem will always present itself as long as we try to ‘filter’ out automatic-spam-clients from humans; it is simply the wrong approach. Eventually spam-clients will evolve to become so sophisticated we cant tell the difference anyway, then all the filters and tricks in the book won’t help you. I have already started to see a distinct difference between the spam clients hitting my blog daily, a trend showing that more sophisticated clients are being developed all the time.

The only way you will truly eradicate spam is to validate the content of the message or the content of the web page being linked, everything else is just a bonus.

The way I see it, there are two solutions that will blow Spam out of the water forever:

Smarter Application

An application that verifies everything based on in-built rules, configuration and past experiences.

* validating that the comment is relevant to the post
* validating any info associated [URL, email] is valid
* validating that the trackback/pingback came from somewhere that exists
* validating links in the comments for their content, is it relevant? are they selling Viagra?

All this is great but it goes far beyond what a humble PHP application should be doing. I dread to think what server load this would produce on even a small-scale blog – not to mention a multi-blog-site or a shared web server.

Dumb Application made smarter through On-line Collaboration

This basically describes Akismet – where a central system has all the complex logic and the client-application receives a simple yes/no decision.

Personally, I would do this asynchronously because the client-application [Wordpress for example] doesn’t really need an immediate decision. Comments can easily wait in a queue for approval which a later XMLRPC callback provides. This would greatly reduce the load on the system and allow far more complex algorithms and lookups to take place improving the overall accuracy of the decisions dispensed.

Other widely used anti-spam solutions also look promising, for example the ‘roadmap’ for Spam Karma SK2.2 has some really good ideas, my favourite’s being the honeypot and the anti-PageRank idea.

If you have read my previous posts you’ll have seen how wide the spam problem already is. Although I am very proud of my little SpamKit plugin because it does exactly what I wanted, I am quite frustrated with its limitations.

The whole spam problem is very interesting to me, its not going to go away any time soon. It’s existed since 1978 and been evolving ever since…

Horde 3.0.8 appears to be broken

Horde is an application framework used by a web-based email client IMP I use to read my email. From the Horde site [www.horde.org]:

The Horde Project is about creating high quality Open Source applications, based on PHP and the Horde Framework.

The guiding principles of the Horde Project are to create solid standards-based applications using intelligent object oriented design that, wherever possible, are designed to run on a wide range of platforms and backends.
There is great emphasis on making Horde as friendly to non-English speakers as possible. The Horde Framework currently supports many localization features such as Unicode and right-to-left text and generous users have contributed many translations for the framework and applications.

Today I downloaded and attempted to install Horde 3.0.8 – released on Sunday 11th December 2005 – something appears to be wrong as I didn’t get very far. I followed all the given instructions, my server is configured correctly, all the dependencies are installed and working. I got so far as to use the web-based setup / configuration screen but it didn’t allow me to save any settings or complete the setup process.

Following the instructions to the letter; I went to the ‘Authentication’ tab, selected ‘IMAP Authentication’, the page reloaded but didn’t reflect my choice from the ‘authentication backend’ drop-down list. Instead it wouldn’t display anything other than ‘Let a Horde application handle authentication’ but without the additional drop-down to select the application to use.

I initially suspected some Javascript incompatibility as I normally use Firefox and sooo many applications are written against Internet Explorer. But after several attempts from different browsers & platforms I gave up on the authentication tab, opting to try at least the ‘Database’ tab and configure MySQL. I could easily fill out all the details but when I tried to ‘Generate Horde Configuration’ it threw me back to the ‘General’ tab, highlighting that I had not completed required fields to do with error reporting & URL generation – both were set to valid values.

I re-read the documentation and re-did the whole installation… just in case I missed something or was too eager to lock down permissions. Again, exactly the same problem. Next, I relaxed ALL the permissions possible, I basically chmod’d the whole thing to 777 – in case the setup wasn’t able to write to the config directory but this didn’t help either.

The FAQ didn’t provide much help so i went to the IRC channel #horde @ irc.freenode.net and found others with exactly the same problem *holds back the tears of frustration* … But unfortunately no-one seemed to have any immediate answers.

On a hunch I grabbed Horde 3.0.7 from the FTP site and went through the whole setup process again. However this time it worked as expected and was running within ten minutes!!

Argh… Next step is to diff the code and see where it went wrong… (stay tuned)

Update – This issue was fixed in version 3.0.9 which is now available from www.horde.org

WP Plugin » SpamKit Plugin 0.0 – Time-Based-Tokens to Fight Spam

This is the first release and prototype of SpamKit for Wordpress.

SpamKit was written by Gerard Calderhead; it’s a PHP library that uses secure time-based-tokens to aid validating form post’s and can be used on guestbooks, blogs, form-mailers etc.

It does this by generating a checksum’d and encrypted ‘token’ containing the UNIX-timestamp from when it was generated. This ‘token’ is written out into the form as a hidden field. When the form is posted back to the server, the token’s value is validated. If it is invalid or tampered with validation will automatically fail, if the token has ‘expired’ it will also fail.

I took SpamKit and plugged it into Wordpress to do the following:

- When a comment form is drawn, a time-based-token is generated and inserted as a hidden field in the form.
- Where the comment would normally be approved, SpamKit is used to validate the token; if corrupt, missing or expired the comment is flagged as ‘spam’ preventing any email notification of the comment being posted.
- After the comment has been saved (as ‘spam’) by Wordpress the plugin changes the comment’s status to ‘Awaiting Moderation’ to allow the moderator to delete it at a later date.

The end result is comment-spam sits in the ‘Awaiting Moderation’ list without generating any email to say so.

The third step may not be what everyone desires for the plugin’s functionality but being a prototype there are no option pages to control this as yet.

The SpamKit plugin has been tested on Wordpress 1.5 only and found to operate as expected on even the most liberal configurations.

Installation is simple, there are no configuration options that require changing, simple copy it into the plugins directory and activate it from the administration screen.

Download: spamkit-plugin.zip

Comments, Questions and Feedback welcomed!

Updated [3rd January 2006] – Download link points to wp-plugins.org

‘NASA Search 1.0′ ??? Something Google should worry about ???

Having written my own Wordpress logging / statistics plug-in over the weekend – which still in prototype, consider it a ‘coming soon’ - I have started to notice more and more peculiar User-Agents visiting my blog.

I quite like to keep an eye on what spiders / bots visit my sites, how often they return and try to infer something about how they were designed by watching them visit.

I was surprised recently to see that the big three ( Yahoo!, MSN & Google ) actually pull RSS feeds as well as HTML pages – of course this makes sense from a efficiency & bandwidth side of things, the RSS feed is the interesting stuff already stripped out.

Today’s one is a real winner though, coming from the following net block and advertising itself as “NASA Search 1.0”.

CODE:
  1. Comcast Cable Communications, Inc. NJ-SOUTH-4 (NET-68-46-128-0-1)
  2.                                   68.46.128.0 - 68.46.191.255

The bot / spider crawled my entire site within a few minutes, starting from my ‘changes-in-wordpress-152’ post and was completely oblivious to my robots.txt (it didn’t even request it).

Also, it appeared to be quite a primitive HTTP client, providing no referrer information or any of the usual headers “Connection: close”, “Accept: */*” even though it was sending a “HTTP/1.1” request. Surprisingly though it did persist a session cookie for the duration of its visit.

I Google’d for the phrase “NASA Search 1.0” and only seemed to find results where auto-generated-stats pages list visiting User-Agents.

It would be quite interesting (and maybe even fun – in a very geeky way) to write a Wordpress plug-in that watches for these peculiar bots and pings their details to a centralised stats database – forming a sort of spider-spider.

Anyway, I will be keeping a keen eye out for the return of “NASA Search 1.0” … Could it be the next greatest NASA funded project? Or is it just some smart a** that has figured out how to change the User-Agent string in his favourite spider/bot.

Stay tuned!

Wordpress Hack » Inserting rel nofollow on links in all categories…

I wanted to add rel='external nofollow' onto every Link in every Category that didn't have an explicit rel attribute defined. I looked though the Wordpress Codex but found no Plugin Hooks to do it so I broke out the source code.

Somewhere down the chain the link markup "<a ..." is generated in the function get_links() found in /wp-include/links.php (approx line 142). I added two lines of code shown below and it works a treat.

Before

PHP:
  1. $rel = $row->link_rel;
  2.        
  3. if ($rel != '') {
  4.     $rel = " rel='$rel'";
  5. }

After

PHP:
  1. $rel = $row->link_rel;
  2.        
  3. if ($rel != '') {
  4.     $rel = " rel='$rel'";
  5. } else {
  6.     $rel = " rel='external nofollow'";
  7. }