WP Plugin » SpamKit Plugin 0.1 – Time-Based-Tokens to Fight Spam

This is a minor release of SpamKit Plugin to address an easy-to-fix problem where Trackbacks from the same blog or server get treated as spam because they don’t include the time-based token. This is checked into Subversion over at WP-Plugins.org and you can download the new version here spamkit-plugin.php.

Changelog:

Added a check in spam_action_pre_comment_approved to compare the REMOTE_ADDR with the SERVER_ADDR, if they match it allows the comment to bypass time-based token checking. However this could be abused if another web application on the server is exploited allowing an attacker to post comments apparently from this server. Then again if someone goes to all the effort of expoiting a web application comment-spam is probably low on their priorities.

Blogs are fundamentally flawed for the typical Grandma-User

It may seem a little sad but I can honestly say that reading my access_log is far more interesting than any soap opera on TV; they are filled with exotic foreigners, futuristic robots, drama, intrigue and personal tragedy. The best thing about it is that it’s all real; these are (mostly) real people who stumble across your humble Blog in the hope to find the solution to their problems.

Over the Christmas period I have observed more people visiting an ancient post of mine than in the past six months. The post is about my experiences with an external hard drive enclosure; more accurately, the chip / controller a great deal of hard drive enclosures use. Based on this I would guess that a considerable number of people got hard drive enclosures for their Christmas and ran into the same problems I had. Anyway, I am wandering a little.

It was reading my access_log’s that made me realise that Blogs are actually a really bad format for the Grandma user…

Picture this, imagine your Grandma is Google’ing and happens to get a result that points to your Blog. She see’s a teaser in the search results that shows you’ve written something about what she is looking for. Grandma clicks your link and is presented with your last 10 posts about God knows what and no sign of the post she saw the snippet of. Grandma goes back to Google thoroughly disappointed and never to return…

I encounter this phenomenon frequently, but because I am familiar with the Blog format I think nothing of drilling down to the relevant category to find the post I wanted; or if I am lazy click Google’s cached copy of the page. However for the average internet surfer it presents a fundamental flaw in the usability of the Blog format.

The problem is quite simple; search engines can never be up to date with your content all the time. The more frequent you post the more the problem will occur and the harder the post will be to find. The way I see it there are two possible solutions.

Smarter search engines

Enhancing search engines so they can distinguish between an index-page of posts and individual posts. This could be done by identifying sections of text within a page as an extract from another URL using something like RDF [http://www.w3.org/RDF/] which can already be embedded within XHTML [http://internetalchemy.org/2005/10/introducing-embedded-rdf]. Enclosing the section of text between the ‘<rdf:RDF>’ tags would do the trick.

In the Blog format the index pages and category pages would all contain embedded RDF indicating that the enclosed section of text is actually from another URL – its permalink. However this idea is not just limited to the Blog format, it has huge potential for most modern website formats.

This wouldn’t be a trivial change for search engines to make, it would be time-consuming and therefore costly but I believe it would be worthwhile for the future of internet content.

Smarter websites

A more short-term solution I am looking at is improving my website [i.e. Wordpress] to detect that the visitor has come from a search engine, try and determine the query they used from the ‘Referer’ HTTP header, then find and present the best matches to that query before any other posts are displayed.

Obviously this method has quite a few shortfalls:

* The ‘Referer’ header may not be there (some people disable it within the browser or through third-party software)
* Although handling the query formats of the main players is quite easy, not all search engines can be catered for
* It requires an intensive search of all the site’s posts, the standard Wordpress search won’t cut it

I contemplated getting the site to pull a copy of the URL given in the ‘Referer’ header, scan for the result that led the visitor to your site then locate the correct post given the snippet text… Then I decided that was a reeeeeeaaally bad idea.

In the long-run I believe the content and therefore the search engines that index it have to improve to cater for the format of internet content today and I think embedded RDF might be the key; unfortunately this cannot happen overnight.

In the meantime making smarter websites will help the situation until the content and the search engines catch up.

SpamKit Plugin moved to WP-Plugins.org

This morning I moved SpamKit Plugin over to WP-Plugins.org.

You can always download the latest version directly from here.

Upgrading to Wordpress 2.0

Keen to try out my own PlugIns in Wordpress 2.0 and swayed by this post I took the plunge and installed the nightly Wordpress 2.0 [20051219] build.

I manage the entire installation of my Wordpress blog in CVS; although I have several customisations to the core Wordpress code, I merged them seamlessly into the 2.0 code without problems. During the upgrade process I did get some warnings that the ‘wp_usermeta’ table didn’t exist – because I hadn’t run /wp-admin/upgrade.php yet which was no big deal. Apart from that the whole process was very straight-forward!

All the plugins I use appear to work perfectly: Weighted Words, iG:Syntax Hiliter, Spelling Checker & my own

My first opinions…

Although the new fancy rich-text editor looks great, I am still going to use the original code-version as I am so used to it and prefer writing my own xhtml. I am not so sure if I like the new admin colour scheme yet either, I have grown quite attached to the previous grey one.

I am extremely impressed that Wordpress has changed hugely ‘under-the-hood’ seemingly without breaking any of the existing interfaces and therefore plugins.

Version 2.0 boasts a lot of new features and the Trac Timeline is a hive of activity. I will probably keep running on the latest nightly build until the final release.

More thoughts on SpamKit…

After my recent release of SpamKit Plugin I have been contemplating the whole spam problem in much greater depth. Gerry highlighted one major problem with my SpamKit plugin itself – trackbacks are considered spam because they don’t include the time-based token. I started to look into amending my plugin to support them when I realised that this would be a serious loop hole.

As far as the SpamKit plugin is concerned it would see trackbacks & pingbacks as new comments, without any time-based token. Thankfully Wordpress passes a ‘comment_type’ flag of ‘trackback’ or ‘pingback’ in these cases, and there is also a separate action of ‘trackback_post’ which the plugin can hook into. But what is stopping a particular clever spammer from doing trackback / pingback spam? Nothing…

So, slightly disheartened by this particular problem I went back to the mental-drawing -board. SpamKit plugin is great for a very specific problem, but its certainly not the grand unified solution I am looking for. All it would take is for a smarter-spammer to pull the comment form with my time-based-token and then send it along with the spam message for instant approval. It’s worryingly easy to do, I did it in about 10 lines of PHP code – no joke.

The spam problem will always present itself as long as we try to ‘filter’ out automatic-spam-clients from humans; it is simply the wrong approach. Eventually spam-clients will evolve to become so sophisticated we cant tell the difference anyway, then all the filters and tricks in the book won’t help you. I have already started to see a distinct difference between the spam clients hitting my blog daily, a trend showing that more sophisticated clients are being developed all the time.

The only way you will truly eradicate spam is to validate the content of the message or the content of the web page being linked, everything else is just a bonus.

The way I see it, there are two solutions that will blow Spam out of the water forever:

Smarter Application

An application that verifies everything based on in-built rules, configuration and past experiences.

* validating that the comment is relevant to the post
* validating any info associated [URL, email] is valid
* validating that the trackback/pingback came from somewhere that exists
* validating links in the comments for their content, is it relevant? are they selling Viagra?

All this is great but it goes far beyond what a humble PHP application should be doing. I dread to think what server load this would produce on even a small-scale blog – not to mention a multi-blog-site or a shared web server.

Dumb Application made smarter through On-line Collaboration

This basically describes Akismet – where a central system has all the complex logic and the client-application receives a simple yes/no decision.

Personally, I would do this asynchronously because the client-application [Wordpress for example] doesn’t really need an immediate decision. Comments can easily wait in a queue for approval which a later XMLRPC callback provides. This would greatly reduce the load on the system and allow far more complex algorithms and lookups to take place improving the overall accuracy of the decisions dispensed.

Other widely used anti-spam solutions also look promising, for example the ‘roadmap’ for Spam Karma SK2.2 has some really good ideas, my favourite’s being the honeypot and the anti-PageRank idea.

If you have read my previous posts you’ll have seen how wide the spam problem already is. Although I am very proud of my little SpamKit plugin because it does exactly what I wanted, I am quite frustrated with its limitations.

The whole spam problem is very interesting to me, its not going to go away any time soon. It’s existed since 1978 and been evolving ever since…

WP Plugin » SpamKit Plugin 0.0 – Time-Based-Tokens to Fight Spam

This is the first release and prototype of SpamKit for Wordpress.

SpamKit was written by Gerard Calderhead; it’s a PHP library that uses secure time-based-tokens to aid validating form post’s and can be used on guestbooks, blogs, form-mailers etc.

It does this by generating a checksum’d and encrypted ‘token’ containing the UNIX-timestamp from when it was generated. This ‘token’ is written out into the form as a hidden field. When the form is posted back to the server, the token’s value is validated. If it is invalid or tampered with validation will automatically fail, if the token has ‘expired’ it will also fail.

I took SpamKit and plugged it into Wordpress to do the following:

- When a comment form is drawn, a time-based-token is generated and inserted as a hidden field in the form.
- Where the comment would normally be approved, SpamKit is used to validate the token; if corrupt, missing or expired the comment is flagged as ‘spam’ preventing any email notification of the comment being posted.
- After the comment has been saved (as ‘spam’) by Wordpress the plugin changes the comment’s status to ‘Awaiting Moderation’ to allow the moderator to delete it at a later date.

The end result is comment-spam sits in the ‘Awaiting Moderation’ list without generating any email to say so.

The third step may not be what everyone desires for the plugin’s functionality but being a prototype there are no option pages to control this as yet.

The SpamKit plugin has been tested on Wordpress 1.5 only and found to operate as expected on even the most liberal configurations.

Installation is simple, there are no configuration options that require changing, simple copy it into the plugins directory and activate it from the administration screen.

Download: spamkit-plugin.zip

Comments, Questions and Feedback welcomed!

Updated [3rd January 2006] – Download link points to wp-plugins.org

‘NASA Search 1.0′ ??? Something Google should worry about ???

Having written my own Wordpress logging / statistics plug-in over the weekend – which still in prototype, consider it a ‘coming soon’ - I have started to notice more and more peculiar User-Agents visiting my blog.

I quite like to keep an eye on what spiders / bots visit my sites, how often they return and try to infer something about how they were designed by watching them visit.

I was surprised recently to see that the big three ( Yahoo!, MSN & Google ) actually pull RSS feeds as well as HTML pages – of course this makes sense from a efficiency & bandwidth side of things, the RSS feed is the interesting stuff already stripped out.

Today’s one is a real winner though, coming from the following net block and advertising itself as “NASA Search 1.0”.

CODE:
  1. Comcast Cable Communications, Inc. NJ-SOUTH-4 (NET-68-46-128-0-1)
  2.                                   68.46.128.0 - 68.46.191.255

The bot / spider crawled my entire site within a few minutes, starting from my ‘changes-in-wordpress-152’ post and was completely oblivious to my robots.txt (it didn’t even request it).

Also, it appeared to be quite a primitive HTTP client, providing no referrer information or any of the usual headers “Connection: close”, “Accept: */*” even though it was sending a “HTTP/1.1” request. Surprisingly though it did persist a session cookie for the duration of its visit.

I Google’d for the phrase “NASA Search 1.0” and only seemed to find results where auto-generated-stats pages list visiting User-Agents.

It would be quite interesting (and maybe even fun – in a very geeky way) to write a Wordpress plug-in that watches for these peculiar bots and pings their details to a centralised stats database – forming a sort of spider-spider.

Anyway, I will be keeping a keen eye out for the return of “NASA Search 1.0” … Could it be the next greatest NASA funded project? Or is it just some smart a** that has figured out how to change the User-Agent string in his favourite spider/bot.

Stay tuned!

Wordpress Hack » Inserting rel nofollow on links in all categories…

I wanted to add rel='external nofollow' onto every Link in every Category that didn't have an explicit rel attribute defined. I looked though the Wordpress Codex but found no Plugin Hooks to do it so I broke out the source code.

Somewhere down the chain the link markup "<a ..." is generated in the function get_links() found in /wp-include/links.php (approx line 142). I added two lines of code shown below and it works a treat.

Before

PHP:
  1. $rel = $row->link_rel;
  2.        
  3. if ($rel != '') {
  4.     $rel = " rel='$rel'";
  5. }

After

PHP:
  1. $rel = $row->link_rel;
  2.        
  3. if ($rel != '') {
  4.     $rel = " rel='$rel'";
  5. } else {
  6.     $rel = " rel='external nofollow'";
  7. }

 

Blocking Wordpress comment spammers by User-Agent

I have been plagued with automated comment spam lately, it is still at a level where its managable manually but..... I am lazy.

The comment message itself is nearly always along the lines of:

XML:
  1. Excellent! I enjoyed reading your material. think that will make relief: http://www.av.com ,
  2. <a href="http://www.adobe.com" rel="nofollow">substances that cure you</a> ,
  3. <a href="http://www.apple.com" rel="nofollow">my parents didnt told me about it</a>

It would appear to be more of a test message, blogs that accept the comment will probably be hammered with real spam at a later date.

I use ModSecurity on my server and wondered if there was an easy way to filter out these requests before they even reach Wordpress. I dug out my access_logs looking for the offending requests. The programs being used to post the comment spam appear to be quiet simplistic, posting directly to "wp-comments-post.php"

CODE:
  1. blog.lobstertechnology.com 209.200.xxx.xxx - - [16/Oct/2005:04:36:20 +0100]
  2.    "POST /wp-comments-post.php HTTP/1.1" 302 5
  3.    "http://blog.lobstertechnology.com/wp-comments-post.php"
  4.    "Jakarta Commons-HttpClient/3.0-rc3"
  5.  
  6. blog.lobstertechnology.com 207.195.xxx.xxx - - [12/Nov/2005:09:57:15 +0000]
  7.    "POST /wp-comments-post.php HTTP/1.1" 302 5
  8.    "-"
  9.    "Mozilla/4.78 (TuringOS; Turing Machine; 0.0)"

Others are a little more sophisitcated and at least bother to change the default User-Agent:

CODE:
  1. blog.lobstertechnology.com 209.200.xxx.xxx - - [09/Nov/2005:12:24:23 +0000]
  2.    "POST /wp-comments-post.php HTTP/1.1" 302 5
  3.    "http://blog.lobstertechnology.com/wp-comments-post.php"
  4.    "Mozilla/4.0"

I crafted a very simple ModSecurity filter to catch these, although it is a little crude, it will only trigger when the listed User-Agents send a request to "/wp-comments-post.php". Adjust as required:

XML:
  1. <ifmodule mod_security.c>
  2.  
  3.    # Turn the filtering engine On or Off
  4.    SecFilterEngine On
  5.  
  6.    ...
  7.  
  8.    # proof of concept Wordpress User-Agent filter
  9.    <location /wp-comments-post.php>
  10.       SecFilterSelective HTTP_USER_AGENT "HttpClient"
  11.       SecFilterSelective HTTP_USER_AGENT "Java"
  12.       SecFilterSelective HTTP_USER_AGENT "TuringOS"
  13.    </location>
  14.  
  15. </ifmodule>

Related Links
ModSecurity - http://www.modsecurity.org/

Changes in Wordpress 1.5.2

A few vulnerabilities found in recent weeks have been addressed by the Wordpress team with release 1.5.2 available now from wordpress.org.

The most recent vulnerability I’ve noticed was cross posted on the Full-Disclosure & Bugtraq mailing lists on the 9th August. The exploit made use of an old security issue in the PHP engine itself; a compile-time & run-time setting ‘register_globals’.

Thankfully the impact ought to be low, the PHP team have long been trying to stop the use of ‘register_globals’; it’s not enabled by default (you explicitly have to switch it on) and there are warnings all over its usage.

The most significant change I found in this release of Wordpress is in the file wp-settings.php

PHP:
  1. // Turn register globals off
  2. function unregister_GLOBALS() {
  3.     if ( !ini_get('register_globals') )
  4.         return;
  5.  
  6.     if ( isset($_REQUEST['GLOBALS']) )
  7.         die('GLOBALS overwrite attempt detected');
  8.  
  9.     // Variables that shouldn't be unset
  10.     $noUnset = array('GLOBALS', '_GET', '_POST', '_COOKIE', '_REQUEST', '_SERVER', '_ENV', '_FILES', 'table_prefix');
  11.    
  12.     $input = array_merge($_GET, $_POST, $_COOKIE, $_SERVER, $_ENV, $_FILES, isset($_SESSION) && is_array($_SESSION) ? $_SESSION : array());
  13.     foreach ( $input as $k => $v )
  14.         if ( !in_array($k, $noUnset) && isset($GLOBALS[$k]) )
  15.             unset($GLOBALS[$k]);
  16. }
  17.  
  18. unregister_GLOBALS();

The addition of the above function is sanitising the globals before proceeding with the request and should greatly reduce the vulnerability of Wordpress on servers with ‘register_globals’ enabled in the future.

Related Links:
http://wordpress.org/development/2005/08/one-five-two/
http://blog.blackdown.de/2005/08/14/another-wordpress-security-update/