WP Plugin » SpamKit Plugin 0.4 – Time-Based-Tokens to Fight Spam

This is a pretty significant release of SpamKit Plugin which provides some cool new features. This is checked into Subversion over at WP-Plugins.org and you can download the new version here spamkit-plugin.php.

Released as version 0.4:
* Added options page, this required sanity checks to prevent double definition of functions, implemented in a C-style #ifdef / #define pattern.
* Added full configuration functionality, this is done using built-in defaults, overridden by saved options making it upgrade proof.
* Added new EXPERIMENTAL check, comments posted by clients with no User-Agent string are auto-spammed and dont make it to the moderation page.
* Added new EXPERIMENTAL check, submitted email address is subject to format validation & DNS check for a mail exchanger.
* Updated to use Gerry’s new OO-based TBT code removing the dependancy on MCRYPT.
* Removed any path-dependant problems, making it compatible with all WP installs *i hope*.
* Added option to place trackback & pingbacks in the moderation queue, disabling this option causes them to be auto approved.
* Added option to moderate comments which fail TBT checks, disabling this option will mean the comments are automatically marked as spam and will never be seen.

Known Issues:
* Because direct calls to this script (for the badge) cannot access WP or any options, there is no easy way to provide a configurable /tmp directory. There is however a configuration option to disable this functionality if it causes problems.

Analysis of Spamming Zombie Botnets

Since writing my SpamKit Plugin I have been keeping a keen eye on the comment/trackback spam subject and have guinea pig’d my ideas on my own blog. Recently I noticed a distinct change in the sophistication of comment-spammers.

The early comment-spammers were using very basic HTTP clients, mostly without thinking about what’s going on ‘under the hood’. As such their spam-messages would come through with easily filtered HTTP “User-Agent” headers like “PEAR HTTP_Request class ( http://pear.php.net/ )” and “libwww-perl/5.803“. Over a period of a few months these – what I call 1st generation – bots began to dwindle in numbers, replaced by slightly more sophisticated clients which loosely emulated real browsers.

These 2nd generation bots were still very primitive, apart from changing the “User-Agent” and adding a few other headers they were still pretty basic and would repeatedly attempt to post comments over the period of a few seconds on a number of posts. This activity is also easily filtered since not even a superhuman Blog-fiend could comment on your top ten posts in less than 10 seconds.

All the attempts so far have been very basic, beginners in Perl / PHP could probably pull it off easily, and they are just as easily filtered out.

Over the Christmas period I observed some very unusual activity, a ’spam attack’ coming from dozens of source IP addresses, coordinated within a few minutes. I initially spotted it because the “User-Agent” header was completely empty – stands out a bit. After some investigation and further attacks I became pretty confident this wasn’t a fluke or coincidence of independent spammers.

I knocked up a quick Wordpress plug-in to capture as much info about these suspicious requests as possible. Here is one of the first attacks.

03/02/2006 20:37:44 212.0.XXX.XXX GET /
03/02/2006 20:38:14 201.242.XXX.XXX GET /category/wordpress/plugins/
03/02/2006 20:39:54 210.183.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/
03/02/2006 20:40:25 200.122.XXX.XXX GET /category/java/jakarta-velocity/
03/02/2006 20:40:37 62.23.XXX.XXX GET /2006/02/02/sitecom-cn-502-usb-bluetooth-dongle-works-on-linux/
03/02/2006 20:40:55 68.96.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/
03/02/2006 20:41:18 70.88.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:41:20 70.88.XXX.XXX GET /category/thoughts/
03/02/2006 20:41:44 200.21.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:41:48 200.21.XXX.XXX GET /2006/01/25/ti-7x21-flashmedia-sd-host-controller-104c-8033/
03/02/2006 20:42:16 61.145.XXX.XXX GET /category/wordpress/plugins/search-term-highlighter/
03/02/2006 20:42:24 217.113.XXX.XXX GET /category/flash/
03/02/2006 20:42:48 212.251.XXX.XXX GET /category/internet/
03/02/2006 20:43:04 205.180.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:43:22 82.76.XXX.XXX GET /keywords/
03/02/2006 20:43:56 218.248.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/#postcomment
03/02/2006 20:44:13 206.191.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/%23postcomment
03/02/2006 20:44:14 206.191.XXX.XXX GET /category/tools/
03/02/2006 20:44:15 206.191.XXX.XXX GET /category/wordpress/plugins/search-term-highlighter/
03/02/2006 20:44:38 62.23.XXX.XXX GET /category/wordpress/plugins/search-term-highlighter/
03/02/2006 20:45:33 82.76.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:45:34 82.76.XXX.XXX GET /category/tools/
03/02/2006 20:45:35 82.76.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:45:48 203.162.XXX.XXX POST /wp-comments-post.php

In this particular instance, the attack was over a ten minute period. The first request was a HTTP GET on the root of my Blog “/” almost definitely used to feed the other bots with URL’s. Next, other clients in the Botnet continue to spider my Blog in parallel, building a list of URL’s to try later and lastly the first of the attempts to post a comment.

If you examine the sequence of requests, the bots are posting a comment, then coming back to check if it was successful. Analysis of later attacks even found other bots in the group checking if the comment posted by a peer bot was successful. The participating hosts are located all over the world but the majority are in North America and Asia.

This obviously demonstrates a very high level of sophistication. Initially I presumed that there was a single client application running requests in parallel over a group of HTTP proxies. After tracing down the locations & owners of each of the participants in the attacks I concluded it was infeasible that they all happened to have open proxies being abused in this way. A large proportion of the machines being used are actually web servers which have probably been exploited and are running IRC-controlled Trojan software.

Backing this up is the pace these attacks are evolving, the first few were very primitive without even a HTTP “User-Agent” header; however this was very quickly amended. The most recent attack I observed (1st March 2006) showed even more improvements, each client was almost indistinguishable from normal visitors. Providing full ‘Internet Explorer’ like headers of accepted mime types, charsets, languages and even including valid HTTP referrer headers and cookies.
Thankfully, all their time seems to be invested in improving the client software; the actual content of the comment was practically identical.

My SpamKit Plugin has so far easily handled each of these situations. It uses Gerry’s “Time Based Tokens” which were auto-generated and written into a hidden form field. Any incoming comments without a token or with an invalid token could be held for moderation while at the same time having zero impact on real visitors writing comments. Unlike techniques used by other solutions it does not require the user to type in a random key from an image like the ‘captcha’ technique, nor does it rely on JavaScript support in the browser. Until these spam bots reach a level of sophistication where they are parsing out HTML forms including hidden values and posting them, the current version of SpamKit will still be an effective solution.

However there is one major drawback with SpamKit; pingback/trackback’s are machine-generated, they will not have a “Time Based Token” and will be held for moderation as if they were spam. The problem with this is that spammers are also increasingly using the pingback/trackback mechanism to get their comments through the net. A lot of thought and discussion on this subject with Gerry lead to one potential solution; scoring & validation on the URL the pingback/trackback is supposedly from.

In early examples of trackback spam the URL given pointed straight to some advertising-based web page. Something like this lends itself to easy detection and filtering as the content when examined would score highly for spam key words like ‘Viagra’ etc. However these attacks have also evolved, the most recent of which point to real web pages or Blogs that contain obfuscated JavaScript redirection code – redirecting real visitor’s browsers but avoiding any page content detection techniques. In some cases the code has been inserted into Bulletin Boards or Guestbook’s which allow unfiltered HTML.

An example page with obfuscated JavaScript redirection (warning, this will redirect you to mp3search.ru)

http://zigfrid.blog.kataweb.it/il_mio_weblog/

So, what measures can be taken to stop spam?

Personally I don’t think you will ever get rid of spam, you have a pretty good chance of eradicating all but the most sophisticated of spammers, but you’ll never stop 100% of spam. The best methodology is to constantly evolve your defences at the same rate or faster than the opposition. For starters Gerry & I are constantly dreaming up new ways we can enhance SpamKit… Recent updates include encoding the original source IP address in the “Time Based Token” which would become invalid if submitted from a different address. Other works in progress include hardcore validation of the email address submitted; does the domain exist? does it have a mail exchanger MX record? etc. content validation, key word searching and probabilities of the content being spam – progress will be reported here and on Gerry’s site.

In the long term spammers are going to have clients that pretty much replicate real users down to the delays & randomness between requests. Countermeasures are going to have to be just as sophisticated, evaluating content and even executing JavaScript as if they were also real clients.

WP Plugin » SpamKit Plugin 0.3 – Time-Based-Tokens to Fight Spam

This is a(nother) minor release of SpamKit Plugin which provides some cool new features. This is checked into Subversion over at WP-Plugins.org and you can download the new version here spamkit-plugin.php.

New Features:

* Minor improvements to the use of TBT’s, any token used within 5 seconds of being generated will be declared invalid. This is to stop the majority of automated clients parsing and sending the TBT token.

* Added a ‘web badge’ for display on your blog pages, it shows the number of spam comments caught with SpamKit. To use it simply add the following where you want the badge to appear:

PHP:
  1. <?php
  2.    if ( function_exists("spamkit_badge") ) {
  3.       spamkit_badge();
  4.    }
  5. ?>

Alternatively, you can have this spamkit_badge method return you the HTML markup by calling spamkit_badge( true ), for example:

PHP:
  1. <?php
  2.    if ( function_exists("spamkit_badge") ) {
  3.       $html = spamkit_badge(true);
  4.       echo $html;
  5.    }
  6. ?>

And it looks like this: SpamKit Plugin for Wordpress: Caught 25 Spam Comments!

* Added a custom pingback to my own blog triggered when the plugin is installed and activated on your own blog. This is used for installation counting and version tracking. Future versions will have this as configurable and optional.

Known Issues:

* The HTML generated by spamkit_badge link’s back to the plugin using an absolute URL (/wp-content/plugins/spamkit-plugin.php) which may not suit everyone’s Wordpress installation. This will be addressed in the next release.

* SpamKit uses temporary files to store the count, saving the image generating part of the script from having to make SQL calls. To do this it is assumed that all systems have a “/tmp” directory which is writable by the user the WWW server is running as. Temporary file names are fairly unique, they are generated by taking the crc32 value of $_SERVER['SERVER_NAME'].

WP Plugin » SpamKit Plugin 0.2 – Time-Based-Tokens to Fight Spam

This is a minor release of SpamKit Plugin to update Gerry's TBT code which now incorporates the IP address from $_SERVER['REMOTE_ADDR'] into the validation. This is checked into Subversion over at WP-Plugins.org and you can download the new version here spamkit-plugin.php.

Changelog:

version 0.2 - updated TBT code with improvements from Gerard Calderhead, TBT now includes the IP address from $_SERVER['REMOTE_ADDR'] into the check and fails if the ip is different during validation.

WP Plugin » SpamKit Plugin 0.1 – Time-Based-Tokens to Fight Spam

This is a minor release of SpamKit Plugin to address an easy-to-fix problem where Trackbacks from the same blog or server get treated as spam because they don't include the time-based token. This is checked into Subversion over at WP-Plugins.org and you can download the new version here spamkit-plugin.php.

Changelog:

Added a check in spam_action_pre_comment_approved to compare the REMOTE_ADDR with the SERVER_ADDR, if they match it allows the comment to bypass time-based token checking. However this could be abused if another web application on the server is exploited allowing an attacker to post comments apparently from this server. Then again if someone goes to all the effort of expoiting a web application comment-spam is probably low on their priorities.

SpamKit Plugin moved to WP-Plugins.org

This morning I moved SpamKit Plugin over to WP-Plugins.org.

You can always download the latest version directly from here.

More thoughts on SpamKit…

After my recent release of SpamKit Plugin I have been contemplating the whole spam problem in much greater depth. Gerry highlighted one major problem with my SpamKit plugin itself – trackbacks are considered spam because they don't include the time-based token. I started to look into amending my plugin to support them when I realised that this would be a serious loop hole.

As far as the SpamKit plugin is concerned it would see trackbacks & pingbacks as new comments, without any time-based token. Thankfully Wordpress passes a 'comment_type' flag of 'trackback' or 'pingback' in these cases, and there is also a separate action of 'trackback_post' which the plugin can hook into. But what is stopping a particular clever spammer from doing trackback / pingback spam? Nothing...

So, slightly disheartened by this particular problem I went back to the mental-drawing -board. SpamKit plugin is great for a very specific problem, but its certainly not the grand unified solution I am looking for. All it would take is for a smarter-spammer to pull the comment form with my time-based-token and then send it along with the spam message for instant approval. It's worryingly easy to do, I did it in about 10 lines of PHP code – no joke.

The spam problem will always present itself as long as we try to 'filter' out automatic-spam-clients from humans; it is simply the wrong approach. Eventually spam-clients will evolve to become so sophisticated we cant tell the difference anyway, then all the filters and tricks in the book won't help you. I have already started to see a distinct difference between the spam clients hitting my blog daily, a trend showing that more sophisticated clients are being developed all the time.

The only way you will truly eradicate spam is to validate the content of the message or the content of the web page being linked, everything else is just a bonus.

The way I see it, there are two solutions that will blow Spam out of the water forever:

Smarter Application

An application that verifies everything based on in-built rules, configuration and past experiences.

* validating that the comment is relevant to the post
* validating any info associated [URL, email] is valid
* validating that the trackback/pingback came from somewhere that exists
* validating links in the comments for their content, is it relevant? are they selling Viagra?

All this is great but it goes far beyond what a humble PHP application should be doing. I dread to think what server load this would produce on even a small-scale blog – not to mention a multi-blog-site or a shared web server.

Dumb Application made smarter through On-line Collaboration

This basically describes Akismet – where a central system has all the complex logic and the client-application receives a simple yes/no decision.

Personally, I would do this asynchronously because the client-application [Wordpress for example] doesn't really need an immediate decision. Comments can easily wait in a queue for approval which a later XMLRPC callback provides. This would greatly reduce the load on the system and allow far more complex algorithms and lookups to take place improving the overall accuracy of the decisions dispensed.

Other widely used anti-spam solutions also look promising, for example the 'roadmap' for Spam Karma SK2.2 has some really good ideas, my favourite's being the honeypot and the anti-PageRank idea.

If you have read my previous posts you'll have seen how wide the spam problem already is. Although I am very proud of my little SpamKit plugin because it does exactly what I wanted, I am quite frustrated with its limitations.

The whole spam problem is very interesting to me, its not going to go away any time soon. It's existed since 1978 and been evolving ever since...

WP Plugin » SpamKit Plugin 0.0 – Time-Based-Tokens to Fight Spam

This is the first release and prototype of SpamKit for Wordpress.

SpamKit was written by Gerard Calderhead; it’s a PHP library that uses secure time-based-tokens to aid validating form post’s and can be used on guestbooks, blogs, form-mailers etc.

It does this by generating a checksum’d and encrypted ‘token’ containing the UNIX-timestamp from when it was generated. This ‘token’ is written out into the form as a hidden field. When the form is posted back to the server, the token’s value is validated. If it is invalid or tampered with validation will automatically fail, if the token has ‘expired’ it will also fail.

I took SpamKit and plugged it into Wordpress to do the following:

- When a comment form is drawn, a time-based-token is generated and inserted as a hidden field in the form.
- Where the comment would normally be approved, SpamKit is used to validate the token; if corrupt, missing or expired the comment is flagged as ‘spam’ preventing any email notification of the comment being posted.
- After the comment has been saved (as ‘spam’) by Wordpress the plugin changes the comment’s status to ‘Awaiting Moderation’ to allow the moderator to delete it at a later date.

The end result is comment-spam sits in the 'Awaiting Moderation' list without generating any email to say so.

The third step may not be what everyone desires for the plugin’s functionality but being a prototype there are no option pages to control this as yet.

The SpamKit plugin has been tested on Wordpress 1.5 only and found to operate as expected on even the most liberal configurations.

Installation is simple, there are no configuration options that require changing, simple copy it into the plugins directory and activate it from the administration screen.

Download: spamkit-plugin.zip

Comments, Questions and Feedback welcomed!

Updated [3rd January 2006] - Download link points to wp-plugins.org