After my recent release of SpamKit Plugin I have been contemplating the whole spam problem in much greater depth. Gerry highlighted one major problem with my SpamKit plugin itself – trackbacks are considered spam because they don’t include the time-based token. I started to look into amending my plugin to support them when I realised that this would be a serious loop hole.
As far as the SpamKit plugin is concerned it would see trackbacks & pingbacks as new comments, without any time-based token. Thankfully Wordpress passes a ‘comment_type’ flag of ‘trackback’ or ‘pingback’ in these cases, and there is also a separate action of ‘trackback_post’ which the plugin can hook into. But what is stopping a particular clever spammer from doing trackback / pingback spam? Nothing…
So, slightly disheartened by this particular problem I went back to the mental-drawing -board. SpamKit plugin is great for a very specific problem, but its certainly not the grand unified solution I am looking for. All it would take is for a smarter-spammer to pull the comment form with my time-based-token and then send it along with the spam message for instant approval. It’s worryingly easy to do, I did it in about 10 lines of PHP code – no joke.
The spam problem will always present itself as long as we try to ‘filter’ out automatic-spam-clients from humans; it is simply the wrong approach. Eventually spam-clients will evolve to become so sophisticated we cant tell the difference anyway, then all the filters and tricks in the book won’t help you. I have already started to see a distinct difference between the spam clients hitting my blog daily, a trend showing that more sophisticated clients are being developed all the time.
The only way you will truly eradicate spam is to validate the content of the message or the content of the web page being linked, everything else is just a bonus.
The way I see it, there are two solutions that will blow Spam out of the water forever:
Smarter Application
An application that verifies everything based on in-built rules, configuration and past experiences.
* validating that the comment is relevant to the post
* validating any info associated [URL, email] is valid
* validating that the trackback/pingback came from somewhere that exists
* validating links in the comments for their content, is it relevant? are they selling Viagra?
All this is great but it goes far beyond what a humble PHP application should be doing. I dread to think what server load this would produce on even a small-scale blog – not to mention a multi-blog-site or a shared web server.
Dumb Application made smarter through On-line Collaboration
This basically describes Akismet – where a central system has all the complex logic and the client-application receives a simple yes/no decision.
Personally, I would do this asynchronously because the client-application [Wordpress for example] doesn’t really need an immediate decision. Comments can easily wait in a queue for approval which a later XMLRPC callback provides. This would greatly reduce the load on the system and allow far more complex algorithms and lookups to take place improving the overall accuracy of the decisions dispensed.
Other widely used anti-spam solutions also look promising, for example the ‘roadmap’ for Spam Karma SK2.2 has some really good ideas, my favourite’s being the honeypot and the anti-PageRank idea.
If you have read my previous posts you’ll have seen how wide the spam problem already is. Although I am very proud of my little SpamKit plugin because it does exactly what I wanted, I am quite frustrated with its limitations.
The whole spam problem is very interesting to me, its not going to go away any time soon. It’s existed since 1978 and been evolving ever since…