IT Terminology – Hard Drive Jenga

Hard Drive Jenga

A term used to describe removal of Hard Drives from a RAID storage array where the objective of the game is to remove as many drives as possible without the array collapsing causing catastrophic data loss. Not for the faint hearted!

Not to be confused with Hard Drive Dominoes (another fine example).

Experimenting with Googlebot

In my previous post 'Blogs are fundamentally flawed…' I noted an observation that more often than not search results would direct a user to an index-style page containing the post instead of directly to the 'permalink' location of the post. This leads to a poor user-experience from the visitor’s point of view, on busy blogs the post has almost certainly moved since the page was spider'd. Google in particular appeared to be the worst for it.

Discussions on the subject with Gerry determined that this is most likely down to Google's PageRank technology; where index-style pages have a higher value than the post pages themselves. To get around this he suggested manipulating 'robots.txt' directives within the index-style pages.

On Google's "Information for Webmasters" help page I found they look for special 'robots.txt' directives and meta tags in documents when spidering specific to Googlebot only. This meant I could single out Googlebot for these directives and not affect other search engines (which don’t exhibit the problem so much).

I basically want Google to 'FOLLOW' links on all pages, but not to 'INDEX' the index-style pages like categories & archives by date. The desired effect being that Google can find all posts as before but simply ignore the index-style pages themselves. Implementing this is quite simple; I modified my theme's "header.php" file inserting the following code in the "head" section:

PHP:
  1. <?php
  2.     if ( !is_single() && !is_page() && !is_home() )
  3.         echo "  <meta name=\"GOOGLEBOT\" content=\"NOINDEX,FOLLOW\" />\n";
  4. ?>

This reads almost literally, if this is not a single post view, not a page view or the home page, add the following "meta..." tag. Although the home page is an index-style page I am reluctant to add 'NOINDEX' because I don't want it disappearing from search results. ;)

Now the long wait for the changes to reflect in Google's results.

Updated 24th January 2006 - Gerry pointed out this can be optimised using De Morgan's Law :P

PHP:
  1. <?php
  2.     if ( ! (is_single() || is_page() || is_home()) )
  3.         echo "  <meta name=\"GOOGLEBOT\" content=\"NOINDEX,FOLLOW\" />\n";
  4. ?>

Blogs are fundamentally flawed for the typical Grandma-User

It may seem a little sad but I can honestly say that reading my access_log is far more interesting than any soap opera on TV; they are filled with exotic foreigners, futuristic robots, drama, intrigue and personal tragedy. The best thing about it is that it’s all real; these are (mostly) real people who stumble across your humble Blog in the hope to find the solution to their problems.

Over the Christmas period I have observed more people visiting an ancient post of mine than in the past six months. The post is about my experiences with an external hard drive enclosure; more accurately, the chip / controller a great deal of hard drive enclosures use. Based on this I would guess that a considerable number of people got hard drive enclosures for their Christmas and ran into the same problems I had. Anyway, I am wandering a little.

It was reading my access_log's that made me realise that Blogs are actually a really bad format for the Grandma user...

Picture this, imagine your Grandma is Google'ing and happens to get a result that points to your Blog. She see's a teaser in the search results that shows you've written something about what she is looking for. Grandma clicks your link and is presented with your last 10 posts about God knows what and no sign of the post she saw the snippet of. Grandma goes back to Google thoroughly disappointed and never to return...

I encounter this phenomenon frequently, but because I am familiar with the Blog format I think nothing of drilling down to the relevant category to find the post I wanted; or if I am lazy click Google’s cached copy of the page. However for the average internet surfer it presents a fundamental flaw in the usability of the Blog format.

The problem is quite simple; search engines can never be up to date with your content all the time. The more frequent you post the more the problem will occur and the harder the post will be to find. The way I see it there are two possible solutions.

Smarter search engines

Enhancing search engines so they can distinguish between an index-page of posts and individual posts. This could be done by identifying sections of text within a page as an extract from another URL using something like RDF [http://www.w3.org/RDF/] which can already be embedded within XHTML [http://internetalchemy.org/2005/10/introducing-embedded-rdf]. Enclosing the section of text between the ‘<rdf:RDF>’ tags would do the trick.

In the Blog format the index pages and category pages would all contain embedded RDF indicating that the enclosed section of text is actually from another URL – its permalink. However this idea is not just limited to the Blog format, it has huge potential for most modern website formats.

This wouldn’t be a trivial change for search engines to make, it would be time-consuming and therefore costly but I believe it would be worthwhile for the future of internet content.

Smarter websites

A more short-term solution I am looking at is improving my website [i.e. Wordpress] to detect that the visitor has come from a search engine, try and determine the query they used from the ‘Referer’ HTTP header, then find and present the best matches to that query before any other posts are displayed.

Obviously this method has quite a few shortfalls:

* The ‘Referer’ header may not be there (some people disable it within the browser or through third-party software)
* Although handling the query formats of the main players is quite easy, not all search engines can be catered for
* It requires an intensive search of all the site’s posts, the standard Wordpress search won’t cut it

I contemplated getting the site to pull a copy of the URL given in the ‘Referer’ header, scan for the result that led the visitor to your site then locate the correct post given the snippet text… Then I decided that was a reeeeeeaaally bad idea.

In the long-run I believe the content and therefore the search engines that index it have to improve to cater for the format of internet content today and I think embedded RDF might be the key; unfortunately this cannot happen overnight.

In the meantime making smarter websites will help the situation until the content and the search engines catch up.

More thoughts on SpamKit…

After my recent release of SpamKit Plugin I have been contemplating the whole spam problem in much greater depth. Gerry highlighted one major problem with my SpamKit plugin itself – trackbacks are considered spam because they don't include the time-based token. I started to look into amending my plugin to support them when I realised that this would be a serious loop hole.

As far as the SpamKit plugin is concerned it would see trackbacks & pingbacks as new comments, without any time-based token. Thankfully Wordpress passes a 'comment_type' flag of 'trackback' or 'pingback' in these cases, and there is also a separate action of 'trackback_post' which the plugin can hook into. But what is stopping a particular clever spammer from doing trackback / pingback spam? Nothing...

So, slightly disheartened by this particular problem I went back to the mental-drawing -board. SpamKit plugin is great for a very specific problem, but its certainly not the grand unified solution I am looking for. All it would take is for a smarter-spammer to pull the comment form with my time-based-token and then send it along with the spam message for instant approval. It's worryingly easy to do, I did it in about 10 lines of PHP code – no joke.

The spam problem will always present itself as long as we try to 'filter' out automatic-spam-clients from humans; it is simply the wrong approach. Eventually spam-clients will evolve to become so sophisticated we cant tell the difference anyway, then all the filters and tricks in the book won't help you. I have already started to see a distinct difference between the spam clients hitting my blog daily, a trend showing that more sophisticated clients are being developed all the time.

The only way you will truly eradicate spam is to validate the content of the message or the content of the web page being linked, everything else is just a bonus.

The way I see it, there are two solutions that will blow Spam out of the water forever:

Smarter Application

An application that verifies everything based on in-built rules, configuration and past experiences.

* validating that the comment is relevant to the post
* validating any info associated [URL, email] is valid
* validating that the trackback/pingback came from somewhere that exists
* validating links in the comments for their content, is it relevant? are they selling Viagra?

All this is great but it goes far beyond what a humble PHP application should be doing. I dread to think what server load this would produce on even a small-scale blog – not to mention a multi-blog-site or a shared web server.

Dumb Application made smarter through On-line Collaboration

This basically describes Akismet – where a central system has all the complex logic and the client-application receives a simple yes/no decision.

Personally, I would do this asynchronously because the client-application [Wordpress for example] doesn't really need an immediate decision. Comments can easily wait in a queue for approval which a later XMLRPC callback provides. This would greatly reduce the load on the system and allow far more complex algorithms and lookups to take place improving the overall accuracy of the decisions dispensed.

Other widely used anti-spam solutions also look promising, for example the 'roadmap' for Spam Karma SK2.2 has some really good ideas, my favourite's being the honeypot and the anti-PageRank idea.

If you have read my previous posts you'll have seen how wide the spam problem already is. Although I am very proud of my little SpamKit plugin because it does exactly what I wanted, I am quite frustrated with its limitations.

The whole spam problem is very interesting to me, its not going to go away any time soon. It's existed since 1978 and been evolving ever since...

Horde 3.0.8 appears to be broken

Horde is an application framework used by a web-based email client IMP I use to read my email. From the Horde site [www.horde.org]:

The Horde Project is about creating high quality Open Source applications, based on PHP and the Horde Framework.

The guiding principles of the Horde Project are to create solid standards-based applications using intelligent object oriented design that, wherever possible, are designed to run on a wide range of platforms and backends.
There is great emphasis on making Horde as friendly to non-English speakers as possible. The Horde Framework currently supports many localization features such as Unicode and right-to-left text and generous users have contributed many translations for the framework and applications.

Today I downloaded and attempted to install Horde 3.0.8 - released on Sunday 11th December 2005 - something appears to be wrong as I didn't get very far. I followed all the given instructions, my server is configured correctly, all the dependencies are installed and working. I got so far as to use the web-based setup / configuration screen but it didn't allow me to save any settings or complete the setup process.

Following the instructions to the letter; I went to the 'Authentication' tab, selected 'IMAP Authentication', the page reloaded but didn't reflect my choice from the 'authentication backend' drop-down list. Instead it wouldn't display anything other than 'Let a Horde application handle authentication' but without the additional drop-down to select the application to use.

I initially suspected some Javascript incompatibility as I normally use Firefox and sooo many applications are written against Internet Explorer. But after several attempts from different browsers & platforms I gave up on the authentication tab, opting to try at least the 'Database' tab and configure MySQL. I could easily fill out all the details but when I tried to 'Generate Horde Configuration' it threw me back to the 'General' tab, highlighting that I had not completed required fields to do with error reporting & URL generation – both were set to valid values.

I re-read the documentation and re-did the whole installation... just in case I missed something or was too eager to lock down permissions. Again, exactly the same problem. Next, I relaxed ALL the permissions possible, I basically chmod'd the whole thing to 777 - in case the setup wasn't able to write to the config directory but this didn't help either.

The FAQ didn't provide much help so i went to the IRC channel #horde @ irc.freenode.net and found others with exactly the same problem *holds back the tears of frustration* ... But unfortunately no-one seemed to have any immediate answers.

On a hunch I grabbed Horde 3.0.7 from the FTP site and went through the whole setup process again. However this time it worked as expected and was running within ten minutes!!

Argh... Next step is to diff the code and see where it went wrong... (stay tuned)

Update - This issue was fixed in version 3.0.9 which is now available from www.horde.org

‘NASA Search 1.0′ ??? Something Google should worry about ???

Having written my own Wordpress logging / statistics plug-in over the weekend – which still in prototype, consider it a ‘coming soon’ - I have started to notice more and more peculiar User-Agents visiting my blog.

I quite like to keep an eye on what spiders / bots visit my sites, how often they return and try to infer something about how they were designed by watching them visit.

I was surprised recently to see that the big three ( Yahoo!, MSN & Google ) actually pull RSS feeds as well as HTML pages – of course this makes sense from a efficiency & bandwidth side of things, the RSS feed is the interesting stuff already stripped out.

Today’s one is a real winner though, coming from the following net block and advertising itself as “NASA Search 1.0”.

CODE:
  1. Comcast Cable Communications, Inc. NJ-SOUTH-4 (NET-68-46-128-0-1)
  2.                                   68.46.128.0 - 68.46.191.255

The bot / spider crawled my entire site within a few minutes, starting from my ‘changes-in-wordpress-152’ post and was completely oblivious to my robots.txt (it didn’t even request it).

Also, it appeared to be quite a primitive HTTP client, providing no referrer information or any of the usual headers “Connection: close”, “Accept: */*” even though it was sending a “HTTP/1.1” request. Surprisingly though it did persist a session cookie for the duration of its visit.

I Google’d for the phrase “NASA Search 1.0” and only seemed to find results where auto-generated-stats pages list visiting User-Agents.

It would be quite interesting (and maybe even fun – in a very geeky way) to write a Wordpress plug-in that watches for these peculiar bots and pings their details to a centralised stats database – forming a sort of spider-spider.

Anyway, I will be keeping a keen eye out for the return of “NASA Search 1.0” … Could it be the next greatest NASA funded project? Or is it just some smart a** that has figured out how to change the User-Agent string in his favourite spider/bot.

Stay tuned!