GoogleBot Experiment Success!

A month has past since I made a change to my Wordpress templates to experiment with Google bot (see previous post) and I can proudly report that it works like a charm.

My original problem was that Google was returning search results pointing to index-style pages on my Blog instead of the post’s themselves. These index pages like Categories & Archives would quickly update and the majority of visitors coming from Google search results were having a poor experience – the post that drew them in wasn’t obviously visible.

I knew I could use Robots.txt directives to control the INDEX-ing and FOLLOW-ing of my site, but I was hesitant about applying experimental rules to all Search Engine robots. Thankfully GoogleBot looks for a header specific to itself only, this let me apply custom rules to Google only very easily.

Using my Wordpress templates I added the following header on all index-style pages except the home page:

<meta name="GOOGLEBOT" content="NOINDEX,FOLLOW"/>

Basically this is instructing GoogleBot to follow links on this page, but not to index the page itself. The end result is that search results pointing to my blog are using the ‘permalink’ URL, not the index page it is listed on.

;)

Experimenting with Googlebot

In my previous post 'Blogs are fundamentally flawed…' I noted an observation that more often than not search results would direct a user to an index-style page containing the post instead of directly to the 'permalink' location of the post. This leads to a poor user-experience from the visitor’s point of view, on busy blogs the post has almost certainly moved since the page was spider'd. Google in particular appeared to be the worst for it.

Discussions on the subject with Gerry determined that this is most likely down to Google's PageRank technology; where index-style pages have a higher value than the post pages themselves. To get around this he suggested manipulating 'robots.txt' directives within the index-style pages.

On Google's "Information for Webmasters" help page I found they look for special 'robots.txt' directives and meta tags in documents when spidering specific to Googlebot only. This meant I could single out Googlebot for these directives and not affect other search engines (which don’t exhibit the problem so much).

I basically want Google to 'FOLLOW' links on all pages, but not to 'INDEX' the index-style pages like categories & archives by date. The desired effect being that Google can find all posts as before but simply ignore the index-style pages themselves. Implementing this is quite simple; I modified my theme's "header.php" file inserting the following code in the "head" section:

PHP:
  1. <?php
  2.     if ( !is_single() && !is_page() && !is_home() )
  3.         echo "  <meta name=\"GOOGLEBOT\" content=\"NOINDEX,FOLLOW\" />\n";
  4. ?>

This reads almost literally, if this is not a single post view, not a page view or the home page, add the following "meta..." tag. Although the home page is an index-style page I am reluctant to add 'NOINDEX' because I don't want it disappearing from search results. ;)

Now the long wait for the changes to reflect in Google's results.

Updated 24th January 2006 - Gerry pointed out this can be optimised using De Morgan's Law :P

PHP:
  1. <?php
  2.     if ( ! (is_single() || is_page() || is_home()) )
  3.         echo "  <meta name=\"GOOGLEBOT\" content=\"NOINDEX,FOLLOW\" />\n";
  4. ?>

Blogs are fundamentally flawed for the typical Grandma-User

It may seem a little sad but I can honestly say that reading my access_log is far more interesting than any soap opera on TV; they are filled with exotic foreigners, futuristic robots, drama, intrigue and personal tragedy. The best thing about it is that it’s all real; these are (mostly) real people who stumble across your humble Blog in the hope to find the solution to their problems.

Over the Christmas period I have observed more people visiting an ancient post of mine than in the past six months. The post is about my experiences with an external hard drive enclosure; more accurately, the chip / controller a great deal of hard drive enclosures use. Based on this I would guess that a considerable number of people got hard drive enclosures for their Christmas and ran into the same problems I had. Anyway, I am wandering a little.

It was reading my access_log's that made me realise that Blogs are actually a really bad format for the Grandma user...

Picture this, imagine your Grandma is Google'ing and happens to get a result that points to your Blog. She see's a teaser in the search results that shows you've written something about what she is looking for. Grandma clicks your link and is presented with your last 10 posts about God knows what and no sign of the post she saw the snippet of. Grandma goes back to Google thoroughly disappointed and never to return...

I encounter this phenomenon frequently, but because I am familiar with the Blog format I think nothing of drilling down to the relevant category to find the post I wanted; or if I am lazy click Google’s cached copy of the page. However for the average internet surfer it presents a fundamental flaw in the usability of the Blog format.

The problem is quite simple; search engines can never be up to date with your content all the time. The more frequent you post the more the problem will occur and the harder the post will be to find. The way I see it there are two possible solutions.

Smarter search engines

Enhancing search engines so they can distinguish between an index-page of posts and individual posts. This could be done by identifying sections of text within a page as an extract from another URL using something like RDF [http://www.w3.org/RDF/] which can already be embedded within XHTML [http://internetalchemy.org/2005/10/introducing-embedded-rdf]. Enclosing the section of text between the ‘<rdf:RDF>’ tags would do the trick.

In the Blog format the index pages and category pages would all contain embedded RDF indicating that the enclosed section of text is actually from another URL – its permalink. However this idea is not just limited to the Blog format, it has huge potential for most modern website formats.

This wouldn’t be a trivial change for search engines to make, it would be time-consuming and therefore costly but I believe it would be worthwhile for the future of internet content.

Smarter websites

A more short-term solution I am looking at is improving my website [i.e. Wordpress] to detect that the visitor has come from a search engine, try and determine the query they used from the ‘Referer’ HTTP header, then find and present the best matches to that query before any other posts are displayed.

Obviously this method has quite a few shortfalls:

* The ‘Referer’ header may not be there (some people disable it within the browser or through third-party software)
* Although handling the query formats of the main players is quite easy, not all search engines can be catered for
* It requires an intensive search of all the site’s posts, the standard Wordpress search won’t cut it

I contemplated getting the site to pull a copy of the URL given in the ‘Referer’ header, scan for the result that led the visitor to your site then locate the correct post given the snippet text… Then I decided that was a reeeeeeaaally bad idea.

In the long-run I believe the content and therefore the search engines that index it have to improve to cater for the format of internet content today and I think embedded RDF might be the key; unfortunately this cannot happen overnight.

In the meantime making smarter websites will help the situation until the content and the search engines catch up.

‘NASA Search 1.0′ ??? Something Google should worry about ???

Having written my own Wordpress logging / statistics plug-in over the weekend – which still in prototype, consider it a ‘coming soon’ - I have started to notice more and more peculiar User-Agents visiting my blog.

I quite like to keep an eye on what spiders / bots visit my sites, how often they return and try to infer something about how they were designed by watching them visit.

I was surprised recently to see that the big three ( Yahoo!, MSN & Google ) actually pull RSS feeds as well as HTML pages – of course this makes sense from a efficiency & bandwidth side of things, the RSS feed is the interesting stuff already stripped out.

Today’s one is a real winner though, coming from the following net block and advertising itself as “NASA Search 1.0”.

CODE:
  1. Comcast Cable Communications, Inc. NJ-SOUTH-4 (NET-68-46-128-0-1)
  2.                                   68.46.128.0 - 68.46.191.255

The bot / spider crawled my entire site within a few minutes, starting from my ‘changes-in-wordpress-152’ post and was completely oblivious to my robots.txt (it didn’t even request it).

Also, it appeared to be quite a primitive HTTP client, providing no referrer information or any of the usual headers “Connection: close”, “Accept: */*” even though it was sending a “HTTP/1.1” request. Surprisingly though it did persist a session cookie for the duration of its visit.

I Google’d for the phrase “NASA Search 1.0” and only seemed to find results where auto-generated-stats pages list visiting User-Agents.

It would be quite interesting (and maybe even fun – in a very geeky way) to write a Wordpress plug-in that watches for these peculiar bots and pings their details to a centralised stats database – forming a sort of spider-spider.

Anyway, I will be keeping a keen eye out for the return of “NASA Search 1.0” … Could it be the next greatest NASA funded project? Or is it just some smart a** that has figured out how to change the User-Agent string in his favourite spider/bot.

Stay tuned!