<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Weblog of Michael Cutler &#187; Spiders &amp; Bots</title>
	<atom:link href="http://blog.lobstertechnology.com/category/spiders-n-bots/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.lobstertechnology.com</link>
	<description>"I felt a great disturbance in the Force, as if millions of peers suddenly cried out in terror and were suddenly silenced."</description>
	<lastBuildDate>Tue, 17 Oct 2006 14:40:43 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>GoogleBot Experiment Success!</title>
		<link>http://blog.lobstertechnology.com/2006/02/24/googlebot-experiment-success/</link>
		<comments>http://blog.lobstertechnology.com/2006/02/24/googlebot-experiment-success/#comments</comments>
		<pubDate>Fri, 24 Feb 2006 02:58:28 +0000</pubDate>
		<dc:creator>Michael Cutler</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Spiders & Bots]]></category>
		<category><![CDATA[Wordpress]]></category>

		<guid isPermaLink="false">http://blog.lobstertechnology.com/2006/02/24/googlebot-experiment-success/</guid>
		<description><![CDATA[A month has past since I made a change to my Wordpress templates to experiment with Google bot (see previous post) and I can proudly report that it works like a charm.
My original problem was that Google was returning search results pointing to index-style pages on my Blog instead of the post&#8217;s themselves. These index [...]]]></description>
			<content:encoded><![CDATA[<p>A month has past since I made a change to my Wordpress templates to experiment with Google bot (<a href='http://blog.lobstertechnology.com/2006/01/23/experimenting-with-googlebot/'>see previous post</a>) and I can proudly report that it works like a charm.</p>
<p>My original problem was that Google was returning search results pointing to index-style pages on my Blog instead of the post&#8217;s themselves. These index pages like Categories &#038; Archives would quickly update and the majority of visitors coming from Google search results were having a poor experience &#8211; the post that drew them in wasn&#8217;t obviously visible.</p>
<p>I knew I could use Robots.txt directives to control the INDEX-ing and FOLLOW-ing of my site, but I was hesitant about applying experimental rules to all Search Engine robots. Thankfully GoogleBot looks for a header specific to itself only, this let me apply custom rules to Google only very easily.</p>
<p>Using my Wordpress templates I added the following header on all index-style pages except the home page:</p>
<p><code>&lt;meta name="GOOGLEBOT" content="NOINDEX,FOLLOW"/&gt;</code></p>
<p>Basically this is instructing GoogleBot to follow links on this page, but not to index the page itself. The end result is that search results pointing to my blog are using the &#8216;permalink&#8217; URL, not the index page it is listed on.</p>
<p> <img src='http://blog.lobstertechnology.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.lobstertechnology.com/2006/02/24/googlebot-experiment-success/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Experimenting with Googlebot</title>
		<link>http://blog.lobstertechnology.com/2006/01/23/experimenting-with-googlebot/</link>
		<comments>http://blog.lobstertechnology.com/2006/01/23/experimenting-with-googlebot/#comments</comments>
		<pubDate>Mon, 23 Jan 2006 15:27:17 +0000</pubDate>
		<dc:creator>Michael Cutler</dc:creator>
				<category><![CDATA[Hacks]]></category>
		<category><![CDATA[Internet]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Spiders & Bots]]></category>
		<category><![CDATA[Thoughts]]></category>

		<guid isPermaLink="false">http://blog.lobstertechnology.com/2006/01/23/experimenting-with-googlebot/</guid>
		<description><![CDATA[In my previous post 'Blogs are fundamentally flawed…' I noted an observation that more often than not search results would direct a user to an index-style page containing the post instead of directly to the 'permalink' location of the post. This leads to a poor user-experience from the visitor’s point of view, on busy blogs [...]]]></description>
			<content:encoded><![CDATA[<p>In my previous post '<a href="http://blog.lobstertechnology.com/2006/01/03/blogs-are-fundamentally-flawed-for-the-typical-grandma-user/">Blogs are fundamentally flawed…</a>' I noted an observation that more often than not search results would direct a user to an index-style page containing the post instead of directly to the 'permalink' location of the post. This leads to a poor user-experience from the visitor’s point of view, on busy blogs the post has almost certainly moved since the page was spider'd. Google in particular appeared to be the worst for it.</p>
<p>Discussions on the subject with <a href='http://webofshite.com/'>Gerry</a> determined that this is most likely down to Google's PageRank technology; where index-style pages have a higher value than the post pages themselves. To get around this he suggested manipulating 'robots.txt' directives within the index-style pages.</p>
<p>On Google's "Information for Webmasters" help page I found they look for special 'robots.txt' directives and meta tags in documents when spidering specific to Googlebot only. This meant I could single out Googlebot for these directives and not affect other search engines (which don’t exhibit the problem so much).</p>
<p>I basically want Google to 'FOLLOW' links on all pages, but not to 'INDEX' the index-style pages like categories &#038; archives by date. The desired effect being that Google can find all posts as before but simply ignore the index-style pages themselves. Implementing this is quite simple; I modified my theme's "header.php" file inserting the following code in the "head" section:</p>
<div class="igBar"><span id="lphp-3"><a href="#" onclick="javascript:showPlainTxt('php-3'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">PHP:</span>
<div id="php-3">
<div class="php">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#000000; font-weight:bold;">&lt;?php</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#616100;">if</span> <span style="color:#006600; font-weight:bold;">&#40;</span> !is_single<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span> &amp;&amp; !is_page<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span> &amp;&amp; !is_home<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <a href="http://www.php.net/echo"><span style="color:#000066;">echo</span></a> <span style="color:#FF0000;">"&nbsp; &lt;meta name=<span style="color:#000099; font-weight:bold;">\"</span>GOOGLEBOT<span style="color:#000099; font-weight:bold;">\"</span> content=<span style="color:#000099; font-weight:bold;">\"</span>NOINDEX,FOLLOW<span style="color:#000099; font-weight:bold;">\"</span> /&gt;<span style="color:#000099; font-weight:bold;">\n</span>"</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#000000; font-weight:bold;">?&gt;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>This reads almost literally, if this is not a single post view, not a page view or the home page, add the following "meta..." tag. Although the home page is an index-style page I am reluctant to add 'NOINDEX' because I don't want it disappearing from search results. <img src='http://blog.lobstertechnology.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Now the long wait for the changes to reflect in Google's results.</p>
<p>Updated 24th January 2006 - <a href='http://webofshite.com/'>Gerry</a> pointed out this can be optimised using <a href="http://en.wikipedia.org/wiki/De_Morgan's_Law">De Morgan's Law</a> <img src='http://blog.lobstertechnology.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> </p>
<div class="igBar"><span id="lphp-4"><a href="#" onclick="javascript:showPlainTxt('php-4'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">PHP:</span>
<div id="php-4">
<div class="php">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#000000; font-weight:bold;">&lt;?php</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; <span style="color:#616100;">if</span> <span style="color:#006600; font-weight:bold;">&#40;</span> ! <span style="color:#006600; font-weight:bold;">&#40;</span>is_single<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span> || is_page<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span> || is_home<span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; <a href="http://www.php.net/echo"><span style="color:#000066;">echo</span></a> <span style="color:#FF0000;">"&nbsp; &lt;meta name=<span style="color:#000099; font-weight:bold;">\"</span>GOOGLEBOT<span style="color:#000099; font-weight:bold;">\"</span> content=<span style="color:#000099; font-weight:bold;">\"</span>NOINDEX,FOLLOW<span style="color:#000099; font-weight:bold;">\"</span> /&gt;<span style="color:#000099; font-weight:bold;">\n</span>"</span>;</div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;"><span style="color:#000000; font-weight:bold;">?&gt;</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.lobstertechnology.com/2006/01/23/experimenting-with-googlebot/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>&#8216;NASA Search 1.0&#8242; ??? Something Google should worry about ???</title>
		<link>http://blog.lobstertechnology.com/2005/12/06/nasa-search-10-something-google-should-worry-about/</link>
		<comments>http://blog.lobstertechnology.com/2005/12/06/nasa-search-10-something-google-should-worry-about/#comments</comments>
		<pubDate>Tue, 06 Dec 2005 10:00:23 +0000</pubDate>
		<dc:creator>Michael Cutler</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[Misc]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Spiders & Bots]]></category>
		<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[Wordpress]]></category>

		<guid isPermaLink="false">http://blog.lobstertechnology.com/2005/12/06/%e2%80%9cnasa-search-10%e2%80%9d-something-google-should-worry-about/</guid>
		<description><![CDATA[I quite like to keep an eye on what spiders / bots visit my sites, how often they return and try to infer something about how they were designed by watching them visit. ...... Today’s one is a real winner though, coming from the following net block and advertising itself as 'NASA Search 1.0'.]]></description>
			<content:encoded><![CDATA[<p>Having written my own Wordpress logging / statistics plug-in over the weekend – which still in prototype, consider it a ‘coming soon’ - I have started to notice more and more peculiar User-Agents visiting my blog.</p>
<p>I quite like to keep an eye on what spiders / bots visit my sites, how often they return and try to infer something about how they were designed by watching them visit.</p>
<p>I was surprised recently to see that the big three ( Yahoo!, MSN &#038; Google ) actually pull RSS feeds as well as HTML pages – of course this makes sense from a efficiency &#038; bandwidth side of things, the RSS feed is the interesting stuff already stripped out.</p>
<p>Today’s one is a real winner though, coming from the following net block and advertising itself as “NASA Search 1.0”.</p>
<div class="igBar"><span id="lcode-5"><a href="#" onclick="javascript:showPlainTxt('code-5'); return false;">PLAIN TEXT</a></span></div>
<div class="syntax_hilite"><span class="langName">CODE:</span>
<div id="code-5">
<div class="code">
<ol>
<li style="font-family: 'Courier New', Courier, monospace; color: black; font-weight: normal; font-style: normal;color:#3A6A8B;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">Comcast Cable Communications, Inc. <span style="">NJ</span>-SOUTH-<span style="color:#800000;color:#800000;">4</span> <span style="color:#006600; font-weight:bold;">&#40;</span>NET-<span style="color:#800000;color:#800000;">68</span>-<span style="color:#800000;color:#800000;">46</span>-<span style="color:#800000;color:#800000;">128</span>-<span style="color:#800000;color:#800000;">0</span>-<span style="color:#800000;color:#800000;">1</span><span style="color:#006600; font-weight:bold;">&#41;</span></div>
</li>
<li style="font-weight: bold;color:#26536A;">
<div style="font-family: 'Courier New', Courier, monospace; font-weight: normal;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span style="color:#800000;color:#800000;">68</span>.<span style="color:#800000;color:#800000;">46</span>.<span style="color:#800000;color:#800000;">128</span>.<span style="color:#800000;color:#800000;">0</span> - <span style="color:#800000;color:#800000;">68</span>.<span style="color:#800000;color:#800000;">46</span>.<span style="color:#800000;color:#800000;">191</span>.<span style="color:#800000;color:#800000;">255</span> </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>The bot / spider crawled my entire site within a few minutes, starting from my ‘changes-in-wordpress-152’ post and was completely oblivious to my robots.txt (it didn’t even request it).</p>
<p>Also, it appeared to be quite a primitive HTTP client, providing no referrer information or any of the usual headers “Connection: close”, “Accept: */*” even though it was sending a “HTTP/1.1” request. Surprisingly though it did persist a session cookie for the duration of its visit.</p>
<p>I Google’d for the phrase “NASA Search 1.0” and only seemed to find results where auto-generated-stats pages list visiting User-Agents.</p>
<p>It would be quite interesting (and maybe even fun – in a very geeky way) to write a Wordpress plug-in that watches for these peculiar bots and pings their details to a centralised stats database – forming a sort of spider-spider.</p>
<p>Anyway, I will be keeping a keen eye out for the return of “NASA Search 1.0” … Could it be the next greatest NASA funded project? Or is it just some smart a** that has figured out how to change the User-Agent string in his favourite spider/bot.</p>
<p>Stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.lobstertechnology.com/2005/12/06/nasa-search-10-something-google-should-worry-about/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
