Make Your Robot Overlords Bow Down To You

The Definitive Guide to Robots Exclusion

A robot overlord (picture by Tom Hilton)

Master your robot overlords with robots.txt

At Storm ID we pride ourselves on providing cutting-edge solutions and modern and exciting advice and consultancy.

However, while the industry evolves, grows and constantly invents new and energising technologies and opportunities, some of the basic issues never seem to go away. There seem to be a number of issues that many in the industry never seem to grasp properly (thankfully, I’m not including our lovely development team here at Storm in that group – but we frequently see 3rd party managed sites coming through to our marketing team with such issues).

One of the most common issues we see that should really be part of web development 101 is the use of robots exclusion protocol – robots.txt.

Why should websites use robots.txt?

Since robots exclusion was first considered (in 1994!) there have been several valid reasons for wanting to exclude robots from indexing some or all of your website content.

The reasons might historically have included:

  1. privacy – you might not have wanted some of your website content to appear in search engines, such as customer data
  2. security – you might have had concerns that a robot might index secure content such as financial data
  3. duplicate content – you might have wanted to have different versions of your site for different users – such as having a large text version, or a printer friendly version – which you wanted to avoid a search engine indexing and counting as duplicate
  4. You might have been cloaking (naughty, naughty – slaps your wrist) – you could have created a text heavy version that used some kind of JavaScript redirect to push real users to a robots excluded image or Flash rich version of the site (welcome to 1998!) *disclaimer – I would heavily recommend NOT doing this, even in 1998!

In 2012, there are even more reasons to consider using robots exclusion, though. Search engines, and Google in particular, are concerned with the overall quality of your website. Google wants to index useful content.  It’s not so concerned with all the other detritus of the web.  You might not think your 2 million internal search result pages are detritus – but Google might. If Google sees a few really good pages on your site, but many more useless ones, there is a real risk that your good pages will suffer in rankings, pulled down by the negative weight of the bad pages. Google ranks the content of web pages, but the reputation and importance of web sites.

Why robots.txt needs care

If you don’t correctly set up your robots exclusion file, there are a number of bad side-effects that might happen.

Clearly, one risk is that it is ignored completely, and search engines end up indexing everything on your website, including all the stuff you actually wanted to exclude.

Worse still, you might accidentally end up blocking important content from being indexed, may be even all of your content.

Getting the syntax right

Perhaps the most common mistake I see in robots.txt files is incorrect syntax. Specifically, the single most common mistake is use of “Allow” as an instruction.

There are only two valid instructions that robots will parse in a robots.txt file (actually there is a third one that some will parse but I will come to that later on).

Those two instructions are: “User-agent:” and “Disallow:”

As the names imply, “User-agent:” is used to instruct a specific robot based-on its user agent, and “Disallow:” is used to inform the robot what URLs it must not index.

There is no “Allow” instruction to make any exception to the syntax.

However, it is acceptable to comment within the robots.txt file using the “#” character at the start of each comment line. The comment is ended by a newline character.

Furthermore, there is no use of wildcards or regular expressions in the robots.txt protocol, with the exception that the wildcard “*” can be used to instruct all robots in the directive “User-agent: *”.

The Disallow field strictly describes partial URLs (not directories, as is commonly described, although a directory is one kind of partial URL).

For example:
Disallow: /help # disallows /help/index.htm and /help.htm
Disallow: /help/ # disallows /help/index.htm but allows /help.htm

You must place your robots.txt file in the root of your website, i.e. at http://www.yourwebsite.com/robots.txt, and it must be called robots.txt.

The third instruction – Sitemap

There is a third instruction that is obeyed by some robots.  That is the instruction “Sitemap:” followed by a URL.  This instruction is used by some robots to indicate the location of an XML file with a list of URLs that should be indexed.

As not all robots obey “Sitemap:”, it should be placed at the end of the robots.txt file to enable other robots to parse the other instructions in the file. Multiple lines should be used to list multiple sitemap files, and only exact URLs can be specified for each sitemap.

For example:
Sitemap: /sitemap.xml
Sitemap: /videositemap.xml

Obviously, this creates the possibility that a URL indicated in a sitemap might otherwise be implicitly or explicitly excluded in the robots “Disallow:” fields. It isn’t always clear what action a search engine will take when it receives contradictory instructions such as this.  Furthermore, it is an indicator that not enough care is being taken in the preparation of the robots.txt and sitemap files, and as a result the search engines may reduce the trust they have in the site, which may impact ranking.

Alternative tactics

If you can’t access your robots.txt file to edit it, then there are alternative tactics that can be used (although if your robots.txt file disallows something, you can’t use these to override that).

Noindex

The Meta tag “robots” is correctly used to indicate to a robot whether to index a page, and whether to follow links on the page. The permission to index or follow is implied by absence, and so there is no need to instruct a robot to index or to follow.  However, you may have reasons to cite noindex, or nofollow.  Noindex is equivalent to a web page being disallowed in robots.txt.

The syntax is:
<meta name="robots" content="noindex, nofollow">

Edit: there are also some other values you can apply to the robots Meta tag, although they aren’t specifically equivalent to robots.txt behaviour, like ‘noindex’ is. Thanks @John_Meffen for reminding me to mention that ;-)

x-robots-tag

Another alternative, which is also useful for binary files such as images and downloads, is a header response sent by the server that instructs a robot that a file is not to be indexed.

Different server software and server-side languages have different ways of creating the x-robots-tag. In PHP, for example, you would include the following line at the top of your PHP file:
header("X-Robots-Tag: noindex", true);

The difference between indexing and ranking

It is important to understand that blocking a page from being indexed by a search engine does not mean that it is prevented from appearing in Google and other search engines.  Indexing means the process a search engine uses to crawl and understand the web page’s content – this is what robots.txt prevents.

Google et al will still potentially include URLs in their index that they find via other publically available data, such as links from third party sites (even links that are “nofollowed”), social sharing such as Facebook and Google Plus shares.

Summary

To round up, I will emphasise that it is important to make sure that your good content is not blocked by robots.txt, but your poor content perhaps should be so that the good content benefits (perhaps you would be better with a proper content strategy, though!).

You should block content you don’t want to be indexed, such as private content or secure content, but really you should ensure that they also live only behind a log in to avoid the URLs being listed in Google also – remember to use the correct response headers for users that are not logged-in – use a 401 response, not a 200 or 301/302.

Remember if you have a staging server or test server, block it all with both robots.txt and hide it behind a log-in.  Make sure, though, that when you push your code to the live server that you don’t push the robots.txt or the log-in live too.

References:

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449

http://www.robotstxt.org/orig.html and http://www.robotstxt.org/robotstxt.html

http://en.wikipedia.org/wiki/Robots.txt

http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1

http://yoast.com/x-robots-tag-play/

‘Robot Overlord’ photo by Tom Hilton

Tell the world