how-to block ads
220.127.116.11 - - [14/Nov/2004:04:51:13 -0500] "GET /robots.txt HTTP/1.0" 200 69 "-" "msnbot/0.3 (+http//search.msn.com/msnbot.htm)"
This is an example of a search engine spider (msnbot, in this case) requesting files from your server. But how do you prevent a bot from doing this, or how do you direct a bot to only index certain portions of your site?
To define what a spider can and cannot do on your site, we can use the Robots Exclusion Standard. You will notice in the example above a request is made for /robots.txt. When a robot first visits your site, it checks for this file first to find out what it is allowed to look at.
Start by creating a file called robots.txt and place it in the root directory of your webserver. In the following example, we will block access to the entire site to all bots:
To block access only to a special directory on your site (in this case, /secret):
You can also block only one bot (again we'll do msnbot):
Are you getting 404 (not found) errors when robots try to find robots.txt?
Even if you don't want to block any robots at all, create a robots.txt with the following, which allows all robots access to your entire site.
Most of the more popular search engine spiders (msnbot, Yahoo! Slurp, googlebot) are well-behaved and will obey your directives. Keep an eye on your logs to make sure that they do obey. If you believe they are not doing as they should, you should report it to the bot owner. There is usally a URL in the bot's user-agent string you can visit to find out details about who's running it, how to contact them, and so forth.
Since misbehaving robots don't pay attention to robots.txt, you may have to block the offending robot's traffic. You can do this by examining your logs to see what sort of signature the robot can be identified by when it comes to your site. You may choose to block the robot's traffic by IP address or "user-agent" (what a robot calls itself).
The robots meta tag
There is another method of controlling access to your content. This one works on a page-by-page basis. Add the following line inside the head section:
<meta name="robots" content="noindex, nofollow" />
This will tell any robot not to index this page in its search results, nor should it follow any links on the page.
For more information on web robots and robots.txt, visit The Web Robots Pages.
Feedback received on this FAQ entry: