What is a robots.txt file?


A robots.txt file is a small text file that tells search engine spiders, robots, and crawlers which pages they should include when they crawl (or index) a web site.  A robots.txt file is placed in the root directory of a web site (most often called the public directory, the htdocs directory, or the www directory).

A robots.txt file is used by webmasters and Internet marketers to advise which directories, or files to EXCLUDE from being indexed by search engines.  By default everything the search engine can read is normally indexed, so a concious effort must be made to tell search engines what NOT TO INCLUDE!

A robots.txt file is generated in a pure text editor, and is uploaded to the root directory of a site via FTP in ASCII mode.  There are many robots.txt file generators on the Internet.  In this author's view, the robots.txt file is so simple (once understood) that a file generator is simply not needed.

Care must be taken to set up the robots.txt file correctly.  There are 2 common statements in a robots.txt follow - namely the "User-agent" and "Disallow" statements.   The "Disallow" is the exclusion statement. 

Common illustrations follow below:   

The "User-agent"  statement can have different values indicated by the  "wild card" character " * " (asterisk). 

* This statement specifies the User-agent for Google:
User-agent: googlebot

* This statement specifies the User-agent for all robots:
User-agent: *

* This Disallow specifies not to access the page  /private-stuff.html  in the root directory:
Disallow: /private-stuff.html

* This Disallow specifies not to access the entire directory   /images/:
Disallow: /images/

*  This Disallow specifies not to access the entire site.  Clearly you will want to make sure NOT to use this Disallow, or your site will likely never be indexed.
Disallow: /

There is a lot of good information on the treatment of robots.txt files on the Internet.  Refer to these sites:



Hint #1:  Just because you request exclusion does not necessarily mean exclusion is guaranteed.  A web site is a public document.  Files excluded by the robots.txt instructions can still be viewed by typing the complete address in the browser.

Hint #2:  To see if a site has a robots.txt file in place, in your browser type the domain name followed by /robots.txt.  For example,  http://www.yourdomain.com/robots.txt.

Hint #3:  To validate a robots.txt file, go to http://www.searchengineworld.com/cgi-bin/robotcheck.cgi


