| A robots.txt file is a small text file that tells
search engine spiders, robots, and crawlers which pages they should include when they
crawl (or index) a web site. A robots.txt file is placed in the root directory of
a web site (most often called the public directory, the htdocs directory, or the www directory).
A robots.txt file is used by webmasters and Internet
marketers to advise which directories, or files to EXCLUDE from being indexed by search
engines. By default everything the search engine can read is normally indexed, so
a concious effort must be made to tell search engines what NOT TO INCLUDE!
A robots.txt file is generated in a pure text editor,
and is uploaded to the root directory of a site via FTP in ASCII mode. There are
many robots.txt file generators on the Internet. In this author's view, the
robots.txt file is so simple (once understood) that a file generator is simply not
needed.
Care must
be taken to set up the robots.txt file correctly. There are 2 common statements in
a robots.txt follow - namely the "User-agent" and "Disallow" statements. The
"Disallow" is the exclusion statement.
Common illustrations follow below:
The "User-agent"
statement can have different values indicated by the "wild card" character " * "
(asterisk).
* This statement
specifies the User-agent for Google:
User-agent: googlebot
* This statement
specifies the User-agent for all robots:
User-agent: *
* This Disallow
specifies not to access the page /private-stuff.html in the root directory:
Disallow: /private-stuff.html
* This Disallow
specifies not to access the entire directory /images/:
Disallow: /images/
* This
Disallow specifies not to access the entire site. Clearly you will want to make
sure NOT to use this Disallow, or your site will likely never be indexed.
Disallow: /
There is a lot of
good information on the treatment of robots.txt files on the Internet. Refer to
these sites:
http://www.webmasterworld.com/forum93/
http://en.wikipedia.org/wiki/Robots.txt
http://www.robotstxt.org/wc/robots.html
http://www.searchengineworld.com/robots/robots_tutorial.htm
Hint #1:
Just because you request exclusion does not necessarily mean exclusion is guaranteed.
A web site is a public document. Files excluded by the robots.txt instructions can
still be viewed by typing the complete address in the browser.
Hint #2: To
see if a site has a robots.txt file in place, in your browser type the domain name
followed by /robots.txt. For example,
http://www.yourdomain.com/robots.txt.
Hint #3: To
validate a robots.txt file, go to
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi |