Robots Text
You might have heard this mentioned before if you have done any reading on search
engine optimization or getting your website listed in a search engine. Usually
on message boards and forums, you might see it as robots.txt.
This is a simple text file (not HTML, PHP, ASP, etc) that (good) search engines
rely on to help spider your website. You can tell some search engines (robots
/ crawlers) not to spider your website. You can tell the search engines what
folders not to search. You might have come across this when you were trying
to add your website manually to a search engine. And the robots.txt is placed
in your root folder - i.e. you should be able to see this blog's
robots.txt file. Most websites will have a robots.txt file, even the
White
House.
While you do want a robots.txt file, keep in mind that it can be used against you.
For example, spammers might search them for e-mail addresses. And the robots
that do not follow the robots.txt might be able to locate folders that you wish
to remain hidden. Or hackers might use the robots.txt file to help locate
these "hidden" folders.
The Basic Robots.txt File
You can use a wildcard (*) to basically notate any and all robots:
User-agent: *
Disallow:
This will tell all robots that it is ok to crawl (spider) your website and all folders
can be crawled. It will help to ensure the robots will spider your website
(some might
not spider the website).
You should not see any 404 error requests in your log files from robots requesting
your robots.txt file. If you see any of these error messages, make sure that
the file is names robots.txt - if you are in a *NIX server and you named your file
Robots.txt, the robots might not be able to retrieve the file since *NIX servers
are case sensitive. Windows servers are not case sensitive.
XML Sitemaps and the Robots.txt File
Most of the larger search engines support
XML Sitemaps. Even
live.com got on board with this protocol. And it is very easy to add this
information to your robots.txt file:
Sitemap: http://www.loudexpressions.com/sitemap.xml
User-agent: *
Disallow:
Now the search engines will know how to locate the
XML sitemap. Keep in mind that the
XML sitemap is different from a
sitemap for users.
Robots.txt and Search Engines
Some search engines might be requesting too much information too fast for your web
server. Remember earlier we spoke a little bit about "
hits".
You can tell some search engines to slow down a bit, like
Yahoo!®
- so your robots.txt file would look something like
Sitemap: http://www.loudexpression.com/sitemap.xml
User-agent: *
Disallow:
User-agent: Slurp
Crawl-delay: 2
This would tell the Slurp robot to wait two seconds before another request.




Comments