Expression® Web Design

Robots Text

Friday, January 09, 2009
You might have heard this mentioned before if you have done any reading on search engine optimization or getting your website listed in a search engine.  Usually on message boards and forums, you might see it as robots.txt.

This is a simple text file (not HTML, PHP, ASP, etc) that (good) search engines rely on to help spider your website.  You can tell some search engines (robots / crawlers) not to spider your website.  You can tell the search engines what folders not to search.  You might have come across this when you were trying to add your website manually to a search engine.  And the robots.txt is placed in your root folder - i.e. you should be able to see this blog's robots.txt file.  Most websites will have a robots.txt file, even the White House.

While you do want a robots.txt file, keep in mind that it can be used against you.  For example, spammers might search them for e-mail addresses.  And the robots that do not follow the robots.txt might be able to locate folders that you wish to remain hidden.  Or hackers might use the robots.txt file to help locate these "hidden" folders.

The Basic Robots.txt File

You can use a wildcard (*) to basically notate any and all robots:

User-agent: *
Disallow:


This will tell all robots that it is ok to crawl (spider) your website and all folders can be crawled.  It will help to ensure the robots will spider your website (some might not spider the website).  You should not see any 404 error requests in your log files from robots requesting your robots.txt file.  If you see any of these error messages, make sure that the file is names robots.txt - if you are in a *NIX server and you named your file Robots.txt, the robots might not be able to retrieve the file since *NIX servers are case sensitive.  Windows servers are not case sensitive.

XML Sitemaps and the Robots.txt File

Most of the larger search engines support XML Sitemaps.  Even live.com got on board with this protocol.  And it is very easy to add this information to your robots.txt file:

Sitemap: http://www.loudexpressions.com/sitemap.xml
 
User-agent: *
Disallow:


Now the search engines will know how to locate the XML sitemap.  Keep in mind that the XML sitemap is different from a sitemap for users.

Robots.txt and Search Engines

Some search engines might be requesting too much information too fast for your web server.  Remember earlier we spoke a little bit about "hits".  You can tell some search engines to slow down a bit, like Yahoo!®  - so your robots.txt file would look something like

Sitemap: http://www.loudexpression.com/sitemap.xml
 
User-agent: *
Disallow:
 
User-agent: Slurp
Crawl-delay: 2


This would tell the Slurp robot to wait two seconds before another request.

DeliciousDigg This PostNewsvineRedditTechnorati

Comments

Name
URL
Email
Email address is not published
Access Code secureimage
Please enter the access code
Remember Me
Comments








Feeds