Bing

Is your robots.txt file on the clock?

September 12, 2008, 04:12 PM by Webmaster Center team | 16 Comments

Just recently a strange problem came across my desk that I thought was worth sharing with you. A customer notified us that content from a site she was interested in was not showing up in our results. Wanting to understand why we may or may not have indexed the site, I took a look to see what the problem was and stumbled upon an interesting but a potentially very bad use of the robots.txt file.

The first visit I made to the site had a very standard robots.txt file that read like:

User-agent:*
Disallow: /cgi-bin/
Disallow: /ads/

But within an hour the robots.txt had changed to:

User-agent: *
Disallow: /cgi-bin/
Disallow: /ads/
Disallow: /products/
Disallow: /home/content/archive/
Disallow: /Survey/
Disallow: /info/
Disallow: /staff/

The new robots.txt file was blocking the Live Search crawler and all others from accessing the main content for the site. The webmaster of this site was using different robots.txt files switched multiple times throughout the day as a method of controlling the rate of crawl or the impact of search crawlers on their site.

At Live Search, and in fact on most search engines, the behavior of changing your robots.txt every few hours can cause problems. When you change your robots.txt file, the changes are perceived as definitive (until they are downloaded again). For most search engines, Robots.txt files may be downloaded a few times per day to provide allow and disallow information per URL to the crawl servers. 

When the crawler returns it does not retry fetching the content for URLs previously disallowed a few hours ago and the crawler may fetch content outside of your new directive as the crawler may be using the previously cached robots.txt file. You may perceive a benefit from changing your file throughout the day as you may see less server load from some search engines on your site, but this behavior also works against you.

The downside to changing your robots.txt file is that your content will dance, going in and out from search engines index depending if the content was fetched on search crawlers servers that were authorized or not.  When content is disallowed, Search Engines will think that you don't want it indexed and will hide the blocked URLs for days to weeks until they update it.

Instead of changing your robots.txt throughout the day, we recommended having a fixed robots.txt with the same rules for all time of the day and if needed using the crawl-delay directive to prevent aggressive crawls.  In a previous post we discussed how we continue to partner with Google and Yahoo to provide better standards and communication around the Robots Exclusion Protocol. While we don’t have complete agreement on how to delay the crawl, the crawl-delay directive is supported by Live Search and Yahoo today. For Google you can make crawl speed adjustments within Google’s Webmaster Central.  

If you have additional questions regarding robots.txt or feedback on how we handle robots.txt directives you can discuss it in our forums.

--Jeremiah Andrick, Program Manager, Live Search Webmaster Center

Filed under:
subscribe

Comments

Kate Morris

Posted On September 12, 2008, 04:26 PM

People do that?? One robots.txt file people, srsly. Thanks, I had never thought of even doing this ... one more thing to know that potential clients *might* be doing.


Jeremiah Andrick

Posted On September 12, 2008, 04:37 PM

@kate

Thanks for the comment. Publishers do a lot things that may or may not be good for them or the Searcher.  Hopefully by educating ourselves, the SEM community and publishers we can correct a lot of mistakes.   Keep a "heart of a teacher" attitude when working with customers and that should help us all.

Jeremiah


Paul

Posted On September 16, 2008, 12:42 PM

very interesting. one extra thing to look out for when such a problem arises


M.L. Stone

Posted On September 16, 2008, 01:23 PM

Wow.  Thank you for saying what needed to be (apparently) said...that's certainly a first for me, but I think your article will come in handy at some point in the future.  I've been in contact with a few clients who probably think that changing the robots.txt every hour is a good idea.  O_o


Jeremiah Andrick

Posted On September 16, 2008, 02:53 PM

@M.L. Stone

Thanks for taking the time to respond.  Please feel free to let us know if there are any issues you find with clients that needs addressing at this level. We will do our best to provide insight and help into the best practice for that area.

Thanks

Jeremiah


Clint Dixon

Posted On September 16, 2008, 03:14 PM

Only problem I see is search spiders crawl servers and happen to follow links on the pages there, so while the robots.txt may be download more often than not the instructions within are not followed.

Next my server logs seem to indicate a once daily crawl so why would you suggest live search crawls more than that?

Actually Live Search barely crawls any of clients sites or any of the 50+ I own and all have a robots.txt file..

Lastly the amount of webmaster who would change their robots.txt throughout the day that the number is probably less than 1% of 1%.

Have any juicy information most of us can use as opposed to the rare person???


Bullet Boy

Posted On September 17, 2008, 02:49 AM

Some CMS systems and Blog platforms actually generate robots files dynamically.  I have never seen one that changes without the discretion of the admin.

Definetly something to watch out for.  

I tested a lesser known open source Content Managment System recently that generated it from the DB.


Levert Marketing

Posted On September 17, 2008, 12:25 PM

What? They are restricting robots from crawling their products page??? It would be interesting to hear about their sitemap.xml file! lol


Jeremiah Andrick

Posted On September 17, 2008, 06:46 PM

Clint

Thanks for the feedback.  We are working to provide more content in the blog for both the wide and narrow audiences. I wrote this post as something that caught my attention.  As for your questions, crawl frequency per site is based on a number of factors and you may potentially be crawled more than once a day. I am glad you are using robots.txt but that also doesn't guarantee crawl frequency either.  

You bring up a good point (your first) about links and we are always looking to improve the protocol to provide reasonable control to publishers.  Good to think about.

Thanks for the feedback


Jeremiah Andrick

Posted On September 17, 2008, 06:57 PM

@bullet boy

Thanks for the comment.  I imagine this is more of a custom implementation rather than a normal tool in a CMS.  Auto-generation of the file really isn't the problem.  It is more of how the auto-generation was pushing the file at various times of day and the intent of those pushes.  

Jeremiah


Strail

Posted On October 02, 2008, 08:28 AM

There I was thinking that robots.txt was a simple file. I'll have to look at these other things mentioned, I never knew you could run with different robot.txt files in your web site


Jon

Posted On October 05, 2008, 11:23 PM

Interesting... guess I'll have to check over my robot.txt files.


Col

Posted On October 08, 2008, 10:06 AM

Cool. I didnt know that either.


Anonymous

Posted On October 30, 2008, 06:50 PM

I don't see the benefit of using several robots texts. Why would someone do that?


Quality Directory

Posted On June 09, 2009, 09:49 AM

I don't see why people block bots by practicing things they have no way of knowing if it's working.


Write a Comment

This information is published in the Community – it’s public.
 

Save Comment Remember Me?

Welcome

to the Microsoft Bing community

Remember, don't post your personal information!