Bing blogs

This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.

Webmaster Blog

May
03

To crawl or not to crawl, that is BingBot's question

If you are reading this column, there is a good chance you publish quality content to your web site, which you would like to get indexed by Bing.   Usually, things go smoothly: BingBot visits your web site and indexes your content, which then appears in our search results and generates traffic to your site. You are happy, Bing is happy and the searcher is happy.

However, things do not always go so smoothly. Sometimes BingBot gets really excited about your quality content and ends up crawling your web site beyond all expectations, digging deeper and harder than you otherwise wanted. Sometimes you did everything you could to promote your quality content but BingBot still does not visit your site.

As much as robots.txt is a reference tool to control BingBot’s behavior, it is also a double-edged sword that may be interpreted in a way that disallows (or allows) much more than you thought initially. In this column, we will go through the most common robots.txt directives supported by Bing, highlighting a few of their pitfalls, as seen in real-life feedback over the past few months.

Where does BingBot look for my robots.txt file?

For a given page, BingBot looks at the root of the host for your robots.txt file. For example, in order to determine if it is allowed to crawl the following page (and at which rate):

http://us.contoso.com/products.htm

BingBot will fetch and analyze your robots.txt file at:

http://us.contoso.com/robots.txt

Note that the host here is the full subdomain (us.contoso.com), not contoso.com nor www.contoso.com. This means that if you have multiple subdomains, BingBot must be able to fetch robots.txt at the root of each one of them, even if all these robots.txt files are the same.  In particular, if a robots.txt file is missing from a subdomain, BingBot will not try to fall back to any other file in your domain, meaning it will consider itself allowed anywhere on the subdomain.  BingBot does not “assume” directives from other hosts which have a robots.txt in place, associated with a domain.

When does BingBot look for my robots.txt file?

Because it would cause a lot of unwanted traffic if BingBot tried to fetch your robots.txt file every single time it wanted to crawl a page on your website, it keeps your directives in memory for a few hours. Then, on an ongoing basis, it tries to fetch your robots.txt file again to see if anything changed.

This means that any change you put in your robots.txt file will be honored only after BingBot fetches the new version of the file, which could take a few hours if it was fetched recently.

Which directives does BingBot honor?

If there is no specific set of directives for the bingbot or msnbot user agent, then BingBot will honor the default set of directives, defined with the wildcard user agent. For example:

User-Agent: *
Disallow: /useless_folder

In most cases, you want to tell all search engines the URL paths where you want them to crawl, and the URL paths you want them to not crawl. Also, maintaining only one default set of directives for all search engines is less error-prone and is our recommendation.

What if I want to allow only BingBot?

In your robots.txt file, you can choose to define individual sections based on user agent. For example, if you want to authorize only BingBot when others crawlers are disallowed, you can do this by including the following directives in your robots.txt file:

User-Agent: *
Disallow: /

User-Agent: bingbot
Allow: /

A key rule to remember is that BingBot honors only one set of directives, in this order of priority:

  • The section for the bingbot user agent, discarding everything else.
  • The section for the msnbot user agent (for backwards compatibility), discarding everything else.
  • The default section (wildcard user agent).

This rule has two main consequences in terms of what BingBot will be allowed to crawl (or not):

  • If you have a specific set of directives for the bingbot user agent, BingBot will ignore all the other directives in the robots.txt file.  Therefore, if there is a default directive that should apply to BingBot as well, you must copy it to the bingbot section in order for BingBot to honor it.
  • If you have a specific set of directives for the msnbot user agent (but not for the bingbot user agent), BingBot will honor these. In particular, if you have old directives blocking MSNBot, you are also blocking BingBot altogether as a side effect. The most common example is:

User-agent: msnbot
Disallow: /

Does BingBot honor the Crawl-delay directive?

Yes, BingBot honors the Crawl-delay directive, whether it is defined in the most specific set of directives or in the default one – that is an important exception to the rule defined above. This directive allows you to throttle BingBot and set, indirectly, a cap to the number of pages it will crawl.

One common mistake is that Crawl-delay does not represent a crawl rate. Instead, it defines the size of a time window (from 1 to 30 seconds) during which BingBot will crawl your web site only once. For example, if your crawl delay is 5, BingBot will slice the day in smaller five-second windows, crawling only one page (or none) in each of these, for a maximum of around 17,280 pages during the day.

This means the higher your crawl delay is, the fewer pages BingBot will crawl. As crawling fewer pages may result in getting less content indexed, we usually do not recommend it, although we also understand that different web sites may have different bandwidth constraints.
Importantly, if your web site has several subdomains, each having its own robots.txt file defining a Crawl-delay directive, BingBot will manage each crawl delay separately. For example, if you have the following directive for both robots.txt files on us.contoso.com and www.contoso.com:

User-agent: *
Crawl-delay: 1

Then BingBot will be allowed to crawl one page at us.contoso.com and one page at www.contoso.com during each one-second window. Therefore, this is something you should take into account when setting the crawl delay value if you have several subdomains serving your content.

My robots.txt file looks good… what else should I know?

There are some other mechanisms available for you to control BingBot’s behavior. One of them is to define hourly crawl rates through the Bing Webmaster Tools (see the Crawl Settings section).  This is particularly useful when your traffic is very cyclical during the day and you would like BingBot to visit your web site more outside of peak hours. By adjusting the graph up or down, you can apply a positive or negative factor to the crawl rate automatically determined by BingBot.  This fine tunes the crawl activity to be more or less at a given time of the day, all controlled by you.  It is important to note that a crawl delay noted in your robots.txt file will override the direction set within the Bing Webmaster Tool, so plan carefully to ensure you are not sending BingBot contradictory messages.

Comments

  • When do you think Bing's crawl pace will increase? Google crawls websites very quickly. A new website of mine had over 3,500 web pages. Google indexed each and every one of them in just 3 days.

    Bing has only indexed 4 in the past week and a half.

  • :(  To crawl or not to crawl, that is not BingBot's question,That's us,webmasters sad question. :(

    As they all said,bingbot doesn't crawl in time.Besides,I can submit 40urls via bing webmaster tools,now i can't find it. :(.Totally sad.

  • please give me a tutorial step by step to submit a sitemap to bing webmaster.... What use atom or rss ??

  • @ aziez - this article should help.  It explains how to submit a sitemap to Bing.  We do accept RSS and Atom feeds as well, and you submit them through the same location.

    onlinehelp.microsoft.com/.../hh204487.aspx

  • Its not an issue now a day, one can even leave the blog, and bots will crawl the site

  • always my sitemap rss or atom always error again and error in bing webmaster, why?? i can't delete too

  • atom error again way...

  • could be good only

    User-Agent: *

    allow: /

    ???

  • Give us Bing Site Explorer, it will really enhance all webmaster's interest. You guys do have all the data and all you guys need to di is share it.

  • Very useful article. I never knew it :)

  • I think that has a high ranking site that will be appears in bing search results

    I got an email from microsoft customer support:

    "Please note That the indexed pages reported in the Bing Webmaster Tools Sometimes the which differ of that is shown in Bing SERP. This is Because only high ranking Those pages are actually shown in Bing and not all the indexed pages reported."

    ....???????

  • I submit my sitemap but satatus is "Pending" and still waiting for approve. How long does it take to be approved a site map?

  • Does BingBot honor the Crawl-delay directive?

    The answer should be no, they do not, nor do they obey robots.txt. We tried everything to slow Bing bots down on our site, nothing worked and emails to support got the typical MS "canned response" to try the things we had already done, and were indicated in the letter we sent.

    We finally had to resort to blocking all of Bing's IP ranges in .htacess to keep from getting slammed by Bing in excess of 30 bots crawling at once.  This was causing our site to become unstable and unresponsive.  Immediately they were blocked, things returned to normal.

  • Disappointed with this article. All the content is about robots and a little about crawl control in bing webmaster tool . We as webmasters , want to know how to measure a website's quality. In what circustances Bing will more likely to index the contents .  Refering crawl control , when I setting up crawl control , I can't setup the crawl speed during 0am - 8am, 0am - 8am is my website non-busy time , but I can't setup the crawl speed during this period, so there is no use for crawl control . Maybe Google's auto control crawl speed is easier and reasonable for webmasters .

  • Bing search engine crawls a page that already has been removed from our entire website! Bing sitemap points to the page that already has been removed. It is very frostrating. I did diagnostics and it says page can not be found and says <H1> tag is missing.

    I called my web developer he asked me to give my webmaster tools passowrd so he can look into it. He doesn't think that there is any encoding error.

    How can I Bing NOT to crawl the page that already has been removed from my site?

    Please help.