Bing blogs

This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.

Webmaster Blog

June
03

Robots Exclusion Protocol: joining together to provide better documentation

As a member of the Live Search Webmaster Team, I'm often asked by web publishers how they can control the way search engines access and display their content. The de-facto standard for managing this is the Robots Exclusion Protocol (REP) introduced back in the early 1990's. Over the years, the REP has evolved to support more than "exclusion" directives; it now supports directives controlling what content gets included, how the content is displayed, and how frequently the content is crawled. The REP offers an easy and efficient way to communicate with search engines, and is currently used by millions of publishers worldwide. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and the way it works for any publisher, no matter how large or small.

In the spirit of making the lives of webmasters simpler, Microsoft, Yahoo and Google are coming forward with detailed documentation about how we implement the Robots Exclusion Protocol (REP). This will provide a common implementation for webmasters and make it easier for any publishers to know how their REP directives will be handled by three major search providers, making REP more intuitive and friendly to even more publishers on the web.

Common REP Directives and USE Cases

The following list includes all the major REP features currently implemented by Google, Microsoft, and Yahoo. We are documenting the features and the use cases they enable for site owners. With each feature, you'll see what it does and how you should communicate it.

Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.

1.Robots.txt Directives

DirectiveImpactUse Cases
Disallow Tells a crawler not to crawl your site or parts of your site -- your site's robots.txt still needs to be crawled to find this directive, but the disallowed pages will not be crawled 'No crawl' pages from a site. This directive in the default syntax prevents specific path(s) of a site from crawling

Allow

Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule – the longest rule – applies.

This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.

$ Wildcard Support

Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June)

'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.

* Wildcard Support Tells a crawler to match a sequence of characters (available by end of June) 'No Crawl' URLs with certain patterns, for e.g., disallow URLs with session ids or other extraneous parameters, etc.

Sitemaps Location

Tells a crawler where it can find your sitemaps.

Point to other locations where feeds exist to point the crawlers to the site's content

2. HTML META Directives

The tags below can be present as Meta Tags in the page HTML or X-Robots-Tag Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.

DirectiveImpactUse Case(s)
NOINDEX META Tag Tells a crawler not to index a given page Don't index the page. This allows pages that are crawled to be kept out of the index.
NOFOLLOW META Tag Tells a crawler not to follow a link to other content on a given page Prevent publicly writeable areas to be abused by spammers looking for link credit. By NOFOLLOW, you let the robot know that you are discounting all outgoing links from this page.
NOSNIPPET META Tag Tells a crawler not to display snippets in the search results for a given page Present no abstract for the page on Search Results.
NOARCHIVE / NOCACHE META Tag Tells a search engine not to show a "cached" link for a given page Do not make a copy of the page available to users from the Search Engine cache.
NOODP META Tag Tells a crawler not to use a title and snippet from the Open Directory Project for a given page Do not use the ODP (Open Directory Project) title and abstract for this page in Search.

Other REP Directives

The directives listed above are used by Microsoft, Google and Yahoo, but may not be implemented by all other search engines.  Additionally, Live Search and Yahoo support the Crawl-Delay directive, which is not supported by Google at this time.

  • Crawl-Delay - Allows a site to delay the frequency with which a crawler checks for new content (Supported by Live Search and Yahoo).

Learn more

Going forward, we plan to continue this coordination and ensure that as new uses of REP arise, we're able to make it as easy as possible for webmasters to use them. Until then, you can find more information about robots.txt at http://www.robotstxt.org and within Live Search's Webmaster Center, which contains lots of helpful information, including:

There is also a useful list of the bots used by the major search engines here: http://www.robotstxt.org/wc/active/html/index.html

-- Fabrice Canel & Nathan Buggia, Live Search Webmaster Team

Comments

  • Good work that the three major search engines are teaming up to create standards that ease our work. I hope this is just a first glance of other cooperations in the future.

  • Hi Live search!

    Google Webmaster Tools allow me to set crawl rate: Slower, Normal (default), Faster. I do not know how long is Slower, Normal, Faster???

  • Can I see example robots.txt  for msnbot?

  • Robot for FAQ pages are recommended?

  • Live Search Team, can you please tell us 1) when your spiders will accept wildcard directives and 2) will the Live Search Webmaster "Validate robots.txt" tool reflect the acceptance of wildcard values concurrently with the spider?

  • always easier for people when different search engines work together.

  • About wildcards:

    Does path '*.doc$' mean the same as ''.doc$'' ?

  • Great Blog but I was wondering how can I exclude parts of my page to be indexed but allow the search to follow the links inside the excluded parts?

    Do the microsoft products Live, MSN, Sharepoint, Search Server follow the same standards for this exclusion?

  • If I wanted to include /videos but disallow /videos/[here] I would include $ so /videos/$ as the $ part is a search query.

  • Yes, REP provides the easiest way to communicate with major search engines. But some lesser search engines don't seem to respect its set out rules.

  • Wath a name of bing spider? I am wont write him in robots.txt.

  • thanks a lot for sharing this.. It will help a lot to the webmasters..

  • Thanks, i needed the noindex for my website, clearly i dnt want the search engines to index my privacy policy pages.

    Matt

  • Do the microsoft products Live, MSN, Sharepoint, Search Server follow the same standards for this exclusion? good problem.

  • bing respect nofollow?  really?