This is a place devoted to giving you deeper insight into the news, trends, people and technology behind Bing.
As a member of the Live Search Webmaster Team, I'm often asked by web publishers how they can control the way search engines access and display their content. The de-facto standard for managing this is the Robots Exclusion Protocol (REP) introduced back in the early 1990's. Over the years, the REP has evolved to support more than "exclusion" directives; it now supports directives controlling what content gets included, how the content is displayed, and how frequently the content is crawled. The REP offers an easy and efficient way to communicate with search engines, and is currently used by millions of publishers worldwide. Its strength lies in its flexibility to evolve in parallel with the web, its universal implementation across major search engines and all major robots, and the way it works for any publisher, no matter how large or small.
In the spirit of making the lives of webmasters simpler, Microsoft, Yahoo and Google are coming forward with detailed documentation about how we implement the Robots Exclusion Protocol (REP). This will provide a common implementation for webmasters and make it easier for any publishers to know how their REP directives will be handled by three major search providers, making REP more intuitive and friendly to even more publishers on the web.
Common REP Directives and USE Cases
The following list includes all the major REP features currently implemented by Google, Microsoft, and Yahoo. We are documenting the features and the use cases they enable for site owners. With each feature, you'll see what it does and how you should communicate it.
Each of these directives can be specified to be applicable for all crawlers or for specific crawlers by targeting them to specific user-agents, which is how any crawler identifies itself. Apart from the identification by user-agent, each of our crawlers also supports Reverse DNS based authentication to allow you to verify the identity of the crawler.
Allow
Tells a crawler the specific pages on your site you want indexed so you can use this in combination with Disallow. If both Disallow and Allow clauses apply to a URL, the most specific rule – the longest rule – applies.
This is useful in particular in conjunction with Disallow clauses, where a large section of a site is disallowed, except a small section within it.
$ Wildcard Support
Tells a crawler to match everything from the end of a URL -- large number of directories without specifying specific pages (available by end of June)
'No Crawl' files with specific patterns, for e.g., files with certain file types that always have a certain extension, say '.pdf', etc.
Sitemaps Location
Tells a crawler where it can find your sitemaps.
The tags below can be present as Meta Tags in the page HTML or X-Robots-Tag Tags in the HTTP Header. This allows non-HTML resources to also implement identical functionality. If both forms of tags are present for a page, the most restrictive version applies.
The directives listed above are used by Microsoft, Google and Yahoo, but may not be implemented by all other search engines. Additionally, Live Search and Yahoo support the Crawl-Delay directive, which is not supported by Google at this time.
Going forward, we plan to continue this coordination and ensure that as new uses of REP arise, we're able to make it as easy as possible for webmasters to use them. Until then, you can find more information about robots.txt at http://www.robotstxt.org and within Live Search's Webmaster Center, which contains lots of helpful information, including:
There is also a useful list of the bots used by the major search engines here: http://www.robotstxt.org/wc/active/html/index.html
-- Fabrice Canel & Nathan Buggia, Live Search Webmaster Team
Good work that the three major search engines are teaming up to create standards that ease our work. I hope this is just a first glance of other cooperations in the future.
Hi Live search!
Google Webmaster Tools allow me to set crawl rate: Slower, Normal (default), Faster. I do not know how long is Slower, Normal, Faster???
Can I see example robots.txt for msnbot?
Robot for FAQ pages are recommended?
Live Search Team, can you please tell us 1) when your spiders will accept wildcard directives and 2) will the Live Search Webmaster "Validate robots.txt" tool reflect the acceptance of wildcard values concurrently with the spider?
always easier for people when different search engines work together.
About wildcards:
Does path '*.doc$' mean the same as ''.doc$'' ?
Great Blog but I was wondering how can I exclude parts of my page to be indexed but allow the search to follow the links inside the excluded parts?
Do the microsoft products Live, MSN, Sharepoint, Search Server follow the same standards for this exclusion?
If I wanted to include /videos but disallow /videos/[here] I would include $ so /videos/$ as the $ part is a search query.
Yes, REP provides the easiest way to communicate with major search engines. But some lesser search engines don't seem to respect its set out rules.
Wath a name of bing spider? I am wont write him in robots.txt.
thanks a lot for sharing this.. It will help a lot to the webmasters..
Thanks, i needed the noindex for my website, clearly i dnt want the search engines to index my privacy policy pages.
Matt
Do the microsoft products Live, MSN, Sharepoint, Search Server follow the same standards for this exclusion? good problem.
bing respect nofollow? really?