Bing blogs

This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.

Search Blog

November
29

Search robots in disguise

There are plenty of bots out there and, as a result, some conventions have arisen.  Well-behaved bots identify themselves with a unique user-agent.  They also follow the robots.txt conventions, which allow webmasters to control how their sites are crawled.

 

Here at Live Search, our crawlers are identified by the user-agent ‘MSNBot’.  This may seem a little non-intuitive, but many webmasters depend on this, and so we chosen not to change it.  In order to make things a little more transparent, we also identify our different types of crawlers.  The complete list is as follows:

 

                MSNBot                                        Main web crawler (www.live.com)

                MSNBot-Media                               Images & all other media (images.live.com)

                MSNBot-NewsBlogs                         News and blogs (search.live.com/news)

                MSNBot-Products                           Products & shopping (products.live.com)

                MSNBot-Academic                          Academic search (academic.live.com)

 

But what about crawlers that aren’t so well-behaved?  After all, anyone could call themselves ‘MSNBot’, and proceed to be as rude and aggressive as they like.  Fortunately, there is a way you can catch these impersonators. Here is how it works:

 

  1. When you get a page view request, it specifies a user-agent and an IP address.  As I described above, all requests from Live Search use a user agent starting with the word ‘MSNBot’.
  2. If you see the MSNBot user-agent, it’s time to check the identity of the bot.  Starting with the IP address (i.e. 207.46.98.149), you can use reverse DNS lookup to find out the registered name of the machine.
  3. Once you have the host name (in this case, msnbot-207-46-98-149.search.msn.com), you can check that it really is coming from Live Search.  The name of all live search crawlers will end with ‘search.msn.com’.  If the name doesn’t end with ‘search.msn.com’, you know it’s not really our crawler.
  4. Finally, you need to verify that the name is accurate.  In order to do this, you can use Forward DNS to see the IP address associated with the host name.  This should match the IP address you used in Step 2 – if it doesn’t, it means the name was fake.

 

By verifying the crawler’s identity, you can catch masquerading crawlers.  When you do catch one, you can simply return an HTTP Error, thus blocking them from seeing your content.

 

We are constantly looking for your feedback to help improve our engine – please send it our way using this link.

 

Brent Hands, Program Manager, Live Search

Comments

  • Wow. Thanks MS! Google makes a big deal about their google bot being so big and great that they cannot release the IP address and not to worry about fake bots using the googlebot name.. You guys on the otherhand have given out the IP address and encourage people to do a reverse DNS on it.. Looks like google is dropping the ball again!

  • Could you give a more indept explanation to identifying the bot with Reverse and Forward DNS Lookup?

    Also, I don't want my site images to show up in the listings, can I DisAllow MSNBot-Media in robots.txt

    Will it ensure that the rest of my site i.e. content will still be crawled?

  • @ Explorer5 Actually MS is following Google's lead on this one.  See http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

  • Nice work Brent, this is useful stuff, only a couple more engines to follow.

  • this sounds great, yahoo should do the same... hehe

  • What about other Google or MSN bot products including newly bought sites ?