Bing blogs

This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.

Search Blog

November
18

Crawling the Internet...

Now that we have a beta, people are starting to pay attention to whether their sites are in our index.  The two most common questions we get are (1) why did you not crawl my site, and (2) you crawled page X, but its not  in your index why?  Let’s take these one at a time.


Why did MSNBot not crawl my site?  The answer to this is not straightforward so I will mention a couple of things that are worth  considering. The first is to determine whether your pages are crawler friendly.  An example of a page that might look "unfriendly" to a crawler is one that looks like this: http://www.somesite.com/info/default.aspx?view=22&tab=9&pcid=81-A4-76&section=848&origin=msnsearch&cookie=false.  When MSNBot looks at this URL it gets scared (well, not really it’s a machine not a human so it doesn’t have feelings).  The algorithm starts to wonder whether it is going to get stuck in a loop endlessly crawling every single permutation of the query parameters.  Thus, URL’s with many (definitely more than 5) query parameters have a very low chance of ever being crawled. Another thing to consider is whether we can find your page.  If we need to traverse through eight pages on your site before finding leaf pages that nobody but yourself points to, MSNBot might choose not to go that far.  This is why many people recommend creating a site map and we would as well.  Lastly, you can also use this tool to Submit your URL to MSN Search.


You crawled my site, so why can’t I find it in your search index? This is one is a little bit easier.  The reason that this is most likely happening is that we are detecting the page as spam when we analyze the page to build our index.   How can you make sure that this does not happen?  The best thing to do is to not spam us.  On our site owners help we talk about some of the things that we consider spam.  In case you have not read it here is a quick refresher: dirty javascript redirects, stuffing alt text, white on white links, off topic links etc.  We take this stuff very seriously and we are continuously working to improve our spam detection -- we still have room for improvement.  The reason that we take this seriously is that web spam threatens our entire industry.  To the extent that spam is successful people will not be able to turn to search engines to find what they are looking for.

Lastly, a brief moment on peanut butter -- why is it that we stop liking peanut butter after like 8th grade? Or is it just me?  I have not had a peanut butter and jelly sandwich for the longest time.  This morning I had one.  Yummy.  Here’s to peanut butter.

Eytan Seidman, Program Manager

Comments

  • Nice :-)
  • I never lost a taste for peanut butter. In fact, I usually buy the big tubs of the stuff to make sure I don't run out. It's great stuff.
  • Ok I get the drift, the Bot gets scared..no biggie. But what about the refresh cycles and indexing creations process. Whys does it take a long time, sometines days before the 'submitted' URL appears as part of the return results ??

    token URL Sumbitted : http://peterdawson.typepad.com/blog/2004/11/gmail_security_.html
    Search Query :Gmail security Flaw /RSS ??

    Aprox 18 hrs after submissions and no relevent results being displayed as of this comments.. I understand this is beta... but then again, I assmue that forcing a manual token submit will fire the engine to force indexing, such that quicker and relevent results get displayed.

    Well, an't that the core service which submit URL process should be ?? I dunna ask'in around..
  • Eytan, I have 2 questions and a comment

    Questions:
    1."An example of a page that might look "unfriendly" to a crawler is one that looks like this.... This is why many people recommend creating a site map." – So if someone includes the scary link in the site map will it be crawled?

    2. “We take this stuff very seriously and we are continuously working to improve our spam detection -- we still have room for improvement. The reason that we take this seriously is that web spam threatens our entire industry.” – Can you go into a little more detail here or in a future post as to what you consider Spam? Some search engines consider sites with similar content spam, which is not always the case. For example with real estate websites, each agent has different listings, yet the content may generally be the same. What are your thoughts on this?

    Comment:
    Two weeks ago, as I made myself a PB&J sandwich, I also thought, “"Why, did I ever stop eating this stuff?" It is good for you (my nails have been growing again), and you feel like a kid again. Of course, that last part could have something to do with me making my PB&J sandwiches with Smucker’s Goober &Grape - that PB&J swirl mix that was popular when I was a child - it was on sale at the market and I've eaten PB&J like everyday since. Mmm.


    Natasha Robinson
    Real Estate Logic
    “…putting Logic in Real Estate.”
    http://www.realestatelogic.net
  • I still eat sandwich with peanut butter and jelly as well but I won't eat this... http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&category=19270&item=5535890757&rd=1
  • ew it's got a bite out of it
  • So if my early-childhood music education site has links to local activities that have nothing directly to do with children, music, or education (but are of interest to the parents of the students, so I want and need to provide the links), plus a link to the St. Chad site because I still haven't gotten over the whole Dade County vote in 2000 thing, plus my incredibly amusing Off-Topic-Link-of-the-Week that I get so much buzz from --- that's bad?

    What are the parameters for determining whether links are off-topic, and does a site get a quota beyond which it can't go? I recognize that you can't give me rules because that gives Web-schmucks what they need to spoof your crawler, but there must be rules of thumb beyond "don't go off-topic."

    I have a peanut butter comment... but... I must resist... must stay on topic... very diffi...aarrgh
  • Firstly let me commend you on having this blog up in the first place.

    1. Assuming a site is ranked with MSN, how ofetn does the bot revisit the site and how aften are the rankings reviewed?

    2. Is there any intention to provide a google style revenue (read banner) style service to MSN.

    3. Will sites be classed into groups or channels of any king, eg: tech, adult, mechanics, political etc?
  • You are right about peanut butter jelly sandwich.
    Also, I wanted to congratulate MSBot for being very regular. It is a wonder to watch the way it comes every day to check my site.
    Please let him know (event if it's a robot) that he is very neat.
  • >Lastly, a brief moment on peanut butter -- why is it that we stop liking peanut butter after like 8th grade? Or is it just me?

    It is just you. "We" never stop liking peanut butter.
  • Is your site unfriendly to search engine spiders like MSNBot? @ Stephan Spencer’s Scatterings
  • I created www.thereareplaces.com (a travel web site) using Microsoft FrontPage. The site looks good and all the pages work well but I am concerned that the site may not index well since FrontPage using some non-standard HTML. FIxing this code will cause havoc with Frontpage.

    In addition, I downloaded an HTML Checker and was surprised to find how much "bad" code was in the files. Apparently when you delete something using Frontpage it will delete the content but does not always delete the html commands, leading to broken code. The bad code is minor but the site is over 700 pages and making it perfect would take a lot of time.
    The question is - do I need to change this code to get the site indexed and crawled or can I change it over time when it is more convenient to me? I would appreciate any thoughts
  • While people are complaining about not being indexed or crawled, I am being killed by the bot. This is not really a complaint since I am very happy with the way the site is indexed in your search. What I would like is for you to enable the bot to use HTTP/1.1. MSNBot alone used more than 5 GB of bandwidth on my site last month. My site does offer gzip compression if the bot was able to take advantage of it by. Googlebot started to use 1.1 and accept gzip encoded data a few month ago and their bot uses 80%-90% less bandwidth
  • I also would like to finally see some official elaboration, from a search engine's side of the coin, on what is considered being "off-topic links" thought versus 100s of people individual, but may be way off the mark, interpretations of what those three words it may mean.

    There are some reasons to share an off-topic links, with search engine thoughts such as link exchange or "boost another site's rankings" thoughts not being a part of the decision at all. For example: copyrights, crediting [or thanking] someone &/or a site, list of resources used for that page[s]/site, etc.

    So doesn't have to go into great or extensive detail but would be nice to have some of the gray areas no longer being _as_ gray in color versus tryingto figure out if Joe's interpretation is closer to the mark than Jim's or Susie's.
  • You have talked about Query String URL’s being indexed but it’s a shame that it look to me like they are being heavily penalised in the ranking?