Bing blogs

This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.

Webmaster Blog

August
11

Find out how Live Search is crawling your site

My favorite feature of our recent launch is the Crawl Issues tool, which gives you details about issues Live Search may encounter while crawling and indexing your website. This information can help you better understand what Live Search sees when crawling your site and should ultimately help you improve your results from Live Search.

crawl-issue-screen-shot

We report four types of issues:

  • File not found (404) errors – reported when Live Search encountered a "404 File Not Found" HTTP status code when last attempting to load a URL.
  • Pages Blocked by Robots Exclusion Protocol (REP) – reported when Live Search has been prevented from indexing or displaying a cached copy of the page because of a policy in your robots exclusion protocol (REP).
  • Long Dynamic URLs – reported when Live Search encounters a URL with an exceptionally long query string. These URLs have the potential to create an infinite loop for search engines due to the number of combinations of their parameters, and are often not crawled.
  • Unsupported Content-Types – reported when a page either specifies a content-type that is not supported by Live Search, or simply doesn’t specify any content type. Examples of supported content-types are: text/html, text/xml, and application/PowerPoint.

For each of these types of issues, we show you the first 20 results on the Crawl Issues page, and allow you to download up to the first 1,000 results in a CSV file that opens easily in Excel. For large websites with potentially thousands of issues, we’ve supplied a filter option that allows you to scope the results by subdomain or by subfolder. For example, if you were the webmaster for microsoft.com and there were 250,000 file-not-found results, you could filter them by “support.microsoft.com” or “support.microsoft.com/kb” to see just the issues from a particular section of your website. Generally, we support up to 2 levels of subdomains and 2 levels of subfolders per URL, but a website may have fewer available.

crawl-issue-filter-box-2

Once you’ve created the filter that gives you just the URLs you need, you can download the results in CSV format and email them to the webmaster that owns that part of the website. This gives them a clear idea of the issues that need to be fixed.

Let’s take an example site and see how you might use this tool—fortunately microsoft.com is always willing to help us out here. Microsoft.com is a gigantic website, with more than 300 full-time people working on the site between the developers, IT personnel, marketers and content authors. And they have almost every type of legacy system you can think of, so it is no wonder that they experience almost every type of issue there is. For example, if we take a look at the number of File Not Founds, it is about 218,000. That alone is way too many to deal with, so I usually scan through the first couple hundred results to see which parts of the website are effected, or I start with subdomains that are the most important.

One of the most highly trafficked portions of the site is the popular support knowledge base (KB articles). These articles are used to store information about all security and other issues, so let’s drill in there. Looking through the 404 pages from that section, one of the first issues that I notice is a series of URLs that look like this: https://mvp.support.microsoft.com/default.aspx/profile/hongfeng.liu. Adding to the mystery, when I pull the page up a browser, it loads perfectly. Hmmm, is this the first bug in our tools? With a little more research using Live HTTP Headers, I discover this page is the result of some funky redirecting and status codes. Here’s what’s going on:

404-issue-diagram

The “http:” version of the page is 302 redirecting to the “https:” version of the page, which is returning a 404 File Not Found error code while still displaying a valid page. Because the page renders correctly in a browser, it can be difficult to manually detect this type of issue. But now that I’ve figured out the problem, I can use the filter functionality to generate a list of all 160 URLs that appear to have this problem, download them as a CSV file, and email them to the site manager who owns that part of microsoft.com.

Hopefully folks will find this tool useful in diagnosing issues within their own sites as well. Please let us know if you have any questions or comments.

--Nathan Buggia, Lead Program Manager, Webmaster Center

Comments

  • Thanks for providing us with more information about the crawling and indexing our sites. Thanks and greetings!

  • That's a nice list of suggestions..useful to optimize a web page! Thanks a lot

  • Our data isn't better or worse than googlebot, however there are some differences. We provide two reports that Google doesn't:

    - Long Dyanmic URLs

    - Unsupported Content-Types

    Google provides similar data for the other reports, however, they may have different coverage of your website, and may have implemented their crawler differently which could give you different results.

    Look for us to continue to expand the reports we provide here in the future.

    -- Nathan Buggia, Webmaster Team

  • HI Nathan Buggia,

    Gr8 job i expect more from MSN also regarding Flash indexing if it can implemented would be gr8.

    I love the HTTP Compression and HTTP Conditional Get Results facility which is cool.

    I have just added to my blog about adLabs Research Center Tools which gives me usefull stats like Age, gender etc.

    Keyword Forecast Tool:

    http://adlab.msn.com/Keyword-Forecast/

    Thanx

  • For some reasons I cannot find, about one of my registered sites (my main one!)

    1 - The site seems to be registered OK

    2 - The backlinks tools shows 600+ backlinks

    3 - The Crawl Issues shows no issue

    ... but none of the pages has been referenced for several months.

    I would hope that crawl issue would hint at the source of the problem, but does not.

    Is there some option of the tools that could help me to find the source of the problem?

  • The Crawl Issue tool is helpful to attentive webmasters. And your report about Long Dynamic URLs and Unsupported Content-Types make your search engine unique.

  • I like the bing community here and just learned quite a bit by visiting here. I think it is cool that we get to know if our sites got any crawl issues by checking in with the tools here. Thank you guys.

  • hi i have been waiting for google and bing to index my blog www.diyanswerdirect.com/blog for two weeks now and am getting very frustrated with them. aswell as iam still waiting for them to update my site www.diyanswerdirect.com. My question is when do they update and how can i get them to crawl or spider my site.

  • thanks to share useful info with us.....

  • thanks, will follow for http://www.singhsuperstore.com & http://www.coltstelecom.com