This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.
At Live Search, one of the most common questions we receive from our peers at microsoft.com and msn.com is how to optimize their sites for search. But microsoft.com is unlike most other sites on the Internet. It is huge, containing millions of URLs, and is growing all the time. However, large content sites like microsoft.com and msn.com are not the only sites that can have an infinite number of URLs. There are also large ecommerce sites and government agency sites that produce very large numbers of URLs.
As with any site, our original recommendations on how to rank in Live Search are still important. But we’ve given it some thought and wanted to provide some recommendations oriented toward very large sites. Over the next several posts, we will be discussing topics that may help your site, especially if it is very large.
By producing lots of content, a site exposes a huge surface area for the search engines to crawl. This can lead to sub-optimal results or underperforming pages. As a site grows, the number of URLs produced will also grow, requiring the search engine bot to dig deeper and work harder. One way to control the growth of URLs is to ensure that you are only exposing one URL per piece of content. This is the process of canonicalization, although it is sometimes referred to as normalization.
Canonicalization encompasses a number of issues related to the selection of the best URL for a page. For most site owners, the issue of canonicalization is a familiar topic because they have selected (or should have selected) the top-level domain they want to point to. For example, the following are all valid ways to get to microsoft.com:
For the large site that uses complex URLs for tracking, co-branding, or ecommerce activities, managing canonicalization is a much larger and more important issue. To reduce the risk of duplicate URLs and diluting your link equity, the following are some additional conversions that should be made:
http://www.mysite.com/ to http://www.mysite.com
http://www.mysite.com/default.aspx to http://www.mysite.com http://www.mysite.com/en/us/default.aspx to http://www.mysite.com/en/us
http://www.mysite.com/FooBar/ to http://www.mysite.com/foobar
http://www.mysite.com/downloads/details.aspx?FamilyID=ab99&displaylang=en to http://www.mysite.com/downloads/en/family/ab99
http://www.mysite.com:8080/ to http://www.mysite.com
https://www.mysite.com/en/us/ to http://www.mysite.com/en/us
The best option for a webmaster is to consider canonicalization when they are planning their site. Unfortunately, many large sites were developed before optimization became an issue. By developing an architecture and site plan that considers canonical forms from the start, such planning will prevent a number of headaches later. But because many site owners already have challenges with canonicalization, here are a few solutions for correcting the situation with existing sites:
For a small site, this can be done fairly easily. However, the more canonical issues your site has, the more complicated the solution will be. The first step is to determine what canonical issues your site suffers from and then build a solution that takes into account both the need to scale and performance requirements.
For some examples of how to do this, I recommend reading Tony Spencer’s article on .htaccess, 301 Redirects & SEO.
Whatever you choose as your canonical form, never deviate from that form in your sitemap or your internal link structure. For example, always link to mysite.com, rather than sometimes linking to mysite.com/default.htm. You can’t enforce how people link to your site externally, but you should ensure your convention is enforced internally. This can be a long process to enact on a large site, so it may help to start with your most trafficked pages where there is the most impact and work your way through the site to your least important pages.
I recently received a question from a publisher on the issue of co-branding. This is where a site may have multiple versions of a single page that it serves with different brands applied, each with a different URL. For example, the Microsoft download page can be reached directly or from any number of branded pages. The URL for such a download could be read:
The content is the same across all the pages other than the brand skin on the page. For a large site like microsoft.com, this means that the same page could be generated over and over, creating additional URLs for the search engine to crawl. The ideal situation in co-branded pages is to have a single page that the search engine bot sees and to block access to any other, branded versions. For a good solution to this problem, Nathan Buggia wrote a very detailed post on how to deal with URL referrer tracking at Jane and Robot. In it, he offers several patterns that will help in this situation and others.
Once you have done the items above or if you are starting from scratch you may want to consider moving from relative to absolute links. Relative links point a link relative to another page on your server. While there are performance and architectural reasons that webmasters use absolute links, the goal of this exercise is to bring consistency. Absolute links tend to make doing so easier.
A relative link will look like:
The absolute form will look like:
While there are other issues that can increase the number of URLs, solving the issue of canonicalization will help both your customers and us at Live Search get more of the URLs that matter.
In our next post on large sites, we will discuss the benefits and implementation of HTTP compression and conditional GET statements. As always, if you have additional questions, feel free to ask in our forums.
Jeremiah Andrick -- Program Manager, Live Search Webmaster
I think that such optimization usefull not only for a large sites.
Most described here is common seo practics for any website.
Thank you for posting this. I can't tell you how many .net developers use bad url practices such as CamelCase for files and extensions (.aspx). The extension was largely due to lack of good rewrite support easily in IIS although plenty of options were available the isapi route. Still though having this on a microsoft site is good to point .net developers is a good thing. If only to rid of CamelCase files and paths to lowercase '-' separated names.
Excellent advice! But curious indeed from Live Search, which has a notoriously bad time handling 301 redirects.
In fact, after reading this post, I yet again looked at my site in Live's index. Again, I looked at the headers...
-> 301 ->
No problemo. Except that Live stubbornly keeps "mysite.com" as the URL displayed in and linked by its index. And this 301 has been in place like ... forever.
I'll keep monitoring and update this comment in what I'm regarding as the increasingly unlikely event that Live gets this right.
Error in your example for "Add or remove the trailing /"
The URLs http://www.mysite.com/">http://www.mysite.com/ and http://www.mysite.com are exactly the same. This is because the very first / character is not actually part of the URL path, but more of a separator between the URL hostname and path.
This is only true for the first "/" - all other ones are part of the path. So a better example would be:
http://www.mysite.com/">http://www.mysite.com/widgets/ and http://www.mysite.com/">http://www.mysite.com/widgets
It is very refreshing to see that I have been recommending 100% of these suggestions to my clients. I usually have the hardest problems with clients who use Microsoft Servers. because default settings allow for upper and lower case urls to resolve to the same content. Even though the urls are not the same.
The CamelBack default settings really need to addressed for IIS. I keep hearing how wonderful MS is lately and how far the dev tools have come with .Net and so forth.
There are also some horrid shopping carts out there for the apache/php servers. OS Commerce is the worst! They have a fix now but its takes some work to get it working.
Also don't forget to talk about never ending URLs, these usually happen in shopping carts that use a unique id in the URL for every time you add or remove a product. You could actually have a limitless number of URLs. This happens with calendars too. Typical of small hotel and event sites.
I usually just block them from being indexed all together because the value on those page for searchers is very small if anything.
Good to know all the info you guys shared. Thanks a lot!
Thanks for very doable SEO tips we should all be taking note of - not just for larger sites. When you are new, it can be tough to get reliable info. This sounds like common sense stuff worth implementing.
" As with any site, our original recommendations on how to rank in Live Search are still important. But we’ve given it some thought and wanted to provide some recommendations "
Link doesn't work :-(
Thanks for all the feedback. I fixed the broken link in this post as well.
Some good points, maybe talk about site hierarchy issues.
You could also segment your site or break it up into smaller quadrants.
Thank you for this post :-)
I have a big gallery where I didn't fixed URLs on lowercase !
So I know it's very easy to change, but I wonder if the website will get bad ranking on other search engines.
Is it necessary to lowercase all URLS ?
I didn't know anything about the issue of canonicalization until I read Matt Cutts post at his blog. Your post here has brought more light into ways to solve the issue.
All canonicalization issues have been fixed. And I use MOD-rewrite to rewrite the dynamically generated URLs to shorter search engine-friendly URLs.
Hope to be better. Better means more features.
© 2013 Microsoft