Does anyone have any idea why the MSNBot wuold be requesting URLs from my site that include a hash mark (pound symbol, the #)? Here are the logs:
65.55.106.243 - - [03/Jun/2009:18:40:47 -0400] "GET /2004/12/26/iphoto-improvement/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.211 - - [03/Jun/2009:18:56:38 -0400] "GET /2005/01/04/spammer-block-listed/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.191 - - [03/Jun/2009:19:29:29 -0400] "GET /2000/03/11/site-changes-march-11-2000/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.193 - - [03/Jun/2009:19:57:51 -0400] "GET /2004/07/15/branding-the-web/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.115 - - [03/Jun/2009:20:20:20 -0400] "GET /2005/08/07/applelotcom-blocked/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.229 - - [03/Jun/2009:20:30:14 -0400] "GET /2006/08/12/re-classic-goes-out-with-nary-a-whisper/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.168 - - [03/Jun/2009:20:34:43 -0400] "GET /2004/05/22/valpak/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 65.55.106.184 - - [03/Jun/2009:20:49:13 -0400] "GET /2006/12/07/chinese-spidersrobots-downloading-mp3-files/#respond HTTP/1.1" 404 32129 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
Thanks, Mike
Michael Clark
http://www.planetmike.com
Hi Mike,
I'm getting our crawler team to look at this. I'll update you as soon as I've heard back.
~B
*I no longer work for Bing.
I'm getting this too. We've spent multiple hours tracking down performance issues on our Windows server, it turned out to be
How do I exclude msnbot from crawling our sites?
you have robots.txt to control all bots.
Just create a robots.txt and disallow the page you want not to be cached by bots. for assistance visit
www.robotstxt.org
Can't seem to get robots.txt to exclude it. I've tried:
User-agent: msnbotDisallow: /
and
User-agent: msnbot/2.0bDisallow: /
and yet it is still crawling my sites. I've read elsewhere from earlier this year that 2.0b doesn't follow robots.txt properly but don't know if that's still a problem issue.
Can someone supply a rule that is proven to work?
rbrandt,
is MSNBot over-crawling your site? bot 2.0 should be obeying all robots.txt commands under MSNBot (don't need to add 2.0 to the string).
If you are still having issues, please email me at bwmc@microsoft.com with your domain name and " MSNbot crawling issue" in the subject line and I will route it to the crawler team for investigation.
will the robots.txt solve this issue
Hotels in UK
I've been tracing this closely, I've set up my script so that I get an email everytime msnbot crawls my site, it sends me the complete URL and variable values. Because I did not get a response quickly for this I also set up a filter on that script that cleans any values of # and following so that it wouldn't wedge my server. Since shortly after my last post here, the # symbol and following has NOT been included in the URL, so it seems the problem is solved, looks like the engineers fixed it to me.
Yes, msnbot is still crawling the site, although I don't know for sure that I have set up robots.txt correctly. I asked for specific syntax here, but got none. At this point I no longer care since it's not causing a problem, but If you want me to try it please supply the correct syntax and I'll do it.
I am going to have to "take this back". I got no notifications that the pound character was passed between the 20th and 30th of July, but shortly after I posted the above note I started getting them again.
I'm not too sure. This topic is too deep. You could probably get the answer at MSDN.
Guaranteed Seo Services
# mark. By the way, by know the answer how can he utilize the hash??