This is a place devoted to giving you deeper insight
into the news, trends, people and technology behind Bing.
We've already covered in past blog articles some of the basics about how webmasters can use a file called robots.txt to control how search engine crawlers (aka bots) crawl their websites. But there is so much more to talk about with bots. So let's take a bit of a deeper dive into the subject.
Topic 1: Using the proper text file encoding
The robots.txt file is used by webmasters to either specifically define which files and directories that compliant search engine bots may or may not crawl. Robots.txt files are basically text files. However, even something as seemingly straightforward as a text file is not as simple as it might seem. Which type of file encoding scheme is used to save the file makes a big difference. For example, when you use the quintessential text file editor, the Notepad utility in Windows, you can save your text files in your choice of the following encoding types:
If you choose to save your robots.txt file as either Unicode or Unicode big endian, the resulting file will not be compatible with most search engine bots.
Robots.txt file requirements
To ensure that the search engine bots can read the directives for blocking or allowing content to be indexed in robots.txt file (not just with Bing, but all of them), save the file using one of the following compatible encoding formats:
Sticking with one of these compatible encoding formats will ensure that the bots you wish to control can read, and thus act upon, your robots.txt file. For more information, check out this article covering the history of character sets from the Microsoft Typography team.
Topic 2: Writing non-ASCII alphabetic characters in robots.txt
The limited number of compatible file encoding formats for robots.txt exposes a potential problem for some users.
The Internet Engineering Task Force (IETF) proclaims that Uniform Resource Identifiers (URIs), comprising both Uniform Resource Locators (URLs) and Uniform Resource Names (URNs), must be written using the US-ASCII character set. However, ASCII's 128 characters only covers the English alphabet, numbers, and punctuation marks. Some of the alphabetic characters from other Latin-based languages, such as ñ in Spanish and ç in French, are left out of ASCII. More significantly, most characters in non-Latin-based alphabets, such as pi (π) in Greek, ya (я) in Cyrillic, and entire alphabets from many other world languages, can't be accurately written in the limited, English-oriented ASCII.
This limitation with regard to robots.txt can come into play for webmasters when bots visit web servers using languages whose characters fall outside of the ASCII character set. If a robots.txt file is present on that server and it includes directives to block bots from indexing content in files and directories whose names include non-ASCII characters, the bot may not interpret the directive as the webmaster intended.
Percent encoding to the rescue
There is a way to make sure the bots can properly read the file and directory path names, regardless of whether it adheres to ASCII standards. When writing directives that include characters unavailable in ASCII, you can "escape" (aka percent-encode) them, which enables the bot to read them.
Percent-encoded characters, discussed in the IETF's RFC 3986, are used as character substitutes. A percent-encoded character is a sequence of one or more three-character codes (aka octets), starting with the "%" sign and followed by two hexadecimal numbers. Percent encoding converts the character's hexadecimal UTF-8 value into a sequence of one or more ASCII-based octets that a URI-compliant bot can read.
To demonstrate what percent-encoded text looks like, type www.%62%69%6e%67.com in your browser's address bar. It will be automatically decoded into www.bing.com. The octet codes %62, %69, %6e, and %67 are decoded by the browser into letters b, i, n, and g, respectively. Note though, that that the recommended use for percent encoding is really for those non-ASCII characters in a URL path to minimize the potential for decoding translation errors.
Real world example
Let's look at a real-world example. Suppose you were the webmaster for a website that contained the URL http://www.domain.com/папка/ (the folder name in the sample URL is written in Cyrillic and literally means "folder"). To block a bot from indexing that folder on your website using percent encoding in your robots.txt file, you would need to write the directive as follows:
If instead you simply wrote
the bot may not be able to read the directive and thus fail to perform as desired.
Performing percent encoding
So how do you translate your non-ASCII characters into escape-encoded octets? Well, it's a bit of a chore, frankly. If you search for them, there are a few websites and/or tools that offer to perform percent encoding for you, but rather than endorse a site I know nothing about, I'll instead tell you how to manually calculate the conversion. If you want to use an automated tool, go for it. But knowing how the process works will allow you to verify that a tool encoded your characters correctly.
Warning! I'm going to get pretty tech geeky here. If working with hexadecimal and binary numbers is not your thing, I apologize up front!
OK, thus warned, let's get to it. You first need to know the UTF-8 hexadecimal value for each character you want to encode. They are usually presented as U+HHHH. The four "H" hex digits are what you need.
As defined in IETF RFC 3987, the escape-encoded characters can be between one and four octets in length. The first octet of the sequence defines how many octets you need to represent the specific UTF-8 character. The higher the hex number, the more octets you need to express it. Remember these rules:
Below is a table to help illustrate these concepts. The letter n in the table represents the open bit positions in each octet for encoding the character's binary number.
Octet sequence (in binary)
0000 0000-0000 007F
0000 0080-0000 07FF
0000 0800-0000 FFFF
1110nnnn 10nnnnnn 10nnnnnn
0001 0000-0010 FFFF
11110nnn 10nnnnnn 10nnnnnn 10nnnnnn
Let's demo this using the first letter of the Cyrillic example given above, п. To manually percent encode this UTF-8 character, do the following:
You can confirm your percent encoded path works as expected by typing it into your browser as part of a URL. If it resolves correctly, you're golden.
There always more to talk about with robots (and so many other webmaster-related topics). If you have any questions, comments, or suggestions, feel free to post them in our SEM forum. Until next time...
-- Rick DeJarnette, Bing Webmaster Center
وزورنا على منتدانا الغالي منتديات الجنة
I'd like to have an instant translator from different languages, like a bluetooth earphone.
thanks for the examples
Good Info. Thank you!
Good information. If a handy converter is available online, it would be great. Anyone know of such a tool?
Great information. Thank you so much Rick.
Yea, excellent post. I didn't get through it all but have it bookmarked. Great info.
All over the head,didn't understood much.may be someone explain it in simple ways.
Thnx for this articl, but I would like to know if anybody can translet the articl to german, coz many people dont understand all of english.
thanks a lot it's very helpful
thanx man but i have big problem with msn bot on my site : http://www.gazetna.com/vb/
it`s come slow
5 visit in day
not more :)
This is a good write up. Thanks!
© 2013 Microsoft