字节unicode、utf-8、ansi的故事及其相互转换（The story of Unicode, UTF-8 and ANSI and their mutual transformation）

unicode转换时间:2021-04-12 阅读:()

unicode、 utf-8、 ansi的故事及其相互转换The story of Unicode,

UTF-8 and ANSI and their mutual transformation由HTTP / /www.cppb log。 COM/Automat eProgram/存档/ 2010 /03 / 26 / 110567。 H TML收藏 比较好。

、、 ANSI的故事Un icode UTF-8

原文地址http://blog.csdn.net/iscandy/archive/2009/02/02/3859219.aspx

很久很久以前有一群人他们决定用8个可以开合的晶体管来组合成不同的状态以表示世界上的万物。他们看到8个开关状态是好的于是他们把这称为”字节” 。

再后来他们又做了一些可以处理这些字节的机器机器开动了可以用字节来组合出很多状态状态开始变来变去他们看到这样是好的于是它们就这机器称为”计算机” 。

开始计算机只在美国用。八位的字节一共可以组合出256 2的8次方种不同的状态。

他们把其中的编号从0开始的32种状态分别规定了特殊的用途一但终端、打印机遇上约定好的这些字节被传过来时就要做一些约定的动作。遇上00x10终端就换行遇上0x07终端就向人们嘟嘟叫例好遇上0x1B打印机就打印反白的字或者终端就用彩色显示字母。他们看到这样很好于是就把这些0x20以下的字节状态称为”控制码” 。

他们又把所有的空格、标点符号、数字、大小写字母分别用连续的字

节状态表示一直编到了第127号这样计算机就可以用不同字节来存储英语的文字了。大家看到这样都感觉很好于是大家都把这个方案叫做的“ASCII”编码ANSI 美标信息交换码美国信息互换标准代码。当时世界上所有的计算机都用同样的ASCI I方案来保存英文文字。

后来就像建造巴比伦塔一样世界各地的都开始使用计算机但是很多国家用的不是英文他们的字母里有许多是ASCI I里没有的为了可以在计算机保存他们的文字他们决定采用127号之后的空位来表示这些新的字母、符号还加入了很多画表格时需要用下到的横线、竖线、交叉等形状一直把序号编到了最后一个状态255。从128到255这一页的字符集被称”扩展字符集” 。从此之后贪婪的人类再没有新的状态可以用了美帝国主义可能没有想到还有第三世界国家的人们也希望可以用到计算机吧

等中国人们得到计算机时 已经没有可以利用的字节状态来表示汉字况且有6000多个常用汉字需要保存呢。但是这难不倒智慧的中国人民我们不客气地把那些127号之后的奇异符号们直接取消掉规定一个小于127的字符的意义与原来相同但两个大于127的字符连在一起时就表示一个汉字前面的一个字节他称之为高字节从0xA1用到0xf 7后面一个字节低字节从0xA1到0xfe这样我们就可以组合出大约7000多个简体汉字了。在这些编码里我们还把数学符号、罗马希腊的字母、日文的假名们都编进去了连在ASCII里本来就有的数字、标点、字母都统统重新编了两个字节长的编码这就是常说的”全角”字符而原来在127号以下的那些就叫”半角”字符了。

中国人民看到这样很不错于是就把这种汉字方案叫做“GB2312” 。GB2312是对ASCII的中文扩展。

但是中国的汉字太多了我们很快就就发现有许多人的人名没有办法

在这里打出来特别是某些很会麻烦别人的国家领导人。

So we have to continue to find the code bits that GB2312 didn'tuse and use it honestly and politely.

Later, or not enough, so I must be no longer required low bytecode number 127, if the first byte is greater than 127 is fixedto indicate that this is a Chinese characters start, no matteris not followed by the contents of the extended character set.As a result, the extended coding scheme is called the GBKstandard, and GBK includes all the content of GB2312, whileadding nearly 20000 new Chinese characters (includingtraditional characters) and symbols.

Later, minorities also used computers, so we expanded and addedthousands of new minority words, and GBK expanded into GB18030.From then on, the culture of the Chinese nation can be handeddown in the computer age.

Chinese programmers see this series of Chinese coding standardsas good, so they are commonly known as "DBCS" (Double, Byte,Charecter, Set, double byte character sets) . In the DBCS seriesof standards, the biggest feature is the word of a long andEnglish character character Chinese characters one byte longcoexisted in the same set of encoding scheme, so they writeprograms to support Chinese treatment, attention must be paidto the value of each byte in a string, if this value is greaterthan 127, then as a double byte character set character appears.At that time, computer monks who had been blessed and programmedcould read the mantra hundreds of times each day:

"A Chinese character is counted in two English characters! AChinese character is counted in two English characters. . . . . ."

Because when countries like Chinese that produce a set of theirown encoding standard, the other who do not know who theencoding, who do not support the other encoding, the mainlandand Taiwan that even separated by only 150 nautical miles, usingthe same language brother, also uses DBCS encoding at the timethe Chinese want the computer to display Chinese characters ofdifferent programs, you must install a "Chinese characterssystem", dedicated to the display and input processing Chinesecharacters, but the Taiwan people write the fortune tellingfeudal ignorance must be added another set of support BIG5encoding what "the system can only be used Chinese characters"with the wrong character, system will show chaos! What aboutthis?And in the world' s forests, there are those poor peoplewho can not use computers for a while. What about their words?What a computer tower of Babylon!

Just then, Archangel Gabriel appeared in time, an internationalorganization called ISO (International Organization forStandardization) , which decided to solve the problem. Themethods they used were simple: they removed all regional codingschemes, and re coded a code that included all the cultures ofthe earth and all letters and symbols! They plan to call it"Universal Multiple-Octet Coded Character Set", or UCS forshort, commonly known as UNICODE".

When UNICODE began to formulate, the memory capacity of the

computer was greatly developed, and space no longer became aproblem. So ISO must be stipulated directly by two bytes, or16 bits to unify all characters, for those "half" charactersin ASCII, UNICODE to the original encoding unchanged, but itslength by 8 extensions of the original 16, and other culturaland linguistic characters are all the re unification ofencoding. Because of the "half" English symbols only need touse the low 8 bits, so the high 8 bits will always be 0, so theair scheme will waste a times in the preservation of Englishtext space.

At this time, programmers from the old society began to finda strange phenomenon: their strlen function is unreliable, aChinese character is no longer equivalent to two characters,but one! Yes, from the beginning of UNICODE, both semiangleEnglish letters, or the whole Chinese characters is a character,they are unified""! At the same time, it is also a uniform"twobytes". Please pay attention to the difference between the twoterms of "character" and "byte". "Byte" is a physical storageunit of 8 bits,

And "character" is a cultural symbol. In UNICODE, a characteris two bytes. An era where Chinese characters are counted intwo English characters is almost over.

Once upon a time when there are multiple character sets, themulti language software company encountered a lot of trouble,they are in different countries in order to sell the same setof software to the regional software but also to bless thedouble byte character set spell, not only to be careful not tomistake, but also the software text focused around to the

different characters. UNICODE is a very good package solutionfor them, and from Windows NT, MS took the opportunity to changeover their operating system, the core code all changed to workwith UNICODE version, from the beginning, no need to installthe WINDOWS system finally a variety of native language system,it can display the whole world all cultural character.However, UNICODE did not consider maintaining compatibilitywith any existing encoding scheme in the formulation, whichmakes the GBK and UNICODE in Chinese characters code layout isnot the same, not a simple arithmetic methods can send textcontent from UNICODE encoding and another encoding forconversion, the conversion must be through the look-up table.As mentioned earlier, UNICODE is represented by two bytes asa single character, and he can combine 65535 differentcharacters in a way that can cover all the symbols of allcultures in the world. If it is not also Never mind, ISO hasprepared a UCS-4 program, the simple answer is four bytes ofa character, so that we can combine 2 billion 100 milliondifferent characters out (MSB has other uses) , it can be usedfor the establishment of the Milky Way that day!

UNICODE came together, there came the rise of computer network,how UNICODE network transmission is also a problem that mustbe considered, so many UTF transmission oriented (UCS Transferformat) standard, as the name suggests, UTF8 is each of the 8bit data transmission, while the UTF16 is 16 each time, but inorder to reliability of transmission, from UNICODE to UTF whenthe correspondence is not directly, but to some algorithms andrules to convert.

Computer network programming by the monks blessing all know,there is a very important issue to transmit information in thenetwork, is for the interpretation of data of high and low, somecomputer is a method of using low first send, such as our PCmachine adopts INTEL architecture, while others are using highfirst sent way and exchange of data in the network, in orderto check whether they understand for high and low is the same,using a very simple method, is sent to each other at thebeginning of the text stream when a symbol is high - text ifit is sent in "FEFF", on the other hand, is sent "FFFE". No,you can open a UTF-X file in binary form to see if the firsttwo bytes are the two bytes

Here, we mention a strange phenomenon is very famous: when youcreate a Windows file in Notepad, enter the "China Unicom" twowords, save, close, and then open again, you will find thatthese two words have disappeared, replaced by a garbled! Ha ha,some people say that this is the reason why Unicom can not move.In fact, this is because GB2312 coding and UTF8 coding haveproduced coding collisions.

Draw a conversion rule from UNICODE to UTF8 from the internet:Unicode

UTF-8

0000 - 007F

0xxxxxxx

0080 - 07FF

110xxxxx 10xxxxxx

0800 - FFF F

1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding for the word "Han" is 6C49.6C49 is between 0800-FFFF, so use the 3 byte template: 1110xxxx,10xxxxxx, 10xxxxxx. Writing 6C49 in binary is:

0110110001001001,

This bit stream is divided into 0110110001001001 by the threebyte template segmentation method, instead of the X in thetemplate, to get: 1110-0110 10-110001 10-001001, or E6 B1 89,which is the encoding of its UTF8.

When you create a new text file, the default encoding is ANSINotepad, if you input Chinese characters encoding ANSI, so heis actually a series of GB encoding, the encoding, the code is"China Unicom":

C1 11000001

AA 10101010

CD 11001101

A8 10101000

Did you notice that? The first two bytes, three or four bytesin the initial part of all is "110" and "10", coinciding withthe UTF8 rules in the two byte template is the same, so onceagain openNotepad, Notepadmistakenly think that this is aUTF8encoding file, let us take the first word of the 110 day andsecond bytes of 10 removed, we get "00001101010", then you fillthe alignment, leading 0, was "0000000001101010", this is theUNICODE 006A feel shy, that is, the lowercase letter "J", andthen the two bytes after UTF8 decoding is 0368, the what is thecharacter. This is the only "Unicom" two words of the document,there is no way to show in Notepad normal reasons.

If you input multiple words in the "China Unicom", the otherword encoding is not necessarily also happens to be 110 and 10bytes starting, it is opened again, Notepad would not insiston this is a utf8 encoding the file, and will use the ANSI wayof reading, then does not appear garbled.

Interconversion:

Original address:http://club.topsage.com/thread-670150-1-1.html

Conversion between Ansi, Unicode, UTF-8 strings, and writingtext files

Ansi string we are most familiar with, English accounted forone byte, Chinese characters 2 bytes, endingwith a\0, commonlyused in TXT text files

展开全文