很久很久以前有一群人他们决定用8个可以开合的晶体管来组合成不同的状态以表示世界上的万物。他们看到8个开关状态是好的于是他们把这称为”字节” 。
再后来他们又做了一些可以处理这些字节的机器机器开动了可以用字节来组合出很多状态状态开始变来变去他们看到这样是好的于是它们就这机器称为”计算机” 。
开始计算机只在美国用。八位的字节一共可以组合出256 2的8次方种不同的状态。
他们把其中的编号从0开始的32种状态分别规定了特殊的用途一但终端、打印机遇上约定好的这些字节被传过来时就要做一些约定的动作。遇上00x10终端就换行遇上0x07终端就向人们嘟嘟叫例好遇上0x1B打印机就打印反白的字或者终端就用彩色显示字母。他们看到这样很好于是就把这些0x20以下的字节状态称为”控制码” 。
节状态表示一直编到了第127号这样计算机就可以用不同字节来存储英语的文字了。大家看到这样都感觉很好于是大家都把这个方案叫做的“ASCII”编码ANSI 美标信息交换码美国信息互换标准代码。当时世界上所有的计算机都用同样的ASCI I方案来保存英文文字。
后来就像建造巴比伦塔一样世界各地的都开始使用计算机但是很多国家用的不是英文他们的字母里有许多是ASCI I里没有的为了可以在计算机保存他们的文字他们决定采用127号之后的空位来表示这些新的字母、符号还加入了很多画表格时需要用下到的横线、竖线、交叉等形状一直把序号编到了最后一个状态255。从128到255这一页的字符集被称”扩展字符集” 。从此之后贪婪的人类再没有新的状态可以用了美帝国主义可能没有想到还有第三世界国家的人们也希望可以用到计算机吧
等中国人们得到计算机时 已经没有可以利用的字节状态来表示汉字况且有6000多个常用汉字需要保存呢。但是这难不倒智慧的中国人民我们不客气地把那些127号之后的奇异符号们直接取消掉规定一个小于127的字符的意义与原来相同但两个大于127的字符连在一起时就表示一个汉字前面的一个字节他称之为高字节从0xA1用到0xf 7后面一个字节低字节从0xA1到0xfe这样我们就可以组合出大约7000多个简体汉字了。在这些编码里我们还把数学符号、罗马希腊的字母、 日文的假名们都编进去了连在ASCII里本来就有的数字、标点、字母都统统重新编了两个字节长的编码这就是常说的”全角”字符而原来在127号以下的那些就叫”半角”字符了。
中国人民看到这样很不错于是就把这种汉字方案叫做“GB2312” 。GB2312是对ASCII的中文扩展。
So we have to continue to find the code bits that GB2312 didn'tuse and use it honestly and politely.
Later, or not enough, so I must be no longer required low bytecode number 127, if the first byte is greater than 127 is fixedto indicate that this is a Chinese characters start, no matteris not followed by the contents of the extended character set.As a result, the extended coding scheme is called the GBKstandard, and GBK includes all the content of GB2312, whileadding nearly 20000 new Chinese characters (includingtraditional characters) and symbols.
Later, minorities also used computers, so we expanded and addedthousands of new minority words, and GBK expanded into GB18030.From then on, the culture of the Chinese nation can be handeddown in the computer age.
Chinese programmers see this series of Chinese coding standardsas good, so they are commonly known as "DBCS" (Double, Byte,Charecter, Set, double byte character sets) . In the DBCS seriesof standards, the biggest feature is the word of a long andEnglish character character Chinese characters one byte longcoexisted in the same set of encoding scheme, so they writeprograms to support Chinese treatment, attention must be paidto the value of each byte in a string, if this value is greaterthan 127, then as a double byte character set character appears.At that time, computer monks who had been blessed and programmedcould read the mantra hundreds of times each day:
"A Chinese character is counted in two English characters! AChinese character is counted in two English characters. . . . . ."
Because when countries like Chinese that produce a set of theirown encoding standard, the other who do not know who theencoding, who do not support the other encoding, the mainlandand Taiwan that even separated by only 150 nautical miles, usingthe same language brother, also uses DBCS encoding at the timethe Chinese want the computer to display Chinese characters ofdifferent programs, you must install a "Chinese characterssystem", dedicated to the display and input processing Chinesecharacters, but the Taiwan people write the fortune tellingfeudal ignorance must be added another set of support BIG5encoding what "the system can only be used Chinese characters"with the wrong character, system will show chaos! What aboutthis?And in the world' s forests, there are those poor peoplewho can not use computers for a while. What about their words?What a computer tower of Babylon!
Just then, Archangel Gabriel appeared in time, an internationalorganization called ISO (International Organization forStandardization) , which decided to solve the problem. Themethods they used were simple: they removed all regional codingschemes, and re coded a code that included all the cultures ofthe earth and all letters and symbols! They plan to call it"Universal Multiple-Octet Coded Character Set", or UCS forshort, commonly known as UNICODE".
When UNICODE began to formulate, the memory capacity of the
computer was greatly developed, and space no longer became aproblem. So ISO must be stipulated directly by two bytes, or16 bits to unify all characters, for those "half" charactersin ASCII, UNICODE to the original encoding unchanged, but itslength by 8 extensions of the original 16, and other culturaland linguistic characters are all the re unification ofencoding. Because of the "half" English symbols only need touse the low 8 bits, so the high 8 bits will always be 0, so theair scheme will waste a times in the preservation of Englishtext space.
At this time, programmers from the old society began to finda strange phenomenon: their strlen function is unreliable, aChinese character is no longer equivalent to two characters,but one! Yes, from the beginning of UNICODE, both semiangleEnglish letters, or the whole Chinese characters is a character,they are unified""! At the same time, it is also a uniform"twobytes". Please pay attention to the difference between the twoterms of "character" and "byte". "Byte" is a physical storageunit of 8 bits,
And "character" is a cultural symbol. In UNICODE, a characteris two bytes. An era where Chinese characters are counted intwo English characters is almost over.
Once upon a time when there are multiple character sets, themulti language software company encountered a lot of trouble,they are in different countries in order to sell the same setof software to the regional software but also to bless thedouble byte character set spell, not only to be careful not tomistake, but also the software text focused around to the
different characters. UNICODE is a very good package solutionfor them, and from Windows NT, MS took the opportunity to changeover their operating system, the core code all changed to workwith UNICODE version, from the beginning, no need to installthe WINDOWS system finally a variety of native language system,it can display the whole world all cultural character.However, UNICODE did not consider maintaining compatibilitywith any existing encoding scheme in the formulation, whichmakes the GBK and UNICODE in Chinese characters code layout isnot the same, not a simple arithmetic methods can send textcontent from UNICODE encoding and another encoding forconversion, the conversion must be through the look-up table.As mentioned earlier, UNICODE is represented by two bytes asa single character, and he can combine 65535 differentcharacters in a way that can cover all the symbols of allcultures in the world. If it is not also Never mind, ISO hasprepared a UCS-4 program, the simple answer is four bytes ofa character, so that we can combine 2 billion 100 milliondifferent characters out (MSB has other uses) , it can be usedfor the establishment of the Milky Way that day!
UNICODE came together, there came the rise of computer network,how UNICODE network transmission is also a problem that mustbe considered, so many UTF transmission oriented (UCS Transferformat) standard, as the name suggests, UTF8 is each of the 8bit data transmission, while the UTF16 is 16 each time, but inorder to reliability of transmission, from UNICODE to UTF whenthe correspondence is not directly, but to some algorithms andrules to convert.
Computer network programming by the monks blessing all know,there is a very important issue to transmit information in thenetwork, is for the interpretation of data of high and low, somecomputer is a method of using low first send, such as our PCmachine adopts INTEL architecture, while others are using highfirst sent way and exchange of data in the network, in orderto check whether they understand for high and low is the same,using a very simple method, is sent to each other at thebeginning of the text stream when a symbol is high - text ifit is sent in "FEFF", on the other hand, is sent "FFFE". No,you can open a UTF-X file in binary form to see if the firsttwo bytes are the two bytes
Here, we mention a strange phenomenon is very famous: when youcreate a Windows file in Notepad, enter the "China Unicom" twowords, save, close, and then open again, you will find thatthese two words have disappeared, replaced by a garbled! Ha ha,some people say that this is the reason why Unicom can not move.In fact, this is because GB2312 coding and UTF8 coding haveproduced coding collisions.
Draw a conversion rule from UNICODE to UTF8 from the internet:Unicode
0000 - 007F
0080 - 07FF
110xxxxx 10xxxxxx
0800 - FFF F
1110xxxx 10xxxxxx 10xxxxxx
For example, the Unicode encoding for the word "Han" is 6C49.6C49 is between 0800-FFFF, so use the 3 byte template: 1110xxxx,10xxxxxx, 10xxxxxx. Writing 6C49 in binary is:
This bit stream is divided into 0110110001001001 by the threebyte template segmentation method, instead of the X in thetemplate, to get: 1110-0110 10-110001 10-001001, or E6 B1 89,which is the encoding of its UTF8.
When you create a new text file, the default encoding is ANSINotepad, if you input Chinese characters encoding ANSI, so heis actually a series of GB encoding, the encoding, the code is"China Unicom":
C1 11000001
AA 10101010
CD 11001101
A8 10101000
Did you notice that? The first two bytes, three or four bytesin the initial part of all is "110" and "10", coinciding withthe UTF8 rules in the two byte template is the same, so onceagain openNotepad, Notepadmistakenly think that this is aUTF8encoding file, let us take the first word of the 110 day andsecond bytes of 10 removed, we get "00001101010", then you fillthe alignment, leading 0, was "0000000001101010", this is theUNICODE 006A feel shy, that is, the lowercase letter "J", andthen the two bytes after UTF8 decoding is 0368, the what is thecharacter. This is the only "Unicom" two words of the document,there is no way to show in Notepad normal reasons.
If you input multiple words in the "China Unicom", the otherword encoding is not necessarily also happens to be 110 and 10bytes starting, it is opened again, Notepad would not insiston this is a utf8 encoding the file, and will use the ANSI wayof reading, then does not appear garbled.
Conversion between Ansi, Unicode, UTF-8 strings, and writingtext files
Ansi string we are most familiar with, English accounted forone byte, Chinese characters 2 bytes, endingwith a\0, commonlyused in TXT text files
