字节unicode、utf-8、ansi的故事及其相互转换(The story of Unicode, UTF-8 and ANSI and their mutual transformation)

unicode转换  时间:2021-04-12  阅读:()

unicode、 utf-8、 ansi的故事及其相互转换The story of Unicode,

UTF-8 and ANSI and their mutual transformation由HTTP / /www.cppb log。 COM/Automat eProgram/存档/ 2010 /03 / 26 / 110567。 H TML收藏 比较好。

、 、 ANSI的故事Un icode UTF-8

原文地址http://blog.csdn.net/iscandy/archive/2009/02/02/3859219.aspx

很久很久以前有一群人他们决定用8个可以开合的晶体管来组合成不同的状态以表示世界上的万物。他们看到8个开关状态是好的于是他们把这称为”字节” 。

再后来他们又做了一些可以处理这些字节的机器机器开动了可以用字节来组合出很多状态状态开始变来变去他们看到这样是好的于是它们就这机器称为”计算机” 。

开始计算机只在美国用。八位的字节一共可以组合出256 2的8次方种不同的状态。

他们把其中的编号从0开始的32种状态分别规定了特殊的用途一但终端、打印机遇上约定好的这些字节被传过来时就要做一些约定的动作。遇上00x10终端就换行遇上0x07终端就向人们嘟嘟叫例好遇上0x1B打印机就打印反白的字或者终端就用彩色显示字母。他们看到这样很好于是就把这些0x20以下的字节状态称为”控制码” 。

他们又把所有的空格、标点符号、数字、大小写字母分别用连续的字

节状态表示一直编到了第127号这样计算机就可以用不同字节来存储英语的文字了。大家看到这样都感觉很好于是大家都把这个方案叫做的“ASCII”编码ANSI 美标信息交换码美国信息互换标准代码。当时世界上所有的计算机都用同样的ASCI I方案来保存英文文字。

后来就像建造巴比伦塔一样世界各地的都开始使用计算机但是很多国家用的不是英文他们的字母里有许多是ASCI I里没有的为了可以在计算机保存他们的文字他们决定采用127号之后的空位来表示这些新的字母、符号还加入了很多画表格时需要用下到的横线、竖线、交叉等形状一直把序号编到了最后一个状态255。从128到255这一页的字符集被称”扩展字符集” 。从此之后贪婪的人类再没有新的状态可以用了美帝国主义可能没有想到还有第三世界国家的人们也希望可以用到计算机吧

等中国人们得到计算机时 已经没有可以利用的字节状态来表示汉字况且有6000多个常用汉字需要保存呢。但是这难不倒智慧的中国人民我们不客气地把那些127号之后的奇异符号们直接取消掉规定一个小于127的字符的意义与原来相同但两个大于127的字符连在一起时就表示一个汉字前面的一个字节他称之为高字节从0xA1用到0xf 7后面一个字节低字节从0xA1到0xfe这样我们就可以组合出大约7000多个简体汉字了。在这些编码里我们还把数学符号、罗马希腊的字母、 日文的假名们都编进去了连在ASCII里本来就有的数字、标点、字母都统统重新编了两个字节长的编码这就是常说的”全角”字符而原来在127号以下的那些就叫”半角”字符了。

中国人民看到这样很不错于是就把这种汉字方案叫做“GB2312” 。GB2312是对ASCII的中文扩展。

但是中国的汉字太多了我们很快就就发现有许多人的人名没有办法

在这里打出来特别是某些很会麻烦别人的国家领导人。

So we have to continue to find the code bits that GB2312 didn'tuse and use it honestly and politely.

Later, or not enough, so I must be no longer required low bytecode number 127, if the first byte is greater than 127 is fixedto indicate that this is a Chinese characters start, no matteris not followed by the contents of the extended character set.As a result, the extended coding scheme is called the GBKstandard, and GBK includes all the content of GB2312, whileadding nearly 20000 new Chinese characters (includingtraditional characters) and symbols.

Later, minorities also used computers, so we expanded and addedthousands of new minority words, and GBK expanded into GB18030.From then on, the culture of the Chinese nation can be handeddown in the computer age.

Chinese programmers see this series of Chinese coding standardsas good, so they are commonly known as "DBCS" (Double, Byte,Charecter, Set, double byte character sets) . In the DBCS seriesof standards, the biggest feature is the word of a long andEnglish character character Chinese characters one byte longcoexisted in the same set of encoding scheme, so they writeprograms to support Chinese treatment, attention must be paidto the value of each byte in a string, if this value is greaterthan 127, then as a double byte character set character appears.At that time, computer monks who had been blessed and programmedcould read the mantra hundreds of times each day:

"A Chinese character is counted in two English characters! AChinese character is counted in two English characters. . . . . ."

Because when countries like Chinese that produce a set of theirown encoding standard, the other who do not know who theencoding, who do not support the other encoding, the mainlandand Taiwan that even separated by only 150 nautical miles, usingthe same language brother, also uses DBCS encoding at the timethe Chinese want the computer to display Chinese characters ofdifferent programs, you must install a "Chinese characterssystem", dedicated to the display and input processing Chinesecharacters, but the Taiwan people write the fortune tellingfeudal ignorance must be added another set of support BIG5encoding what "the system can only be used Chinese characters"with the wrong character, system will show chaos! What aboutthis?And in the world' s forests, there are those poor peoplewho can not use computers for a while. What about their words?What a computer tower of Babylon!

Just then, Archangel Gabriel appeared in time, an internationalorganization called ISO (International Organization forStandardization) , which decided to solve the problem. Themethods they used were simple: they removed all regional codingschemes, and re coded a code that included all the cultures ofthe earth and all letters and symbols! They plan to call it"Universal Multiple-Octet Coded Character Set", or UCS forshort, commonly known as UNICODE".

When UNICODE began to formulate, the memory capacity of the

computer was greatly developed, and space no longer became aproblem. So ISO must be stipulated directly by two bytes, or16 bits to unify all characters, for those "half" charactersin ASCII, UNICODE to the original encoding unchanged, but itslength by 8 extensions of the original 16, and other culturaland linguistic characters are all the re unification ofencoding. Because of the "half" English symbols only need touse the low 8 bits, so the high 8 bits will always be 0, so theair scheme will waste a times in the preservation of Englishtext space.

At this time, programmers from the old society began to finda strange phenomenon: their strlen function is unreliable, aChinese character is no longer equivalent to two characters,but one! Yes, from the beginning of UNICODE, both semiangleEnglish letters, or the whole Chinese characters is a character,they are unified""! At the same time, it is also a uniform"twobytes". Please pay attention to the difference between the twoterms of "character" and "byte". "Byte" is a physical storageunit of 8 bits,

And "character" is a cultural symbol. In UNICODE, a characteris two bytes. An era where Chinese characters are counted intwo English characters is almost over.

Once upon a time when there are multiple character sets, themulti language software company encountered a lot of trouble,they are in different countries in order to sell the same setof software to the regional software but also to bless thedouble byte character set spell, not only to be careful not tomistake, but also the software text focused around to the

different characters. UNICODE is a very good package solutionfor them, and from Windows NT, MS took the opportunity to changeover their operating system, the core code all changed to workwith UNICODE version, from the beginning, no need to installthe WINDOWS system finally a variety of native language system,it can display the whole world all cultural character.However, UNICODE did not consider maintaining compatibilitywith any existing encoding scheme in the formulation, whichmakes the GBK and UNICODE in Chinese characters code layout isnot the same, not a simple arithmetic methods can send textcontent from UNICODE encoding and another encoding forconversion, the conversion must be through the look-up table.As mentioned earlier, UNICODE is represented by two bytes asa single character, and he can combine 65535 differentcharacters in a way that can cover all the symbols of allcultures in the world. If it is not also Never mind, ISO hasprepared a UCS-4 program, the simple answer is four bytes ofa character, so that we can combine 2 billion 100 milliondifferent characters out (MSB has other uses) , it can be usedfor the establishment of the Milky Way that day!

UNICODE came together, there came the rise of computer network,how UNICODE network transmission is also a problem that mustbe considered, so many UTF transmission oriented (UCS Transferformat) standard, as the name suggests, UTF8 is each of the 8bit data transmission, while the UTF16 is 16 each time, but inorder to reliability of transmission, from UNICODE to UTF whenthe correspondence is not directly, but to some algorithms andrules to convert.

Computer network programming by the monks blessing all know,there is a very important issue to transmit information in thenetwork, is for the interpretation of data of high and low, somecomputer is a method of using low first send, such as our PCmachine adopts INTEL architecture, while others are using highfirst sent way and exchange of data in the network, in orderto check whether they understand for high and low is the same,using a very simple method, is sent to each other at thebeginning of the text stream when a symbol is high - text ifit is sent in "FEFF", on the other hand, is sent "FFFE". No,you can open a UTF-X file in binary form to see if the firsttwo bytes are the two bytes

Here, we mention a strange phenomenon is very famous: when youcreate a Windows file in Notepad, enter the "China Unicom" twowords, save, close, and then open again, you will find thatthese two words have disappeared, replaced by a garbled! Ha ha,some people say that this is the reason why Unicom can not move.In fact, this is because GB2312 coding and UTF8 coding haveproduced coding collisions.

Draw a conversion rule from UNICODE to UTF8 from the internet:Unicode

UTF-8

0000 - 007F

0xxxxxxx

0080 - 07FF

110xxxxx 10xxxxxx

0800 - FFF F

1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding for the word "Han" is 6C49.6C49 is between 0800-FFFF, so use the 3 byte template: 1110xxxx,10xxxxxx, 10xxxxxx. Writing 6C49 in binary is:

0110110001001001,

This bit stream is divided into 0110110001001001 by the threebyte template segmentation method, instead of the X in thetemplate, to get: 1110-0110 10-110001 10-001001, or E6 B1 89,which is the encoding of its UTF8.

When you create a new text file, the default encoding is ANSINotepad, if you input Chinese characters encoding ANSI, so heis actually a series of GB encoding, the encoding, the code is"China Unicom":

C1 11000001

AA 10101010

CD 11001101

A8 10101000

Did you notice that? The first two bytes, three or four bytesin the initial part of all is "110" and "10", coinciding withthe UTF8 rules in the two byte template is the same, so onceagain openNotepad, Notepadmistakenly think that this is aUTF8encoding file, let us take the first word of the 110 day andsecond bytes of 10 removed, we get "00001101010", then you fillthe alignment, leading 0, was "0000000001101010", this is theUNICODE 006A feel shy, that is, the lowercase letter "J", andthen the two bytes after UTF8 decoding is 0368, the what is thecharacter. This is the only "Unicom" two words of the document,there is no way to show in Notepad normal reasons.

If you input multiple words in the "China Unicom", the otherword encoding is not necessarily also happens to be 110 and 10bytes starting, it is opened again, Notepad would not insiston this is a utf8 encoding the file, and will use the ANSI wayof reading, then does not appear garbled.

Interconversion:

Original address:http://club.topsage.com/thread-670150-1-1.html

Conversion between Ansi, Unicode, UTF-8 strings, and writingtext files

Ansi string we are most familiar with, English accounted forone byte, Chinese characters 2 bytes, endingwith a\0, commonlyused in TXT text files

美国200G美国高防服务器16G,800元

美国高防服务器提速啦专业提供美国高防服务器,美国高防服务器租用,美国抗攻击服务器,高防御美国服务器租用等。我们的海外高防服务器带给您坚不可摧的DDoS防护,保障您的业务不受攻击影响。HostEase美国高防服务器位于加州和洛杉矶数据中心,均为国内访问速度最快最稳定的美国抗攻击机房,带给您快速的访问体验。我们的高防服务器配有最高层级的DDoS防护系统,每款抗攻击服务器均拥有免费DDoS防护额度,让您...

ftlcloud9元/月,美国云服务器,1G内存/1核/20g硬盘/10M带宽不限/10G防御

ftlcloud(超云)目前正在搞暑假促销,美国圣何塞数据中心的云服务器低至9元/月,系统盘与数据盘分离,支持Windows和Linux,免费防御CC攻击,自带10Gbps的DDoS防御。FTL-超云服务器的主要特色:稳定、安全、弹性、高性能的云端计算服务,快速部署,并且可根据业务需要扩展计算能力,按需付费,节约成本,提高资源的有效利用率。活动地址:https://www.ftlcloud.com...

LightNode(7.71美元),免认证高质量香港CN2 GIA

LightNode是一家位于香港的VPS服务商.提供基于KVM虚拟化技术的VPS.在提供全球常见节点的同时,还具备东南亚地区、中国香港等边缘节点.满足开发者建站,游戏应用,外贸电商等应用场景的需求。新用户注册充值就送,最高可获得20美元的奖励金!成为LightNode的注册用户后,还可以获得属于自己的邀请链接。通过你的邀请链接带来的注册用户,你将直接获得该用户的消费的10%返佣,永久有效!平台目前...

unicode转换为你推荐
复旦大学wordpress腾讯社交广告平台运营手册signal37企业邮局系统企业邮件系统用什么软件好?flashfxpflashfxp怎么用?outlookexpressoutlook Express是什么啊?怎么用啊?outlookexpressOUTLOOK EXPRESS作用是什么?我想删除它会不会影响系统163yeahyeah邮箱和163邮箱的区别在哪里 那个好用波音737起飞爆胎飞机会爆胎的吗?美要求解锁iPhoneiphone美版解锁硬解大概需要多少钱啊
科迈动态域名 fdcservers directspace java主机 la域名 免费ftp空间申请 免费网站申请 hostker hostloc 免费防火墙 linux服务器维护 umax120 能外链的相册 银盘服务 河南移动梦网 云服务器比较 贵阳电信 lamp兄弟连 netvigator 服务器托管价格 更多