字节unicode、utf-8、ansi的故事及其相互转换(The story of Unicode, UTF-8 and ANSI and their mutual transformation)

unicode转换  时间:2021-04-12  阅读:()

unicode、 utf-8、 ansi的故事及其相互转换The story of Unicode,

UTF-8 and ANSI and their mutual transformation由HTTP / /www.cppb log。 COM/Automat eProgram/存档/ 2010 /03 / 26 / 110567。 H TML收藏 比较好。

、 、 ANSI的故事Un icode UTF-8

原文地址http://blog.csdn.net/iscandy/archive/2009/02/02/3859219.aspx

很久很久以前有一群人他们决定用8个可以开合的晶体管来组合成不同的状态以表示世界上的万物。他们看到8个开关状态是好的于是他们把这称为”字节” 。

再后来他们又做了一些可以处理这些字节的机器机器开动了可以用字节来组合出很多状态状态开始变来变去他们看到这样是好的于是它们就这机器称为”计算机” 。

开始计算机只在美国用。八位的字节一共可以组合出256 2的8次方种不同的状态。

他们把其中的编号从0开始的32种状态分别规定了特殊的用途一但终端、打印机遇上约定好的这些字节被传过来时就要做一些约定的动作。遇上00x10终端就换行遇上0x07终端就向人们嘟嘟叫例好遇上0x1B打印机就打印反白的字或者终端就用彩色显示字母。他们看到这样很好于是就把这些0x20以下的字节状态称为”控制码” 。

他们又把所有的空格、标点符号、数字、大小写字母分别用连续的字

节状态表示一直编到了第127号这样计算机就可以用不同字节来存储英语的文字了。大家看到这样都感觉很好于是大家都把这个方案叫做的“ASCII”编码ANSI 美标信息交换码美国信息互换标准代码。当时世界上所有的计算机都用同样的ASCI I方案来保存英文文字。

后来就像建造巴比伦塔一样世界各地的都开始使用计算机但是很多国家用的不是英文他们的字母里有许多是ASCI I里没有的为了可以在计算机保存他们的文字他们决定采用127号之后的空位来表示这些新的字母、符号还加入了很多画表格时需要用下到的横线、竖线、交叉等形状一直把序号编到了最后一个状态255。从128到255这一页的字符集被称”扩展字符集” 。从此之后贪婪的人类再没有新的状态可以用了美帝国主义可能没有想到还有第三世界国家的人们也希望可以用到计算机吧

等中国人们得到计算机时 已经没有可以利用的字节状态来表示汉字况且有6000多个常用汉字需要保存呢。但是这难不倒智慧的中国人民我们不客气地把那些127号之后的奇异符号们直接取消掉规定一个小于127的字符的意义与原来相同但两个大于127的字符连在一起时就表示一个汉字前面的一个字节他称之为高字节从0xA1用到0xf 7后面一个字节低字节从0xA1到0xfe这样我们就可以组合出大约7000多个简体汉字了。在这些编码里我们还把数学符号、罗马希腊的字母、 日文的假名们都编进去了连在ASCII里本来就有的数字、标点、字母都统统重新编了两个字节长的编码这就是常说的”全角”字符而原来在127号以下的那些就叫”半角”字符了。

中国人民看到这样很不错于是就把这种汉字方案叫做“GB2312” 。GB2312是对ASCII的中文扩展。

但是中国的汉字太多了我们很快就就发现有许多人的人名没有办法

在这里打出来特别是某些很会麻烦别人的国家领导人。

So we have to continue to find the code bits that GB2312 didn'tuse and use it honestly and politely.

Later, or not enough, so I must be no longer required low bytecode number 127, if the first byte is greater than 127 is fixedto indicate that this is a Chinese characters start, no matteris not followed by the contents of the extended character set.As a result, the extended coding scheme is called the GBKstandard, and GBK includes all the content of GB2312, whileadding nearly 20000 new Chinese characters (includingtraditional characters) and symbols.

Later, minorities also used computers, so we expanded and addedthousands of new minority words, and GBK expanded into GB18030.From then on, the culture of the Chinese nation can be handeddown in the computer age.

Chinese programmers see this series of Chinese coding standardsas good, so they are commonly known as "DBCS" (Double, Byte,Charecter, Set, double byte character sets) . In the DBCS seriesof standards, the biggest feature is the word of a long andEnglish character character Chinese characters one byte longcoexisted in the same set of encoding scheme, so they writeprograms to support Chinese treatment, attention must be paidto the value of each byte in a string, if this value is greaterthan 127, then as a double byte character set character appears.At that time, computer monks who had been blessed and programmedcould read the mantra hundreds of times each day:

"A Chinese character is counted in two English characters! AChinese character is counted in two English characters. . . . . ."

Because when countries like Chinese that produce a set of theirown encoding standard, the other who do not know who theencoding, who do not support the other encoding, the mainlandand Taiwan that even separated by only 150 nautical miles, usingthe same language brother, also uses DBCS encoding at the timethe Chinese want the computer to display Chinese characters ofdifferent programs, you must install a "Chinese characterssystem", dedicated to the display and input processing Chinesecharacters, but the Taiwan people write the fortune tellingfeudal ignorance must be added another set of support BIG5encoding what "the system can only be used Chinese characters"with the wrong character, system will show chaos! What aboutthis?And in the world' s forests, there are those poor peoplewho can not use computers for a while. What about their words?What a computer tower of Babylon!

Just then, Archangel Gabriel appeared in time, an internationalorganization called ISO (International Organization forStandardization) , which decided to solve the problem. Themethods they used were simple: they removed all regional codingschemes, and re coded a code that included all the cultures ofthe earth and all letters and symbols! They plan to call it"Universal Multiple-Octet Coded Character Set", or UCS forshort, commonly known as UNICODE".

When UNICODE began to formulate, the memory capacity of the

computer was greatly developed, and space no longer became aproblem. So ISO must be stipulated directly by two bytes, or16 bits to unify all characters, for those "half" charactersin ASCII, UNICODE to the original encoding unchanged, but itslength by 8 extensions of the original 16, and other culturaland linguistic characters are all the re unification ofencoding. Because of the "half" English symbols only need touse the low 8 bits, so the high 8 bits will always be 0, so theair scheme will waste a times in the preservation of Englishtext space.

At this time, programmers from the old society began to finda strange phenomenon: their strlen function is unreliable, aChinese character is no longer equivalent to two characters,but one! Yes, from the beginning of UNICODE, both semiangleEnglish letters, or the whole Chinese characters is a character,they are unified""! At the same time, it is also a uniform"twobytes". Please pay attention to the difference between the twoterms of "character" and "byte". "Byte" is a physical storageunit of 8 bits,

And "character" is a cultural symbol. In UNICODE, a characteris two bytes. An era where Chinese characters are counted intwo English characters is almost over.

Once upon a time when there are multiple character sets, themulti language software company encountered a lot of trouble,they are in different countries in order to sell the same setof software to the regional software but also to bless thedouble byte character set spell, not only to be careful not tomistake, but also the software text focused around to the

different characters. UNICODE is a very good package solutionfor them, and from Windows NT, MS took the opportunity to changeover their operating system, the core code all changed to workwith UNICODE version, from the beginning, no need to installthe WINDOWS system finally a variety of native language system,it can display the whole world all cultural character.However, UNICODE did not consider maintaining compatibilitywith any existing encoding scheme in the formulation, whichmakes the GBK and UNICODE in Chinese characters code layout isnot the same, not a simple arithmetic methods can send textcontent from UNICODE encoding and another encoding forconversion, the conversion must be through the look-up table.As mentioned earlier, UNICODE is represented by two bytes asa single character, and he can combine 65535 differentcharacters in a way that can cover all the symbols of allcultures in the world. If it is not also Never mind, ISO hasprepared a UCS-4 program, the simple answer is four bytes ofa character, so that we can combine 2 billion 100 milliondifferent characters out (MSB has other uses) , it can be usedfor the establishment of the Milky Way that day!

UNICODE came together, there came the rise of computer network,how UNICODE network transmission is also a problem that mustbe considered, so many UTF transmission oriented (UCS Transferformat) standard, as the name suggests, UTF8 is each of the 8bit data transmission, while the UTF16 is 16 each time, but inorder to reliability of transmission, from UNICODE to UTF whenthe correspondence is not directly, but to some algorithms andrules to convert.

Computer network programming by the monks blessing all know,there is a very important issue to transmit information in thenetwork, is for the interpretation of data of high and low, somecomputer is a method of using low first send, such as our PCmachine adopts INTEL architecture, while others are using highfirst sent way and exchange of data in the network, in orderto check whether they understand for high and low is the same,using a very simple method, is sent to each other at thebeginning of the text stream when a symbol is high - text ifit is sent in "FEFF", on the other hand, is sent "FFFE". No,you can open a UTF-X file in binary form to see if the firsttwo bytes are the two bytes

Here, we mention a strange phenomenon is very famous: when youcreate a Windows file in Notepad, enter the "China Unicom" twowords, save, close, and then open again, you will find thatthese two words have disappeared, replaced by a garbled! Ha ha,some people say that this is the reason why Unicom can not move.In fact, this is because GB2312 coding and UTF8 coding haveproduced coding collisions.

Draw a conversion rule from UNICODE to UTF8 from the internet:Unicode

UTF-8

0000 - 007F

0xxxxxxx

0080 - 07FF

110xxxxx 10xxxxxx

0800 - FFF F

1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding for the word "Han" is 6C49.6C49 is between 0800-FFFF, so use the 3 byte template: 1110xxxx,10xxxxxx, 10xxxxxx. Writing 6C49 in binary is:

0110110001001001,

This bit stream is divided into 0110110001001001 by the threebyte template segmentation method, instead of the X in thetemplate, to get: 1110-0110 10-110001 10-001001, or E6 B1 89,which is the encoding of its UTF8.

When you create a new text file, the default encoding is ANSINotepad, if you input Chinese characters encoding ANSI, so heis actually a series of GB encoding, the encoding, the code is"China Unicom":

C1 11000001

AA 10101010

CD 11001101

A8 10101000

Did you notice that? The first two bytes, three or four bytesin the initial part of all is "110" and "10", coinciding withthe UTF8 rules in the two byte template is the same, so onceagain openNotepad, Notepadmistakenly think that this is aUTF8encoding file, let us take the first word of the 110 day andsecond bytes of 10 removed, we get "00001101010", then you fillthe alignment, leading 0, was "0000000001101010", this is theUNICODE 006A feel shy, that is, the lowercase letter "J", andthen the two bytes after UTF8 decoding is 0368, the what is thecharacter. This is the only "Unicom" two words of the document,there is no way to show in Notepad normal reasons.

If you input multiple words in the "China Unicom", the otherword encoding is not necessarily also happens to be 110 and 10bytes starting, it is opened again, Notepad would not insiston this is a utf8 encoding the file, and will use the ANSI wayof reading, then does not appear garbled.

Interconversion:

Original address:http://club.topsage.com/thread-670150-1-1.html

Conversion between Ansi, Unicode, UTF-8 strings, and writingtext files

Ansi string we are most familiar with, English accounted forone byte, Chinese characters 2 bytes, endingwith a\0, commonlyused in TXT text files

弘速云(28元/月)香港葵湾2核2G10M云服务器

弘速云怎么样?弘速云是创建于2021年的品牌,运营该品牌的公司HOSU LIMITED(中文名称弘速科技有限公司)公司成立于2021年国内公司注册于2019年。HOSU LIMITED主要从事出售香港vps、美国VPS、香港独立服务器、香港站群服务器等,目前在售VPS线路有CN2+BGP、CN2 GIA,该公司旗下产品均采用KVM虚拟化架构。可联系商家代安装iso系统,目前推出全场vps新开7折,...

妮妮云(119元/季)日本CN2 2核2G 30M 119元/季

妮妮云的知名度应该也不用多介绍了,妮妮云旗下的云产品提供商,相比起他家其他的产品,云产品还是非常良心的,经常出了一些优惠活动,前段时间的八折活动推出了很多优质产品,近期商家秒杀活动又上线了,秒杀产品比较全面,除了ECS和轻量云,还有一些免费空间、增值代购、云数据库等,如果你是刚入行安稳做站的朋友,可以先入手一个119/元季付的ECS来起步,非常稳定。官网地址:www.niniyun.com活动专区...

无忧云( 9.9元/首月),河南洛阳BGP 2核 2G,大连BGP线路 20G高防 ,

无忧云怎么样?无忧云服务器好不好?无忧云值不值得购买?无忧云,无忧云是一家成立于2017年的老牌商家旗下的服务器销售品牌,现由深圳市云上无忧网络科技有限公司运营,是正规持证IDC/ISP/IRCS商家,自营有国内雅安高防、洛阳BGP企业线路、香港CN2线路、国外服务器产品等,非常适合需要稳定的线路的用户,如游戏、企业建站业务需求和各种负载较高的项目,同时还有自营的高性能、高配置的BGP线路高防物理...

unicode转换为你推荐
莲都区招投标中心办公场所地址变更公告datehttpflashftpFLASHFXP怎么用有没有详细的说明??asp.net空间谁知道免费的ASP空间重庆杨家坪猪肉摊主杀人重庆一市民发现买的新鲜猪肉晚上发蓝光.专家解释,猪肉中含磷较多且携带了一种能发光的细菌--磷光杆菌时美要求解锁iPhone如何看美版苹果是有锁无锁2828商机网28商机网适合年轻人做的项目??pintang深圳御品堂怎么才能保证他们卖的东西都是有机食品?图文模块图文模块的标题栏填什么啊?建站无忧求好点的免费建站网
域名解析 php空间租用 budgetvm naning9韩国官网 http500内部服务器错误 轻博 网站实时监控 debian7 中国智能物流骨干网 架设服务器 怎么测试下载速度 php空间推荐 电信虚拟主机 ca187 腾讯总部在哪 域名与空间 网站加速软件 西安服务器托管 沈阳主机托管 linode支付宝 更多