编码字符编码知识unicode、utf-8、ascii、gb2312等编码之间是如何转换的(Character encoding knowledge how is the conversion between Unicode, UTF-8, ASCII, and GB2312 encoded)

unicode转换  时间:2021-04-12  阅读:()

字符编码知识unicode、 utf-8、 ascii、 gb2312等编码之间是如何转

换的Character encoding knowledge how is the conversionbetween Unicode, UTF-8, ASCII, and GB2312 encodedCharacter encoding knowledge: how is the conversion betweenUnicode, UTF-8, ASCII, and GB2312 encoded?

Character coding is the cornerstone of computer technology. Tomaster a computer, you must understand the knowledge ofcharacter encoding. Do not pay attention to the people may notcare about this, but these nouns sometimes really make peopleconfused, want to learn computer knowledge, understand it isalso very important, I also learn slowly learned some knowledgein this respect.

1. ASCII code

Inside the computer, all the information is eventuallyrepresented as a binary string. Each binary bit (bit) has 0 and1 states, so the eight binary bits can be combined into 256States, called (byte) . That is to say, a byte can be used torepresent 256 different states, each corresponding to onesymbol, i. e. , 256 symbols, from 0000000 to 11111111.

In the 60s of the last century, the United States developed aset of character encoding, and made a uniform stipulation onthe relationship between English characters and binary digits.This is called ASCII code, has been used so far.

The ASCII code specifies a total of 128 characters, such as thespace "SPACE" is 32 (decimal 32, binary means 00100000) , and

the uppercase letter "A" is 65 (binary 01000001) . These 128symbols, including 32 printed symbols that cannot be printed,take up only one byte of the latter 7 bits, and the first 1 areuniformly specified as 0. Here is a screenshot: you can go tothis webpage for details:http://www.nengcha. com/code/ascii/all/

2, non ASCII encoding

It is enough to encode English with 128 symbols, but it is notenough to represent other languages and 128 symbols. Forexample, in French, with phonetic symbols above a letter, itwill not be able to use ASCII code. As a result, some Europeancountries decided to make new symbols using the highest bitsof inactivity in bytes. For example, in French type encodingis 130 (binary 10000010) . As a result, the coding systems usedby these European countries can represent up to 256 symbols.But there are new problems here. Different countries havedifferent letters, so even though they all use 256 symbols, theletters they represent are different. For example, in the 130French encoding represents e, but on behalf of the encoding inHebrew letters Gimel (?) , on behalf of another symbol in Russianencoding. However, in all of these encodings, the symbolsrepresented by 0 - 127 are the same, not the same as the 128- 255 segment.

As for Asian countries, more symbols are used, and Chinesecharacters are up to about 100 thousand. When a byte can onlyrepresent 256 symbols, it is certainly not enough. You must usemore than one byte to represent a symbol. For example, the

common encoding in simplified Chinese is GB2312, whichuses twobytes to represent a Chinese character, so it can theoreticallyrepresent up to 256x256=65536 symbols.

3.Unicode

As in the previous section, there are various encoding methodsin the world, and the same binary number can be interpreted intodifferent symbols. Therefore, if you want to open a text file,you must know its encoding method, otherwise it will appeargarbled by the wrong encoding. Why email often garbled? Thatis because the sender and receiver use different encodingmethods. Interpretation: with a text file that is written inEnglish, in English encoding conditions, each character and acorresponding binary number (such as 00101000, similar) andthen saved to the computer, then put the English documents toa Russian national computer users, transmission is a binarystream 0101 such data to the user needs to have this Russian,Russian encoding to decode it, each binary transfer characterdisplay, as the flow data of each binary string encoding tableRussian interpretation in the different ways, the same data as00101000 in English may represent A, and in Russian on behalfof B, this will produce a garbled, this is my personalunderstanding.

GB2312 encoding, Japanese encoding, and other non Unicodeencoding, is through the conversion table (codepage) convertedto unicode encoding, or how to display it?

It can be imagined that if there is an encoding, all the symbolsof the world will be included. Each symbol gives a unique

encoding, then the garbled question disappears. This is Unicode,as its name indicates, and this is an encoding of all symbols.Unicode, of course, is a big collection, and now the size canhold about 1000000 symbols. Each symbol is coded differently,for example,

U+0639 stands for the Arabia alphabet Ain, and U+0041 standsfor English capital letters A, and U+4E25 stands for Chinesecharacters". Specific symbols corresponding table, you canquery unicode.org, or special Chinese characters correspondingt ab l e.

4. , Unicode' s problem

It should be noted that Unicode is just a set of symbols, justa specification, standard, which specifies only the binary codeof symbols, but does not specify how the binary code should bestored on the computer.

For example, the Chinese character "Yan" Unicode is sixteendecimal number 4E25, converted to binary number, a full 15

(100111000100101) , that is to say, this symbol requires atleast 2 bytes. Representing other larger symbols may require3 bytes or 4 bytes, or even more.

Here are two serious problems. The first question is, how canyou distinguish between Unicode and ASCII?How does a computerknow that three bytes represent a symbol instead of threesymbols? The second problem is that we already know, Englishletters only one byte is enough, if the unified regulations of

the Unicode, each symbol represents three or four bytes, theneach English letters before they must have two to three bytesis 0, which is a great waste for storage, a text file the sizewill be two or three times as large, this is not acceptable.The result is: 1) a variety of storage methods for Unicode haveemerged, that is, there are many different binary formats thatcan be used to represent unicode. 2) Unicode can not bepopularized for a long time until the advent of the internet.

5.UTF-8

With the popularity of the Internet, a unified encoding isstrongly demanded. UTF-8 is one of the most widely usedimplementations of Unicode on the internet. Otherimplementations include UTF-16 and UTF-32, but basically noton the internet. Again, the relationship here is that UTF-8 isone of the implementations of Unicode, which specifies howcharacters are stored, transmitted, and stored in a computer.One of the biggest features of UTF-8 is that it is a variablelength encoding. It can use 1~4 bytes to represent a symbol andchange the byte length depending on the symbol.

The encoding rules for UTF-8 are simple, only two:

1) for single byte symbols, the first bit of the byte is setto 0, and the next 7 bits are the Unicode code of the symbol.So for English letters, the UTF-8 code is the same as the ASCIIcode.

2) fornbyte notation (n>1) , the first byte of the first nbitsare set to 1, the n+1 bit is set to 0, and the first two bitsof the back byte are set to 10. The remaining bits that are notmentioned are all Unicode codes of this symbol.

The following table summarizes the encoding rules, and theletter "X" indicates the bits that can be encoded.

Unicode UTF-8 encoding | symbol scope

(sixteen m) | (binary)

--------------------+--------------------------------------

-------

0000 0000-0000 007F 0xxxxxxx |

0000 0080-0000 07FF 110xxxxx 10xxxxxx |

0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx |

0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |The following example shows how to implement UTF-8 encodingwith Chinese characters "Yan" as an example.

Known as "strict"Unicode is 4E25 (100111000100101) , accordingto the table, it was found that the 4E25 is in the range of thirdfor the period (0000 0800-0000 FFFF) , so "UTF-8 encodingstrict" three bytes, the format is "1110xxxx 10xxxxxx 10xxxxxx".Then, from the last bit of "Yan", you start filling the X in

the format from the back, and the additional bits make up 0.We get that, "UTF-8 encoding is" strict "111001001011100010100101", this is the actual data saved in the computer,convert sixteen Hex is E4B8A5, turn into the sixteenhexadecimal purpose in order to facilitate reading.

6. conversion between Unicode and UTF-8

Through the example of the previous section, you can see thatthe "Yan"Unicode code is 4E25, and the UTF-8 encoding is E4B8A5,and the two are different. The conversion between them can beimplemented by program.

Under the Windows platform, there is a simple conversion method,that is, using the built-in Notepad applet Notepad.exe. Afteryou open the file, click the save as command on the file menu,and you will jump out of a dialog box with an "encoded" dropbar at the bottom.

There are four options: ANSI, Unicode, Unicode, big, endian,and UTF-8.

1) ANSI is the default encoding. For English documents, theASCII is encoded,

For simplified Chinese documents, GB2312 encoding (only forWindows simplified Chinese version, if it is traditionalChinese version, will use Big5 code) .

2) Unicode encoding refers to the UCS-2 encoding, that is, theUnicode code that uses two bytes to store characters directly.

This option is in the little endian format.

3) the Unicode big endian encoding corresponds to the previousoption. I' ll explain the meaning of little, endian, and bigendian in the next section.

4) UTF-8 encoding, that is, the encoding method mentioned inthe previous section.

After you have chosen the encoding method and then click"save"button, the encoding of the file is immediately converted.

7. , Little, endian, and Big endian

As mentioned in the previous section, Unicode code can be storeddirectly in UCS-2 format. In Chinese, Yan, for example, Unicodecode is 4E25, needs to be stored in two bytes, one byte is 4E,and the other is 25. Storage time, 4E in front, 25 in the back,that is, Big endian way; 25 in front, 4E in the back, that is,Little endian way.

Well, naturally, a question arises: how does a computer knowwhich way a file is encoded?

The definition of the Unicode specification, respectively,added to the front of each file a encoding sequence ofcharacters, the character is called "zero width non breakingspace" (ZERO WIDTH NO-BREAK SPACE) , FEFF. This is exactly twobytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that

the file is in a big way; if the first two bytes are FF FE, itmeans that the file is headed in a small way.

8. example

Next, give an example.

Open Notepad, program Notepad. exe, new text f ile, the contentis a"Yan"word, followed by ANSI, Unicode, Unicode, big, endianand UTF-8 encoding to save.

Then, use the "Sixteen decimal" function in the text editingsoftware UltraEdit to observe the internal encoding of thefile.

1) ANSI: the encoding of the file is two bytes of"D1 CF", whichis the "strict" GB2312 encoding, which implies that GB2312 isstored in a large way.

2) Unicode: the encoding is four bytes "FF FE 25 4E", in which"FF FE" means small endian storage, and the real encoding is4E25.

3) Unicode big endian: the encoding is four bytes "FE FF 4E 25",where "FE FF" indicates a bulk storage.

4: UTF-8) encoding is six bytes "EF BB BF E4 B8 A5", the firstthree bytes of the "EF BB BF" said this is the UTF-8 encoding,the specific encoding after the three "E4B8A5" is "strict", andits encoding sequence stored in order is the same.

ftlcloud(超云)9元/月,1G内存/1核/20g硬盘/10M带宽不限/10G防御,美国云服务器

ftlcloud怎么样?ftlcloud(超云)目前正在搞暑假促销,美国圣何塞数据中心的云服务器低至9元/月,系统盘与数据盘分离,支持Windows和Linux,免费防御CC攻击,自带10Gbps的DDoS防御。FTL-超云服务器的主要特色:稳定、安全、弹性、高性能的云端计算服务,快速部署,并且可根据业务需要扩展计算能力,按需付费,节约成本,提高资源的有效利用率。点击进入:ftlcloud官方网站...

CloudCone:$14/年KVM-512MB/10GB/3TB/洛杉矶机房

CloudCone发布了2021年的闪售活动,提供了几款年付VPS套餐,基于KVM架构,采用Intel® Xeon® Silver 4214 or Xeon® E5s CPU及SSD硬盘组RAID10,最低每年14.02美元起,支持PayPal或者支付宝付款。这是一家成立于2017年的国外VPS主机商,提供VPS和独立服务器租用,数据中心为美国洛杉矶MC机房。下面列出几款年付套餐配置信息。CPU:...

菠萝云:带宽广州移动大带宽云广州云:广州移动8折优惠,月付39元

菠萝云国人商家,今天分享一下菠萝云的广州移动机房的套餐,广州移动机房分为NAT套餐和VDS套餐,NAT就是只给端口,共享IP,VDS有自己的独立IP,可做站,商家给的带宽起步为200M,最高给到800M,目前有一个8折的优惠,另外VDS有一个下单立减100元的活动,有需要的朋友可以看看。菠萝云优惠套餐:广州移动NAT套餐,开放100个TCP+UDP固定端口,共享IP,8折优惠码:gzydnat-8...

unicode转换为你推荐
mediawikimediawiki的乱码问题重庆400年老树穿楼生长重庆轻轨穿过居民楼在哪里,从解放碑怎么去腾讯公司电话是多少腾讯公司电话是多少curl扩展大神帮忙看下centos 7.2 系统 php7.0.12的 curl 扩展怎么开启,谢谢啦123456hd有很多App后面都有hd是什么意思oa办公软件价格一个oa系统多少钱长沙电话号码升位长沙的座机什么时候变成8位的,急!在线等答案,那如果之前的7位数是不是都会变啊?变成什么样了呢?discuz论坛申请自己怎么申请论坛?joomla模板怎样把html一步一步地转换成joomla模板?dedecmsdedecms真那么好用,那么强吗
手机网站空间 东莞虚拟主机 域名城 hawkhost 台湾服务器 win8.1企业版升级win10 网通ip 卡巴斯基是免费的吗 新睿云 监控服务器 东莞主机托管 华为云建站 阿里云邮箱登陆 国外网页代理 asp空间 腾讯服务器 windows2008 月付空间 godaddy中文 美国达拉斯 更多