编码字符编码知识unicode、utf-8、ascii、gb2312等编码之间是如何转换的(Character encoding knowledge how is the conversion between Unicode, UTF-8, ASCII, and GB2312 encoded)

unicode转换  时间:2021-04-12  阅读:()

字符编码知识unicode、 utf-8、 ascii、 gb2312等编码之间是如何转

换的Character encoding knowledge how is the conversionbetween Unicode, UTF-8, ASCII, and GB2312 encodedCharacter encoding knowledge: how is the conversion betweenUnicode, UTF-8, ASCII, and GB2312 encoded?

Character coding is the cornerstone of computer technology. Tomaster a computer, you must understand the knowledge ofcharacter encoding. Do not pay attention to the people may notcare about this, but these nouns sometimes really make peopleconfused, want to learn computer knowledge, understand it isalso very important, I also learn slowly learned some knowledgein this respect.

1. ASCII code

Inside the computer, all the information is eventuallyrepresented as a binary string. Each binary bit (bit) has 0 and1 states, so the eight binary bits can be combined into 256States, called (byte) . That is to say, a byte can be used torepresent 256 different states, each corresponding to onesymbol, i. e. , 256 symbols, from 0000000 to 11111111.

In the 60s of the last century, the United States developed aset of character encoding, and made a uniform stipulation onthe relationship between English characters and binary digits.This is called ASCII code, has been used so far.

The ASCII code specifies a total of 128 characters, such as thespace "SPACE" is 32 (decimal 32, binary means 00100000) , and

the uppercase letter "A" is 65 (binary 01000001) . These 128symbols, including 32 printed symbols that cannot be printed,take up only one byte of the latter 7 bits, and the first 1 areuniformly specified as 0. Here is a screenshot: you can go tothis webpage for details:http://www.nengcha. com/code/ascii/all/

2, non ASCII encoding

It is enough to encode English with 128 symbols, but it is notenough to represent other languages and 128 symbols. Forexample, in French, with phonetic symbols above a letter, itwill not be able to use ASCII code. As a result, some Europeancountries decided to make new symbols using the highest bitsof inactivity in bytes. For example, in French type encodingis 130 (binary 10000010) . As a result, the coding systems usedby these European countries can represent up to 256 symbols.But there are new problems here. Different countries havedifferent letters, so even though they all use 256 symbols, theletters they represent are different. For example, in the 130French encoding represents e, but on behalf of the encoding inHebrew letters Gimel (?) , on behalf of another symbol in Russianencoding. However, in all of these encodings, the symbolsrepresented by 0 - 127 are the same, not the same as the 128- 255 segment.

As for Asian countries, more symbols are used, and Chinesecharacters are up to about 100 thousand. When a byte can onlyrepresent 256 symbols, it is certainly not enough. You must usemore than one byte to represent a symbol. For example, the

common encoding in simplified Chinese is GB2312, whichuses twobytes to represent a Chinese character, so it can theoreticallyrepresent up to 256x256=65536 symbols.

3.Unicode

As in the previous section, there are various encoding methodsin the world, and the same binary number can be interpreted intodifferent symbols. Therefore, if you want to open a text file,you must know its encoding method, otherwise it will appeargarbled by the wrong encoding. Why email often garbled? Thatis because the sender and receiver use different encodingmethods. Interpretation: with a text file that is written inEnglish, in English encoding conditions, each character and acorresponding binary number (such as 00101000, similar) andthen saved to the computer, then put the English documents toa Russian national computer users, transmission is a binarystream 0101 such data to the user needs to have this Russian,Russian encoding to decode it, each binary transfer characterdisplay, as the flow data of each binary string encoding tableRussian interpretation in the different ways, the same data as00101000 in English may represent A, and in Russian on behalfof B, this will produce a garbled, this is my personalunderstanding.

GB2312 encoding, Japanese encoding, and other non Unicodeencoding, is through the conversion table (codepage) convertedto unicode encoding, or how to display it?

It can be imagined that if there is an encoding, all the symbolsof the world will be included. Each symbol gives a unique

encoding, then the garbled question disappears. This is Unicode,as its name indicates, and this is an encoding of all symbols.Unicode, of course, is a big collection, and now the size canhold about 1000000 symbols. Each symbol is coded differently,for example,

U+0639 stands for the Arabia alphabet Ain, and U+0041 standsfor English capital letters A, and U+4E25 stands for Chinesecharacters". Specific symbols corresponding table, you canquery unicode.org, or special Chinese characters correspondingt ab l e.

4. , Unicode' s problem

It should be noted that Unicode is just a set of symbols, justa specification, standard, which specifies only the binary codeof symbols, but does not specify how the binary code should bestored on the computer.

For example, the Chinese character "Yan" Unicode is sixteendecimal number 4E25, converted to binary number, a full 15

(100111000100101) , that is to say, this symbol requires atleast 2 bytes. Representing other larger symbols may require3 bytes or 4 bytes, or even more.

Here are two serious problems. The first question is, how canyou distinguish between Unicode and ASCII?How does a computerknow that three bytes represent a symbol instead of threesymbols? The second problem is that we already know, Englishletters only one byte is enough, if the unified regulations of

the Unicode, each symbol represents three or four bytes, theneach English letters before they must have two to three bytesis 0, which is a great waste for storage, a text file the sizewill be two or three times as large, this is not acceptable.The result is: 1) a variety of storage methods for Unicode haveemerged, that is, there are many different binary formats thatcan be used to represent unicode. 2) Unicode can not bepopularized for a long time until the advent of the internet.

5.UTF-8

With the popularity of the Internet, a unified encoding isstrongly demanded. UTF-8 is one of the most widely usedimplementations of Unicode on the internet. Otherimplementations include UTF-16 and UTF-32, but basically noton the internet. Again, the relationship here is that UTF-8 isone of the implementations of Unicode, which specifies howcharacters are stored, transmitted, and stored in a computer.One of the biggest features of UTF-8 is that it is a variablelength encoding. It can use 1~4 bytes to represent a symbol andchange the byte length depending on the symbol.

The encoding rules for UTF-8 are simple, only two:

1) for single byte symbols, the first bit of the byte is setto 0, and the next 7 bits are the Unicode code of the symbol.So for English letters, the UTF-8 code is the same as the ASCIIcode.

2) fornbyte notation (n>1) , the first byte of the first nbitsare set to 1, the n+1 bit is set to 0, and the first two bitsof the back byte are set to 10. The remaining bits that are notmentioned are all Unicode codes of this symbol.

The following table summarizes the encoding rules, and theletter "X" indicates the bits that can be encoded.

Unicode UTF-8 encoding | symbol scope

(sixteen m) | (binary)

--------------------+--------------------------------------

-------

0000 0000-0000 007F 0xxxxxxx |

0000 0080-0000 07FF 110xxxxx 10xxxxxx |

0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx |

0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |The following example shows how to implement UTF-8 encodingwith Chinese characters "Yan" as an example.

Known as "strict"Unicode is 4E25 (100111000100101) , accordingto the table, it was found that the 4E25 is in the range of thirdfor the period (0000 0800-0000 FFFF) , so "UTF-8 encodingstrict" three bytes, the format is "1110xxxx 10xxxxxx 10xxxxxx".Then, from the last bit of "Yan", you start filling the X in

the format from the back, and the additional bits make up 0.We get that, "UTF-8 encoding is" strict "111001001011100010100101", this is the actual data saved in the computer,convert sixteen Hex is E4B8A5, turn into the sixteenhexadecimal purpose in order to facilitate reading.

6. conversion between Unicode and UTF-8

Through the example of the previous section, you can see thatthe "Yan"Unicode code is 4E25, and the UTF-8 encoding is E4B8A5,and the two are different. The conversion between them can beimplemented by program.

Under the Windows platform, there is a simple conversion method,that is, using the built-in Notepad applet Notepad.exe. Afteryou open the file, click the save as command on the file menu,and you will jump out of a dialog box with an "encoded" dropbar at the bottom.

There are four options: ANSI, Unicode, Unicode, big, endian,and UTF-8.

1) ANSI is the default encoding. For English documents, theASCII is encoded,

For simplified Chinese documents, GB2312 encoding (only forWindows simplified Chinese version, if it is traditionalChinese version, will use Big5 code) .

2) Unicode encoding refers to the UCS-2 encoding, that is, theUnicode code that uses two bytes to store characters directly.

This option is in the little endian format.

3) the Unicode big endian encoding corresponds to the previousoption. I' ll explain the meaning of little, endian, and bigendian in the next section.

4) UTF-8 encoding, that is, the encoding method mentioned inthe previous section.

After you have chosen the encoding method and then click"save"button, the encoding of the file is immediately converted.

7. , Little, endian, and Big endian

As mentioned in the previous section, Unicode code can be storeddirectly in UCS-2 format. In Chinese, Yan, for example, Unicodecode is 4E25, needs to be stored in two bytes, one byte is 4E,and the other is 25. Storage time, 4E in front, 25 in the back,that is, Big endian way; 25 in front, 4E in the back, that is,Little endian way.

Well, naturally, a question arises: how does a computer knowwhich way a file is encoded?

The definition of the Unicode specification, respectively,added to the front of each file a encoding sequence ofcharacters, the character is called "zero width non breakingspace" (ZERO WIDTH NO-BREAK SPACE) , FEFF. This is exactly twobytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that

the file is in a big way; if the first two bytes are FF FE, itmeans that the file is headed in a small way.

8. example

Next, give an example.

Open Notepad, program Notepad. exe, new text f ile, the contentis a"Yan"word, followed by ANSI, Unicode, Unicode, big, endianand UTF-8 encoding to save.

Then, use the "Sixteen decimal" function in the text editingsoftware UltraEdit to observe the internal encoding of thefile.

1) ANSI: the encoding of the file is two bytes of"D1 CF", whichis the "strict" GB2312 encoding, which implies that GB2312 isstored in a large way.

2) Unicode: the encoding is four bytes "FF FE 25 4E", in which"FF FE" means small endian storage, and the real encoding is4E25.

3) Unicode big endian: the encoding is four bytes "FE FF 4E 25",where "FE FF" indicates a bulk storage.

4: UTF-8) encoding is six bytes "EF BB BF E4 B8 A5", the firstthree bytes of the "EF BB BF" said this is the UTF-8 encoding,the specific encoding after the three "E4B8A5" is "strict", andits encoding sequence stored in order is the same.

萤光云(13.25元)香港CN2 新购首月6.5折

萤光云怎么样?萤光云是一家国人云厂商,总部位于福建福州。其成立于2002年,主打高防云服务器产品,主要提供福州、北京、上海BGP和香港CN2节点。萤光云的高防云服务器自带50G防御,适合高防建站、游戏高防等业务。目前萤光云推出北京云服务器优惠活动,机房为北京BGP机房,购买北京云服务器可享受6.5折优惠+51元代金券(折扣和代金券可叠加使用)。活动期间还支持申请免费试用,需提交工单开通免费试用体验...

数脉科技:香港服务器低至350元/月;阿里云CN2+BGP线路,带宽10M30M50M100M

数脉科技(shuhost)8月促销:香港独立服务器,自营BGP、CN2+BGP、阿里云线路,新客立减400港币/月,老用户按照优惠码减免!香港服务器带宽可选10Mbps、30Mbps、50Mbps、100Mbps带宽,支持中文本Windows、Linux等系统。数脉香港特价阿里云10MbpsCN2,e3-1230v2,16G内存,1T HDD 或 240G SSD,10Mbps带宽,IPv41个,...

raksmart:年中大促,美国物理机$30/月甩卖;爆款VPS仅月付$1.99;洛杉矶/日本/中国香港多IP站群$177/月

RAKsmart怎么样?RAKsmart发布了2021年中促销,促销时间,7月1日~7月31日!,具体促销优惠整理如下:1)美国西海岸的圣何塞、洛杉矶独立物理服务器低至$30/月(续费不涨价)!2)中国香港大带宽物理机,新品热卖!!!,$269.23 美元/月,3)站群服务器、香港站群、日本站群、美国站群,低至177美元/月,4)美国圣何塞,洛杉矶10G口服务器,不限流量,惊爆价:$999.00,...

unicode转换为你推荐
sns平台社交网站是啥意思?http404未找到为什么网站上传,打开看不到,显示HTTP 404 - 未找到文件苹果appstore宕机apple id登陆不了app store怎么办中国企业信息网全国企业信息公示系统怎么查询企业信息支付宝调整还款日支付宝还款日期可以更改吗?我爱e书网侯龙涛小说那里有下载的中国保健养猪网中央7台致富经养猪什么是通配符dir是什么什么是seo学习SEO的好处是什么?开源网店免费开源网上商城系统有哪些
万网免费域名 ftp空间 漂亮qq空间 国内免备案主机 vps.net vultr美国与日本 圣迭戈 godaddy优惠码 腾讯云数据库 12306抢票攻略 圣诞节促销 坐公交投2700元 毫秒英文 工信部icp备案号 老左来了 中国电信测网速 ftp免费空间 阿里云官方网站 太原联通测速 lamp的音标 更多