编码字符编码知识unicode、utf-8、ascii、gb2312等编码之间是如何转换的(Character encoding knowledge how is the conversion between Unicode, UTF-8, ASCII, and GB2312 encoded)

unicode转换  时间:2021-04-12  阅读:()

字符编码知识unicode、 utf-8、 ascii、 gb2312等编码之间是如何转

换的Character encoding knowledge how is the conversionbetween Unicode, UTF-8, ASCII, and GB2312 encodedCharacter encoding knowledge: how is the conversion betweenUnicode, UTF-8, ASCII, and GB2312 encoded?

Character coding is the cornerstone of computer technology. Tomaster a computer, you must understand the knowledge ofcharacter encoding. Do not pay attention to the people may notcare about this, but these nouns sometimes really make peopleconfused, want to learn computer knowledge, understand it isalso very important, I also learn slowly learned some knowledgein this respect.

1. ASCII code

Inside the computer, all the information is eventuallyrepresented as a binary string. Each binary bit (bit) has 0 and1 states, so the eight binary bits can be combined into 256States, called (byte) . That is to say, a byte can be used torepresent 256 different states, each corresponding to onesymbol, i. e. , 256 symbols, from 0000000 to 11111111.

In the 60s of the last century, the United States developed aset of character encoding, and made a uniform stipulation onthe relationship between English characters and binary digits.This is called ASCII code, has been used so far.

The ASCII code specifies a total of 128 characters, such as thespace "SPACE" is 32 (decimal 32, binary means 00100000) , and

the uppercase letter "A" is 65 (binary 01000001) . These 128symbols, including 32 printed symbols that cannot be printed,take up only one byte of the latter 7 bits, and the first 1 areuniformly specified as 0. Here is a screenshot: you can go tothis webpage for details:http://www.nengcha. com/code/ascii/all/

2, non ASCII encoding

It is enough to encode English with 128 symbols, but it is notenough to represent other languages and 128 symbols. Forexample, in French, with phonetic symbols above a letter, itwill not be able to use ASCII code. As a result, some Europeancountries decided to make new symbols using the highest bitsof inactivity in bytes. For example, in French type encodingis 130 (binary 10000010) . As a result, the coding systems usedby these European countries can represent up to 256 symbols.But there are new problems here. Different countries havedifferent letters, so even though they all use 256 symbols, theletters they represent are different. For example, in the 130French encoding represents e, but on behalf of the encoding inHebrew letters Gimel (?) , on behalf of another symbol in Russianencoding. However, in all of these encodings, the symbolsrepresented by 0 - 127 are the same, not the same as the 128- 255 segment.

As for Asian countries, more symbols are used, and Chinesecharacters are up to about 100 thousand. When a byte can onlyrepresent 256 symbols, it is certainly not enough. You must usemore than one byte to represent a symbol. For example, the

common encoding in simplified Chinese is GB2312, whichuses twobytes to represent a Chinese character, so it can theoreticallyrepresent up to 256x256=65536 symbols.

3.Unicode

As in the previous section, there are various encoding methodsin the world, and the same binary number can be interpreted intodifferent symbols. Therefore, if you want to open a text file,you must know its encoding method, otherwise it will appeargarbled by the wrong encoding. Why email often garbled? Thatis because the sender and receiver use different encodingmethods. Interpretation: with a text file that is written inEnglish, in English encoding conditions, each character and acorresponding binary number (such as 00101000, similar) andthen saved to the computer, then put the English documents toa Russian national computer users, transmission is a binarystream 0101 such data to the user needs to have this Russian,Russian encoding to decode it, each binary transfer characterdisplay, as the flow data of each binary string encoding tableRussian interpretation in the different ways, the same data as00101000 in English may represent A, and in Russian on behalfof B, this will produce a garbled, this is my personalunderstanding.

GB2312 encoding, Japanese encoding, and other non Unicodeencoding, is through the conversion table (codepage) convertedto unicode encoding, or how to display it?

It can be imagined that if there is an encoding, all the symbolsof the world will be included. Each symbol gives a unique

encoding, then the garbled question disappears. This is Unicode,as its name indicates, and this is an encoding of all symbols.Unicode, of course, is a big collection, and now the size canhold about 1000000 symbols. Each symbol is coded differently,for example,

U+0639 stands for the Arabia alphabet Ain, and U+0041 standsfor English capital letters A, and U+4E25 stands for Chinesecharacters". Specific symbols corresponding table, you canquery unicode.org, or special Chinese characters correspondingt ab l e.

4. , Unicode' s problem

It should be noted that Unicode is just a set of symbols, justa specification, standard, which specifies only the binary codeof symbols, but does not specify how the binary code should bestored on the computer.

For example, the Chinese character "Yan" Unicode is sixteendecimal number 4E25, converted to binary number, a full 15

(100111000100101) , that is to say, this symbol requires atleast 2 bytes. Representing other larger symbols may require3 bytes or 4 bytes, or even more.

Here are two serious problems. The first question is, how canyou distinguish between Unicode and ASCII?How does a computerknow that three bytes represent a symbol instead of threesymbols? The second problem is that we already know, Englishletters only one byte is enough, if the unified regulations of

the Unicode, each symbol represents three or four bytes, theneach English letters before they must have two to three bytesis 0, which is a great waste for storage, a text file the sizewill be two or three times as large, this is not acceptable.The result is: 1) a variety of storage methods for Unicode haveemerged, that is, there are many different binary formats thatcan be used to represent unicode. 2) Unicode can not bepopularized for a long time until the advent of the internet.

5.UTF-8

With the popularity of the Internet, a unified encoding isstrongly demanded. UTF-8 is one of the most widely usedimplementations of Unicode on the internet. Otherimplementations include UTF-16 and UTF-32, but basically noton the internet. Again, the relationship here is that UTF-8 isone of the implementations of Unicode, which specifies howcharacters are stored, transmitted, and stored in a computer.One of the biggest features of UTF-8 is that it is a variablelength encoding. It can use 1~4 bytes to represent a symbol andchange the byte length depending on the symbol.

The encoding rules for UTF-8 are simple, only two:

1) for single byte symbols, the first bit of the byte is setto 0, and the next 7 bits are the Unicode code of the symbol.So for English letters, the UTF-8 code is the same as the ASCIIcode.

2) fornbyte notation (n>1) , the first byte of the first nbitsare set to 1, the n+1 bit is set to 0, and the first two bitsof the back byte are set to 10. The remaining bits that are notmentioned are all Unicode codes of this symbol.

The following table summarizes the encoding rules, and theletter "X" indicates the bits that can be encoded.

Unicode UTF-8 encoding | symbol scope

(sixteen m) | (binary)

--------------------+--------------------------------------

-------

0000 0000-0000 007F 0xxxxxxx |

0000 0080-0000 07FF 110xxxxx 10xxxxxx |

0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx |

0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |The following example shows how to implement UTF-8 encodingwith Chinese characters "Yan" as an example.

Known as "strict"Unicode is 4E25 (100111000100101) , accordingto the table, it was found that the 4E25 is in the range of thirdfor the period (0000 0800-0000 FFFF) , so "UTF-8 encodingstrict" three bytes, the format is "1110xxxx 10xxxxxx 10xxxxxx".Then, from the last bit of "Yan", you start filling the X in

the format from the back, and the additional bits make up 0.We get that, "UTF-8 encoding is" strict "111001001011100010100101", this is the actual data saved in the computer,convert sixteen Hex is E4B8A5, turn into the sixteenhexadecimal purpose in order to facilitate reading.

6. conversion between Unicode and UTF-8

Through the example of the previous section, you can see thatthe "Yan"Unicode code is 4E25, and the UTF-8 encoding is E4B8A5,and the two are different. The conversion between them can beimplemented by program.

Under the Windows platform, there is a simple conversion method,that is, using the built-in Notepad applet Notepad.exe. Afteryou open the file, click the save as command on the file menu,and you will jump out of a dialog box with an "encoded" dropbar at the bottom.

There are four options: ANSI, Unicode, Unicode, big, endian,and UTF-8.

1) ANSI is the default encoding. For English documents, theASCII is encoded,

For simplified Chinese documents, GB2312 encoding (only forWindows simplified Chinese version, if it is traditionalChinese version, will use Big5 code) .

2) Unicode encoding refers to the UCS-2 encoding, that is, theUnicode code that uses two bytes to store characters directly.

This option is in the little endian format.

3) the Unicode big endian encoding corresponds to the previousoption. I' ll explain the meaning of little, endian, and bigendian in the next section.

4) UTF-8 encoding, that is, the encoding method mentioned inthe previous section.

After you have chosen the encoding method and then click"save"button, the encoding of the file is immediately converted.

7. , Little, endian, and Big endian

As mentioned in the previous section, Unicode code can be storeddirectly in UCS-2 format. In Chinese, Yan, for example, Unicodecode is 4E25, needs to be stored in two bytes, one byte is 4E,and the other is 25. Storage time, 4E in front, 25 in the back,that is, Big endian way; 25 in front, 4E in the back, that is,Little endian way.

Well, naturally, a question arises: how does a computer knowwhich way a file is encoded?

The definition of the Unicode specification, respectively,added to the front of each file a encoding sequence ofcharacters, the character is called "zero width non breakingspace" (ZERO WIDTH NO-BREAK SPACE) , FEFF. This is exactly twobytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that

the file is in a big way; if the first two bytes are FF FE, itmeans that the file is headed in a small way.

8. example

Next, give an example.

Open Notepad, program Notepad. exe, new text f ile, the contentis a"Yan"word, followed by ANSI, Unicode, Unicode, big, endianand UTF-8 encoding to save.

Then, use the "Sixteen decimal" function in the text editingsoftware UltraEdit to observe the internal encoding of thefile.

1) ANSI: the encoding of the file is two bytes of"D1 CF", whichis the "strict" GB2312 encoding, which implies that GB2312 isstored in a large way.

2) Unicode: the encoding is four bytes "FF FE 25 4E", in which"FF FE" means small endian storage, and the real encoding is4E25.

3) Unicode big endian: the encoding is four bytes "FE FF 4E 25",where "FE FF" indicates a bulk storage.

4: UTF-8) encoding is six bytes "EF BB BF E4 B8 A5", the firstthree bytes of the "EF BB BF" said this is the UTF-8 encoding,the specific encoding after the three "E4B8A5" is "strict", andits encoding sequence stored in order is the same.

3G流量免费高防CDN 50-200G防御

简介酷盾安全怎么样?酷盾安全,隶属于云南酷番云计算有限公司,主要提供高防CDN服务,高防服务器等,分为中国境内CDN,和境外CDN和二个产品,均支持SSL。目前CDN处于内测阶段,目前是免费的,套餐包0.01一个。3G流量(高防CDN)用完了继续续费或者购买升级包即可。有兴趣的可以看看,需要实名的。官方网站: :点击进入官网云南酷番云计算有限公司优惠方案流量3G,用完了不够再次购买或者升级套餐流量...

美国cera机房 2核4G 19.9元/月 宿主机 E5 2696v2x2 512G

美国特价云服务器 2核4G 19.9元杭州王小玉网络科技有限公司成立于2020是拥有IDC ISP资质的正规公司,这次推荐的美国云服务器也是商家主打产品,有点在于稳定 速度 数据安全。企业级数据安全保障,支持异地灾备,数据安全系数达到了100%安全级别,是国内唯一一家美国云服务器拥有这个安全级别的商家。E5 2696v2x2 2核 4G内存 20G系统盘 10G数据盘 20M带宽 100G流量 1...

EdgeNat 新年开通优惠 - 韩国独立服务器原生IP地址CN2线路七折优惠

EdgeNat 商家在之前也有分享过几次活动,主要提供香港和韩国的VPS主机,分别在沙田和首尔LG机房,服务器均为自营硬件,电信CN2线路,移动联通BGP直连,其中VPS主机基于KVM架构,宿主机采用四路E5处理器、raid10+BBU固态硬盘!最高可以提供500Gbps DDoS防御。这次开年活动中有提供七折优惠的韩国独立服务器,原生IP地址CN2线路。第一、优惠券活动EdgeNat优惠码(限月...

unicode转换为你推荐
phpmyadmin下载phpmyadmin怎么安装,求网站空间商帮助。linux防火墙设置如何在Linux中启动/停止和启用/禁用FirewallD和Iptables防火墙360arp防火墙在哪360的9.6版本ARP防火墙在哪?360免费建站我用的360免费建站,但自己买的一级域名要先备案,360不提供备案,我要怎么做才能把我的域名绑定网站啊?青岛网通测速家用电脑上网(青岛网通)512k网速算不算快,玩主流网游卡不卡qq挂件qq空间挂件大全!论坛版块图标请教一下论坛版块图标怎么做?引擎收录搜索引擎的收录和索引是什么意思dedecmsdedecms是做什么的空间导航自定义名称空间导航自定义名称 短一点的
dns是什么 godaddy主机 抢票工具 evssl证书 免费ftp空间申请 铁通流量查询 150邮箱 vip购优汇 百兆独享 共享主机 免费智能解析 免费dns解析 多线空间 四川电信商城 域名与空间 云营销系统 web应用服务器 好看的空间 开心online 美国达拉斯 更多