编码字符编码知识unicode、utf-8、ascii、gb2312等编码之间是如何转换的(Character encoding knowledge how is the conversion between Unicode, UTF-8, ASCII, and GB2312 encoded)

unicode转换  时间:2021-04-12  阅读:()

字符编码知识unicode、 utf-8、 ascii、 gb2312等编码之间是如何转

换的Character encoding knowledge how is the conversionbetween Unicode, UTF-8, ASCII, and GB2312 encodedCharacter encoding knowledge: how is the conversion betweenUnicode, UTF-8, ASCII, and GB2312 encoded?

Character coding is the cornerstone of computer technology. Tomaster a computer, you must understand the knowledge ofcharacter encoding. Do not pay attention to the people may notcare about this, but these nouns sometimes really make peopleconfused, want to learn computer knowledge, understand it isalso very important, I also learn slowly learned some knowledgein this respect.

1. ASCII code

Inside the computer, all the information is eventuallyrepresented as a binary string. Each binary bit (bit) has 0 and1 states, so the eight binary bits can be combined into 256States, called (byte) . That is to say, a byte can be used torepresent 256 different states, each corresponding to onesymbol, i. e. , 256 symbols, from 0000000 to 11111111.

In the 60s of the last century, the United States developed aset of character encoding, and made a uniform stipulation onthe relationship between English characters and binary digits.This is called ASCII code, has been used so far.

The ASCII code specifies a total of 128 characters, such as thespace "SPACE" is 32 (decimal 32, binary means 00100000) , and

the uppercase letter "A" is 65 (binary 01000001) . These 128symbols, including 32 printed symbols that cannot be printed,take up only one byte of the latter 7 bits, and the first 1 areuniformly specified as 0. Here is a screenshot: you can go tothis webpage for details:http://www.nengcha. com/code/ascii/all/

2, non ASCII encoding

It is enough to encode English with 128 symbols, but it is notenough to represent other languages and 128 symbols. Forexample, in French, with phonetic symbols above a letter, itwill not be able to use ASCII code. As a result, some Europeancountries decided to make new symbols using the highest bitsof inactivity in bytes. For example, in French type encodingis 130 (binary 10000010) . As a result, the coding systems usedby these European countries can represent up to 256 symbols.But there are new problems here. Different countries havedifferent letters, so even though they all use 256 symbols, theletters they represent are different. For example, in the 130French encoding represents e, but on behalf of the encoding inHebrew letters Gimel (?) , on behalf of another symbol in Russianencoding. However, in all of these encodings, the symbolsrepresented by 0 - 127 are the same, not the same as the 128- 255 segment.

As for Asian countries, more symbols are used, and Chinesecharacters are up to about 100 thousand. When a byte can onlyrepresent 256 symbols, it is certainly not enough. You must usemore than one byte to represent a symbol. For example, the

common encoding in simplified Chinese is GB2312, whichuses twobytes to represent a Chinese character, so it can theoreticallyrepresent up to 256x256=65536 symbols.

3.Unicode

As in the previous section, there are various encoding methodsin the world, and the same binary number can be interpreted intodifferent symbols. Therefore, if you want to open a text file,you must know its encoding method, otherwise it will appeargarbled by the wrong encoding. Why email often garbled? Thatis because the sender and receiver use different encodingmethods. Interpretation: with a text file that is written inEnglish, in English encoding conditions, each character and acorresponding binary number (such as 00101000, similar) andthen saved to the computer, then put the English documents toa Russian national computer users, transmission is a binarystream 0101 such data to the user needs to have this Russian,Russian encoding to decode it, each binary transfer characterdisplay, as the flow data of each binary string encoding tableRussian interpretation in the different ways, the same data as00101000 in English may represent A, and in Russian on behalfof B, this will produce a garbled, this is my personalunderstanding.

GB2312 encoding, Japanese encoding, and other non Unicodeencoding, is through the conversion table (codepage) convertedto unicode encoding, or how to display it?

It can be imagined that if there is an encoding, all the symbolsof the world will be included. Each symbol gives a unique

encoding, then the garbled question disappears. This is Unicode,as its name indicates, and this is an encoding of all symbols.Unicode, of course, is a big collection, and now the size canhold about 1000000 symbols. Each symbol is coded differently,for example,

U+0639 stands for the Arabia alphabet Ain, and U+0041 standsfor English capital letters A, and U+4E25 stands for Chinesecharacters". Specific symbols corresponding table, you canquery unicode.org, or special Chinese characters correspondingt ab l e.

4. , Unicode' s problem

It should be noted that Unicode is just a set of symbols, justa specification, standard, which specifies only the binary codeof symbols, but does not specify how the binary code should bestored on the computer.

For example, the Chinese character "Yan" Unicode is sixteendecimal number 4E25, converted to binary number, a full 15

(100111000100101) , that is to say, this symbol requires atleast 2 bytes. Representing other larger symbols may require3 bytes or 4 bytes, or even more.

Here are two serious problems. The first question is, how canyou distinguish between Unicode and ASCII?How does a computerknow that three bytes represent a symbol instead of threesymbols? The second problem is that we already know, Englishletters only one byte is enough, if the unified regulations of

the Unicode, each symbol represents three or four bytes, theneach English letters before they must have two to three bytesis 0, which is a great waste for storage, a text file the sizewill be two or three times as large, this is not acceptable.The result is: 1) a variety of storage methods for Unicode haveemerged, that is, there are many different binary formats thatcan be used to represent unicode. 2) Unicode can not bepopularized for a long time until the advent of the internet.

5.UTF-8

With the popularity of the Internet, a unified encoding isstrongly demanded. UTF-8 is one of the most widely usedimplementations of Unicode on the internet. Otherimplementations include UTF-16 and UTF-32, but basically noton the internet. Again, the relationship here is that UTF-8 isone of the implementations of Unicode, which specifies howcharacters are stored, transmitted, and stored in a computer.One of the biggest features of UTF-8 is that it is a variablelength encoding. It can use 1~4 bytes to represent a symbol andchange the byte length depending on the symbol.

The encoding rules for UTF-8 are simple, only two:

1) for single byte symbols, the first bit of the byte is setto 0, and the next 7 bits are the Unicode code of the symbol.So for English letters, the UTF-8 code is the same as the ASCIIcode.

2) fornbyte notation (n>1) , the first byte of the first nbitsare set to 1, the n+1 bit is set to 0, and the first two bitsof the back byte are set to 10. The remaining bits that are notmentioned are all Unicode codes of this symbol.

The following table summarizes the encoding rules, and theletter "X" indicates the bits that can be encoded.

Unicode UTF-8 encoding | symbol scope

(sixteen m) | (binary)

--------------------+--------------------------------------

-------

0000 0000-0000 007F 0xxxxxxx |

0000 0080-0000 07FF 110xxxxx 10xxxxxx |

0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx |

0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |The following example shows how to implement UTF-8 encodingwith Chinese characters "Yan" as an example.

Known as "strict"Unicode is 4E25 (100111000100101) , accordingto the table, it was found that the 4E25 is in the range of thirdfor the period (0000 0800-0000 FFFF) , so "UTF-8 encodingstrict" three bytes, the format is "1110xxxx 10xxxxxx 10xxxxxx".Then, from the last bit of "Yan", you start filling the X in

the format from the back, and the additional bits make up 0.We get that, "UTF-8 encoding is" strict "111001001011100010100101", this is the actual data saved in the computer,convert sixteen Hex is E4B8A5, turn into the sixteenhexadecimal purpose in order to facilitate reading.

6. conversion between Unicode and UTF-8

Through the example of the previous section, you can see thatthe "Yan"Unicode code is 4E25, and the UTF-8 encoding is E4B8A5,and the two are different. The conversion between them can beimplemented by program.

Under the Windows platform, there is a simple conversion method,that is, using the built-in Notepad applet Notepad.exe. Afteryou open the file, click the save as command on the file menu,and you will jump out of a dialog box with an "encoded" dropbar at the bottom.

There are four options: ANSI, Unicode, Unicode, big, endian,and UTF-8.

1) ANSI is the default encoding. For English documents, theASCII is encoded,

For simplified Chinese documents, GB2312 encoding (only forWindows simplified Chinese version, if it is traditionalChinese version, will use Big5 code) .

2) Unicode encoding refers to the UCS-2 encoding, that is, theUnicode code that uses two bytes to store characters directly.

This option is in the little endian format.

3) the Unicode big endian encoding corresponds to the previousoption. I' ll explain the meaning of little, endian, and bigendian in the next section.

4) UTF-8 encoding, that is, the encoding method mentioned inthe previous section.

After you have chosen the encoding method and then click"save"button, the encoding of the file is immediately converted.

7. , Little, endian, and Big endian

As mentioned in the previous section, Unicode code can be storeddirectly in UCS-2 format. In Chinese, Yan, for example, Unicodecode is 4E25, needs to be stored in two bytes, one byte is 4E,and the other is 25. Storage time, 4E in front, 25 in the back,that is, Big endian way; 25 in front, 4E in the back, that is,Little endian way.

Well, naturally, a question arises: how does a computer knowwhich way a file is encoded?

The definition of the Unicode specification, respectively,added to the front of each file a encoding sequence ofcharacters, the character is called "zero width non breakingspace" (ZERO WIDTH NO-BREAK SPACE) , FEFF. This is exactly twobytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that

the file is in a big way; if the first two bytes are FF FE, itmeans that the file is headed in a small way.

8. example

Next, give an example.

Open Notepad, program Notepad. exe, new text f ile, the contentis a"Yan"word, followed by ANSI, Unicode, Unicode, big, endianand UTF-8 encoding to save.

Then, use the "Sixteen decimal" function in the text editingsoftware UltraEdit to observe the internal encoding of thefile.

1) ANSI: the encoding of the file is two bytes of"D1 CF", whichis the "strict" GB2312 encoding, which implies that GB2312 isstored in a large way.

2) Unicode: the encoding is four bytes "FF FE 25 4E", in which"FF FE" means small endian storage, and the real encoding is4E25.

3) Unicode big endian: the encoding is four bytes "FE FF 4E 25",where "FE FF" indicates a bulk storage.

4: UTF-8) encoding is six bytes "EF BB BF E4 B8 A5", the firstthree bytes of the "EF BB BF" said this is the UTF-8 encoding,the specific encoding after the three "E4B8A5" is "strict", andits encoding sequence stored in order is the same.

2022年腾讯云新春采购季代金券提前领 领取满减优惠券和域名优惠

2022年春节假期陆续结束,根据惯例在春节之后各大云服务商会继续开始一年的促销活动。今年二月中旬会开启新春采购季的活动,我们已经看到腾讯云商家在春节期间已经有预告活动。当时已经看到有抢先优惠促销活动,目前我们企业和个人可以领取腾讯云代金券满减活动,以及企业用户可以领取域名优惠低至.COM域名1元。 直达链接 - 腾讯云新春采购活动抢先看活动时间:2022年1月20日至2022年2月15日我们可以在...

无视CC攻击CDN ,DDOS打不死高防CDN,免备案CDN,月付58元起

快快CDN主营业务为海外服务器无须备案,高防CDN,防劫持CDN,香港服务器,美国服务器,加速CDN,是一家综合性的主机服务商。美国高防服务器,1800DDOS防御,单机1800G DDOS防御,大陆直链 cn2线路,线路友好。快快CDN全球安全防护平台是一款集 DDOS 清洗、CC 指纹识别、WAF 防护为一体的外加全球加速的超强安全加速网络,为您的各类型业务保驾护航加速前进!价格都非常给力,需...

TNAHosting($5/月)4核/12GB/500GB/15TB/芝加哥机房

TNAHosting是一家成立于2012年的国外主机商,提供VPS主机及独立服务器租用等业务,其中VPS主机基于OpenVZ和KVM架构,数据中心在美国芝加哥机房。目前,商家在LET推出芝加哥机房大硬盘高配VPS套餐,再次刷新了价格底线,基于OpenVZ架构,12GB内存,500GB大硬盘,支持月付仅5美元起。下面列出这款VPS主机配置信息。CPU:4 cores内存:12GB硬盘:500GB月流...

unicode转换为你推荐
360免费建站免费空间-360免费建站空间是多大?12306.com注册12306邮箱地址怎么写headersalreadysentPHP中session_start的意思是什么图文模块图文模块的标题栏填什么啊?论坛版块图标论坛版块图标怎么设置?放图片的链接吗?还是?无忧验证码手机登录前程无忧怎么不显示登录验证码temporarilyunavailableResource temporarily unavailable,该怎么处理seo基础教程新手做SEO需要学习哪些东西ftp工具求一个比较好的手机在线FTP工具。chmod文件夹ubunto怎么修改文件夹权限
虚拟主机是什么 mysql虚拟主机 winscp softlayer 英语简历模板word info域名 天猫双十一抢红包 商家促销 大容量存储 hostker 谁的qq空间最好看 支付宝扫码领红包 论坛主机 云服务器比较 申请免费空间 阿里云邮箱登陆 中国联通宽带测速 xuni 群英网络 美国迈阿密 更多