编码字符编码知识unicode、utf-8、ascii、gb2312等编码之间是如何转换的(Character encoding knowledge how is the conversion between Unicode, UTF-8, ASCII, and GB2312 encoded)

unicode转换  时间:2021-04-12  阅读:()

字符编码知识unicode、 utf-8、 ascii、 gb2312等编码之间是如何转

换的Character encoding knowledge how is the conversionbetween Unicode, UTF-8, ASCII, and GB2312 encodedCharacter encoding knowledge: how is the conversion betweenUnicode, UTF-8, ASCII, and GB2312 encoded?

Character coding is the cornerstone of computer technology. Tomaster a computer, you must understand the knowledge ofcharacter encoding. Do not pay attention to the people may notcare about this, but these nouns sometimes really make peopleconfused, want to learn computer knowledge, understand it isalso very important, I also learn slowly learned some knowledgein this respect.

1. ASCII code

Inside the computer, all the information is eventuallyrepresented as a binary string. Each binary bit (bit) has 0 and1 states, so the eight binary bits can be combined into 256States, called (byte) . That is to say, a byte can be used torepresent 256 different states, each corresponding to onesymbol, i. e. , 256 symbols, from 0000000 to 11111111.

In the 60s of the last century, the United States developed aset of character encoding, and made a uniform stipulation onthe relationship between English characters and binary digits.This is called ASCII code, has been used so far.

The ASCII code specifies a total of 128 characters, such as thespace "SPACE" is 32 (decimal 32, binary means 00100000) , and

the uppercase letter "A" is 65 (binary 01000001) . These 128symbols, including 32 printed symbols that cannot be printed,take up only one byte of the latter 7 bits, and the first 1 areuniformly specified as 0. Here is a screenshot: you can go tothis webpage for details:http://www.nengcha. com/code/ascii/all/

2, non ASCII encoding

It is enough to encode English with 128 symbols, but it is notenough to represent other languages and 128 symbols. Forexample, in French, with phonetic symbols above a letter, itwill not be able to use ASCII code. As a result, some Europeancountries decided to make new symbols using the highest bitsof inactivity in bytes. For example, in French type encodingis 130 (binary 10000010) . As a result, the coding systems usedby these European countries can represent up to 256 symbols.But there are new problems here. Different countries havedifferent letters, so even though they all use 256 symbols, theletters they represent are different. For example, in the 130French encoding represents e, but on behalf of the encoding inHebrew letters Gimel (?) , on behalf of another symbol in Russianencoding. However, in all of these encodings, the symbolsrepresented by 0 - 127 are the same, not the same as the 128- 255 segment.

As for Asian countries, more symbols are used, and Chinesecharacters are up to about 100 thousand. When a byte can onlyrepresent 256 symbols, it is certainly not enough. You must usemore than one byte to represent a symbol. For example, the

common encoding in simplified Chinese is GB2312, whichuses twobytes to represent a Chinese character, so it can theoreticallyrepresent up to 256x256=65536 symbols.

3.Unicode

As in the previous section, there are various encoding methodsin the world, and the same binary number can be interpreted intodifferent symbols. Therefore, if you want to open a text file,you must know its encoding method, otherwise it will appeargarbled by the wrong encoding. Why email often garbled? Thatis because the sender and receiver use different encodingmethods. Interpretation: with a text file that is written inEnglish, in English encoding conditions, each character and acorresponding binary number (such as 00101000, similar) andthen saved to the computer, then put the English documents toa Russian national computer users, transmission is a binarystream 0101 such data to the user needs to have this Russian,Russian encoding to decode it, each binary transfer characterdisplay, as the flow data of each binary string encoding tableRussian interpretation in the different ways, the same data as00101000 in English may represent A, and in Russian on behalfof B, this will produce a garbled, this is my personalunderstanding.

GB2312 encoding, Japanese encoding, and other non Unicodeencoding, is through the conversion table (codepage) convertedto unicode encoding, or how to display it?

It can be imagined that if there is an encoding, all the symbolsof the world will be included. Each symbol gives a unique

encoding, then the garbled question disappears. This is Unicode,as its name indicates, and this is an encoding of all symbols.Unicode, of course, is a big collection, and now the size canhold about 1000000 symbols. Each symbol is coded differently,for example,

U+0639 stands for the Arabia alphabet Ain, and U+0041 standsfor English capital letters A, and U+4E25 stands for Chinesecharacters". Specific symbols corresponding table, you canquery unicode.org, or special Chinese characters correspondingt ab l e.

4. , Unicode' s problem

It should be noted that Unicode is just a set of symbols, justa specification, standard, which specifies only the binary codeof symbols, but does not specify how the binary code should bestored on the computer.

For example, the Chinese character "Yan" Unicode is sixteendecimal number 4E25, converted to binary number, a full 15

(100111000100101) , that is to say, this symbol requires atleast 2 bytes. Representing other larger symbols may require3 bytes or 4 bytes, or even more.

Here are two serious problems. The first question is, how canyou distinguish between Unicode and ASCII?How does a computerknow that three bytes represent a symbol instead of threesymbols? The second problem is that we already know, Englishletters only one byte is enough, if the unified regulations of

the Unicode, each symbol represents three or four bytes, theneach English letters before they must have two to three bytesis 0, which is a great waste for storage, a text file the sizewill be two or three times as large, this is not acceptable.The result is: 1) a variety of storage methods for Unicode haveemerged, that is, there are many different binary formats thatcan be used to represent unicode. 2) Unicode can not bepopularized for a long time until the advent of the internet.

5.UTF-8

With the popularity of the Internet, a unified encoding isstrongly demanded. UTF-8 is one of the most widely usedimplementations of Unicode on the internet. Otherimplementations include UTF-16 and UTF-32, but basically noton the internet. Again, the relationship here is that UTF-8 isone of the implementations of Unicode, which specifies howcharacters are stored, transmitted, and stored in a computer.One of the biggest features of UTF-8 is that it is a variablelength encoding. It can use 1~4 bytes to represent a symbol andchange the byte length depending on the symbol.

The encoding rules for UTF-8 are simple, only two:

1) for single byte symbols, the first bit of the byte is setto 0, and the next 7 bits are the Unicode code of the symbol.So for English letters, the UTF-8 code is the same as the ASCIIcode.

2) fornbyte notation (n>1) , the first byte of the first nbitsare set to 1, the n+1 bit is set to 0, and the first two bitsof the back byte are set to 10. The remaining bits that are notmentioned are all Unicode codes of this symbol.

The following table summarizes the encoding rules, and theletter "X" indicates the bits that can be encoded.

Unicode UTF-8 encoding | symbol scope

(sixteen m) | (binary)

--------------------+--------------------------------------

-------

0000 0000-0000 007F 0xxxxxxx |

0000 0080-0000 07FF 110xxxxx 10xxxxxx |

0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx |

0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |The following example shows how to implement UTF-8 encodingwith Chinese characters "Yan" as an example.

Known as "strict"Unicode is 4E25 (100111000100101) , accordingto the table, it was found that the 4E25 is in the range of thirdfor the period (0000 0800-0000 FFFF) , so "UTF-8 encodingstrict" three bytes, the format is "1110xxxx 10xxxxxx 10xxxxxx".Then, from the last bit of "Yan", you start filling the X in

the format from the back, and the additional bits make up 0.We get that, "UTF-8 encoding is" strict "111001001011100010100101", this is the actual data saved in the computer,convert sixteen Hex is E4B8A5, turn into the sixteenhexadecimal purpose in order to facilitate reading.

6. conversion between Unicode and UTF-8

Through the example of the previous section, you can see thatthe "Yan"Unicode code is 4E25, and the UTF-8 encoding is E4B8A5,and the two are different. The conversion between them can beimplemented by program.

Under the Windows platform, there is a simple conversion method,that is, using the built-in Notepad applet Notepad.exe. Afteryou open the file, click the save as command on the file menu,and you will jump out of a dialog box with an "encoded" dropbar at the bottom.

There are four options: ANSI, Unicode, Unicode, big, endian,and UTF-8.

1) ANSI is the default encoding. For English documents, theASCII is encoded,

For simplified Chinese documents, GB2312 encoding (only forWindows simplified Chinese version, if it is traditionalChinese version, will use Big5 code) .

2) Unicode encoding refers to the UCS-2 encoding, that is, theUnicode code that uses two bytes to store characters directly.

This option is in the little endian format.

3) the Unicode big endian encoding corresponds to the previousoption. I' ll explain the meaning of little, endian, and bigendian in the next section.

4) UTF-8 encoding, that is, the encoding method mentioned inthe previous section.

After you have chosen the encoding method and then click"save"button, the encoding of the file is immediately converted.

7. , Little, endian, and Big endian

As mentioned in the previous section, Unicode code can be storeddirectly in UCS-2 format. In Chinese, Yan, for example, Unicodecode is 4E25, needs to be stored in two bytes, one byte is 4E,and the other is 25. Storage time, 4E in front, 25 in the back,that is, Big endian way; 25 in front, 4E in the back, that is,Little endian way.

Well, naturally, a question arises: how does a computer knowwhich way a file is encoded?

The definition of the Unicode specification, respectively,added to the front of each file a encoding sequence ofcharacters, the character is called "zero width non breakingspace" (ZERO WIDTH NO-BREAK SPACE) , FEFF. This is exactly twobytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that

the file is in a big way; if the first two bytes are FF FE, itmeans that the file is headed in a small way.

8. example

Next, give an example.

Open Notepad, program Notepad. exe, new text f ile, the contentis a"Yan"word, followed by ANSI, Unicode, Unicode, big, endianand UTF-8 encoding to save.

Then, use the "Sixteen decimal" function in the text editingsoftware UltraEdit to observe the internal encoding of thefile.

1) ANSI: the encoding of the file is two bytes of"D1 CF", whichis the "strict" GB2312 encoding, which implies that GB2312 isstored in a large way.

2) Unicode: the encoding is four bytes "FF FE 25 4E", in which"FF FE" means small endian storage, and the real encoding is4E25.

3) Unicode big endian: the encoding is four bytes "FE FF 4E 25",where "FE FF" indicates a bulk storage.

4: UTF-8) encoding is six bytes "EF BB BF E4 B8 A5", the firstthree bytes of the "EF BB BF" said this is the UTF-8 encoding,the specific encoding after the three "E4B8A5" is "strict", andits encoding sequence stored in order is the same.

Megalayer促销:美国圣何塞CN2线路VPS月付48元起/香港VPS月付59元起/香港E3独服月付499元起

Megalayer是新晋崛起的国外服务器商,成立于2019年,一直都处于稳定发展的状态,机房目前有美国机房,香港机房,菲律宾机房。其中圣何塞包括CN2或者国际线路,Megalayer商家提供了一些VPS特价套餐,譬如15M带宽CN2线路主机最低每月48元起,基于KVM架构,支持windows或者Linux操作系统。。Megalayer技术团队行业经验丰富,分别来自于蓝汛、IBM等知名企业。Mega...

优林云(53元)哈尔滨电信2核2G

优林怎么样?优林好不好?优林 是一家国人VPS主机商,成立于2016年,主营国内外服务器产品。云服务器基于hyper-v和kvm虚拟架构,国内速度还不错。今天优林给我们带来促销的是国内东北地区哈尔滨云服务器!全部是独享带宽!首月5折 续费5折续费!地区CPU内存硬盘带宽价格购买哈尔滨电信2核2G50G1M53元直达链接哈尔滨电信4核4G50G1M83元直达链接哈尔滨电信8核8G50G1M131元直...

Megalayer(月599元)限时8月香港和美国大带宽服务器

第一、香港服务器机房这里我们可以看到有提供四个大带宽方案,是全向带宽和国际带宽,前者适合除了中国大陆地区的全网地区用户可以用,后者国际带宽适合欧美地区业务。如果我们是需要大陆地区速度CN2优化的,那就需要选择常规的优化带宽方案,参考这里。CPU内存硬盘带宽流量价格选择E3-12308GB240GB SSD50M全向带宽不限999元/月方案选择E3-12308GB240GB SSD100M国际带宽不...

unicode转换为你推荐
研究员声明及重要披露事项请参见第小企业如何做品牌小公司的品牌建设怎么样才能做好现有新的ios更新可用请从ios14be苹果手机更新不了最新14系统是怎么回事?重庆杨家坪猪肉摊主杀人重庆一市民发现买的新鲜猪肉晚上发蓝光.专家解释,猪肉中含磷较多且携带了一种能发光的细菌--磷光杆菌时易名网易名网交易域名是怎么收费的小型汽车网上自主编号申请成都新车上牌办理流程和办理条件是如何的爱买网超谁有http://www.25j58.com爱网购吧网站简介?三五互联股票三五互联是干什么的?billboardchina美国Billboard公告牌年度10大金曲最新华丽合辑免费代理加盟怎样免费加盟代理淘宝
网站空间商 域名查询系统 ftp教程 七夕促销 qq云端 免费高速空间 美国免费空间 超级服务器 smtp虚拟服务器 中国电信测速器 上海电信测速 免费asp空间 国外的代理服务器 华为k3 免费网络空间 电信宽带测速软件 97rb 国外代理服务器 贵州电信 fatcow 更多