字节unicode、utf-8、ansi的故事及其相互转换(The story of Unicode, UTF-8 and ANSI and their mutual transformation)

unicode转换  时间:2021-04-12  阅读:()

unicode、 utf-8、 ansi的故事及其相互转换The story of Unicode,

UTF-8 and ANSI and their mutual transformation由HTTP / /www.cppb log。 COM/Automat eProgram/存档/ 2010 /03 / 26 / 110567。 H TML收藏 比较好。

、 、 ANSI的故事Un icode UTF-8

原文地址http://blog.csdn.net/iscandy/archive/2009/02/02/3859219.aspx

很久很久以前有一群人他们决定用8个可以开合的晶体管来组合成不同的状态以表示世界上的万物。他们看到8个开关状态是好的于是他们把这称为”字节” 。

再后来他们又做了一些可以处理这些字节的机器机器开动了可以用字节来组合出很多状态状态开始变来变去他们看到这样是好的于是它们就这机器称为”计算机” 。

开始计算机只在美国用。八位的字节一共可以组合出256 2的8次方种不同的状态。

他们把其中的编号从0开始的32种状态分别规定了特殊的用途一但终端、打印机遇上约定好的这些字节被传过来时就要做一些约定的动作。遇上00x10终端就换行遇上0x07终端就向人们嘟嘟叫例好遇上0x1B打印机就打印反白的字或者终端就用彩色显示字母。他们看到这样很好于是就把这些0x20以下的字节状态称为”控制码” 。

他们又把所有的空格、标点符号、数字、大小写字母分别用连续的字

节状态表示一直编到了第127号这样计算机就可以用不同字节来存储英语的文字了。大家看到这样都感觉很好于是大家都把这个方案叫做的“ASCII”编码ANSI 美标信息交换码美国信息互换标准代码。当时世界上所有的计算机都用同样的ASCI I方案来保存英文文字。

后来就像建造巴比伦塔一样世界各地的都开始使用计算机但是很多国家用的不是英文他们的字母里有许多是ASCI I里没有的为了可以在计算机保存他们的文字他们决定采用127号之后的空位来表示这些新的字母、符号还加入了很多画表格时需要用下到的横线、竖线、交叉等形状一直把序号编到了最后一个状态255。从128到255这一页的字符集被称”扩展字符集” 。从此之后贪婪的人类再没有新的状态可以用了美帝国主义可能没有想到还有第三世界国家的人们也希望可以用到计算机吧

等中国人们得到计算机时 已经没有可以利用的字节状态来表示汉字况且有6000多个常用汉字需要保存呢。但是这难不倒智慧的中国人民我们不客气地把那些127号之后的奇异符号们直接取消掉规定一个小于127的字符的意义与原来相同但两个大于127的字符连在一起时就表示一个汉字前面的一个字节他称之为高字节从0xA1用到0xf 7后面一个字节低字节从0xA1到0xfe这样我们就可以组合出大约7000多个简体汉字了。在这些编码里我们还把数学符号、罗马希腊的字母、 日文的假名们都编进去了连在ASCII里本来就有的数字、标点、字母都统统重新编了两个字节长的编码这就是常说的”全角”字符而原来在127号以下的那些就叫”半角”字符了。

中国人民看到这样很不错于是就把这种汉字方案叫做“GB2312” 。GB2312是对ASCII的中文扩展。

但是中国的汉字太多了我们很快就就发现有许多人的人名没有办法

在这里打出来特别是某些很会麻烦别人的国家领导人。

So we have to continue to find the code bits that GB2312 didn'tuse and use it honestly and politely.

Later, or not enough, so I must be no longer required low bytecode number 127, if the first byte is greater than 127 is fixedto indicate that this is a Chinese characters start, no matteris not followed by the contents of the extended character set.As a result, the extended coding scheme is called the GBKstandard, and GBK includes all the content of GB2312, whileadding nearly 20000 new Chinese characters (includingtraditional characters) and symbols.

Later, minorities also used computers, so we expanded and addedthousands of new minority words, and GBK expanded into GB18030.From then on, the culture of the Chinese nation can be handeddown in the computer age.

Chinese programmers see this series of Chinese coding standardsas good, so they are commonly known as "DBCS" (Double, Byte,Charecter, Set, double byte character sets) . In the DBCS seriesof standards, the biggest feature is the word of a long andEnglish character character Chinese characters one byte longcoexisted in the same set of encoding scheme, so they writeprograms to support Chinese treatment, attention must be paidto the value of each byte in a string, if this value is greaterthan 127, then as a double byte character set character appears.At that time, computer monks who had been blessed and programmedcould read the mantra hundreds of times each day:

"A Chinese character is counted in two English characters! AChinese character is counted in two English characters. . . . . ."

Because when countries like Chinese that produce a set of theirown encoding standard, the other who do not know who theencoding, who do not support the other encoding, the mainlandand Taiwan that even separated by only 150 nautical miles, usingthe same language brother, also uses DBCS encoding at the timethe Chinese want the computer to display Chinese characters ofdifferent programs, you must install a "Chinese characterssystem", dedicated to the display and input processing Chinesecharacters, but the Taiwan people write the fortune tellingfeudal ignorance must be added another set of support BIG5encoding what "the system can only be used Chinese characters"with the wrong character, system will show chaos! What aboutthis?And in the world' s forests, there are those poor peoplewho can not use computers for a while. What about their words?What a computer tower of Babylon!

Just then, Archangel Gabriel appeared in time, an internationalorganization called ISO (International Organization forStandardization) , which decided to solve the problem. Themethods they used were simple: they removed all regional codingschemes, and re coded a code that included all the cultures ofthe earth and all letters and symbols! They plan to call it"Universal Multiple-Octet Coded Character Set", or UCS forshort, commonly known as UNICODE".

When UNICODE began to formulate, the memory capacity of the

computer was greatly developed, and space no longer became aproblem. So ISO must be stipulated directly by two bytes, or16 bits to unify all characters, for those "half" charactersin ASCII, UNICODE to the original encoding unchanged, but itslength by 8 extensions of the original 16, and other culturaland linguistic characters are all the re unification ofencoding. Because of the "half" English symbols only need touse the low 8 bits, so the high 8 bits will always be 0, so theair scheme will waste a times in the preservation of Englishtext space.

At this time, programmers from the old society began to finda strange phenomenon: their strlen function is unreliable, aChinese character is no longer equivalent to two characters,but one! Yes, from the beginning of UNICODE, both semiangleEnglish letters, or the whole Chinese characters is a character,they are unified""! At the same time, it is also a uniform"twobytes". Please pay attention to the difference between the twoterms of "character" and "byte". "Byte" is a physical storageunit of 8 bits,

And "character" is a cultural symbol. In UNICODE, a characteris two bytes. An era where Chinese characters are counted intwo English characters is almost over.

Once upon a time when there are multiple character sets, themulti language software company encountered a lot of trouble,they are in different countries in order to sell the same setof software to the regional software but also to bless thedouble byte character set spell, not only to be careful not tomistake, but also the software text focused around to the

different characters. UNICODE is a very good package solutionfor them, and from Windows NT, MS took the opportunity to changeover their operating system, the core code all changed to workwith UNICODE version, from the beginning, no need to installthe WINDOWS system finally a variety of native language system,it can display the whole world all cultural character.However, UNICODE did not consider maintaining compatibilitywith any existing encoding scheme in the formulation, whichmakes the GBK and UNICODE in Chinese characters code layout isnot the same, not a simple arithmetic methods can send textcontent from UNICODE encoding and another encoding forconversion, the conversion must be through the look-up table.As mentioned earlier, UNICODE is represented by two bytes asa single character, and he can combine 65535 differentcharacters in a way that can cover all the symbols of allcultures in the world. If it is not also Never mind, ISO hasprepared a UCS-4 program, the simple answer is four bytes ofa character, so that we can combine 2 billion 100 milliondifferent characters out (MSB has other uses) , it can be usedfor the establishment of the Milky Way that day!

UNICODE came together, there came the rise of computer network,how UNICODE network transmission is also a problem that mustbe considered, so many UTF transmission oriented (UCS Transferformat) standard, as the name suggests, UTF8 is each of the 8bit data transmission, while the UTF16 is 16 each time, but inorder to reliability of transmission, from UNICODE to UTF whenthe correspondence is not directly, but to some algorithms andrules to convert.

Computer network programming by the monks blessing all know,there is a very important issue to transmit information in thenetwork, is for the interpretation of data of high and low, somecomputer is a method of using low first send, such as our PCmachine adopts INTEL architecture, while others are using highfirst sent way and exchange of data in the network, in orderto check whether they understand for high and low is the same,using a very simple method, is sent to each other at thebeginning of the text stream when a symbol is high - text ifit is sent in "FEFF", on the other hand, is sent "FFFE". No,you can open a UTF-X file in binary form to see if the firsttwo bytes are the two bytes

Here, we mention a strange phenomenon is very famous: when youcreate a Windows file in Notepad, enter the "China Unicom" twowords, save, close, and then open again, you will find thatthese two words have disappeared, replaced by a garbled! Ha ha,some people say that this is the reason why Unicom can not move.In fact, this is because GB2312 coding and UTF8 coding haveproduced coding collisions.

Draw a conversion rule from UNICODE to UTF8 from the internet:Unicode

UTF-8

0000 - 007F

0xxxxxxx

0080 - 07FF

110xxxxx 10xxxxxx

0800 - FFF F

1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding for the word "Han" is 6C49.6C49 is between 0800-FFFF, so use the 3 byte template: 1110xxxx,10xxxxxx, 10xxxxxx. Writing 6C49 in binary is:

0110110001001001,

This bit stream is divided into 0110110001001001 by the threebyte template segmentation method, instead of the X in thetemplate, to get: 1110-0110 10-110001 10-001001, or E6 B1 89,which is the encoding of its UTF8.

When you create a new text file, the default encoding is ANSINotepad, if you input Chinese characters encoding ANSI, so heis actually a series of GB encoding, the encoding, the code is"China Unicom":

C1 11000001

AA 10101010

CD 11001101

A8 10101000

Did you notice that? The first two bytes, three or four bytesin the initial part of all is "110" and "10", coinciding withthe UTF8 rules in the two byte template is the same, so onceagain openNotepad, Notepadmistakenly think that this is aUTF8encoding file, let us take the first word of the 110 day andsecond bytes of 10 removed, we get "00001101010", then you fillthe alignment, leading 0, was "0000000001101010", this is theUNICODE 006A feel shy, that is, the lowercase letter "J", andthen the two bytes after UTF8 decoding is 0368, the what is thecharacter. This is the only "Unicom" two words of the document,there is no way to show in Notepad normal reasons.

If you input multiple words in the "China Unicom", the otherword encoding is not necessarily also happens to be 110 and 10bytes starting, it is opened again, Notepad would not insiston this is a utf8 encoding the file, and will use the ANSI wayof reading, then does not appear garbled.

Interconversion:

Original address:http://club.topsage.com/thread-670150-1-1.html

Conversion between Ansi, Unicode, UTF-8 strings, and writingtext files

Ansi string we are most familiar with, English accounted forone byte, Chinese characters 2 bytes, endingwith a\0, commonlyused in TXT text files

半月湾($59.99/年),升级带宽至200M起步 三网CN2 GIA线路

在前面的文章中就有介绍到半月湾Half Moon Bay Cloud服务商有提供洛杉矶DC5数据中心云服务器,这个堪比我们可能熟悉的某服务商,如果我们有用过的话会发现这个服务商的价格比较贵,而且一直缺货。这里,于是半月湾服务商看到机会来了,于是有新增同机房的CN2 GIA优化线路。在之前的文章中介绍到Half Moon Bay Cloud DC5机房且进行过测评。这次的变化是从原来基础的年付49....

日本CN2独立物理服务器 E3 1230 16G 20M 500元/月 提速啦

提速啦的来历提速啦是 网站 本着“良心 便宜 稳定”的初衷 为小白用户避免被坑 由赣州王成璟网络科技有限公司旗下赣州提速啦网络科技有限公司运营 投资1000万人民币 在美国Cera 香港CTG 香港Cera 国内 杭州 宿迁 浙江 赣州 南昌 大连 辽宁 扬州 等地区建立数据中心 正规持有IDC ISP CDN 云牌照 公司。公司购买产品支持3天内退款 超过3天步退款政策。提速啦的市场定位提速啦主...

cera:秋季美国便宜VPS促销,低至24/月起,多款VPS配置,自带免费Windows

介绍:819云怎么样?819云创办于2019,由一家从2017年开始从业的idc行业商家创办,主要从事云服务器,和物理机器819云—-带来了9月最新的秋季便宜vps促销活动,一共4款便宜vps,从2~32G内存,支持Windows系统,…高速建站的美国vps位于洛杉矶cera机房,服务器接入1Gbps带宽,采用魔方管理系统,适合新手玩耍!官方网站:https://www.8...

unicode转换为你推荐
ldapserver怎样打开DWA文件?请说详细点?360免费建站怎样给360免费自助建站制作的企业网站做一级域名解析绑定?filezilla_serverFileZilla无法连接服务器怎么解决购物车(淘宝)为什么推荐购物车购买,是什么意思啊?discuz伪静态discuz怎么才能把专题目录也实现伪静态的方法详解qq头像上传失败昨天和今天QQ头像上传失败,是怎么回事?帖子标题百度贴吧如何改帖子的标题长沙电话号码升位湖南的电话号码什么时候从6位数升到7位数的?网上支付功能银行卡怎么开启在线支付功能discuz教程急急急,求创建论坛网站【Discuz】最详细的教程!
免费云主机 新世界机房 香港服务器99idc shopex空间 新世界电讯 777te 1g空间 ca187 yundun 服务器是干什么用的 免费ftp lamp的音标 香港博客 湖南铁通 百度新闻源申请 e-mail 西安电信测速网 装修瓦工培训 隐士ddos 电脑主机结构图 更多