Differences and relationships between UTF-8 GBK UTF8 GB2312

Author：Eve Cole Update Time：2010-08-08 09:47:10

UTF-8: Unicode TransformationFormat-8bit, BOM is allowed, but BOM is usually not included. It is a multi-byte encoding used to solve international characters. It uses 8 bits (that is, one byte) for English and 24 bits (three bytes) for Chinese. UTF-8 contains characters that are used by all countries in the world. It is an international encoding and has strong versatility. UTF-8 encoded text can be displayed on browsers in various countries that support the UTF8 character set. For example, if it is UTF8 encoding, Chinese can also be displayed on foreigners' English IE, and they do not need to download IE's Chinese language support package.

GBK is a standard based on the national standard GB2312 and expanded to be compatible with GB2312. The text encoding of GBK is represented by double bytes, that is, both Chinese and English characters are represented by double bytes. In order to distinguish Chinese characters, the highest bits are set to 1. GBK contains all Chinese characters and is a national encoding. It is less versatile than UTF8, but UTF8 occupies a larger database than GBD.

GBK, GB2312, etc. must be converted to UTF8 through Unicode encoding:

GBK, GB2312--Unicode--UTF8

UTF8--Unicode--GBK, GB2312

For a website or forum, if there are many English characters, it is recommended to use UTF-8 to save space. However, many forum plug-ins now generally only support GBK.
Detailed explanation of the difference between encodings. Simply put, unicode, gbk and big five codes are the encoded values, and utf-8, uft-16 and the like are the expressions of this value. The previous three codes are compatible. For the same Chinese character, the three code values are completely different. For example, the uncode value of "Han" is different from gbk. Suppose the uncode is a040 and gbk is b030, and the uft-8 code is the form in which that value is expressed. The utf-8 code is completely organized only for uncode. If GBK wants to be converted to UTF-8, it must be converted to uncode first, and then converted to utf-8 and it is OK.

For details, see the article reprinted below.

Let’s talk about Unicode encoding and briefly explain terms such as UCS, UTF, BMP, and BOM. This is an interesting read written by programmers for programmers. The so-called fun means that you can easily understand some previously unclear concepts and improve your knowledge, which is similar to upgrading in an RPG game. The motivation for organizing this article is two questions:

Question one:
Using "Save As" in Windows Notepad, you can convert between GBK, Unicode, Unicode big endian and UTF-8 encoding methods. Or you can go directly to http://www.knowsky.com/tools/utf8.asp for online conversion.

It is also a txt file. How does Windows identify the encoding method?

I discovered a long time ago that Unicode, Unicode bigendian and UTF-8 encoded txt files will have a few more bytes at the beginning, which are FF, FE (Unicode), FE, FF (Unicode bigendian), EF, BB, BF (UTF-8). But what criteria are these markers based on?

Question two:
Recently I saw ConvertUTF.c on the Internet, which realizes the mutual conversion of UTF-32, UTF-16 and UTF-8. I already know about encoding methods such as Unicode (UCS2), GBK, and UTF-8. But this program makes me a little confused, and I can't remember what the relationship between UTF-16 and UCS2 is.
After checking the relevant information, I finally clarified these issues, and also learned some details about Unicode. Write an article and send it to friends who have similar questions. This article is written as easy to understand as possible, but readers are required to know what bytes are and what hexadecimal is.

0. big endian and little endian
big endian and little endian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the character "汉" is 6C49. So when writing to a file, should 6C be written in front or 49 be written in front? If 6C is written in front, it is big endian. If 49 is written in front, it is little endian.

The word "endian" comes from "Gulliver's Travels". The civil war in Lilliput was caused by whether to crack the eggs from the Big-Endian or the Little-Endian. As a result, there were six rebellions. One emperor lost his life, and another Lost the throne.

We generally translate endian as "byte order", and big endian and little endian are called "big end" and "little end".

1. Character encoding and internal code. By the way, Chinese character encoding characters must be encoded before they can be processed by the computer. The default encoding method used by the computer is the computer's internal code. Early computers used 7-bit ASCII encoding. In order to process Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for Traditional Chinese.

IETF's RFC2781 and RFC3629 describe the encoding methods of UTF-16 and UTF-8 clearly, crisply and rigorously in the consistent style of RFC. I always forget that IETF is the abbreviation of Internet Engineering Task Force. However, the RFC maintained by the IETF is the basis for all specifications on the Internet.

2.1. Internal code and code page
Currently, the Windows kernel already supports the Unicode character set, so that the kernel can support all languages in the world. However, since a large number of existing programs and documents use a certain language encoding, such as GBK, it is impossible for Windows not to support the existing encoding and all use Unicode.

Windows uses code pages to adapt to various countries and regions. The code page can be understood as the internal code mentioned earlier. The code page corresponding to GBK is CP936.

Microsoft also defines a code page for GB18030: CP54936. However, since GB18030 has some 4-byte encodings, and the Windows code page only supports single-byte and double-byte encodings, this code page cannot really be used.

3. UCS-2, UCS-4, BMP
UCS comes in two formats: UCS-2 and UCS-4. As the name suggests, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually only 31 bits are used, the highest bit must be 0). Let's do some simple math games:

UCS-2 has 2^16=65536 code points, and UCS-4 has 2^31=2147483648 code points.

UCS-4 is divided into 2^7=128 groups according to the highest byte with the highest bit being 0. Each group is divided into 256 planes based on the next highest byte. Each plane is divided into 256 rows according to the third byte, and each row contains 256 cells. Of course, cells in the same row only differ in the last byte, and the rest are the same.

Plane 0 of group 0 is called Basic Multilingual Plane, or BMP. Or in UCS-4, the code bits with the upper two bytes being 0 are called BMP.

UCS-2 is obtained by removing the first two zero bytes of UCS-4's BMP. Add two zero bytes in front of the two bytes of UCS-2 to get the BMP of UCS-4. There are no characters allocated outside the BMP in the current UCS-4 specification.

4. UTF encoding

UTF-8 encodes UCS in 8-bit units. The encoding from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary)
0000-007F 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding of "Chinese" is 6C49. 6C49 is between 0800-FFFF, so you must use a 3-byte template: 1110xxxx 10xxxxxx10xxxxxx. Writing 6C49 in binary is: 0110 110001 001001. Using this bit stream to replace x in the template in turn, we get: 1110011010110001 10001001, which is E6 B1 89.

Readers can use Notepad to test whether our coding is correct. It should be noted that UltraEdit will automatically convert to UTF-16 when opening a UTF-8 encoded text file, which may cause confusion. You can turn this option off in settings. A better tool is Hex Workshop.

UTF-16 encodes UCS in 16-bit units. For UCS codes less than 0x10000, UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes no less than 0x10000, an algorithm is defined. However, since the BMP of the actually used UCS2 or UCS4 must be less than 0x10000, for now, it can be considered that UTF-16 and UCS-2 are basically the same. However, UCS-2 is only an encoding scheme, and UTF-16 is used for actual transmission, so the issue of byte order has to be considered.

5. UTF byte order and BOM
UTF-8 uses bytes as the encoding unit and has no endianness issues. UTF-16 uses two bytes as the encoding unit. Before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, the Unicode encoding of "Kui" is 594E, and the Unicode encoding of "B" is 4E59. If we receive the UTF-16 byte stream "594E", is this "Ku" or "B"?

The recommended method of marking byte order in the Unicode specification is the BOM. BOM is not the BOM list of "Bill Of Material", but Byte order Mark. BOM is a little clever idea:

There is a character called "ZERO WIDTH NO-BREAKSPACE" in UCS encoding, and its encoding is FEFF. FFFE is a character that does not exist in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream.

In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.

UTF-8 does not require a BOM to indicate the byte order, but can use the BOM to indicate the encoding method. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAKSPACE" is EF BB BF (readers can verify it using the encoding method we introduced earlier). So if the receiver receives a byte stream starting with EF BBBF, it knows that it is UTF-8 encoded.

Windows uses BOM to mark the encoding of text files.

6. Further reference materials The main reference material for this article is "Short overview of ISO-IEC 10646 and Unicode" ( http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html ).

I also found two pieces of information that looked good, but because I already had the answers to my initial questions, I didn’t read them:

"Understanding Unicode A general introduction to the Unicode Standard" ( http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a )
"Character set encoding basics Understanding character set encodings and legacy encodings" ( http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03 )
I have written software packages for converting UTF-8, UCS-2, and GBK to and from each other, including versions using Windows API and versions that do not use Windows API. If I have time in the future, I will sort it out and put it on my personal homepage ( http://fmddlmyy.home4u.china.com ).

I started writing this article after thinking through all the issues. I thought I could finish it in a while. Unexpectedly, it took a long time to consider the wording and check the details, and I wrote it from 1:30 to 9:00 in the afternoon. Hope some readers can benefit from it.

Appendix 1 Let’s talk about the location code, GB2312, internal code and code page. Some friends still have questions about this sentence in the article:
"The original text of GB2312 is still the area code. From the area code to the inner code, you need to add A0 to the high byte and low byte respectively."

Let me explain it in detail:

"The original text of GB2312" refers to a national standard in 1980, "Basic Set of Chinese Coded Character Sets for National Standard Information Exchange of the People's Republic of China GB2312-80". This standard uses two numbers to encode Chinese characters and Chinese symbols. The first number is called the "area" and the second number is called the "bit". So it is also called location code. Areas 1-9 are Chinese symbols, areas 16-55 are first-level Chinese characters, and areas 56-87 are second-level Chinese characters. Now Windows also has a location input method, for example, input 1601 to get "ah". (This location input method can automatically recognize the hexadecimal GB2312 and decimal location codes, which means that inputting B0A1 will also get "ah".)

Internal code refers to the character encoding within the operating system. The internal code of early operating systems was language-dependent. Today's Windows supports Unicode within the system, and then uses code pages to adapt to various languages. The concept of "internal code" is relatively vague. Microsoft generally refers to the encoding specified by the default code page as internal code.

There is no official definition of the term internal code, and code page is just the name of the company Microsoft. As programmers, as long as we know what they are, there is no need to examine these terms too much.

The so-called code page (code page) is the character encoding for a language. For example, the code page of GBK is CP936, the code page of BIG5 is CP950, and the code page of GB2312 is CP20936.

Windows has the concept of a default code page, that is, what encoding is used by default to interpret characters. For example, Windows Notepad opens a text file, and the content inside is a byte stream: BA, BA, D7, D6. How should Windows interpret it?

Should it be interpreted in accordance with Unicode encoding, GBK, BIG5, or ISO8859-1? If you interpret it according to GBK, you will get the word "Chinese characters". According to other encoding interpretations, the corresponding character may not be found, or the wrong character may be found. The so-called "error" means that it is inconsistent with the original intention of the text author, and garbled characters are produced.

The answer is that Windows interprets the byte stream in the text file according to the current default code page. The default code page can be set through the Regional Options in Control Panel. There is an ANSI item in Notepad's Save As, which actually saves according to the encoding method of the default code page.

The internal code of Windows is Unicode, which can technically support multiple code pages at the same time. As long as the file can explain what encoding it uses and the user has installed the corresponding code page, Windows can display it correctly. For example, charset can be specified in an HTML file.

Some HTML file authors, especially English authors, believe that everyone in the world uses English and do not specify charset in the file. If he uses characters between 0x80-0xff, and Chinese Windows interprets them according to the default GBK, garbled characters will appear. At this time, just add the statement specifying charset to the html file, for example:
<meta http-equiv="Content-Type" content="text/html; charset=ISO8859-1">
If the code page used by the original author is compatible with ISO8859-1, there will be no garbled characters.

Let’s talk about the location code. Ah’s location code is 1601, which is 0x10, 0x01 in hexadecimal. This conflicts with the ASCII encoding widely used by computers. In order to be compatible with the ASCII encoding of 00-7f, we add A0 to the high and low bytes of the area code respectively. In this way, the code for "ah" becomes B0A1. We also call the encoding with two A0s added as GB2312 encoding, although the original text of GB2312 does not mention this at all.