Detailed explanation of the use of Java character encoding

Author：Eve Cole Update Time：2024-11-23 19:36:01

1. What is character encoding?

Character is a general term for text and symbols, including text, graphic symbols, mathematical symbols, etc. A set of abstract characters is a character set (Charset). The emergence of character sets is to facilitate the dissemination and storage of information. Currently commonly used character sets include: ASCII, ISO 8859-1, Unicode, GB2312

2. What are the characteristics of various coding sets?

ASCII:

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a computer coding system based on the Latin alphabet.

Contains content: control characters (carriage return, backspace, line feed), displayable characters (English upper and lower case, Arabic numerals and Western symbols).

Technical characteristics: 7 bits represent one character, a total of 128 characters

Disadvantages: It can only represent English, and language symbols in Western Europe, East Asia and Latin America cannot be represented.

ISO 8859-1:

ISO 8859-1, officially numbered ISO/IEC 8859-1:1998, also known as Latin-1 or "Western European Language", is the first 8-bit character set of ISO/IEC 8859 within the International Organization for Standardization.

It is based on ASCII and adds 96 letters and symbols in the vacant 0xA0-0xFF range for Latin alphabet languages that use additional symbols. The ISO 8859-1:1987 version has been launched.

Content included: ASCII encoding includes some languages used in Western Europe.

Technical characteristics: 8 bits represent a character.

Unicode:

Unicode character set encoding is the abbreviation of Universal Multiple-Octet Coded Character Set. It is a character encoding system developed by an organization called the Unicode Consortium and supports various languages in the world today. The exchange, processing and display of written text. The encoding began to be developed in 1990 and was officially announced in 1994. The latest version is Unicode 4.1.0 on March 31, 2005.

Technical characteristics: 16-bit encoding, each character occupies 2 bytes. The Unicode encoding of a character is determined. However, in the actual transmission process, because the designs of different system platforms are not necessarily consistent, and for the purpose of saving space, the implementation of Unicode encoding is different. The implementation of Unicode is called Unicode Transformation Format (UTF for short). If a 7-bit ASCII character Unicode file is transmitted using the original 2-byte Unicode encoding during the transmission process, it will cause a relatively large waste. For this situation, you can use UTF-8 encoding, which is a variable-length encoding that still uses a 7-bit encoding to represent the basic 7-bit ASCII characters, occupying one byte (the first bit is filled with 0). When mixed with other Unicode characters, it will be converted according to a certain algorithm. Each character is encoded using 1-3 bytes, and the first bit is 0 or 1 for identification.

GB2312:

GB 2312 or GB 2312-80 is China's national standard simplified Chinese character set, the full name is "Chinese Coded Character Set for Information Exchange Basic Set", also known as GB0. It was issued by the State Administration of Standards of China and implemented on May 1, 1981. GB2312 encoding is popular in mainland China; Singapore and other places also use this encoding. Almost all Chinese systems and international software in mainland China support GB 2312.

Contains: 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; it also includes 682 characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.

Technical features: Each Chinese character and symbol is represented by two bytes. The first byte is called the "high byte" and the second byte is called the "low byte". The "high byte" uses 0xA1-0xF7, and the "low byte" uses 0xA1-0xFE0xA0). Since the first-level Chinese characters start from area 16, the "high byte" range of the Chinese character area is 0xB0-0xF7, the range of "low byte" is 0xA1-0xFE, and the occupied code bits are 72*94=6768. Among them, 5 vacancies are D7FA-D7FE.