Every region in the world has its own local language. Regional differences directly lead to differences in language environment. In the process of developing an international program, it is important to deal with language issues.
This is a problem that exists worldwide, so Java provides a worldwide solution. The method described in this article is for processing Chinese, but by extension, it is also applicable to processing languages from other countries and regions in the world.
Chinese characters are double-byte. The so-called double byte means that a double word occupies two BYTE positions (that is, 16 bits), which are called high bits and low bits respectively. The Chinese character encoding specified in China is GB2312, which is mandatory. Currently, almost all applications that can process Chinese support GB2312. GB2312 includes first- and second-level Chinese characters and 9 area symbols. The high bits range from 0xa1 to 0xfe, and the low bits also range from 0xa1 to 0xfe. Among them, the encoding range of Chinese characters is 0xb0a1 to 0xf7fe.
There is another encoding called GBK, but this is a specification, not mandatory. GBK provides 20902 Chinese characters, which is compatible with GB2312, and the encoding range is 0x8140 to 0xfefe. All characters in GBK can be mapped to Unicode 2.0 one by one.
In the near future, China will promulgate another standard: GB18030-2000 (GBK2K). It includes the fonts of Tibetan, Mongolian and other ethnic minorities, fundamentally solving the problem of insufficient character positions. Note: It is no longer fixed length. The two-byte part is compatible with GBK, and the four-byte part is extended characters and glyphs. Its first and third bytes range from 0x81 to 0xfe, and its second and fourth bytes range from 0x30 to 0x39.
This article does not intend to introduce Unicode. Those who are interested can browse "http://www.unicode.org/" to view more information. Unicode has a characteristic: it includes all character glyphs in the world. Therefore, languages in various regions can establish mapping relationships with Unicode, and Java takes advantage of this to achieve conversion between heterogeneous languages.
In JDK, the encodings related to Chinese are:
Table 1 List of encodings related to Chinese in JDK
Encoding name | description |
ASCII | 7-bit, same as ascii7 |
ISO8859-1 | 8-bit, same as 8859_1, ISO-8859-1, ISO_8859-1, latin1... etc. |
GB2312-80 | 16-bit, same as gb2312, gb2312-1980, EUC_CN ,euccn,1381,Cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB...etc. are the same. |
GBK | is the same as MS936. Note: case-sensitive |
UTF8 | is the same as UTF-8. |
GB18030 | is the same as cp1392 and 1392. Currently, few JDKs are supported |
in practice. When programming, the ones I come into contact with more are GB2312 (GBK) and ISO8859-1.
Why is there a "?" sign?
As mentioned above, conversion between different languages is completed through Unicode. Suppose there are two different languages A and B. The conversion steps are: first convert A to Unicode, and then convert Unicode to B.
Give examples. There is a Chinese character "李" in GB2312, and its code is "C0EE", which needs to be converted into ISO8859-1 code. The steps are: first convert the character "李" into Unicode to get "674E", and then convert "674E" into ISO8859-1 characters. Of course, this mapping will not succeed, because there is no character corresponding to "674E" in ISO8859-1.
The problem occurs when the mapping is unsuccessful! When converting from a certain language to Unicode, if the character does not exist in a certain language, the result will be the Unicode code "uffffd" ("u" means Unicode encoding,). When converting from Unicode to a certain language, if a certain language does not have corresponding characters, you will get "0x3f" ("?"). This is where the "?" comes from.
For example: perform the new String(buf, "gb2312") operation on the character stream buf = "0x80 0x40 0xb0 0xa1", the result will be "ufffdu554a", and then println it out, the result will be "?ah", Because "0x80 0x40" is a character in GBK, not in GB2312.
For another example, perform the new String (buf.getBytes("GBK")) operation on the string String="u00d6u00ecu00e9u0046u00bbu00f9", and the result is "3fa8aca8a6463fa8b4", among which, "u00d6 "There is no corresponding character in "GBK", so we get "3f", "u00ec" corresponds to "a8ac", "u00e9" corresponds to "a8a6", and "0046" corresponds to "46" (because this is an ASCII character ), "u00bb" was not found, and "3f" was obtained. Finally, "u00f9" corresponds to "a8b4". Println this string and the result is "?ìéF?ù". Did you see that? It’s not all question marks here, because the content mapped between GBK and Unicode includes characters in addition to Chinese characters. This example is the best proof.
Therefore, when transcoding Chinese characters, if confusion occurs, you may not necessarily get question marks! However, a mistake is a mistake after all. There is no qualitative difference between 50 steps and 100 steps.
Or you may ask: What will be the result if it is included in the source character set but not in Unicode? The answer is I don't know. Because I don't have the source character set at hand to do this test. But one thing is certain, that is, the source character set is not standardized enough. In Java, if this happens, an exception will be thrown.
What is UTF
UTF is the abbreviation of Unicode Text Format, which means Unicode text format. For UTF, it is defined as follows:
(1) If the first 9 bits of a Unicode 16-bit character are 0, it is represented by a byte. The first bit of this byte is "0", and the remaining 7 bits are the same as the original character. The last 7 digits are the same, such as "u0034" (0000 0000 0011 0100), represented by "34" (0011 0100); (the same as the source Unicode character);
(2) If the first 5 of Unicode's 16-bit characters If the bit is 0, it is represented by 2 bytes. The first byte starts with "110", and the following 5 bits are the same as the highest 5 bits of the source character after excluding the first 5 zeros; the second byte starts with "10" At the beginning, the next 6 bits are the same as the lower 6 bits in the source character. For example, "u025d" (0000 0010 0101 1101) will be converted to "c99d" (1100 1001 1001 1101);
(3) If it does not meet the above two rules, it will be represented by three bytes. The first byte starts with "1110", and the last four bits are the high four bits of the source character; the second byte starts with "10", and the last six bits are the middle six bits of the source character; the third byte starts with "10". Starting with "10", the last six digits are the lower six digits of the source character; for example, "u9da7" (1001 1101 1010 0111) is converted into "e9b6a7" (1110 1001 1011 0110 1010 0111);
the difference between Unicode and Unicode in JAVA programs can be described like this The relationship between UTF is not absolute: when a string is run in memory, it appears as a Unicode code, and when it is saved to a file or other media, UTF is used. This conversion process is completed by writeUTF and readUTF.
Okay, the basic discussion is almost done, let’s get to the point.
First think of the problem as a black box. Let’s first look at the first-level representation of the black box:
input(charsetA)->process(Unicode)->output(charsetB)
is simple. This is an IPO model, that is, input, processing and output. The same content needs to be converted from charsetA to unicode and then to charsetB.
Let’s look at the secondary representation:
SourceFile(jsp,java)->class->output.
In this figure, it can be seen that the input is jsp and java source files. During the processing, the Class file is used as the carrier and then output. Then refine it to the third level:
jsp->temp file->class->browser,os console,db
app,servlet->class->browser,os console,db
. This picture will be more clear. The Jsp file first generates the intermediate Java file, and then generates the Class. Servlets and ordinary apps are directly compiled to generate Class. Then, output from Class to the browser, console or database, etc.
JSP: The process from source file to Class
The source file of Jsp is a text file ending with ".jsp". In this section, the interpretation and compilation process of JSP files will be explained, and the Chinese changes will be tracked.
1. The JSP conversion tool (jspc) provided by the JSP/Servlet engine searches for the charset specified in <%@ page contentType ="text/html; charset=<Jsp-charset>"%> in the JSP file. If <Jsp-charset> is not specified in the JSP file, the default setting file.encoding in the JVM is used. Under normal circumstances, this value is ISO8859-1;
2. jspc uses the equivalent of "javac –encoding <Jsp-charset> " command interprets all characters appearing in the JSP file, including Chinese characters and ASCII characters, and then converts these characters into Unicode characters, then converts them into UTF format, and saves them as JAVA files. When converting ASCII characters into Unicode characters, you simply add "00" in front, such as "A", which is converted into "u0041" (no reason is needed, this is how the Unicode code table is compiled). Then, after conversion to UTF, it changed back to "41"! This is why you can use an ordinary text editor to view the JAVA files generated by JSP;
3. The engine uses the command equivalent to "javac -encoding UNICODE" to compile the JAVA files into CLASS files;
first look at the Chinese characters in these processes conversion situation. There is the following source code:
<%@ page contentType="text/html; charset=gb2312"%>
<html><body>
<%
String a="Chinese";
out.println(a);
%>
</body></html>
This code was written on UltraEdit for Windows. After saving, the hexadecimal encoding of the two characters "Chinese" is "D6 D0 CE C4" (GB2312 encoding). After looking up the table, the Unicode encoding of the word "Chinese" is "u4E2Du6587", which in UTF is "E4 B8 AD E6 96 87". Open the JAVA file converted from the JSP file generated by the engine, and find that the word "Chinese" has indeed been replaced by "E4 B8 AD E6 96 87". Then check the CLASS file generated by the JAVA file compilation, and find that the result is the same as Exactly the same as in the JAVA file.
Let’s look at the situation where the CharSet specified in JSP is ISO-8859-1.
<%@ page contentType="text/html; charset=ISO-8859-1"%>
<html><body>
<%
String a="Chinese";
out.println(a);
%>
</body></html>
Similarly, this file is written with UltraEdit, and the two characters "Chinese" are also stored as GB2312 encoding "D6 D0 CE C4". First simulate the process of generating JAVA files and CLASS files: jspc uses ISO-8859-1 to interpret "Chinese" and maps it to Unicode. Since ISO-8859-1 is 8-bit and is a Latin language, its mapping rule is to add "00" before each byte, so the mapped Unicode encoding should be "u00D6u00D0u00CEu00C4" , after conversion to UTF it should be "C3 96 C3 90 C3 8E C3 84". Okay, open the file and take a look. In the JAVA file and CLASS file, "Chinese" is indeed expressed as "C3 96 C3 90 C3 8E C3 84".
If <Jsp-charset> is not specified in the above code, that is, the first line is written as "<%@ page contentType="text/html" %>", JSPC will use the file.encoding setting to interpret the JSP file. On RedHat 6.2, the processing result is exactly the same as specifying ISO-8859-1.
So far, the mapping process of Chinese characters in the conversion process from JSP files to CLASS files has been explained. In a word: From "JspCharSet to Unicode to UTF". The following table summarizes this process:
Table 2 "Chinese" conversion process from JSP to CLASS
Jsp-CharSet | In JSP file In | JAVA file | In CLASS file |
GB2312 | D6 D0 CE C4 (GB2312) | from u4E2Du6587 (Unicode) to E4 B8 AD E6 96 87 (UTF) | E4 B8 AD E6 96 87 (UTF) |
ISO-8859 -1 | D6 D0 CE C4 (GB2312) | from u00D6u00D0u00CEu00C4 (Unicode) to C3 96 C3 90 C3 8E C3 84 (UTF) | C3 96 C3 90 C3 8E C3 84 (UTF) |
None (default = file.encoding) | Same as ISO-8859 -1 | Same as ISO-8859-1 | Same as ISO-8859-1 |
Servlet: The process from source file to Class.
The Servlet source file is a text file ending with ".java". This section will discuss the Servlet compilation process and track the Chinese changes.
Use "javac" to compile the Servlet source file. javac can take the "-encoding <Compile-charset>" parameter, which means "use the encoding specified in <Compile-charset> to interpret the Serlvet source file."
When the source file is compiled, use <Compile-charset> to interpret all characters, including Chinese characters and ASCII characters. Then convert the character constants into Unicode characters, and finally, convert Unicode into UTF.
In Servlet, there is another place to set the CharSet of the output stream. Usually before outputting the result, the setContentType method of HttpServletResponse is called to achieve the same effect as setting <Jsp-charset> in JSP, which is called <Servlet-charset>.
Note that a total of three variables are mentioned in the article: <Jsp-charset>, <Compile-charset> and <Servlet-charset>. Among them, JSP files are only related to <Jsp-charset>, while <Compile-charset> and <Servlet-charset> are only related to Servlet.
Look at the following example:
import javax.servlet.*;
import javax.servlet.http.*;
class testServlet extends HttpServlet
{
public void doGet(HttpServletRequest req,HttpServletResponse resp)
throws ServletException,java.io.IOException
{
resp.setContentType("text/html; charset=GB2312");
java.io.PrintWriter out=resp.getWriter();
out.println("<html>");
out.println("#中文#");
out.println("</html>");
}
}
This file is also written with UltraEdit for Windows, and the two characters "Chinese" are saved as "D6 D0 CE C4" (GB2312 encoding).
Start compiling. The following table shows the hexadecimal code of the word "Chinese" in the CLASS file when <Compile-charset> is different. During compilation, <Servlet-charset> has no effect. <Servlet-charset> only affects the output of the CLASS file. In fact, <Servlet-charset> and <Compile-charset> work together to achieve the same effect as <Jsp-charset> in the JSP file, because <Jsp-charset >It will have an impact on compilation and output of CLASS files.
Table 3 The transformation process of "Chinese" from Servlet source file to Class
Compile-charset | The equivalent Unicode code | in the Class file | in the Servlet source file | is
GB2312 | D6 D0 CE C4 (GB2312) | E4 B8 AD E6 96 87 (UTF) | u4E2Du6587 (in Unicode = "Chinese") |
ISO-8859-1 | D6 D0 CE C4 (GB2312) | C3 96 C3 90 C3 8E C3 84 (UTF) | u00D6 u00D0 u00CE u00C4 (A 00 is added in front of D6 D0 CE C4) |
None (default) | D6 D0 CE C4 (GB2312) | Same as ISO-8859 -1 | Same as ISO-8859-1 |
No. | Step Description | Result |
1 | Write JSP source file and save it as GB2312 format | D6 D0 CE C4 (D6D0=中文 CEC4=文) |
2 | jspc converts the JSP source file into a temporary JAVA file, maps the string to Unicode according to GB2312, and writes it into the JAVA file in UTF format | E4 B8 AD E6 96 87 |
3 | Compile the temporary JAVA file into a CLASS file | E4 B8 AD E6 96 87 |
4 | When running, first read the string from the CLASS file using readUTF. The Unicode encoding in the memory is | 4E 2D 65 87 (in Unicode 4E2D=中文6587=文) |
5According | to Jsp -charset=GB2312 Convert Unicode to byte stream | D6 D0 CE C4 |
6 | Output the byte stream to IE, and set the encoding of IE to GB2312 (Author's note: This information is hidden in the HTTP header) | D6 D0 CE C4 |
7 | IE View results "Chinese" with | "Simplified Chinese" (correct display) |
No. | Step Description | Result | |
1 | Write JSP source file and save it as GB2312 format | D6 D0 CE C4 (D6D0=中文 CEC4=文) | |
2 | jspc converts the JSP source file into a temporary JAVA file, maps the string to Unicode according to ISO8859-1, and writes it into the JAVA file in UTF format | C3 96 C3 90 C3 8E C3 | |
84 | 3 | The temporary JAVA file is compiled into a CLASS file | C3 96 C3 90 C3 8E C3 84 |
4. | When running, first read the string from the CLASS file using readUTF. The Unicode encoding in the memory is | 00 D6 00 D0 00 CE 00 C4 (Nothing!!!) | |
5 | Convert Unicode into byte stream | D6 D0 CE C4 | according to Jsp-charset=ISO8859-1|
6 | Output the byte stream to IE, and set the encoding of IE to ISO8859-1 (author's press : This information is hidden in the HTTP header) | D6 D0 CE C4 | |
7 | IE uses "Western European characters" to view the result | as garbled characters. It is actually four ASCII characters, but because it is greater than 128, it displays strangely. | |
8 | Change the page encoding of IE to "Simplified Chinese" "中文" | "中文" (correct display) |
No. | Step Description | Result | |
1 | Write JSP source file and save it as GB2312 format | D6 D0 CE C4 (D6D0=中文 CEC4=文) | |
2 | jspc converts the JSP source file into a temporary JAVA file, maps the string to Unicode according to ISO8859-1, and writes it into the JAVA file in UTF format | C3 96 C3 90 C3 8E C3 | |
84 | 3 | The temporary JAVA file is compiled into a CLASS file | C3 96 C3 90 C3 8E C3 84 |
4. | When running, first read the string from the CLASS file using readUTF. The Unicode encoding in the memory is | 00 D6 00 D0 00 CE 00 C4 | |
5. | According to Jsp- charset=ISO8859-1 Convert Unicode into a byte stream | D6 D0 CE C4 | |
6 | Output the byte stream to IE | D6 D0 CE C4 | |
7 | IE uses the encoding of the page when the request is made to view the results, | depending on the situation. If it is Simplified Chinese, it can be displayed correctly. Otherwise, step 8 in Table 5 needs to be performed. |
No. | Step Description | Result |
1 | Write the Servlet source file and save it in GB2312 format | D6 D0 CE C4 (D6D0=Chinese CEC4=Chinese) |
2 | Use javac –encoding GB2312 to compile the JAVA source file into a CLASS file | E4 B8 AD E6 96 87 (UTF) |
3 | When running, first read the string from the CLASS file using readUTF, and store it in the memory The encoding is Unicode | 4E 2D 65 87 (Unicode) |
4 | Convert Unicode into byte stream | D6 D0 CE C4 (GB2312) | according to Servlet-charset=GB2312
5 | Output the byte stream to IE and set the encoding attribute of IE to Servlet- charset=GB2312 | D6 D0 CE C4 (GB2312) |
6 | IE uses "Simplified Chinese" to view the result | "Chinese" (correctly displayed) |
No. | Step Description | Result |
1 | Write the Servlet source file and save it in GB2312 format | D6 D0 CE C4 (D6D0=中文 CEC4=文) |
2 | Use javac –encoding ISO8859-1 to compile the JAVA source file into a CLASS file | C3 96 C3 90 C3 8E C3 84 (UTF) |
3 | When running, first read the string from the CLASS file using readUTF , the Unicode encoding in the memory is | 00 D6 00 D0 00 CE 00 C4 |
4 | Convert Unicode into a byte stream according to Servlet-charset=ISO8859-1 | D6 D0 CE C4 |
5 | Output the byte stream to IE and set the encoding of IE The attribute is Servlet-charset=ISO8859-1 | D6 D0 CE C4 (GB2312) |
6 | IE uses "Western European characters" to view | garbled results (the reason is the same as Table 5) |
7 | Change the page encoding of IE to "Simplified Chinese" | "Chinese" (correctly displayed) |
Serial number | step description | Result | field |
1 | Enter "Chinese" in IE | D6 D0 CE C4 | IE |
2 | IE converts the string into UTF and sends it to the transport stream | E4 B8 AD E6 96 87 | |
3 | Servlet receives the input stream and reads it with readUTF | 4E 2D 65 87 (unicode) | Servlet |
4 | In the Servlet, the programmer must restore the string to a byte stream | D6 D0 CE C4 | according to GB2312.|
5 | The programmer generates a new string | 00 D6 00 D0 00 CE | according to the database internal code ISO8859-1.00 C4 |
6 | Submit the newly generated string to JDBC | 00 D6 00 D0 00 CE 00 C4 | |
7 | JDBC detects that the internal code of the database is ISO8859-1 | 00 D6 00 D0 00 CE 00 C4 | JDBC |
8 | JDBC converts the received string according to ISO8859 -1 Generate byte stream | D6 D0 CE C4 | |
9 | JDBC writes the byte stream into the database | D6 D0 CE C4 | |
10 | Complete the data storage work | D6 D0 CE C4 Database | |
The following is the process of retrieving numbers from the database | |||
11 | JDBC retrieves words from the database Throttle | D6 D0 CE C4 | JDBC |
12 | JDBC generates a string according to the database character set ISO8859-1 and submits it to Servlet | 00 D6 00 D0 00 CE 00 C4 (Unicode) | |
13 | Servlet obtains the string | 00 D6 00 D0 00 CE 00 C4 (Unicode) | Servlet |
14 | The programmer must restore the original byte stream | D6 D0 CE C4 | according to the internal code ISO8859-1 of the database.|
15 | Programmers must generate new strings according to the client character set GB2312 | 4E 2D 65 87 (Unicode) | |
The Servlet prepares to output the string to the client | |||
16. | The Servlet generates the byte stream | D6D0 CE C4 | Servlet |
17 | based on <Servlet-charset>. The Servlet outputs the byte stream to IE. If <Servlet-charset> has been specified, IE's byte stream will also be set. The encoding is <Servlet-charset> | D6 D0 CE C4 | |
18 | IE View the result"Chinese" (correctly displayed) | according to the specified encoding or default encoding | IE |