In-depth analysis of JSP and Servlet processing of Chinese

Author：Eve Cole Update Time：2009-07-02 17:21:56

Every region in the world has its own local language. Regional differences directly lead to differences in language environment. In the process of developing an international program, it is important to deal with language issues.

This is a problem that exists worldwide, so Java provides a worldwide solution. The method described in this article is for processing Chinese, but by extension, it is also applicable to processing languages from other countries and regions in the world.

Chinese characters are double-byte. The so-called double byte means that a double word occupies two BYTE positions (that is, 16 bits), which are called high bits and low bits respectively. The Chinese character encoding specified in China is GB2312, which is mandatory. Currently, almost all applications that can process Chinese support GB2312. GB2312 includes first- and second-level Chinese characters and 9 area symbols. The high bits range from 0xa1 to 0xfe, and the low bits also range from 0xa1 to 0xfe. Among them, the encoding range of Chinese characters is 0xb0a1 to 0xf7fe.

There is another encoding called GBK, but this is a specification, not mandatory. GBK provides 20902 Chinese characters, which is compatible with GB2312, and the encoding range is 0x8140 to 0xfefe. All characters in GBK can be mapped to Unicode 2.0 one by one.

In the near future, China will promulgate another standard: GB18030-2000 (GBK2K). It includes the fonts of Tibetan, Mongolian and other ethnic minorities, fundamentally solving the problem of insufficient character positions. Note: It is no longer fixed length. The two-byte part is compatible with GBK, and the four-byte part is extended characters and glyphs. Its first and third bytes range from 0x81 to 0xfe, and its second and fourth bytes range from 0x30 to 0x39.

This article does not intend to introduce Unicode. Those who are interested can browse "http://www.unicode.org/" to view more information. Unicode has a characteristic: it includes all character glyphs in the world. Therefore, languages in various regions can establish mapping relationships with Unicode, and Java takes advantage of this to achieve conversion between heterogeneous languages.

In JDK, the encodings related to Chinese are:

Table 1 List of encodings related to Chinese in JDK

Encoding name	description
ASCII	7-bit, same as ascii7
ISO8859-1	8-bit, same as 8859_1, ISO-8859-1, ISO_8859-1, latin1... etc.
GB2312-80	16-bit, same as gb2312, gb2312-1980, EUC_CN ,euccn,1381,Cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB...etc. are the same.
GBK	is the same as MS936. Note: case-sensitive
UTF8	is the same as UTF-8.
GB18030	is the same as cp1392 and 1392. Currently, few JDKs are supported

in practice. When programming, the ones I come into contact with more are GB2312 (GBK) and ISO8859-1.

Why is there a "?" sign?

As mentioned above, conversion between different languages is completed through Unicode. Suppose there are two different languages A and B. The conversion steps are: first convert A to Unicode, and then convert Unicode to B.

Give examples. There is a Chinese character "李" in GB2312, and its code is "C0EE", which needs to be converted into ISO8859-1 code. The steps are: first convert the character "李" into Unicode to get "674E", and then convert "674E" into ISO8859-1 characters. Of course, this mapping will not succeed, because there is no character corresponding to "674E" in ISO8859-1.

The problem occurs when the mapping is unsuccessful! When converting from a certain language to Unicode, if the character does not exist in a certain language, the result will be the Unicode code "uffffd" ("u" means Unicode encoding,). When converting from Unicode to a certain language, if a certain language does not have corresponding characters, you will get "0x3f" ("?"). This is where the "?" comes from.

For example: perform the new String(buf, "gb2312") operation on the character stream buf = "0x80 0x40 0xb0 0xa1", the result will be "ufffdu554a", and then println it out, the result will be "?ah", Because "0x80 0x40" is a character in GBK, not in GB2312.

For another example, perform the new String (buf.getBytes("GBK")) operation on the string String="u00d6u00ecu00e9u0046u00bbu00f9", and the result is "3fa8aca8a6463fa8b4", among which, "u00d6 "There is no corresponding character in "GBK", so we get "3f", "u00ec" corresponds to "a8ac", "u00e9" corresponds to "a8a6", and "0046" corresponds to "46" (because this is an ASCII character ), "u00bb" was not found, and "3f" was obtained. Finally, "u00f9" corresponds to "a8b4". Println this string and the result is "?ìéF?ù". Did you see that? It’s not all question marks here, because the content mapped between GBK and Unicode includes characters in addition to Chinese characters. This example is the best proof.

Therefore, when transcoding Chinese characters, if confusion occurs, you may not necessarily get question marks! However, a mistake is a mistake after all. There is no qualitative difference between 50 steps and 100 steps.

Or you may ask: What will be the result if it is included in the source character set but not in Unicode? The answer is I don't know. Because I don't have the source character set at hand to do this test. But one thing is certain, that is, the source character set is not standardized enough. In Java, if this happens, an exception will be thrown.

What is UTF

UTF is the abbreviation of Unicode Text Format, which means Unicode text format. For UTF, it is defined as follows:

(1) If the first 9 bits of a Unicode 16-bit character are 0, it is represented by a byte. The first bit of this byte is "0", and the remaining 7 bits are the same as the original character. The last 7 digits are the same, such as "u0034" (0000 0000 0011 0100), represented by "34" (0011 0100); (the same as the source Unicode character);

(2) If the first 5 of Unicode's 16-bit characters If the bit is 0, it is represented by 2 bytes. The first byte starts with "110", and the following 5 bits are the same as the highest 5 bits of the source character after excluding the first 5 zeros; the second byte starts with "10" At the beginning, the next 6 bits are the same as the lower 6 bits in the source character. For example, "u025d" (0000 0010 0101 1101) will be converted to "c99d" (1100 1001 1001 1101);

(3) If it does not meet the above two rules, it will be represented by three bytes. The first byte starts with "1110", and the last four bits are the high four bits of the source character; the second byte starts with "10", and the last six bits are the middle six bits of the source character; the third byte starts with "10". Starting with "10", the last six digits are the lower six digits of the source character; for example, "u9da7" (1001 1101 1010 0111) is converted into "e9b6a7" (1110 1001 1011 0110 1010 0111);

the difference between Unicode and Unicode in JAVA programs can be described like this The relationship between UTF is not absolute: when a string is run in memory, it appears as a Unicode code, and when it is saved to a file or other media, UTF is used. This conversion process is completed by writeUTF and readUTF.

Okay, the basic discussion is almost done, let’s get to the point.

First think of the problem as a black box. Let’s first look at the first-level representation of the black box:

input(charsetA)->process(Unicode)->output(charsetB)

is simple. This is an IPO model, that is, input, processing and output. The same content needs to be converted from charsetA to unicode and then to charsetB.

Let’s look at the secondary representation:

SourceFile(jsp,java)->class->output.

In this figure, it can be seen that the input is jsp and java source files. During the processing, the Class file is used as the carrier and then output. Then refine it to the third level:

jsp->temp file->class->browser,os console,db

app,servlet->class->browser,os console,db

. This picture will be more clear. The Jsp file first generates the intermediate Java file, and then generates the Class. Servlets and ordinary apps are directly compiled to generate Class. Then, output from Class to the browser, console or database, etc.

JSP: The process from source file to Class

The source file of Jsp is a text file ending with ".jsp". In this section, the interpretation and compilation process of JSP files will be explained, and the Chinese changes will be tracked.

1. The JSP conversion tool (jspc) provided by the JSP/Servlet engine searches for the charset specified in <%@ page contentType ="text/html; charset=<Jsp-charset>"%> in the JSP file. If <Jsp-charset> is not specified in the JSP file, the default setting file.encoding in the JVM is used. Under normal circumstances, this value is ISO8859-1;

2. jspc uses the equivalent of "javac –encoding <Jsp-charset> " command interprets all characters appearing in the JSP file, including Chinese characters and ASCII characters, and then converts these characters into Unicode characters, then converts them into UTF format, and saves them as JAVA files. When converting ASCII characters into Unicode characters, you simply add "00" in front, such as "A", which is converted into "u0041" (no reason is needed, this is how the Unicode code table is compiled). Then, after conversion to UTF, it changed back to "41"! This is why you can use an ordinary text editor to view the JAVA files generated by JSP;

3. The engine uses the command equivalent to "javac -encoding UNICODE" to compile the JAVA files into CLASS files;

first look at the Chinese characters in these processes conversion situation. There is the following source code:

<%@ page contentType="text/html; charset=gb2312"%>
<html><body>
<%
String a="Chinese";
out.println(a);
%>
</body></html>

This code was written on UltraEdit for Windows. After saving, the hexadecimal encoding of the two characters "Chinese" is "D6 D0 CE C4" (GB2312 encoding). After looking up the table, the Unicode encoding of the word "Chinese" is "u4E2Du6587", which in UTF is "E4 B8 AD E6 96 87". Open the JAVA file converted from the JSP file generated by the engine, and find that the word "Chinese" has indeed been replaced by "E4 B8 AD E6 96 87". Then check the CLASS file generated by the JAVA file compilation, and find that the result is the same as Exactly the same as in the JAVA file.

Let’s look at the situation where the CharSet specified in JSP is ISO-8859-1.

<%@ page contentType="text/html; charset=ISO-8859-1"%>
<html><body>
<%
String a="Chinese";
out.println(a);
%>
</body></html>

Similarly, this file is written with UltraEdit, and the two characters "Chinese" are also stored as GB2312 encoding "D6 D0 CE C4". First simulate the process of generating JAVA files and CLASS files: jspc uses ISO-8859-1 to interpret "Chinese" and maps it to Unicode. Since ISO-8859-1 is 8-bit and is a Latin language, its mapping rule is to add "00" before each byte, so the mapped Unicode encoding should be "u00D6u00D0u00CEu00C4" , after conversion to UTF it should be "C3 96 C3 90 C3 8E C3 84". Okay, open the file and take a look. In the JAVA file and CLASS file, "Chinese" is indeed expressed as "C3 96 C3 90 C3 8E C3 84".

If <Jsp-charset> is not specified in the above code, that is, the first line is written as "<%@ page contentType="text/html" %>", JSPC will use the file.encoding setting to interpret the JSP file. On RedHat 6.2, the processing result is exactly the same as specifying ISO-8859-1.

So far, the mapping process of Chinese characters in the conversion process from JSP files to CLASS files has been explained. In a word: From "JspCharSet to Unicode to UTF". The following table summarizes this process:

Table 2 "Chinese" conversion process from JSP to CLASS

Jsp-CharSet	In JSP file In	JAVA file	In CLASS file
GB2312	D6 D0 CE C4 (GB2312)	from u4E2Du6587 (Unicode) to E4 B8 AD E6 96 87 (UTF)	E4 B8 AD E6 96 87 (UTF)
ISO-8859 -1	D6 D0 CE C4 (GB2312)	from u00D6u00D0u00CEu00C4 (Unicode) to C3 96 C3 90 C3 8E C3 84 (UTF)	C3 96 C3 90 C3 8E C3 84 (UTF)
None (default = file.encoding)	Same as ISO-8859 -1	Same as ISO-8859-1	Same as ISO-8859-1

The next section first discusses the conversion process of Servlet from JAVA files to CLASS files, and then explains how to output the CLASS files to the client. The reason for this arrangement is that JSP and Servlet have the same output processing method.

Servlet: The process from source file to Class.

The Servlet source file is a text file ending with ".java". This section will discuss the Servlet compilation process and track the Chinese changes.

Use "javac" to compile the Servlet source file. javac can take the "-encoding <Compile-charset>" parameter, which means "use the encoding specified in <Compile-charset> to interpret the Serlvet source file."

When the source file is compiled, use <Compile-charset> to interpret all characters, including Chinese characters and ASCII characters. Then convert the character constants into Unicode characters, and finally, convert Unicode into UTF.

In Servlet, there is another place to set the CharSet of the output stream. Usually before outputting the result, the setContentType method of HttpServletResponse is called to achieve the same effect as setting <Jsp-charset> in JSP, which is called <Servlet-charset>.

Note that a total of three variables are mentioned in the article: <Jsp-charset>, <Compile-charset> and <Servlet-charset>. Among them, JSP files are only related to <Jsp-charset>, while <Compile-charset> and <Servlet-charset> are only related to Servlet.

Look at the following example:

import javax.servlet.*;

import javax.servlet.http.*;

class testServlet extends HttpServlet
{
public void doGet(HttpServletRequest req,HttpServletResponse resp)
throws ServletException,java.io.IOException
{
resp.setContentType("text/html; charset=GB2312");
java.io.PrintWriter out=resp.getWriter();
out.println("<html>");
out.println("#中文#");
out.println("</html>");
}
}

This file is also written with UltraEdit for Windows, and the two characters "Chinese" are saved as "D6 D0 CE C4" (GB2312 encoding).

Start compiling. The following table shows the hexadecimal code of the word "Chinese" in the CLASS file when <Compile-charset> is different. During compilation, <Servlet-charset> has no effect. <Servlet-charset> only affects the output of the CLASS file. In fact, <Servlet-charset> and <Compile-charset> work together to achieve the same effect as <Jsp-charset> in the JSP file, because <Jsp-charset >It will have an impact on compilation and output of CLASS files.

Table 3 The transformation process of "Chinese" from Servlet source file to Class

Compile-charset	The equivalent Unicode code	in the Class file	in the Servlet source file
GB2312	D6 D0 CE C4 (GB2312)	E4 B8 AD E6 96 87 (UTF)	u4E2Du6587 (in Unicode = "Chinese")
ISO-8859-1	D6 D0 CE C4 (GB2312)	C3 96 C3 90 C3 8E C3 84 (UTF)	u00D6 u00D0 u00CE u00C4 (A 00 is added in front of D6 D0 CE C4)
None (default)	D6 D0 CE C4 (GB2312)	Same as ISO-8859 -1	Same as ISO-8859-1

The compilation process of ordinary Java programs is exactly the same as that of Servlets.

Is the Chinese representation in the CLASS file obvious? OK, let’s see how CLASS outputs Chinese?

Class: output string

As mentioned above, strings are represented as Unicode encoding in memory. As for what this Unicode encoding represents, it depends on which character set it is mapped from, that is to say, its ancestors. This is like checking in luggage, which looks like a cardboard box. What's inside depends on what the person who sent the mail actually mailed.

Look at the example above. If you give a string of Unicode encoding "00D6 00D0 00CE 00C4", if you don't convert it and directly compare it with the Unicode code table, it will be four characters (and they are special characters); if you compare it with "ISO8859 -1" for mapping, then directly remove the preceding "00" to get "D6 D0 CE C4", which are four characters in the ASCII code table; and if it is mapped as GB2312, the result will be very It may be a lot of garbled characters, because there may not be (or there may be) characters in GB2312 that correspond to characters such as 00D6 (if they do not correspond, you will get 0x3f, which is a question mark. If they do, the characters such as 00D6 are too close. Before, it is estimated that they are also some special symbols. The encoding of real Chinese characters in Unicode starts from 4E00).

As you can see, the same Unicode characters can be interpreted in different ways. Of course, one of these is the desired outcome. Based on the above example, "D6 D0 CE C4" should be what we want. When exporting "D6 D0 CE C4" to IE and viewing it in "Simplified Chinese" mode, you can see clear "Chinese" Two words. (Of course, if you must use "Western European characters" to view it, there's nothing you can do about it, you won't get anything from time and place.) Why? Because "00D6 00D0 00CE 00C4" was originally converted from ISO8859-1.
The following conclusions are given:

Before Class outputs a string, the Unicode string will be regenerated into a byte stream according to a certain internal code, and then the byte stream will be input, which is equivalent to performing a "String.getBytes(???)" operation. ??? represents a certain character set.

If it is a Servlet, then this internal code is the internal code specified in the HttpServletResponse.setContentType() method, which is the <Servlet-charset> defined above.

If it is JSP, then this internal code is the internal code specified in <%@ page contentType=""%>, which is the <Jsp-charset> defined above.

If it is a Java program, then this internal code is the internal code specified in file.encoding, and the default is ISO8859-1.

When the output object is a browser

Take the popular browser IE as an example. IE supports multiple internal codes. If IE receives a byte stream "D6 D0 CE C4", you can try to use various internal codes to check it. You will find that you get the correct results when using "Simplified Chinese". Because "D6 D0 CE C4" is originally the encoding of the word "中文" in Simplified Chinese.

OK, watch it in full.

JSP: The source file is a text file in GB2312 format, and the JSP source file contains the two Chinese characters "Chinese"

If <Jsp-charset> is specified as GB2312, the conversion process is as follows.

Table 4 Change process when Jsp-charset = GB2312

No.	Step Description	Result
1	Write JSP source file and save it as GB2312 format	D6 D0 CE C4 (D6D0=中文 CEC4=文)
2	jspc converts the JSP source file into a temporary JAVA file, maps the string to Unicode according to GB2312, and writes it into the JAVA file in UTF format	E4 B8 AD E6 96 87
3	Compile the temporary JAVA file into a CLASS file	E4 B8 AD E6 96 87
4	When running, first read the string from the CLASS file using readUTF. The Unicode encoding in the memory is	4E 2D 65 87 (in Unicode 4E2D=中文6587=文)
5According	to Jsp -charset=GB2312 Convert Unicode to byte stream	D6 D0 CE C4
6	Output the byte stream to IE, and set the encoding of IE to GB2312 (Author's note: This information is hidden in the HTTP header)	D6 D0 CE C4
7	IE View results "Chinese" with	"Simplified Chinese" (correct display)

If <Jsp-charset> is specified as ISO8859-1, the conversion process is as follows.

Table 5 Change process when Jsp-charset = ISO8859-1

according to Jsp-charset=ISO8859-1

No.	Step Description	Result
1	Write JSP source file and save it as GB2312 format	D6 D0 CE C4 (D6D0=中文 CEC4=文)
2	jspc converts the JSP source file into a temporary JAVA file, maps the string to Unicode according to ISO8859-1, and writes it into the JAVA file in UTF format	C3 96 C3 90 C3 8E C3
84	3	The temporary JAVA file is compiled into a CLASS file	C3 96 C3 90 C3 8E C3 84
4.	When running, first read the string from the CLASS file using readUTF. The Unicode encoding in the memory is	00 D6 00 D0 00 CE 00 C4 (Nothing!!!)
5	Convert Unicode into byte stream	D6 D0 CE C4
6	Output the byte stream to IE, and set the encoding of IE to ISO8859-1 (author's press : This information is hidden in the HTTP header)	D6 D0 CE C4
7	IE uses "Western European characters" to view the result	as garbled characters. It is actually four ASCII characters, but because it is greater than 128, it displays strangely.
8	Change the page encoding of IE to "Simplified Chinese" "中文"	"中文" (correct display)

Strange! Why is it that the <Jsp-charset> is set to be the same as GB2312 and ISO8859-1 and can be displayed correctly? Because steps 2 and 5 in Table 4 and Table 5 are inverse to each other, they "cancel" each other. However, when it is specified as ISO8859-1, step 8 needs to be added, which is particularly inconvenient.

Let’s look at the situation when <Jsp-charset> is not specified.

Table 6 Change process when Jsp-charset is not specified

No.	Step Description	Result
1	Write JSP source file and save it as GB2312 format	D6 D0 CE C4 (D6D0=中文 CEC4=文)
2	jspc converts the JSP source file into a temporary JAVA file, maps the string to Unicode according to ISO8859-1, and writes it into the JAVA file in UTF format	C3 96 C3 90 C3 8E C3
84	3	The temporary JAVA file is compiled into a CLASS file	C3 96 C3 90 C3 8E C3 84
4.	When running, first read the string from the CLASS file using readUTF. The Unicode encoding in the memory is	00 D6 00 D0 00 CE 00 C4
5.	According to Jsp- charset=ISO8859-1 Convert Unicode into a byte stream	D6 D0 CE C4
6	Output the byte stream to IE	D6 D0 CE C4
7	IE uses the encoding of the page when the request is made to view the results,	depending on the situation. If it is Simplified Chinese, it can be displayed correctly. Otherwise, step 8 in Table 5 needs to be performed.

Servlet: The source file is a JAVA file, the format is GB2312, and the source file contains the two Chinese characters "Chinese"

If <Compile-charset>=GB2312, <Servlet-charset>=GB2312

Table 7 Change process when Compile-charset=Servlet-charset=GB2312

according to Servlet-charset=GB2312

No.	Step Description	Result
1	Write the Servlet source file and save it in GB2312 format	D6 D0 CE C4 (D6D0=Chinese CEC4=Chinese)
2	Use javac –encoding GB2312 to compile the JAVA source file into a CLASS file	E4 B8 AD E6 96 87 (UTF)
3	When running, first read the string from the CLASS file using readUTF, and store it in the memory The encoding is Unicode	4E 2D 65 87 (Unicode)
4	Convert Unicode into byte stream	D6 D0 CE C4 (GB2312)
5	Output the byte stream to IE and set the encoding attribute of IE to Servlet- charset=GB2312	D6 D0 CE C4 (GB2312)
6	IE uses "Simplified Chinese" to view the result	"Chinese" (correctly displayed)

If <Compile-charset>=ISO8859-1, <Servlet-charset>=ISO8859-1

Table 8 Change process when Compile-charset=Servlet-charset=ISO8859-1

No.	Step Description	Result
1	Write the Servlet source file and save it in GB2312 format	D6 D0 CE C4 (D6D0=中文 CEC4=文)
2	Use javac –encoding ISO8859-1 to compile the JAVA source file into a CLASS file	C3 96 C3 90 C3 8E C3 84 (UTF)
3	When running, first read the string from the CLASS file using readUTF , the Unicode encoding in the memory is	00 D6 00 D0 00 CE 00 C4
4	Convert Unicode into a byte stream according to Servlet-charset=ISO8859-1	D6 D0 CE C4
5	Output the byte stream to IE and set the encoding of IE The attribute is Servlet-charset=ISO8859-1	D6 D0 CE C4 (GB2312)
6	IE uses "Western European characters" to view	garbled results (the reason is the same as Table 5)
7	Change the page encoding of IE to "Simplified Chinese"	"Chinese" (correctly displayed)

If Compile-charset or Servlet-charset is not specified, the default value is ISO8859-1.

When Compile-charset=Servlet-charset, steps 2 and 4 can be reversed and "offset" each other, and the displayed results can be correct. Readers can try to write the situation when Compile-charset＜＞Servlet-charset, it is definitely incorrect.

When the output object is a database

When outputting to the database, the principle is the same as outputting to the browser. This section only takes Servlet as an example. Readers are asked to deduce the situation of JSP by themselves.

Suppose there is a Servlet that can receive a Chinese character string from the client (IE, Simplified Chinese), then write it into the database with the internal code of ISO8859-1, and then retrieve the string from the database and display it to client.

Table 9 Change process when the output object is a database (1)

according to GB2312. according to the database internal code ISO8859-1. according to the internal code ISO8859-1 of the database. based on <Servlet-charset> IE View the result

Serial number	step description	Result	field
1	Enter "Chinese" in IE	D6 D0 CE C4	IE
2	IE converts the string into UTF and sends it to the transport stream	E4 B8 AD E6 96 87	IE
3	Servlet receives the input stream and reads it with readUTF	4E 2D 65 87 (unicode)	Servlet
4	In the Servlet, the programmer must restore the string to a byte stream	D6 D0 CE C4
5	The programmer generates a new string	00 D6 00 D0 00 CE		00 C4
6	Submit the newly generated string to JDBC	00 D6 00 D0 00 CE 00 C4
7	JDBC detects that the internal code of the database is ISO8859-1	00 D6 00 D0 00 CE 00 C4	JDBC
8	JDBC converts the received string according to ISO8859 -1 Generate byte stream	D6 D0 CE C4
9	JDBC writes the byte stream into the database	D6 D0 CE C4
10	Complete the data storage work	D6 D0 CE C4 Database
The following is the process of retrieving numbers from the database
11	JDBC retrieves words from the database Throttle	D6 D0 CE C4	JDBC
12	JDBC generates a string according to the database character set ISO8859-1 and submits it to Servlet	00 D6 00 D0 00 CE 00 C4 (Unicode)
13	Servlet obtains the string	00 D6 00 D0 00 CE 00 C4 (Unicode)	Servlet
14	The programmer must restore the original byte stream	D6 D0 CE C4
15	Programmers must generate new strings according to the client character set GB2312	4E 2D 65 87 (Unicode)
The Servlet prepares to output the string to the client
16.	The Servlet generates the byte stream	D6D0 CE C4	Servlet
17	. The Servlet outputs the byte stream to IE. If <Servlet-charset> has been specified, IE's byte stream will also be set. The encoding is <Servlet-charset>	D6 D0 CE C4	Servlet
18	"Chinese" (correctly displayed)	according to the specified encoding or default encoding	IE

To explain, steps 4, 5 and 15, 16 in the table are marked in red, indicating that the coder must perform conversion. Steps 4 and 5 are actually just one sentence: "new String(source.getBytes("GB2312"), "ISO8859-1")". Steps 15 and 16 are also one sentence: "new String(source.getBytes("ISO8859-1"), "GB2312")". Dear reader, are you aware of every detail when writing code like this?

As for the process when the client internal code and database internal code are other values, and the process when the output object is the system console, readers can think about it themselves. Once you understand the principles of the above process, I believe you can write it easily.

At this point, the writing has come to an end. The end point is back to the starting point, and for programmers, it has almost no impact.

Because we have been told to do this for a long time.

A conclusion is given below as an end.

1. In the Jsp file, contentType must be specified, in which the value of charset must be the same as the character set used by the client browser; for the string constants, no internal code conversion is required; for string variables, it is required to be able to Restore the byte stream that the client can recognize according to the character set specified in ContentType. Simply put, "the string variable is based on the <Jsp-charset> character set";

2. In the Servlet, you must use HttpServletResponse.setContentType() to set the charset, and set it to be consistent with the client's internal code; for the string constants, you need to specify the encoding when compiling Javac. This encoding must be consistent with the platform where the source file is written. The character set is the same, generally speaking it is GB2312 or GBK; for string variables, like JSP, it must be "based on the <Servlet-charset> character set".