First of all, let me explain that this refers to String in Java. Although I have decided to switch to C/C++, because I encountered a problem today, I still want to take a look. The definition of String is as follows:
Copy the code code as follows:
public final class String
{
private final char value[]; // saved string
private final int offset; // starting position
private final int count; //Number of characters
private int hash; // cached hash value
...
}
When debugging, you can see the saved values as follows:
It should be noted that if hashCode() has not been called, the hash value is 0. It is easy to know that the value here is the char array of the actual saved string value (that is, the "string test"), and what is the value of each char? Easy to verify: Unicode.
At this point, everyone can guess how our commonly used subString is implemented: If we were to implement it, let new String use the same value (char array) and only modify the offset and count. This saves space and is fast (no need to copy), and in fact it is like this:
Copy the code code as follows:
public String substring(int beginIndex) {
return substring(beginIndex, count);
}
public String substring(int beginIndex, int endIndex) {
...
return ((beginIndex == 0) && (endIndex == count)) ? this :
new String(offset + beginIndex, endIndex - beginIndex, value);
}
String(int offset, int count, char value[]) {
this.value = value;
this.offset = offset;
this.count = count;
}
Since we are discussing strings, what encoding does the JVM use by default? Through debugging you can find:
Copy the code code as follows:
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
java.security.PrivilegedAction pa = new GetPropertyAction("file.encoding");
String csn = (String)AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
The value of defaultCharset can be passed:
-Dfile.encoding=utf-8
Make settings. Of course, if you want to set it to "abc", you can, but it will be set to UTF-8 by default. You can see the specific value through System.getProperty("file.encoding"). Why do you see defaultCharset? Because the network transmission process should be byte arrays, the byte arrays obtained by different encoding methods may be different. So, we need to know how the encoding method is obtained, right? The specific method of getting the byte array is getBytes, which we will focus on below. What it ultimately calls is the encode method of CharsetEncoder, as follows:
Copy the code code as follows:
public final CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput) {
int newState = endOfInput ? ST_END : ST_CODING;
if ((state != ST_RESET) && (state != ST_CODING) && !(endOfInput && (state == ST_END)))
throwIllegalStateException(state, newState);
state = newState;
for (;;) {
CoderResult cr;
try {
cr = encodeLoop(in, out);
} catch (BufferUnderflowException x) {
throw new CoderMalfunctionError(x);
} catch (BufferOverflowException x) {
throw new CoderMalfunctionError(x);
}
if (cr.isOverflow())
return cr;
if (cr.isUnderflow()) {
if (endOfInput && in.hasRemaining()) {
cr = CoderResult.malformedForLength(in.remaining());
} else {
return cr;
}
}
CodingErrorAction action = null;
if (cr.isMalformed())
action = malformedInputAction;
else if (cr.isUnmappable())
action = unmappableCharacterAction;
else
assert false : cr.toString();
if (action == CodingErrorAction.REPORT)
return cr;
if (action == CodingErrorAction.REPLACE) {
if (out.remaining() < replacement.length)
return CoderResult.OVERFLOW;
out.put(replacement);
}
if ((action == CodingErrorAction.IGNORE) || (action == CodingErrorAction.REPLACE)) {
in.position(in.position() + cr.length());
continue;
}
assert false;
}
}
Of course, the corresponding CharsetEncoder will be selected first according to the required encoding format, and the most important thing is that different CharsetEncoder implements different encodeLoop methods. You may not understand why there is a for(;;) here? In fact, you can roughly understand it by looking at the package (nio) where CharsetEncoder is located and its parameters: this function can handle streams (although we will not loop when we use it here).
In the encodeLoop method, as many chars as possible will be converted into bytes, and new String is almost the reverse process above.
In the actual development process, garbled characters are often encountered:
Get the file name when uploading the file;
The string passed by JS to the backend;
First try the running results of the following code:
Copy the code code as follows:
public static void main(String[] args) throws Exception {
String str = "string";
// -41 -42 -73 -5 -76 -82
printArray(str.getBytes());
// -27 -83 -105 -25 -84 -90 -28 -72 -78
printArray(str.getBytes("utf-8"));
// ???
System.out.println(new String(str.getBytes(), "utf-8"));
// Yingjuan?
System.out.println(new String(str.getBytes("utf-8"), "gbk"));
//Character??
System.out.println(new String("瀛涓?".getBytes("gbk"), "utf-8"));
// -41 -42 -73 -5 63 63
printArray(new String("Yingjuan?".getBytes("gbk"), "utf-8").getBytes());
}
public static void printArray(byte[] bs){
for(int i = 0; i < bs.length; i++){
System.out.print(bs[i] + " ");
}
System.out.println();
}
The output is described in the comments in the program:
Because 2 bytes in GBK represent a Chinese character, there are 6 bytes;
Because 3 bytes in UTF-8 represent a Chinese character, there are 9 bytes;
Because the byte array that cannot be generated by GBK is used to generate a string according to UTF-8 rules, ??? is displayed;
This is the reason why garbled characters are often encountered. GBK uses the byte generated by UTF-8 to generate strings;
Although the code generated above is garbled, the computer does not think so, so it can still get the byte array through getBytes, and UTF-8 in this array can be recognized;
The last two 63 (?) should be filled by encode (or there are not enough bytes to fill directly, I didn’t look carefully at this place);
Because the encoding of letters and numbers is the same between GBK and UTF-8, there will be no garbled characters in the processing of these characters. However, their encoding of Chinese characters is indeed different. This is the origin of many problems. Look at the code below:
new String(new String("we".getBytes("UTF-8"), "GBK").getBytes("GBK"), "UTF-8);
Obviously the result of this code is "us", but what does it do to us? First we notice:
new String("we".getBytes("UTF-8"), "GBK");
The result of this code is garbled code, and many garbled codes are "messed up like this". But remember: the chaos here is for us, and for the computer, it doesn’t matter whether it’s “messy” or “not messy”. When we almost give up, it can still get it from the garbled code through “getBytes("GBK")” Its "backbone", and then we can use the "backbone" to restore the original string.
It seems that the above code can solve the garbled problem between "GBK" and "UTF-8", but this solution is only limited to a special case: the number of all consecutive Chinese characters is an even number! The reasons have been mentioned above and will not be repeated here.
So how to solve this problem?
The first solution: encodeURI Why use this method? The reason is very simple: GBK and UTF-8 have the same encoding of %, numbers, and letters, so the string after encoding can be 100% guaranteed to be the same thing under these two encodings, and then decode to get the characters. Just skewer. According to the format of String, we can guess that the efficiency of encoding and decoding is very, very high, so this is also a good solution.
The second solution: Unified encoding format <BR>We are using Webx mining here. You only need to set defaultCharset="UTF-8" in webx.xml.