Today, I used javascript script to make a menu tool in the ASP.NET page, and saved it as menuScript.js. Use <script language="javascript" src="../js/MenuScript.js"></script in the page >Call, and a strange phenomenon occurred during operation: Chinese characters on the page were displayed normally, but Chinese characters in the menu were displayed as garbled characters.
No need to ask, if you think about it on your knees, there is something wrong with the encoding. Switch between UTF-8 and GB2312 encodings in the "View"-"Encoding" option of the page. As a result, the Chinese characters on the page and the Chinese characters in the menu alternately become garbled.
Solution: There are encoding settings in the configuration file: <globalization requestEncoding="utf-8" responseEncoding="utf-8" />
There is an encoding option when saving the menuScript.js file (you can open this file with Word and save it as another, select the encoding), just keep the two encodings the same.
In order to better understand the coding problem, I found an article on this in CSDN, author: fmddlmyy. Reprint it here for reference:
Let’s talk about coding Let’s talk about coding #region Let’s talk about coding
/**//*
0. big endian and little endian
big endian and little endian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the character "汉" is 6C49. So when writing to a file, should 6C be written in front or 49 be written in front? If 6C is written in front, it is big endian. Or write 49 in front, which is little endian.
The word "endian" comes from "Gulliver's Travels". The civil war in Lilliput originated from whether to crack the eggs from the big end (Big-Endian) or the small end (Little-Endian). As a result, six rebellions took place, one of the emperors lost his life, and the other One lost his throne.
We generally translate endian as "byte order", and big endian and little endian are called "big end" and "little end".
1. Character encoding and internal code. By the way, Chinese character encoding characters must be encoded before they can be processed by the computer. The default encoding method used by the computer is the computer's internal code. Early computers used 7-bit ASCII encoding. In order to process Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for Traditional Chinese.
GB2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The internal code range of the Chinese character area is from B0-F7 in the high byte, A1-FE in the low byte, and the occupied code bits are 72*94=6768. Among them, 5 vacancies are D7FA-D7FE.
GB2312 supports too few Chinese characters. The 1995 Chinese character expansion specification GBK1.0 includes 21,886 symbols, which is divided into Chinese character area and graphic symbol area. The Chinese character area includes 21003 characters. GB18030 in 2000 is the official national standard that replaced GBK1.0. This standard includes 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major ethnic minority languages. The current PC platform must support GB18030, and there are no requirements for embedded products. Therefore, mobile phones and MP3 generally only support GB2312.
From ASCII, GB2312, GBK to GB18030, these encoding methods are backward compatible, that is, the same character always has the same encoding in these schemes, and later standards support more characters. In these encodings, English and Chinese can be processed uniformly. The way to distinguish Chinese encoding is that the highest bit of the high byte is not 0. According to what programmers call them, GB2312, GBK to GB18030 all belong to double-byte character sets (DBCS).
The default internal code of some Chinese Windows is still GBK, which can be upgraded to GB18030 through the GB18030 upgrade package. However, the characters added by GB18030 compared to GBK are difficult for ordinary people to use. Usually we still use GBK to refer to Chinese Windows internal code.
Here are some more details:
The original text of GB2312 is still the area code. From the area code to the inner code, A0 needs to be added to the high byte and low byte respectively.
In DBCS, the storage format of GB internal code is always big endian, that is, the high-order bit comes first.
The highest bits of the two bytes of GB2312 are both 1. But there are only 128*128=16384 code points that meet this condition. Therefore, the highest bit of the low byte of GBK and GB18030 may not be 1. However, this does not affect the parsing of the DBCS character stream: when reading the DBCS character stream, as long as a byte with a high bit of 1 is encountered, the next two bytes can be encoded as a double byte, regardless of the low byte. What is high position.
2. Unicode, UCS and UTF
As mentioned earlier, the encoding methods from ASCII, GB2312, GBK to GB18030 are backward compatible. Unicode is only compatible with ASCII (more precisely, compatible with ISO-8859-1) and is not compatible with GB code. For example, the Unicode encoding of the character "汉" is 6C49, while the GB code is BABA.
Unicode is also a character encoding method, but it is designed by an international organization and can accommodate encoding schemes for all languages around the world. The scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", or UCS for short. UCS can be seen as the abbreviation of "Unicode Character Set".
According to Wikipedia ( http://zh.wikipedia.org/wiki/ ): Historically, there have been two organizations that tried to design Unicode independently, namely the International Organization for Standardization (ISO) and an association of software manufacturers (unicode. org). ISO developed the ISO 10646 project and the Unicode Consortium developed the Unicode project.
Around 1991, both sides recognized that the world did not need two incompatible character sets. So they began to merge the work of both parties and work together to create a single coding list. Starting from Unicode2.0, the Unicode project uses the same fonts and fonts as ISO 10646-1.
Both projects still exist and publish their own standards independently. The latest version of the Unicode Consortium is Unicode 4.1.0 in 2005. The latest ISO standard is 10646-3:2003.
UCS specifies how to use multiple bytes to represent various text. How to transmit these encodings is stipulated by the UTF (UCS Transformation Format) specification. Common UTF specifications include UTF-8, UTF-7, and UTF-16.
IETF's RFC2781 and RFC3629 describe the encoding methods of UTF-16 and UTF-8 clearly, crisply and rigorously in the consistent style of RFC. I always forget that IETF is the abbreviation of Internet Engineering Task Force. However, the RFC maintained by the IETF is the basis for all specifications on the Internet.
3. UCS-2, UCS-4, BMP
UCS comes in two formats: UCS-2 and UCS-4. As the name suggests, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually only 31 bits are used, the highest bit must be 0). Let's do some simple math games:
UCS-2 has 2^16=65536 code points, and UCS-4 has 2^31=2147483648 code points.
UCS-4 is divided into 2^7=128 groups according to the highest byte with the highest bit being 0. Each group is divided into 256 planes based on the next highest byte. Each plane is divided into 256 rows according to the third byte, and each row contains 256 cells. Of course, cells in the same row only differ in the last byte, and the rest are the same.
Plane 0 of group 0 is called Basic Multilingual Plane, or BMP. Or in UCS-4, the code bits with the upper two bytes being 0 are called BMP.
UCS-2 is obtained by removing the first two zero bytes of UCS-4's BMP. Add two zero bytes in front of the two bytes of UCS-2 to get the BMP of UCS-4. There are no characters allocated outside the BMP in the current UCS-4 specification.
4. UTF encoding UTF-8 encodes UCS in 8-bit units. The encoding from UCS-2 to UTF-8 is as follows:
UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary)
0000-007F 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx
For example, the Unicode encoding of the character "Chinese" is 6C49. 6C49 is between 0800-FFFF, so you must use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Writing 6C49 in binary is: 0110 110001 001001. Using this bit stream to replace x in the template in turn, we get: 11100110 10110001 10001001, which is E6 B1 89.
Readers can use Notepad to test whether our coding is correct.
UTF-16 encodes UCS in 16-bit units. For UCS codes less than 0x10000, UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes no less than 0x10000, an algorithm is defined. However, since the BMP of the actually used UCS2 or UCS4 must be less than 0x10000, for now, it can be considered that UTF-16 and UCS-2 are basically the same. However, UCS-2 is only an encoding scheme, and UTF-16 is used for actual transmission, so the issue of byte order has to be considered.
5. UTF byte order and BOM
UTF-8 uses bytes as the encoding unit, and there is no byte order problem. UTF-16 uses two bytes as the encoding unit. Before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, the Unicode encoding of "Kui" received is 594E, and the Unicode encoding of "B" is 4E59. If we receive the UTF-16 byte stream "594E", is this "Ku" or "B"?
The recommended method of marking byte order in the Unicode specification is the BOM. BOM is not the BOM list of "Bill Of Material", but Byte Order Mark. BOM is a little clever idea:
There is a character called "ZERO WIDTH NO-BREAK SPACE" in UCS encoding, and its encoding is FEFF. FFFE is a character that does not exist in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream.
In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.
UTF-8 does not require a BOM to indicate the byte order, but can use the BOM to indicate the encoding method. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAK SPACE" is EF BB BF (readers can verify it using the encoding method we introduced earlier). So if the receiver receives a byte stream starting with EF BB BF, it knows that it is UTF-8 encoded.
Windows uses BOM to mark the encoding of text files.
6. Further reference materials The main reference material for this article is "Short overview of ISO-IEC 10646 and Unicode" ( http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html ).
I also found two pieces of information that looked good, but because I already had the answers to my initial questions, I didn’t read them:
"Understanding Unicode A general introduction to the Unicode Standard" ( http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a )
"Character set encoding basics Understanding character set encodings and legacy encodings" ( http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03 )
I have written software packages for converting UTF-8, UCS-2, and GBK to and from each other, including versions using Windows API and versions that do not use Windows API. If I have time in the future, I will sort it out and put it on my personal homepage ( http://fmddlmyy.home4u.china.com ).
I started writing this article after thinking through all the issues. I thought I could finish it in a while. Unexpectedly, it took a long time to consider the wording and check the details, and I wrote it from 1:30 to 9:00 in the afternoon. Hope some readers can benefit from it.
*/
#endregion