Thoughts on MySQL character set architecture

Author：Eve Cole Update Time：2010-01-07 17:31:04

With the widespread use of various multi-byte character sets, a very high proportion of English-speaking programmers in software development do not know much about multi-byte characters. This is why many vulnerabilities in recent years are multi-byte. a cause. The author of this article talks about his own views on the role of MySQL's character set architecture. In the past few months, every time I use MySQL, I almost always think: Is MySQL’s current hierarchical character set architecture really useful?

MySQL character set processing

Send request

Client (character_set_client)=》Database connection (character_set_connection)=》Storage (table, column)

return request

Storage (table, column)=》Database connection (character_set_connection)=》Client (character_set_results)

At each non-initial node, a character set conversion operation is performed from the previous node to the current node. For example, consider the following environment:

◆ character_set_connection utf-8

◆ character_set_results gbk

◆ character_set_client gb2312

◆ There is table A, and the field character sets are all BIG5

When sending a request, the data is first converted from gbk to utf-8, then to BIG5, and then stored.

When returning the request, the data is first converted from BIG5 to utf-8, then to gb2312, and then sent to the client.

The role of architecture

1. Allow different clients to have different character sets. A typical example is that I have a UTF-8 site, which is a client with a charset client of UTF-8. At the same time, I may need to read and write the database on a gbk terminal, which is another client, but its character set is gbk.

2. When operating the file system through the database, you need to convert the file path to the character set of the file system. For example my client is gbk and the server file system is utf-8. Operation "/A slice/Rina.rmvb", among the data sent, the data of "slice" is different from the server. At this time, there needs to be a way to convert the "slice" of GBK to utf-8. Here MySQL introduces something called character_filesystem to accomplish this.

Other than that, I can't think of any other uses for the moment. But think about it carefully, do we really need this kind of treatment? Many websites just hope that their data can come out as they please. There are two more situations here.

1. I hope I can sort or perform like operations based on the data. Let’s talk about sorting first. For fields containing Chinese, the concept of sorting based on character sets is useless. When sorting Simplified Chinese, you generally want to sort by Pinyin. I have not really understood the verification in MySQL, but judging from the programs I have come into contact with, if this type of sorting is required, a field is specially created to store pinyin for sorting. There are also polyphonic characters in Pinyin. If it is UTF-8, there is also a situation where a certain range of Chinese is shared by China, Japan and South Korea at the same time. It is not so easy to implement, so neither the GBK nor the UTF-8 checkset of MySQL should implement Pinyin. I dare say that most websites in China that use MySQL now use a check set that is just a byte sort. With byte sorting, there is no need to use any character set at all. Therefore, for Chinese sites, MySQL character verification has no meaning in sorting.

But in terms of like operation, it does have a little meaning. For example, if I like '%a%', it is possible to match a Chinese character containing a in a certain part. Of course, this situation will not be encountered under UTF-8, because the storage format of UTF-8 means that a can only be a, and cannot be part of a multi-byte character. But this problem may occur in other character sets. In the end, like becomes the same as order, making verification meaningless. faint.

2. If there is no need to sort the data, like or full-text search, then please stop using char, varchar, text and the like. binary, varbinary, BLOB are the correct choices. Binary and the like will not perform character set conversion when storing and retrieving, but when sorting, they are only sorted according to the binary content, so the efficiency is much higher than that of char, varchar, and text.

In this case, there is no need for a character set. However, according to the current MySQL architecture, character set operations between client and connection ignore field types. Character set conversion will still be performed between these two nodes.

Also mention the character set setting in PHP. Please stop using statements like mysql_query("set names utf8"). mysql_set_charset() is the most complete character set setting method. The latter has one more setting than the former, which is to set the charset member of struct MySQL. This member variable plays a very important role in escape, especially for encoding formats such as GBK that use "" as part of the character. If you only use mysql_query("set names XXX"), then in some character sets, there will be major security holes, causing mysql_real_escape_string to become as unsafe as addslashes.