Evaluation of several methods of dynamically converting GB encoding to UTF-8 encoding in PHP

Author：Eve Cole Update Time：2009-06-02 18:07:15

In the article "Evaluation of IP Address->Geographical Location Conversion", it is mentioned that using the ip2addr function to directly read the IP database file is the most efficient. Compared with using the MySQL database to store IP data, using SQL query is the least efficient. But the IP database file QQWry.dat is GB2312 encoded. Now I need UTF-8 encoded geolocation results. If you use the MySQL method, you can convert the data to UTF-8 encoding when it is stored in the database, once and for all. However, the QQWry.dat file cannot be modified, and the output result of the ip2addr function can only be dynamically converted.

There are at least four ways to dynamically convert GB->UTF-8 encoding:

use PHP's iconv extension to convert,

use PHP's mb_string extension to convert

, use swap table conversion, and the swap table is stored

the MySQL database

using swap table conversion, swap table The first two methods

of storing in text files

can only be used if the server has been set up accordingly (corresponding extensions have been compiled and installed). My virtual host does not have these two extensions, so I have to consider the latter two methods. The first two methods are not evaluated in this article.

The evaluation procedure is as follows (for func_ip.php, please refer to the article "Evaluation of IP Address->Geographical Location Conversion"):

<?php
require_once ("func_ip.php");
function u2utf8($c) {
$str = "";
if ($c < 0x80) {
$str .= $c;
} elseif ($c < 0x800) {
$str .= chr(0xC0 | $c >> 6);
$str .= chr(0x80 | $c & 0x3F);
} elseif ($c < 0x10000) {
$str .= chr(0xE0 | $c >> 12);
$str .= chr(0x80 | $c >> 6 & 0x3F);
$str .= chr(0x80 | $c & 0x3F);
} elseif ($c < 0x200000) {
$str .= chr(0xF0 | $c >> 18);
$str .= chr(0x80 | $c >> 12 & 0x3F);
$str .= chr(0x80 | $c >> 6 & 0x3F);
$str .= chr(0x80 | $c & 0x3F);
}
return $str;
}
function GB2UTF8_SQL($strGB) {
if (!trim($strGB)) return $strGB;
$strRet = "";
$intLen = strlen($strGB);
for ($i = 0; $i < $intLen; $i++) {
if (ord($strGB{$i}) > 127) {
$strCurr = substr($strGB, $i, 2);
$intGB = hexdec(bin2hex($strCurr)) - 0x8080;
$strSql = "SELECT code_unicode FROM nnstats_gb_unicode
WHERE code_gb = ".$intGB." LIMIT 1"
;
$resResult = mysql_query($strSql);
if ($arrCode = mysql_fetch_array($resResult)) $strRet .= u2utf8($arrCode["code_unicode"]);
else $strRet .= "??";
$i++;
} else {
$strRet .= $strGB{$i};
}
}
return $strRet;
}
function GB2UTF8_FILE($strGB) {
if (!trim($strGB)) return $strGB;
$arrLines = file("gb_unicode.txt");
foreach ($arrLines as $strLine) {
$arrCodeTable[hexdec(substr($strLine, 0, 6))] = hexdec(substr($strLine, 7, 6));
}
$strRet = "";
$intLen = strlen($strGB);
for ($i = 0; $i < $intLen; $i++) {
if (ord($strGB{$i}) > 127) {
$strCurr = substr($strGB, $i, 2);
$intGB = hexdec(bin2hex($strCurr)) - 0x8080;
if ($arrCodeTable[$intGB]) $strRet .= u2utf8($arrCodeTable[$intGB]);
else $strRet .= "??";
$i++;
} else {
$strRet .= $strGB{$i};
}
}
return $strRet;
}
function EncodeIp($strDotquadIp) {
$arrIpSep = explode('.', $strDotquadIp);
if (count($arrIpSep) != 4) return 0;
$intIp = 0;
foreach ($arrIpSep as $k => $v) $intIp += (int)$v * pow(256, 3 - $k);
return $intIp;
//return sprintf('%02x%02x%02x%02x', $arrIpSep[0], $arrIpSep[1], $arrIpSep[2], $arrIpSep[3]);
}
function GetMicroTime() {
list($msec, $sec) = explode(" ", microtime());
return ((double)$msec + (double)$sec);
}
for ($i = 0; $i < 100; $i++) { // Randomly generate 100 IP addresses
$strIp = mt_rand(0, 255).".".mt_rand(0, 255).".".mt_rand(0, 255).".".mt_rand(0, 255);
$arrAddr[$i] = ip2addr(EncodeIp($strIp));
}
$resConn = mysql_connect("localhost", "netnest", "netnest");
mysql_select_db("test");
// Evaluate encoding conversion of MySQL queries
$dblTimeStart = GetMicroTime();
for ($i = 0; $i < 100; $i++) {
$strUTF8Region = GB2UTF8_SQL($arrAddr[$i]["region"]);
$strUTF8Address = GB2UTF8_SQL($arrAddr[$i]["address"]);
}
$dblTimeDuration = GetMicroTime() - $dblTimeStart;
//The evaluation ends and the results are output
echo $dblTimeDuration; echo "rn";
// Encoding conversion of evaluation text file query
$dblTimeStart = GetMicroTime();
for ($i = 0; $i < 100; $i++) {
$strUTF8Region = GB2UTF8_FILE($arrAddr[$i]["region"]);
$strUTF8Address = GB2UTF8_FILE($arrAddr[$i]["address"]);
}
$dblTimeDuration = GetMicroTime() - $dblTimeStart;
//The evaluation ends and the results are output
echo $dblTimeDuration; echo "rn";
?>
Results of two evaluations (accurate to 3 decimal places, unit is second):

MySQL query conversion: 0.112
Text query conversion: 10.590

MySQL query conversion: 0.099
Text query conversion: 10.623

It can be seen that this time the MySQL method is far ahead of the file query method. But there is no rush to use the MySQL method now, because the text file method is so time-consuming mainly because it has to read the entire gb_unicode.txt into the memory for each conversion, and gb_unicode.txt is a text file with the following format:

0x2121 0x3000 #IDEOGRAPHICSPACE
0x2122 0x3001 #IDEOGRAPHIC COMMA
0x2123 0x3002 #IDEOGRAPHIC FULL STOP
0x2124 0x30FB # KATAKANA MIDDLE DOT
0x2125 0x02C9 # MODIFIER LETTER MACRON (Mandarin Chinese first tone)
…
0x552A 0x6458 # <CJK>
0x552B 0x658B # <CJK>
0x552C 0x5B85 # <CJK>
0x552D 0x7A84 # <CJK>
…
0x777B 0x9F37 # <CJK>
0x777C 0x9F3D # <CJK>
0x777D 0x9F3E # <CJK>
0x777E 0x9F44 # <CJK>

The text file is inefficient, so consider converting the text file into a binary file, and then use the half-way method to find the file without reading the entire file into memory. The file format is: the file header is 2 bytes, storing the number of records; then records are stored in the file one by one, each record is 4 bytes, the first 2 bytes correspond to the GB code, and the last 2 bytes correspond to the Unicode code. The conversion procedure is as follows:

<?php
$arrLines = file("gb_unicode.txt");
foreach ($arrLines as $strLine) {
$arrCodeTable[hexdec(substr($strLine, 0, 6))] = hexdec(substr($strLine, 7, 6));
}
ksort($arrCodeTable);
$intCount = count($arrCodeTable);
$strCount = chr($intCount % 256) . chr(floor($intCount / 256));
$fileGBU = fopen("gbu.dat", "wb");
fwrite($fileGBU, $strCount);
foreach ($arrCodeTable as $k => $v) {
$strData = chr($k % 256) . chr(floor($k / 256)) . chr($v % 256) . chr(floor($v / 256));
fwrite($fileGBU, $strData);
}
fclose($fileGBU);
?>
After executing the program, the binary GB->Unicode comparison table gbu.dat is obtained, and the data records are sorted according to the GB code, which is convenient for searching by the half method. The function for transcoding using gbu.dat is as follows:

MySQL method: 0.125
Text file method: 10.873
Binary file halving method: 0.106

MySQL method: 0.102
Text file method: 10.677
Binary file halving method: 0.092

It can be seen that the binary file halving method has a slight advantage over the MySQL method. However, the above evaluations all transcode short geographical locations. What if they transcode longer texts? I found 5 Blog RSS 2.0 files, all encoded in GB2312. Evaluate the time it takes to encode 5 files using the three methods. The two measurement data are as follows (accurate to 3 decimal places, unit: seconds):

MySQL method: 7.206
Text file method: 0.772
Binary file halving method: 5.022

MySQL method: 7.440
Text file method: 0.766
Binary file halving method: 5.055

It can be seen that the text file method is optimal for long texts, because after the transcoding comparison table is read into the memory, the transcoding can be very efficient. In this case, we can also try to improve it and change the text file method to: the transcoding comparison table is read into the memory from the binary file gbu.dat instead of the text file. The evaluation data is as follows (the accuracy and unit are the same as above):

Read the comparison table from the text file: 0.766
Reading lookup table from binary file: 0.831

Reading lookup table from text file: 0.774
Reading the comparison table from the binary file: 0.833

indicates that this improvement has failed, and reading the transcoding comparison table from the text file is more efficient.

Summary: Use PHP to dynamically convert GB encoding to UTF-8 encoding. If the text converted each time is small, it is suitable to use a binary file combined with the half-way conversion method; if the text converted each time is large, it is suitable to use a text file to store the transcoding. Lookup table, and read the lookup table into memory once before conversion.