Solve the problem of garbled page information crawled by web crawling tools

Author：Eve Cole Update Time：2009-07-01 16:25:14

question:
When the web crawling tool automatically collects page information, some pages appear to be garbled. The reasons are:
The wrong encoding type was used to read the page information. The encoding information C#.NET obtains from the current class is sometimes wrong. I think that for applications that are not ASP.NET, the encoding information it reads is wrong.
solve:
Idea: You must first obtain the encoding of the page at runtime, and then read the content of the page, so that the obtained page content will not be garbled.
method:
1: Use ASCII encoding to read the page content.
2: Use regular expressions to filter out the encoding information of the page from the read page content. The page information obtained in the previous step may be garbled. But the HTML markup is correct, and all the information encoded can be obtained from the HTML markup.
3. Use the correct encoding type to read page information.
If anyone has a better method, please enlighten me!

The code is attached below:

Code Demonstration
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.Web;
using System.IO;
using System.Text.RegularExpressions;
namespacecharset
{
class Program
{

static void Main(string[] args)
{
string url = " http://www.gdqy.edu.cn ";
GetCharset1(url);
GetChartset2(url);

Console.Read();
}
// Get the page encoding directly through HttpWebResponse
static void GetCharset1(string url)
{
try
{
WebRequest webRequest = WebRequest.Create(url);
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();

string charset = webResponse.CharacterSet;
string contentEncoding = webResponse.ContentEncoding;
string contentType = webResponse.ContentType;

Console.WriteLine("context type:{0}", contentType);

Console.WriteLine("charset:{0}", charset);

Console.WriteLine("content encoding:{0}" , contentEncoding);
//Test or retrieve whether the page is garbled
//Console.WriteLine(getHTML(url,charset));

}
catch (UriFormatException ex)
{

Console.WriteLine(ex.Message);
}
catch(WebException ex)
{

Console.WriteLine(ex.Message);
}

}
//Use regular expression to get page encoding
static void GetChartset2(string url)
{

try
{
string html = getHTML(url,Encoding.ASCII.EncodingName);
Regex reg_charset = new Regex(@"charsetbs*=s*(?<charset>[^""]*)");
string enconding = null;
if (reg_charset.IsMatch(html))
{
enconding = reg_charset.Match(html).Groups["charset"].Value;
Console.WriteLine("charset:{0}",enconding);
}
else
{
enconding = Encoding.Default.EncodingName;
}
//Test or retrieve whether the page is garbled
//Console.WriteLine(getHTML(url,enconding));
}
catch (UriFormatException ex)
{

Console.WriteLine(ex.Message);
}
catch(WebException ex)
{

Console.WriteLine(ex.Message);
}

}
//Read page content method
static string getHTML(string url,string encodingName)
{

try
{
WebRequest webRequest = WebRequest.Create(url);
WebResponse webResponse = webRequest.GetResponse();
Stream stream = webResponse.GetResponseStream();
StreamReader sr = new StreamReader(stream, Encoding.GetEncoding(encodingName));
string html = sr.ReadToEnd();
return html;
}
catch (UriFormatException ex)
{

Console.WriteLine(ex.Message);
return null;
}
catch (WebException ex)
{

Console.WriteLine(ex.Message);
return null;
}
}

}
}

The encoding format used on the http://www.gdqy.edu.cn page is: gb2312
The content displayed by the first method is:
context type:text/html
charset:ISO-8859-1
content encoding:
The content displayed by the second method is:
charset:gb2312

, so the information obtained by the first method is wrong, and the second method is correct.
Why is the encoding format obtained by the first method: ISO-8859-1?
I used the Reflector reflection tool to obtain the source code of the CharacterSet property, and it is not difficult to see the reason. If we could get the source code of the ContentType attribute, we would be able to see the cause of the error, but I haven't been able to find it out after a long time. If someone can make up for it, I would be very grateful.
Below I attach the Reflector reflection tool to obtain the source code of the CharacterSet property. Friends who are interested can take a look.

CharacterSet source code
public string CharacterSet
{
get
{
this.CheckDisposed();
string text1 = this.m_HttpResponseHeaders.ContentType;
if ((this.m_CharacterSet == null) && !ValidationHelper.IsBlankString(text1))
{
this.m_CharacterSet = string.Empty;
string text2 = text1.ToLower(CultureInfo.InvariantCulture);
if (text2.Trim().StartsWith("text/"))
{
this.m_CharacterSet = "ISO-8859-1";
}
int num1 = text2.IndexOf(";");
if (num1 > 0)
{
while ((num1 = text2.IndexOf("charset", num1)) >= 0)
{
num1 += 7;
if ((text2[num1 - 8] == ';') || (text2[num1 - 8] == ' '))
{
while ((num1 < text2.Length) && (text2[num1] == ' '))
{
num1++;
}
if ((num1 < (text2.Length - 1)) && (text2[num1] == '='))
{
num1++;
int num2 = text2.IndexOf(';', num1);
if (num2 > num1)
{
this.m_CharacterSet = text1.Substring(num1, num2).Trim();
break;
}
this.m_CharacterSet = text1.Substring(num1).Trim();
break;
}
}
}
}
}
return this.m_CharacterSet;
}

http://www.cnblogs.com/xuanfeng/archive/2007/01/21/626296.html