解決網爬工具爬取頁面資訊出現亂碼的問題

作者：Eve Cole 更新時間：2009-07-01 16:25:14

問題：
網爬工具中自動蒐集頁面資訊時，有的頁面出現了亂碼現象原因：
讀取頁面資訊是使用了錯誤的編碼類型。 C#.NET從現在的類別中獲取得來的編碼資訊有時是錯誤的，本人認為對不是ASP.NET的應用程序，它讀過的編碼資訊都是錯誤的。
解決：
想法：必須先在運行時取得得該頁面的編碼，再去讀取頁面的內容，這樣得來的頁面內容才不會出現亂碼現象。
方法：
1:使用ASCII編碼去讀取頁面內容。
2:使用正規表示式從讀取的頁面內容中篩選出頁面的編碼資訊。上個步驟取得的頁面資訊可能會有亂碼。但HTML標誌是正確的，所有可以從HTML標誌中得到編碼的資訊。
3.用正確的編碼類型去讀取頁面資訊。
如果哪位有更好的方法，請多賜教啊！

下面附上程式碼：

程式碼演示
using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.Web;
using System.IO;
using System.Text.RegularExpressions;
namespace charset
{
class Program
{

static void Main(string[] args)
{
string url = " http://www.gdqy.edu.cn ";
GetCharset1(url);
GetChartset2(url);

Console.Read();
}
// 透過HttpWebResponse直接取得頁面編碼
static void GetCharset1(string url)
{
try
{
WebRequest webRequest = WebRequest.Create(url);
HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();

string charset = webResponse.CharacterSet;
string contentEncoding = webResponse.ContentEncoding;
string contentType = webResponse.ContentType;

Console.WriteLine("context type:{0}", contentType);

Console.WriteLine("charset:{0}", charset);

Console.WriteLine("content encoding:{0}" , contentEncoding);
//測試或取頁面是否出現亂碼
//Console.WriteLine(getHTML(url,charset));

}
catch (UriFormatException ex)
{

Console.WriteLine(ex.Message);
}
catch(WebException ex)
{

Console.WriteLine(ex.Message);
}

}
//使用正規表示式取得頁面編碼
static void GetChartset2(string url)
{

try
{
string html = getHTML(url,Encoding.ASCII.EncodingName);
Regex reg_charset = new Regex(@"charsetbs*=s*(?<charset>[^""]*)");
string enconding = null;
if (reg_charset.IsMatch(html))
{
enconding = reg_charset.Match(html).Groups["charset"].Value;
Console.WriteLine("charset:{0}",enconding);
}
else
{
enconding = Encoding.Default.EncodingName;
}
//測試或取頁面是否出現亂碼
//Console.WriteLine(getHTML(url,enconding));
}
catch (UriFormatException ex)
{

Console.WriteLine(ex.Message);
}
catch(WebException ex)
{

Console.WriteLine(ex.Message);
}

}
//讀取頁面內容方法
static string getHTML(string url,string encodingName)
{

try
{
WebRequest webRequest = WebRequest.Create(url);
WebResponse webResponse = webRequest.GetResponse();
Stream stream = webResponse.GetResponseStream();
StreamReader sr = new StreamReader(stream, Encoding.GetEncoding(encodingName));
string html = sr.ReadToEnd();
return html;
}
catch (UriFormatException ex)
{

Console.WriteLine(ex.Message);
return null;
}
catch (WebException ex)
{

Console.WriteLine(ex.Message);
return null;
}
}

}
}

http://www.gdqy.edu.cn頁面的使用的編碼格式是：gb2312
第一個方法顯示的內容是：
context type:text/html
charset:ISO-8859-1
content encoding:
第二個方法顯示的內容是：
charset:gb2312

所以第一個方法取得的資訊是錯誤的，第二個方法是對的。
為什麼第一個方法取得的的編碼格式是：ISO-8859-1呢？
我用Reflector反射工具取得了CharacterSet屬性的原始碼，從中不難看出原因。如果能取得ContentType屬性的原始碼就不以看出其出錯的原因了，但是搞了許久都沒找出，如果那位那補上，那就太感謝了。
下面我附上Reflector反射工具取得了CharacterSet屬性的原始碼，有興趣的朋友看看。

CharacterSet原始碼
public string CharacterSet
{
get
{
this.CheckDisposed();
string text1 = this.m_HttpResponseHeaders.ContentType;
if ((this.m_CharacterSet == null) && !ValidationHelper.IsBlankString(text1))
{
this.m_CharacterSet = string.Empty;
string text2 = text1.ToLower(CultureInfo.InvariantCulture);
if (text2.Trim().StartsWith("text/"))
{
this.m_CharacterSet = "ISO-8859-1";
}
int num1 = text2.IndexOf(";");
if (num1 > 0)
{
while ((num1 = text2.IndexOf("charset", num1)) >= 0)
{
num1 += 7;
if ((text2[num1 - 8] == ';') || (text2[num1 - 8] == ' '))
{
while ((num1 < text2.Length) && (text2[num1] == ' '))
{
num1++;
}
if ((num1 < (text2.Length - 1)) && (text2[num1] == '='))
{
num1++;
int num2 = text2.IndexOf(';', num1);
if (num2 > num1)
{
this.m_CharacterSet = text1.Substring(num1, num2).Trim();
break;
}
this.m_CharacterSet = text1.Substring(num1).Trim();
break;
}
}
}
}
}
return this.m_CharacterSet;
}

http://www.cnblogs.com/xuanfeng/archive/2007/01/21/626296.html