The so-called data collection program is also a web thief program (please don’t scold me). After I finish writing it, I will post something here. I hope everyone has any ideas and study together.
1. At the beginning of downloading the data, some websites require logging in before you can view it
.To get the corresponding data, this requires us to send the login user name and password, but I logged in, but his server is not garbage, and he redirected it, and a total of 2 SESSIONs were generated. For this second SESSION, I I don’t know how to capture it. So I speculated^-^ and used software to capture SESSION. I created a software called Ethereal and added the following code to the header of the HTTP request.
WebClient myWebClient = new WebClient();
string sessionkey=textBox78.Text;
string refererurl=textBox77.Text;
myWebClient.Headers.Clear();
myWebClient.Headers.Add("Cookie",sessionkey);
myWebClient.Headers.Add("Referer", refererurl);
myWebClient.Headers.Add("User-agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031107 Debian/1.5-3");
This deceives the server, haha
2. The second part is to download the code
byte[] myDataBuffer = myWebClient.DownloadData(remoteUri);
download = Encoding.Default.GetString(myDataBuffer);
3. The third part is the matching of data. I read the stream into the data, then use IndexOf to get the positions of the two key fields, and then use Substring to get them out. , I know this is stupid, but it is difficult to use regular expressions (who can give me some advice). After matching the string, I used the following function to remove the HTML code:
private string StripHTML(string strHtml)
{
string [] aryReg ={
@"<script[^>]*?>.*?</script>",
@"<(/s*)?!?((w+:)?w+)(w+(s*=?s*(([""'])( \[""' tbnr]|[^7])*?7|w+)|.{0})|s)*?(/s *)?>",
@"([rn])[s]+",
@"&(quot|#34);",
@"&(amp|#38);",
@"&(lt|#60);",
@"&(gt|#62);",
@"&(nbsp|#160);",
@"&(iexcl|#161);",
@"&(cent|#162);",
@"&(pound|#163);",
@"&(copy|#169);",
@"&#(d+);",
@"-->",
@"<!--.*n"
};
string [] aryRep = {
"",
"",
"",
""",
"&",
"<",
">",
" ",
"xa1",//chr(161),
"xa2",//chr(162),
"xa3",//chr(163),
"xa9",//chr(169),
"",
"rn",
""
};
string newReg =aryReg[0];
string strOutput=strHtml;
for(int i = 0;i<aryReg.Length;i++)
{
Regex regex = new Regex(aryReg[i],RegexOptions.IgnoreCase);
strOutput = regex.Replace(strOutput,aryRep[i]);
}
strOutput.Replace("<","");
strOutput.Replace(">","");
strOutput.Replace("rn","");
return strOutput;
}
4. The next step is to store it in the database. Everyone knows this. But I still have a problem. When I write the data, an EXCEPTION appears, saying that my field is too long and cannot be written into the database. I I am using ACCESS, I will try using SQL.
5. If you have any good suggestions, please leave me a comment. Let’s make progress together.
Source: http://jetadv.cnblogs.com/archive/2006/02/18 /333213.html