PythonSpiderNotes Download - PythonSpiderNotes Source code download

PythonSpiderNotes

Python

1.0.0

Download

Python Getting Started Web Crawler Essence Edition

Python learning web crawler is mainly divided into 3 major sections: crawling , analysis , and storage

In addition, the more commonly used crawler framework Scrapy will be introduced in detail here at the end.

First, let’s list the relevant articles that I have summarized, which cover the basic concepts and skills needed to get started with web crawlers: Ning Ge’s Small Station - Web Crawler

When we enter a URL in the browser and press Enter, what happens in the background? For example, if you enter http://www.lining0806.com/, you will see the homepage of Brother Ning’s website.

Simply put, this process takes place in the following four steps:

Find the IP address corresponding to the domain name.
Send a request to the server corresponding to the IP.
The server responds to the request and sends back the web page content.
The browser parses the content of the web page.

What a web crawler has to do, simply put, is to implement the functions of a browser. By specifying the URL, the data required by the user is directly returned, without the need to manually manipulate the browser to obtain it step by step.

crawl

In this step, you need to clarify what content you want to get? Is it HTML source code, or a string in Json format, etc.

1. The most basic crawling

Most of the crawling cases are get requests, that is, obtaining data directly from the other party's server.

First of all, Python comes with two modules, urllib and urllib2, which can basically satisfy general page crawling. In addition, requests is also a very useful package, similar to this, there are httplib2 and so on.

 Requests：
	import requests
	response = requests.get(url)
	content = requests.get(url).content
	print "response headers:", response.headers
	print "content:", content
Urllib2：
	import urllib2
	response = urllib2.urlopen(url)
	content = urllib2.urlopen(url).read()
	print "response headers:", response.headers
	print "content:", content
Httplib2：
	import httplib2
	http = httplib2.Http()
	response_headers, content = http.request(url, 'GET')
	print "response headers:", response_headers
	print "content:", content

In addition, for URLs with query fields, get requests generally append the requested data to the URL to split the URL and transmit data. Multiple parameters are connected with &.

 data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests：data为dict，json
	import requests
	response = requests.get(url=url, params=data)
Urllib2：data为string
	import urllib, urllib2    
	data = urllib.urlencode(data)
	full_url = url+'?'+data
	response = urllib2.urlopen(full_url)

Related reference: Review of NetEase News Ranking Crawl

Reference project: The most basic crawler of web crawlers: crawling NetEase news rankings

2. Handling login situations

2.1 Log in using form

This situation is a post request, that is, the form data is sent to the server first, and the server then stores the returned cookie locally.

 data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests：data为dict，json
	import requests
	response = requests.post(url=url, data=data)
Urllib2：data为string
	import urllib, urllib2    
	data = urllib.urlencode(data)
	req = urllib2.Request(url=url, data=data)
	response = urllib2.urlopen(req)

2.2 Log in using cookies

Using cookies to log in, the server will think that you are a logged in user, so it will return you logged in content. Therefore, if a verification code is required, you can use a login cookie with a verification code to solve the problem.

 import requests			
requests_session = requests.session() 
response = requests_session.post(url=url_login, data=data)

If there is a verification code, it is not possible to use response = requests_session.post(url=url_login, data=data) at this time. The method should be as follows:

 response_captcha = requests_session.get(url=url_login, cookies=cookies)
response1 = requests.get(url_login) # 未登陆
response2 = requests_session.get(url_login) # 已登陆，因为之前拿到了Response Cookie！
response3 = requests_session.get(url_results) # 已登陆，因为之前拿到了Response Cookie！

Related reference: Web crawler-verification code login

Reference project: Web crawler username, password and verification code login: Crawling Zhihu website

3. Handling the anti-crawler mechanism

3.1 Using a proxy

Applicable situation: Restricting IP addresses can also solve the problem of needing to enter a verification code to log in due to "frequent clicks".

In this case, the best way is to maintain a proxy IP pool. There are many free proxy IPs on the Internet, and there are many good and bad ones. You can find the ones that can be used through screening. For "frequent clicks", we can also avoid being banned by the website by limiting the frequency of crawlers visiting the website.

 proxies = {'http':'http://XX.XX.XX.XX:XXXX'}
Requests：
	import requests
	response = requests.get(url=url, proxies=proxies)
Urllib2：
	import urllib2
	proxy_support = urllib2.ProxyHandler(proxies)
	opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
	urllib2.install_opener(opener) # 安装opener，此后调用urlopen()时都会使用安装过的opener对象
	response = urllib2.urlopen(url)

3.2 Time setting

Applicable situation: limited frequency situation.

Both Requests and Urllib2 can use the sleep() function of the time library:

 import time
time.sleep(1)

3.3 Disguise as a browser, or fight against “anti-hotlinking”

Some websites will check whether you are actually accessing it through a browser, or whether it is accessed automatically by a machine. In this case, plus User-Agent, it shows that you are accessing through a browser. Sometimes it will also check whether the Referer information is included and whether your Referer is legal, usually plus the Referer.

 headers = {'User-Agent':'XXXXX'} # 伪装成浏览器访问，适用于拒绝爬虫的网站
headers = {'Referer':'XXXXX'}
headers = {'User-Agent':'XXXXX', 'Referer':'XXXXX'}
Requests：
	response = requests.get(url=url, headers=headers)
Urllib2：
	import urllib, urllib2   
	req = urllib2.Request(url=url, headers=headers)
	response = urllib2.urlopen(req)

4. For disconnection and reconnection

Not much to say.

 def multi_session(session, *arg):
	retryTimes = 20
	while retryTimes>0:
		try:
			return session.post(*arg)
		except:
			print '.',
			retryTimes -= 1

or

 def multi_open(opener, *arg):
	retryTimes = 20
	while retryTimes>0:
		try:
			return opener.open(*arg)
		except:
			print '.',
			retryTimes -= 1

In this way, we can use multi_session or multi_open to maintain the session or opener captured by the crawler.

5. Multi-process crawling

Here is an experimental comparison of parallel crawling for Wall Street news: Python multi-process crawling and Java single-threaded and multi-threaded crawling

Related reference: Comparison of multi-process and multi-thread computing methods in Python and Java

6. Processing of Ajax requests

For the "load more" case, use Ajax to transfer a lot of data.

How it works is: after loading the source code of the web page from the URL of the web page, the JavaScript program will be executed in the browser. These programs load more content and "populate" the web page. This is why if you go directly to crawl the URL of the web page itself, you won't find the actual content of the page.

Here, if you use Google Chrome to analyze the link corresponding to "Request" (method: right-click → Inspect Element → Network → Clear, click "Load More", the corresponding GET link will appear and look for the Type of text/html, click to view the get parameters Or copy the Request URL), loop process.

If there is a page before "request", the first page will be deduced based on the URL in the previous step. By analogy, the data of the Ajax address is captured.
Perform regular matching on the returned json format data (str). In the json format data, the unicode_escape encoding in the form of 'uxxxx' needs to be converted into the unicode encoding of u'uxxxx'.

7. Automated testing tool Selenium

Selenium is an automated testing tool. It can control the browser, including character filling, mouse clicks, element acquisition, page switching and a series of operations. In short, Selenium can do anything that a browser can do.

Listed here is the code that uses selenium to dynamically grab fare information from Qunar.com after a given city list.

Reference project: Web crawler Selenium uses proxy login: Crawl Qunar website

8. Verification code identification

For the situation where the website has a verification code, we have three methods:

Use proxy, update IP.
Log in using cookies.
Verification code identification.

Using a proxy and using cookies to log in have been discussed before. Now let’s talk about verification code identification.

You can use the open source Tesseract-OCR system to download and recognize verification code images, and transfer the recognized characters to the crawler system for simulated login. Of course, you can also upload the verification code image to the coding platform for identification. If unsuccessful, you can update the verification code identification again until successful.

Reference project: Verification code identification project version 1: Captcha1

There are two issues that need to be paid attention to when crawling:

How do you monitor updates to a series of websites, that is, how do you perform an incremental crawl?
How to implement distributed crawling for massive data?

analyze

After crawling, the crawled content is analyzed. Whatever content you need, you can extract the relevant content from it.

Common analysis tools include regular expressions, BeautifulSoup, lxml, etc.

storage

After analyzing what we need, the next step is to store it.

We can choose to save it to a text file, or to a MySQL or MongoDB database, etc.

There are two issues that need to be paid attention to when storing:

How to deduplicate web pages?
In what form is the content stored?

Scrapy

Scrapy is an open source Python crawler framework based on Twisted, which is widely used in industry.

For related content, you can refer to the construction of a web crawler based on Scrapy. At the same time, the WeChat search crawling project code introduced in this article is given as a learning reference.

Reference project: Use Scrapy or Requests to recursively crawl WeChat search results

Robots protocol

A good web crawler must first comply with the Robots protocol . The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.

Place a robots.txt text file (such as https://www.taobao.com/robots.txt) in the root directory of the website. In it, you can specify the pages that different web crawlers can access and the pages that are prohibited from accessing. The specified pages are determined by regular rules. Expression representation. Before the web crawler collects this website, it first obtains the robots.txt text file, then parses the rules in it, and then collects the website data according to the rules.

1. Robots protocol rules

 User-agent: 指定对哪些爬虫生效
Disallow: 指定不允许访问的网址
Allow: 指定允许访问的网址

Note: An English word must be capitalized. The colon is in English. There is a space after the colon. "/" represents the entire website.

2. Robots protocol example

禁止所有机器人访问
	User-agent: *
	Disallow: /
允许所有机器人访问
	User-agent: *
	Disallow: 
禁止特定机器人访问
	User-agent: BadBot
	Disallow: /
允许特定机器人访问
	User-agent: GoodBot
	Disallow: 
禁止访问特定目录
	User-agent: *
	Disallow: /images/
仅允许访问特定目录
	User-agent: *
	Allow: /images/
	Disallow: /
禁止访问特定文件
	User-agent: *
	Disallow: /*.html$
仅允许访问特定文件
	User-agent: *
	Allow: /*.html$
	Disallow: /

Expand

Additional Information

Version 1.0.0
Type Python
Update Time 2024-12-27
size 7.37MB
From Github

Related Applications

Nuitka

2024-12-14
Google Blog Converters (blog data converter)

2009-05-24
azure storage python

2024-12-15
Poker Pro

2024-12-16
repository guide

2024-12-16
datamule python

2024-11-08

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Nuitka

Python

1.0.0
Google Blog Converters (blog data converter)

Python

1.0 R54
azure storage python

Python

v2.1.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All