Python learning web crawler is mainly divided into 3 major sections: crawling , analysis , and storage
In addition, the more commonly used crawler framework Scrapy will be introduced in detail here at the end.
First, let’s list the relevant articles that I have summarized, which cover the basic concepts and skills needed to get started with web crawlers: Ning Ge’s Small Station - Web Crawler
When we enter a URL in the browser and press Enter, what happens in the background? For example, if you enter http://www.lining0806.com/, you will see the homepage of Brother Ning’s website.
Simply put, this process takes place in the following four steps:
What a web crawler has to do, simply put, is to implement the functions of a browser. By specifying the URL, the data required by the user is directly returned, without the need to manually manipulate the browser to obtain it step by step.
In this step, you need to clarify what content you want to get? Is it HTML source code, or a string in Json format, etc.
Most of the crawling cases are get requests, that is, obtaining data directly from the other party's server.
First of all, Python comes with two modules, urllib and urllib2, which can basically satisfy general page crawling. In addition, requests is also a very useful package, similar to this, there are httplib2 and so on.
Requests:
import requests
response = requests.get(url)
content = requests.get(url).content
print "response headers:", response.headers
print "content:", content
Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()
print "response headers:", response.headers
print "content:", content
Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url, 'GET')
print "response headers:", response_headers
print "content:", content
In addition, for URLs with query fields, get requests generally append the requested data to the URL to split the URL and transmit data. Multiple parameters are connected with &.
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
import requests
response = requests.get(url=url, params=data)
Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
full_url = url+'?'+data
response = urllib2.urlopen(full_url)
Related reference: Review of NetEase News Ranking Crawl
Reference project: The most basic crawler of web crawlers: crawling NetEase news rankings
2.1 Log in using form
This situation is a post request, that is, the form data is sent to the server first, and the server then stores the returned cookie locally.
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
import requests
response = requests.post(url=url, data=data)
Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
req = urllib2.Request(url=url, data=data)
response = urllib2.urlopen(req)
2.2 Log in using cookies
Using cookies to log in, the server will think that you are a logged in user, so it will return you logged in content. Therefore, if a verification code is required, you can use a login cookie with a verification code to solve the problem.
import requests
requests_session = requests.session()
response = requests_session.post(url=url_login, data=data)
If there is a verification code, it is not possible to use response = requests_session.post(url=url_login, data=data) at this time. The method should be as follows:
response_captcha = requests_session.get(url=url_login, cookies=cookies)
response1 = requests.get(url_login) # 未登陆
response2 = requests_session.get(url_login) # 已登陆,因为之前拿到了Response Cookie!
response3 = requests_session.get(url_results) # 已登陆,因为之前拿到了Response Cookie!
Related reference: Web crawler-verification code login
Reference project: Web crawler username, password and verification code login: Crawling Zhihu website
3.1 Using a proxy
Applicable situation: Restricting IP addresses can also solve the problem of needing to enter a verification code to log in due to "frequent clicks".
In this case, the best way is to maintain a proxy IP pool. There are many free proxy IPs on the Internet, and there are many good and bad ones. You can find the ones that can be used through screening. For "frequent clicks", we can also avoid being banned by the website by limiting the frequency of crawlers visiting the website.
proxies = {'http':'http://XX.XX.XX.XX:XXXX'}
Requests:
import requests
response = requests.get(url=url, proxies=proxies)
Urllib2:
import urllib2
proxy_support = urllib2.ProxyHandler(proxies)
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener) # 安装opener,此后调用urlopen()时都会使用安装过的opener对象
response = urllib2.urlopen(url)
3.2 Time setting
Applicable situation: limited frequency situation.
Both Requests and Urllib2 can use the sleep() function of the time library:
import time
time.sleep(1)
3.3 Disguise as a browser, or fight against “anti-hotlinking”
Some websites will check whether you are actually accessing it through a browser, or whether it is accessed automatically by a machine. In this case, plus User-Agent, it shows that you are accessing through a browser. Sometimes it will also check whether the Referer information is included and whether your Referer is legal, usually plus the Referer.
headers = {'User-Agent':'XXXXX'} # 伪装成浏览器访问,适用于拒绝爬虫的网站
headers = {'Referer':'XXXXX'}
headers = {'User-Agent':'XXXXX', 'Referer':'XXXXX'}
Requests:
response = requests.get(url=url, headers=headers)
Urllib2:
import urllib, urllib2
req = urllib2.Request(url=url, headers=headers)
response = urllib2.urlopen(req)
Not much to say.
def multi_session(session, *arg):
retryTimes = 20
while retryTimes>0:
try:
return session.post(*arg)
except:
print '.',
retryTimes -= 1
or
def multi_open(opener, *arg):
retryTimes = 20
while retryTimes>0:
try:
return opener.open(*arg)
except:
print '.',
retryTimes -= 1
In this way, we can use multi_session or multi_open to maintain the session or opener captured by the crawler.
Here is an experimental comparison of parallel crawling for Wall Street news: Python multi-process crawling and Java single-threaded and multi-threaded crawling
Related reference: Comparison of multi-process and multi-thread computing methods in Python and Java
For the "load more" case, use Ajax to transfer a lot of data.
How it works is: after loading the source code of the web page from the URL of the web page, the JavaScript program will be executed in the browser. These programs load more content and "populate" the web page. This is why if you go directly to crawl the URL of the web page itself, you won't find the actual content of the page.
Here, if you use Google Chrome to analyze the link corresponding to "Request" (method: right-click → Inspect Element → Network → Clear, click "Load More", the corresponding GET link will appear and look for the Type of text/html, click to view the get parameters Or copy the Request URL), loop process.
Selenium is an automated testing tool. It can control the browser, including character filling, mouse clicks, element acquisition, page switching and a series of operations. In short, Selenium can do anything that a browser can do.
Listed here is the code that uses selenium to dynamically grab fare information from Qunar.com after a given city list.
Reference project: Web crawler Selenium uses proxy login: Crawl Qunar website
For the situation where the website has a verification code, we have three methods:
Using a proxy and using cookies to log in have been discussed before. Now let’s talk about verification code identification.
You can use the open source Tesseract-OCR system to download and recognize verification code images, and transfer the recognized characters to the crawler system for simulated login. Of course, you can also upload the verification code image to the coding platform for identification. If unsuccessful, you can update the verification code identification again until successful.
Reference project: Verification code identification project version 1: Captcha1
There are two issues that need to be paid attention to when crawling:
After crawling, the crawled content is analyzed. Whatever content you need, you can extract the relevant content from it.
Common analysis tools include regular expressions, BeautifulSoup, lxml, etc.
After analyzing what we need, the next step is to store it.
We can choose to save it to a text file, or to a MySQL or MongoDB database, etc.
There are two issues that need to be paid attention to when storing:
Scrapy is an open source Python crawler framework based on Twisted, which is widely used in industry.
For related content, you can refer to the construction of a web crawler based on Scrapy. At the same time, the WeChat search crawling project code introduced in this article is given as a learning reference.
Reference project: Use Scrapy or Requests to recursively crawl WeChat search results
A good web crawler must first comply with the Robots protocol . The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled.
Place a robots.txt text file (such as https://www.taobao.com/robots.txt) in the root directory of the website. In it, you can specify the pages that different web crawlers can access and the pages that are prohibited from accessing. The specified pages are determined by regular rules. Expression representation. Before the web crawler collects this website, it first obtains the robots.txt text file, then parses the rules in it, and then collects the website data according to the rules.
User-agent: 指定对哪些爬虫生效
Disallow: 指定不允许访问的网址
Allow: 指定允许访问的网址
Note: An English word must be capitalized. The colon is in English. There is a space after the colon. "/" represents the entire website.
禁止所有机器人访问
User-agent: *
Disallow: /
允许所有机器人访问
User-agent: *
Disallow:
禁止特定机器人访问
User-agent: BadBot
Disallow: /
允许特定机器人访问
User-agent: GoodBot
Disallow:
禁止访问特定目录
User-agent: *
Disallow: /images/
仅允许访问特定目录
User-agent: *
Allow: /images/
Disallow: /
禁止访问特定文件
User-agent: *
Disallow: /*.html$
仅允许访问特定文件
User-agent: *
Allow: /*.html$
Disallow: /