Practical content: Obtain the Chinese and English names, Hong Kong and Taiwan names, directors, release years, movie classifications and ratings of Douban's TOP250 movies, and store the data in databases and files. The link is: https://movie.douban.com/top250?start=.
We have introduced many ways to crawl web page data before. Let’s crawl the data below.
importreimportrequestsfrombs4importBeautifulSoupforiinrange(0,2):headers={#This simulates a browser to access'user-agent':'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/52.0.2743.82Safari /537. 36','Host':'movie.douban.com'}res='https://movie.douban.com/top250?start='+str(25*i)#25 times r=requests.get(res ,headers=headers,timeout=10)#Set the timeout soup=BeautifulSoup(r.text,html.parser)#Set the parsing method, you can also print(soup)
The output is:
<!DOCTYPEhtml><htmlclass=ua-windowsua-webkit><head><metacontent=text/html;charset=utf-8http-equiv=Content-Type/><metacontent=webkitname=renderer/><metacontent=alwaysname=referrer /><metacontent=ok0wCgT20tBBg o9_zat2iAcimtN4Ftf5ccsh092Xeywname=google-site-verification><title>Top250 Douban Movies</title>.....</script><!--dae-web-movie--default-759d9f45f7-b69fc-->< script>_SPLITTEST=''</script></link></link></body></html>
Here we use the mad5() function in the hashlib module in Python. The check code is as follows. If you have just crawled the data, you can omit this step.
MD5 is an encryption algorithm commonly used in the field of computer security.
importhashlibdefvertifyupdate(html):md5=hashlib.md5()md5.update(html.encode(encoding='utf-8'))md5code=md5.hexdigest()print(md5code)old_html=''htlm_name='gp.txt 'ifos.path.exists(htlm_name):wit hopen(htlm_name,'r',encoding='utf-8')asf:old_html=f.read()ifmd5code==old_html:print('data not updated')returnFalseelse:withopen(htlm_name,'w',encoding= 'utf-8')asf:f.write(md5code)print('data updated')returnTrue
This function needs to import the hashlib module, then create an md5 object, pass in the information of the current page, and perform MD5 operations on the incoming data by using the updata() method.
Then use the if statement to determine whether the file exists. If it exists, read the MD5 code in it, and then determine whether the two MD5 codes are the same. If they are the same, it means there is no update. Otherwise, it has been updated and the new MD5 code is passed to in the file.
Crawling data is the first step we need to do. If the data has been stored for a long time before being used, it needs to be detected. These are relatively simple steps. The more troublesome part is how to obtain more accurate data. The next section Let’s do data analysis.