Because the course project requires the use of CNKI crawler, I found CyrusRenty's CNKI-download warehouse after searching on GitHub, but after cloning it, I found that it was not usable, so I modified it. Except for downloading documents and verification code functions, the rest Functional testing is available.
Because only accessing the campus network cannot make Zhiwang identify the school (it needs to go through another school's VPN), I cannot use o(╥﹏╥)o to download the document function, so I cannot modify it. I have not encountered verification codes. Trying to ensure sufficient interval when crawling each page should be able to avoid this problem.
PS: For issues such as NoneType, I ignored them with a crude try catch. Therefore, if you see that there is no data in a certain row of Excel in the Excel row after the crawling is completed, just delete it as noise ~ If you urgently need all the complete fields, File an issue and I'll see if I can help.
PS: Sometimes the following error may be reported when running the crawler:
In this case, please turn off the global proxy VPN, enter the CNKI address in the browser to see if you can enter normally, and then try to run the program again and again.
If you have any questions about this project, you can raise an issue directly. I hope I can help you to the best of my ability! If you have better code to incorporate, Feichang welcomes PR!
The following content is taken from the original author, and I pay tribute to the original author∠(°ゝ°)
The project is a crawler implemented based on Python3 to crawl CNKI data. It can be searched according to CNKI advanced search and provides detailed information crawling functions such as basic document information, document downloads, and document abstracts.
The implementation process can be viewed on my blog
The program runs as follows:
The detailed information excel table is as follows:
Download caj as follows:
Capturing data by sending parsing packets has slightly higher performance than using selenium and other methods.
You can use CNKI’s advanced search function to search and retrieve documents more efficiently.
The function of capturing detailed information and downloading caj documents can be selectively turned on according to the anti-crawler situation of the network and CNKI.
Use excel tables to quickly view the required literature abstracts and other information. You can selectively download according to the download links provided by excel to prevent the CNKI network from being reversely crawled due to excessive downloading.
tesserocr
is used in the verification code processing part, but the verification effect is not very good at present, so manual identification of verification codes is enabled by default.If
tesseract
is not installed locally, you can install it first and then executepip install tesserocr
. Or comment out lines 15, 63, and 64 of theCrackVerifyCode.py
file and then execute the installation command.
pip install -r requirements.txt
# Config.ini is the project configuration file # 0 is closed 1 is open isDownloadFile = 1 # Whether to download the file isCrackCode=0 # Whether to automatically recognize the verification code isDetailPage=0 # Whether to save the document details to excelisDownLoadLink # Whether to save the download link in excel stepWaitTime=5 # Pause time for each download and crawl details page
It is recommended not to open the download and crawl details pages at the same time, and the pause time should not be less than 3 seconds.
python run-spider.py
After the crawler is finished running, all data will be saved in the data folder. The old data folder will be automatically deleted every time the program is re-run.
CNKI_download -- data 存放所有爬取数据 -- CAJs 存放所有下载的caj原文 -- xxxxxxx.caj -- xxxxxxx.caj -- Links.txt 所有爬取文献的下载链接 -- ReferenceList.txt 爬取文献简要信息 -- Reference_detail.xls 文献详细信息excel表
The premise for the project to run is that the computer can access CNKI through IP and download it (most schools have purchased databases). When I was about to finish writing, I found that there is still a jump interface, and public network access will be added in the future.
If "access denied by the remote host" appears, you can appropriately lengthen the pause time for each session.
If you run it once, remember to close all the files in the data folder before running it again, otherwise an error may be reported because the data folder cannot be deleted.
If you only crawl information without downloading it, you may repeatedly enter the verification code (even if it is entered correctly) after running about 1,000 documents. It’s not yet known what the reason is
Complete other unimplemented functionality for advanced searches.
Add the specified start crawling page information to crawl again from the last error.
Add a public network jump to the CNKI interface to ensure that users who cannot log in via IP can still use this crawler.
Create a proxy pool to implement proxy IP access based on public network jumps, reducing the number of IP addresses blocked by CNKI and the number of verification code inputs.
Write program implementation and analysis process records.