cnki download Download - cnki download Source code download

cnki download

Other source code

Download

HowNet crawler

Because the course project requires the use of CNKI crawler, I found CyrusRenty's CNKI-download warehouse after searching on GitHub. However, after cloning it, I found that it was not usable, so I modified it. Except for downloading documents and verification code functions, the rest Functional testing is available.

Because only accessing the campus network cannot make Zhiwang recognize the school (it needs to go through another school's VPN), I cannot use o(╥﹏╥)o to download the document function, so I cannot modify it. I have not encountered verification codes. Trying to ensure sufficient interval when crawling each page should be able to avoid this problem.

PS: For issues such as NoneType, I ignored them with a crude try catch. Therefore, if you see that there is no data in a certain row of Excel in the Excel row after the crawling is completed, just delete it as noise ~ If you urgently need all the complete fields, File an issue and I'll see if I can help.

PS: Sometimes the following error may be reported when running the crawler:

In this case, please turn off the global proxy VPN, enter the CNKI address in the browser to see if you can enter normally, and then try to run the program again and again.

If you have any questions about this project, you can raise an issue directly. I hope I can help you to the best of my ability! If you have better code to incorporate, Feichang welcomes PR!

The following content is taken from the original author, and I pay tribute to the original author∠(°ゝ°)

The project is a crawler implemented based on Python3 to crawl CNKI data. It can be searched according to CNKI advanced search and provides detailed information crawling functions such as basic document information, document downloads, and document abstracts.

The implementation process can be viewed on my blog

The program runs as follows:

The detailed information excel table is as follows:

Download caj as follows:

Features

Capturing data by sending parsing packets has slightly higher performance than using selenium and other methods.
You can use the CNKI advanced search function to search and retrieve documents more efficiently.
The function of capturing detailed information and downloading caj documents can be selectively enabled based on the anti-crawler situation of the network and CNKI.
Use excel tables to quickly view required literature abstracts and other information. You can selectively download according to the download links provided by excel to prevent the CNKI network from being reversely crawled due to excessive downloading.

How to use

Install dependencies

tesserocr is used in the verification code processing part, but the verification effect is not very good at present, so manual identification of verification codes is enabled by default.
If tesseract is not installed locally, you can install it first and then execute pip install tesserocr . Or comment out lines 15, 63, and 64 of the CrackVerifyCode.py file and then execute the installation command.

 pip install -r requirements.txt

Configuration options

 # Config.ini is the project configuration file # 0 means closed 1 means open isDownloadFile = 1 # Whether to download the file isCrackCode=0 # Whether to automatically recognize the verification code isDetailPage=0 # Whether to save the document details to excelisDownLoadLink # Whether to save the download link in excel stepWaitTime=5 # Pause time for each download and crawl details page

It is recommended not to open the download and crawl details pages at the same time, and the pause time should not be less than 3 seconds.

Start the program

 python run-spider.py

File structure description after running

After the crawler is finished running, all data will be saved in the data folder. The old data folder will be automatically deleted each time the program is re-run.

CNKI_download
  -- data                        存放所有爬取数据
       -- CAJs                   存放所有下载的caj原文
            -- xxxxxxx.caj
            -- xxxxxxx.caj   
       -- Links.txt              所有爬取文献的下载链接
       -- ReferenceList.txt      爬取文献简要信息
       -- Reference_detail.xls   文献详细信息excel表

Things to note

The premise for the project to run is that the computer can access CNKI through IP and download it (most schools have purchased databases). When I was about to finish writing, I found that there is still a jump interface, and public network access will be added in the future.
If "access denied by the remote host" appears, the pause time can be appropriately lengthened.
If you run it once, remember to close all the files in the data folder before running it again, otherwise an error may be reported because the data folder cannot be deleted.
If you only crawl information without downloading it, you may repeatedly enter the verification code (even if it is entered correctly) after running about 1,000 documents. It’s not yet known what the reason is

TO DO LIST

Complete other unimplemented functionality for advanced searches.
Add the specified start crawling page information to crawl again from the last error.
Add a public network jump to the CNKI interface to ensure that users who cannot log in via IP can still use this crawler.
Create a proxy pool to implement proxy IP access based on public network jumps, reducing the number of IP addresses blocked by CNKI and the number of verification code inputs.
Write program implementation and analysis process records.

Expand

Additional Information

Version
Type Other source code
Update Time 2024-11-21
size 50MB
From Github

Related Applications

studydrive download

2024-11-05
claude download

2024-11-02
kindle download

2024-11-02
markdown download

2024-11-01
kingfisher download

2024-11-01
CNKI Mobile CNKI

2023-10-12

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All