"I used a crawler to "steal" one million Zhihu users in one day, just to prove that PHP is the best language in the world"
phpspider is a crawler development framework. Using this framework, you don’t need to understand the underlying technology implementation of the crawler. The crawler is blocked by the website, and some websites require login or verification code recognition to crawl. With just a few lines of PHP code, you can create your own crawler. Using the multi-process Worker class library encapsulated by the framework, the code is simpler and the execution efficiency is higher and faster.
There are some crawling rules for specific websites in the demo directory. As long as you have a PHP environment installed, the code can be run directly on the command line. Developers who are interested in crawlers can join the QQ group to discuss: 147824717.
Let’s take the Embarrassing Encyclopedia as an example to see what our crawler looks like:
$configs = array( 'name' => '糗事百科', 'domains' => array( 'qiushibaike.com', 'www.qiushibaike.com' ), 'scan_urls' => array( 'http://www.qiushibaike.com/' ), 'content_url_regexes' => array( "http://www.qiushibaike.com/article/d+" ), 'list_url_regexes' => array( "http://www.qiushibaike.com/8hr/page/d+?s=d+" ), 'fields' => array( array( // 抽取内容页的文章内容 'name' => "article_content", 'selector' => "//*[@id='single-next-link']", 'required' => true ), array( // 抽取内容页的文章作者 'name' => "article_author", 'selector' => "//div[contains(@class,'author')]//h2", 'required' => true ), ), ); $spider = new phpspider($configs); $spider->start();
The overall framework of the crawler is like this. First, a $configs array is defined, which sets some information about the website to be crawled. Then, it is configured and configured by calling $spider = new phpspider($configs);
and $spider->start();
Start the crawler.
For more details, go to:
Development documentation