QueryList is a simple and elegant PHP collection tool based on phpQuery, with high scalability.
1. Have the exact same CSS3 DOM selector as jQuery
2. Has exactly the same DOM operation API as jQuery
3. Have a universal list collection solution
4. With a powerful HTTP request suite, it can easily implement complex network requests such as simulated login, fake browser, HTTP proxy, etc.
5. Have a garbled code solution
6. It has powerful content filtering function and can use jQuey selector to filter content.
7. It has a high degree of modular design and strong scalability.
8. Have an expressive API
9. Have high-quality documentation
10. Has rich plug-ins
11. Have a professional Q&A community and communication group
Plug-ins make it easy to implement things like
1. Multi-thread collection
2. Collect JavaScript dynamically rendered pages (PhantomJS/headless WebKit)
3. Image localization
4. Simulate browser behavior, such as submitting a Form form
5. Web crawler
6.......
PHP >= 7.0
If your PHP is still stuck at PHP5, or you don't know how to use Composer, you can choose to use QueryList3. QueryList3 supports php5.3 and manual installation.
Install via Composer:
composer require jaeger/querylist
Element operations
Collect all image addresses on "Nitu.com"
QueryList::get('http://www.nipic.com')->find('img')->attrs('src');
Collect Baidu search results
$ql = QueryList::get('http://www.baidu.com/s?wd=QueryList');$ql->find('title')->text(); // Get the website title $ql ->find('meta[name=keywords]')->content; // Get the website header keyword $ql->find('h3>a')->texts(); // Get the search result title list $ql->find('h3>a')->attrs('href'); //Get the search result link list $ql->find('img')->src; //Get the first picture Link address $ql->find('img:eq(1)')->src; //Get the link address of the second picture $ql->find('img')->eq(2)->src ; //Get the link address of the third picture // Traverse all pictures $ql->find('img')->map(function($img){echo $img->alt; //Print the alt attribute of the picture });
More usage
$ql->find('#head')->append('<div>Append content</div>')->find('div')->htmls();$ql->find('.two ')->children('img')->attrs('alt'); //Get all img child nodes under the element with class two //Traverse all child nodes under the element with class two $data = $ql- >find('.two')->children()->map(function ($item){ //Use is to determine node type if($item->is('a')){ return $item->text (); }elseif($item->is('img')) { return $item->alt; }});$ql->find('a')->attr('href', 'newVal' )->removeClass('className')->html('newHtml')->...$ql->find('div > p')->add('div > ul')->filter(': has(a)')->find('p:first')->nextAll()->andSelf()->...$ql->find('div.old')->replaceWith( $ql- >find('div.new')->clone())->appendTo('.trash')->prepend('Deleted')->...
List collection
Collect the titles and links of Baidu search results list:
$data = QueryList::get('http://www.baidu.com/s?wd=QueryList')// Set collection rules->rules([ 'title'=>array('h3','text' ), 'link'=>array('h3>a','href')])->query()->getData();print_r($data->all());
Collection results:
Array( [0] => Array ( [title] => QueryList | An extremely powerful PHP collection tool based on phpQuery [link] => http://www.baidu.com/link?url=GU_YbDT2IHk4ns1tjG2I8_vjmH0SCJEAPuuZN ) [1] = > Array ( [title] => PHP uses QueryList to crawl web page content - wb145230 - Blog Park [link] => http://www.baidu.com/link?url=zn0DXBnrvIF2ibRVW34KcRVFG1_bCdZvqvwIhUqiXaS ) [2] => Array ( [title ] => Introduction - QueryList guidance document [link] => http://www.baidu.com/link?url=pSypvMovqS4v2sWeQo5fDBJ4EoYhXYi0Lxx ) //...)
transcoding
// Output encoding: UTF-8, input encoding: GB2312QueryList::get('https://top.etao.com')->encoding('UTF-8','GB2312')->find('a' )->texts();//Output encoding: UTF-8, input encoding: automatic recognition QueryList::get('https://top.etao.com')->encoding('UTF-8')-> find('a')->texts();
HTTP network operations (GuzzleHttp)
Log in to Sina Weibo with cookies
//Collect pages from Sina Weibo that require login to access $ql = QueryList::get('http://weibo.com','param1=testvalue & params2=somevalue',[ 'headers' => [ //Fill in Cookie 'Cookie' obtained from the browser => 'SINAGLOBAL=546064; wb_cmtLike_2112031=1; wvr=6;....' ]]);//echo $ql->getHtml();echo $ql-> find('title')->text();//Output: My homepage Weibo - discover new things anytime and anywhere
Use HTTP proxy
$urlParams = ['param1' => 'testvalue','params2' => 'somevalue'];$opts = [// Set http proxy 'proxy' => 'http://222.141.11.17:8118', / /Set the timeout, unit: seconds 'timeout' => 30, // Forge http headers 'headers' => [ 'Referer' => 'https://querylist.cc/', 'User-Agent' => ' testing/1.0', 'Accept' => 'application/json', 'X-Foo' => ['Bar', 'Baz'], 'Cookie' => 'abc=111;xxx=222' ]]; $ql->get('http://httpbin.org/get',$urlParams,$opts);// echo $ql->getHtml();
Simulated login
// Use post to log in $ql = QueryList::post('http://xxxx.com/login',[ 'username' => 'admin', 'password' => '123456'])->get(' http://xxx.com/admin');//Collect pages that require login to access $ql->get('http://xxx.com/admin/page');//echo $ql->getHtml ();
Form form operations
Simulate login to GitHub
// Get the QueryList instance $ql = QueryList::getInstance(); // Get the login form $form = $ql->get('https://github.com/login')->find('form') ;//Fill in the GitHub username and password $form->find('input[name=login]')->val('your github username or email');$form->find('input[name=password]' )->val('your github password');//Serialized form data $fromData = $form->serializeArray();$postData = [];foreach ($fromData as $item) { $postData[$item[ 'name']] = $item['value'];}//Submit login form $actionUrl = 'https://github.com'.$form->attr('action');$ql->post( $actionUrl,$postData);//Determine whether the login is successful// echo $ql->getHtml();$userName = $ql->find('.header-nav-current-user>.css-truncate-target' )->text();if($userName){ echo 'Login successful! Welcome:'.$userName;}else{ echo 'Login failed!';}
Bind function extension
Customize and extend a myHttp method:
$ql = QueryList::getInstance();//Bind a myHttp method to the QueryList object $ql->bind('myHttp',function ($url){ // $this is the current QueryList object $html = file_get_contents( $url); $this->setHtml($html); return $this;});//Then you can call $data = $ql->myHttp('https://toutiao.io' )->find('h3 a')->texts();print_r($data->all());
Or encapsulate the implementation into a class and bind it like this:
$ql->bind('myHttp',function ($url){ return new MyHttp($this,$url);});
Plug-in usage
Use the PhantomJS plug-in to collect JavaScript dynamically rendered pages:
//Set the PhantomJS binary file path during installation $ql = QueryList::use(PhantomJs::class,'/usr/local/bin/phantomjs');//Collect the Toutiao mobile game $data = $ql->browser( 'https://m.toutiao.com')->find('p')->texts();print_r($data->all());// Use HTTP proxy $ql->browser('https ://m.toutiao.com',false,['--proxy' => '192.168.1.42:8080', '--proxy-type' => 'http'])
Use CURL multi-thread plug-in to collect GitHub rankings in multi-threads:
$ql = QueryList::use(CurlMulti::class);$ql->curlMulti([ 'https://github.com/trending/php', 'https://github.com/trending/go', / /.....more urls]) // Call this callback when each task is successfully completed ->success(function (QueryList $ql,CurlMulti $curl,$r){ echo "Current url:{$r['info' ]['url']} rn"; $data = $ql->find('h3 a')->texts(); print_r($data->all());}) // each Task failure callback->error(function ($errorInfo,CurlMulti $curl){ echo "Current url:{$errorInfo['info']['url']} rn"; print_r($errorInfo['error' ]);})->start([// Maximum number of concurrency 'maxThread' => 10, // Number of error retries 'maxTry' => 3,]);
jae-jae/QueryList-PhantomJS: Use PhantomJS to collect JavaScript dynamically rendered pages
jae-jae/QueryList-CurlMulti: Curl multi-thread collection
jae-jae/QueryList-AbsoluteUrl: Convert URL relative path to absolute path
jae-jae/QueryList-Rule-Google : Google search engine
jae-jae/QueryList-Rule-Baidu: Baidu search engine