This project is a Java-based WeChat public account article crawler, implemented using the Web Collector open source crawler framework and Spring Boot. It can be obtained through the Sogou platform using a variety of proxy strategies to regularly crawl designated public account articles, or by accessing the browser URL manually triggers crawling.
You can conduct secondary development based on this project and configure Redis to avoid repeated crawling (this project uses RamCrawler instead of BreadthCrawler), or you can implement OSS to dump static resources and implement article saving interfaces to save the crawled content. library.
Features of this project:
Implemented based on Web Collector crawler framework and Spring Boot
There are various proxy strategies including specifying proxy IP, not using proxy IP, and using Abu Cloud proxy to avoid crawler IP being blocked.
There is a complete crawler result import function to support server-side use.
Interfaces including Redis service, OSS service and database storage are reserved.
Support Swagger interface documentation
maven install
Adjust the configuration file application.yml
. For specific parameters, please see Configuration.
Start the project WxCrawlerApplication
Open the browser and visit http://localhost:11111/wxCrawler?proxyPolicy=none
to trigger the crawler task
After crawling, the file is generated under the path specified by crawler.weixin.outputPath
and archived in the naming method of公众号名_文章名.html
Start the project, visit http://localhost:11111/swagger-ui.html
, and display all external interfaces as follows:
You can also get the interface introduction by directly accessing http://localhost:11111/help
:
/parseWxArticleList
and /parseWxArticleDetail
interfaces are used for internal testing to parse the article list json and article details json captured by fiddler from the WeChat client. Since Sogou WeChat currently only includes WeChat "subscription accounts" and not "service accounts", additional processing is required for public accounts of the service account type. In addition, Sogou WeChat currently does not include the "number of reads" and "number of likes" of articles, so they also need to be obtained through packet capture. For details, please see support for crawling service account articles as well as reading and like counts .
server:
port: 11111
spring:
application:
name: wx-crawl
crawler:
weixin:
# 待爬取的微信公众号,支持爬取多个公众号,以;分隔(目前仅支持订阅号)
accounts: 雪球;缘聚小许
# outputPath 生成文章內容html文件
outputPath: D:/article_output
# 爬取访问一次后休眠时间,毫秒单位,为避免搜狗和微信封杀,建议设置至少3000以上
sleepTime: 5000
# 是否使用断点续爬,通过redis避免重复爬取
# 注意,由于会跳过已爬过的文章,因此就无法更新旧文章了
resumable: false
# 代理使用策略,包括不使用代理,使用指定代理IP以及使用阿布云代理
# 如: none | 222.182.56.50:8118,124.235.208.252:443 | abuyun
proxyPolicy: none
# 是否更新已存在的文章
updateArticle: true
proxy:
# 阿布云账号
abuyunAccount: xxxx
# 阿布云密码
abuyunPassword: xxxxx
Implementation of warehousing
The article storage interface is com.xuzp.service.IArticleService
:
/**
* 保存文章
* @param articleVO
* @return
*/
ResultBase<String> save(ArticleTransferVO articleVO, Operation operation);
/**
* 查找文章
* @param articleVO
* @return
*/
ResultBase<ArticleTransferVO> find(ArticleTransferVO articleVO);
The article storage implementation class is com.xuzp.service.impl.ArticleServiceImpl
, please extend it by yourself.
Redis extension
The Redis interface is com.xuzp.service.IRedisService
:
/**
* @param key
* @param value
*/
void set(String key, String value);
/**
* @param key
* @return
*/
Boolean exists(String key);
/**
* @param key
* @return
*/
Object get(final String key);
The implementation class of the Redis interface is com.xuzp.service.impl.RedisServiceImpl
, please extend it yourself.
OSS extension
The OSS interface is com.xuzp.service.IOssService
:
/**
* 把存于腾讯服务器上的包括视频,音频,图片等静态资源转储
* @param url 资源地址
* @return 新oss url
*/
ResultBase<String> resourceTranslation(String url);
The implementation class is located at com.xuzp.service.impl.OssServiceImpl
, please extend it yourself.
Adjust automatic crawling time
The scheduled task code is com.xuzp.schedule.CrawlScheduledTasks
, please adjust the time yourself:
/**
* 定时爬虫去抓取微信公众号文章
*/
@Scheduled(cron = "0 30 10,18 * * ?")
public void weixinCrawlTask() {
crawlerStater(new LazyLoader<Crawler>(){
@Override
public WxCrawler newInstance() {
return wxCrawlerConfig.wxCrawler(null);
}
}, WxCrawlerConstant.CRAWL_DEPTH, WX_CRAWLER, "公众号爬虫");
}
Support for crawling service account articles as well as reading and like counts
The current crawling is based on the Sogou WeChat platform http://weixin.sogou.com/
and is therefore subject to the following restrictions:
Only WeChat subscription accounts are included, but articles from WeChat service accounts are not included.
Only the last 10 records can be crawled, and all past historical records cannot be crawled.
Only the basic information of the article is included, including title, author, abstract, article content, publication time, etc., but information such as the number of article reads and the number of likes are not included.
Currently, this project provides two additional interfaces to obtain article information by parsing Fiddler packet capture data:
/parseWxArticleList: 解析抓包获取的文章列表json
/parseWxArticleDetail: 解析获取抓包获取的文章详情json,后于 parseWxArticleList 接口执行。
In the future, we will expand the analysis of fiddler's packet capture data to achieve automation. Please wait for subsequent version updates:)