sparkler
1.0.0
网络爬虫是一种机器人程序,它从网络上获取资源,以便构建搜索引擎、知识库等应用程序。Sparkler(Spark-Crawler 的缩写)是一种新的网络爬虫,它利用了分布式计算的最新进展通过整合 Spark、Kafka、Lucene/Solr、Tika 和 pf4j 等各种 Apache 项目来构建信息检索领域。 Sparkler 是一个可扩展、高度可扩展的高性能网络爬虫,它是 Apache Nutch 的演变,并在 Apache Spark Cluster 上运行。
Sparkler 正在被提议给 Apache 孵化器。查看提案文件并在此处提供您的建议稍后会完成,最终!
要使用 Sparkler,请安装 docker 并运行以下命令:
# Step 0. Get the image
docker pull ghcr.io/uscdatascience/sparkler/sparkler:main
# Step 1. Create a volume for elastic
docker volume create elastic
# Step 1. Inject seed urls
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su ' http://www.bbc.com/news '
# Step 3. Start the crawl job
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main crawl -id myid -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations
1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:
a. emacs sparkler/bin/seed-urls.txt
b. copy paste your urls
c. Ctrl+x Ctrl+s to save
d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]
* Note: You can use Vim and Nano editors also or use: echo -e " http://example1.comnhttp://example2.com " >> seedfile.txt command.
3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory)
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.
要爬网直到所有新 URL 的末尾,请使用-i -1
,示例: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1
如有任何问题或建议,欢迎在我们的邮件列表中 [email protected] 或者,您可以使用 slack 渠道获取帮助 http://irds.usc.edu/sparkler/#slack