Crawl4AI 简化了异步网络爬行和数据提取,使其可供大型语言模型 (LLM) 和 AI 应用程序使用。 ?
CrawlResult
中集成文件爬行、下载和跟踪。srcset
、 picture
和响应式图像格式。file://
路径和原始 HTML ( raw:
)。玩玩这个
访问我们的文档网站
Crawl4AI 提供灵活的安装选项以适应各种用例。您可以将其安装为 Python 包或使用 Docker。
选择最适合您需求的安装选项:
对于基本的网络爬行和抓取任务:
pip install crawl4ai
默认情况下,这将安装 Crawl4AI 的异步版本,使用 Playwright 进行网络爬行。
注意:当您安装 Crawl4AI 时,安装脚本应自动安装并设置 Playwright。但是,如果您遇到任何与 Playwright 相关的错误,您可以使用以下方法之一手动安装它:
通过命令行:
playwright install
如果上述方法不起作用,请尝试这个更具体的命令:
python -m playwright install chromium
事实证明,第二种方法在某些情况下更可靠。
如果您需要使用 Selenium 的同步版本:
pip install crawl4ai[sync]
对于计划修改源代码的贡献者:
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
一键部署您自己的 Crawl4AI 实例:
推荐规格:至少 4GB RAM。部署时选择“professional-xs”或更高版本以获得稳定运行。
部署将:
Crawl4AI 可作为 Docker 映像使用,以便于部署。您可以直接从 Docker Hub 拉取(推荐)或从存储库构建。
# Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic # Basic crawling features
docker pull unclecode/crawl4ai:all # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu # GPU-enabled version
# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
# In case you want to set platform to arm64
docker run --platform linux/arm64 -p 11235:11235 unclecode/crawl4ai:basic
# In case to allocate more shared memory for the container
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build the image
docker build -t crawl4ai:local
--build-arg INSTALL_TYPE=basic # Options: basic, all
.
# In case you want to set platform to arm64
docker build -t crawl4ai:local
--build-arg INSTALL_TYPE=basic # Options: basic, all
--platform linux/arm64
.
# Run your local build
docker run -p 11235:11235 crawl4ai:local
快速测试(适用于两个选项):
import requests
# Submit a crawl job
response = requests . post (
"http://localhost:11235/crawl" ,
json = { "urls" : "https://example.com" , "priority" : 10 }
)
task_id = response . json ()[ "task_id" ]
# Get results
result = requests . get ( f"http://localhost:11235/task/ { task_id } " )
有关高级配置、环境变量和使用示例,请参阅我们的 Docker 部署指南。
import asyncio
from crawl4ai import AsyncWebCrawler
async def main ():
async with AsyncWebCrawler ( verbose = True ) as crawler :
result = await crawler . arun ( url = "https://www.nbcnews.com/business" )
print ( result . markdown )
if __name__ == "__main__" :
asyncio . run ( main ())
import asyncio
from crawl4ai import AsyncWebCrawler
async def main ():
async with AsyncWebCrawler ( verbose = True ) as crawler :
js_code = [ "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();" ]
result = await crawler . arun (
url = "https://www.nbcnews.com/business" ,
js_code = js_code ,
css_selector = ".wide-tease-item__description" ,
bypass_cache = True
)
print ( result . extracted_content )
if __name__ == "__main__" :
asyncio . run ( main ())
import asyncio
from crawl4ai import AsyncWebCrawler
async def main ():
async with AsyncWebCrawler ( verbose = True , proxy = "http://127.0.0.1:7890" ) as crawler :
result = await crawler . arun (
url = "https://www.nbcnews.com/business" ,
bypass_cache = True
)
print ( result . markdown )
if __name__ == "__main__" :
asyncio . run ( main ())
JsonCssExtractionStrategy
允许使用 CSS 选择器从网页中精确提取结构化数据。
import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import JsonCssExtractionStrategy
async def extract_news_teasers ():
schema = {
"name" : "News Teaser Extractor" ,
"baseSelector" : ".wide-tease-item__wrapper" ,
"fields" : [
{
"name" : "category" ,
"selector" : ".unibrow span[data-testid='unibrow-text']" ,
"type" : "text" ,
},
{
"name" : "headline" ,
"selector" : ".wide-tease-item__headline" ,
"type" : "text" ,
},
{
"name" : "summary" ,
"selector" : ".wide-tease-item__description" ,
"type" : "text" ,
},
{
"name" : "time" ,
"selector" : "[data-testid='wide-tease-date']" ,
"type" : "text" ,
},
{
"name" : "image" ,
"type" : "nested" ,
"selector" : "picture.teasePicture img" ,
"fields" : [
{ "name" : "src" , "type" : "attribute" , "attribute" : "src" },
{ "name" : "alt" , "type" : "attribute" , "attribute" : "alt" },
],
},
{
"name" : "link" ,
"selector" : "a[href]" ,
"type" : "attribute" ,
"attribute" : "href" ,
},
],
}
extraction_strategy = JsonCssExtractionStrategy ( schema , verbose = True )
async with AsyncWebCrawler ( verbose = True ) as crawler :
result = await crawler . arun (
url = "https://www.nbcnews.com/business" ,
extraction_strategy = extraction_strategy ,
bypass_cache = True ,
)
assert result . success , "Failed to crawl the page"
news_teasers = json . loads ( result . extracted_content )
print ( f"Successfully extracted { len ( news_teasers ) } news teasers" )
print ( json . dumps ( news_teasers [ 0 ], indent = 2 ))
if __name__ == "__main__" :
asyncio . run ( extract_news_teasers ())
有关更高级的使用示例,请查看文档中的示例部分。
import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel , Field
class OpenAIModelFee ( BaseModel ):
model_name : str = Field (..., description = "Name of the OpenAI model." )
input_fee : str = Field (..., description = "Fee for input token for the OpenAI model." )
output_fee : str = Field (..., description = "Fee for output token for the OpenAI model." )
async def main ():
async with AsyncWebCrawler ( verbose = True ) as crawler :
result = await crawler . arun (
url = 'https://openai.com/api/pricing/' ,
word_count_threshold = 1 ,
extraction_strategy = LLMExtractionStrategy (
provider = "openai/gpt-4o" , api_token = os . getenv ( 'OPENAI_API_KEY' ),
schema = OpenAIModelFee . schema (),
extraction_type = "schema" ,
instruction = """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache = True ,
)
print ( result . extracted_content )
if __name__ == "__main__" :
asyncio . run ( main ())
Crawl4AI 擅长处理复杂的场景,例如抓取通过 JavaScript 加载动态内容的多个页面。以下是跨多个页面抓取 GitHub 提交的示例:
import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler
async def crawl_typescript_commits ():
first_commit = ""
async def on_execution_started ( page ):
nonlocal first_commit
try :
while True :
await page . wait_for_selector ( 'li.Box-sc-g0xbh4-0 h4' )
commit = await page . query_selector ( 'li.Box-sc-g0xbh4-0 h4' )
commit = await commit . evaluate ( '(element) => element.textContent' )
commit = re . sub ( r's+' , '' , commit )
if commit and commit != first_commit :
first_commit = commit
break
await asyncio . sleep ( 0.5 )
except Exception as e :
print ( f"Warning: New content didn't appear after JavaScript execution: { e } " )
async with AsyncWebCrawler ( verbose = True ) as crawler :
crawler . crawler_strategy . set_hook ( 'on_execution_started' , on_execution_started )
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range ( 3 ): # Crawl 3 pages
result = await crawler . arun (
url = url ,
session_id = session_id ,
css_selector = "li.Box-sc-g0xbh4-0" ,
js = js_next_page if page > 0 else None ,
bypass_cache = True ,
js_only = page > 0
)
assert result . success , f"Failed to crawl page { page + 1 } "
soup = BeautifulSoup ( result . cleaned_html , 'html.parser' )
commits = soup . select ( "li" )
all_commits . extend ( commits )
print ( f"Page { page + 1 } : Found { len ( commits ) } commits" )
await crawler . crawler_strategy . kill_session ( session_id )
print ( f"Successfully crawled { len ( all_commits ) } commits across 3 pages" )
if __name__ == "__main__" :
asyncio . run ( crawl_typescript_commits ())
此示例演示了 Crawl4AI 处理异步加载内容的复杂场景的能力。它会抓取 GitHub 提交的多个页面,执行 JavaScript 来加载新内容,并使用自定义挂钩来确保在继续之前加载数据。
有关更高级的使用示例,请查看文档中的示例部分。
Crawl4AI 的设计以速度为主要关注点。我们的目标是通过高质量的数据提取提供尽可能最快的响应,最大限度地减少数据和用户之间的抽象。
我们对 Crawl4AI 和付费服务 Firecrawl 进行了速度比较。结果证明了 Crawl4AI 的优越性能:
Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49
Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49
Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89
正如您所看到的,Crawl4AI 的性能显着优于 Firecrawl:
您可以在我们的存储库中找到完整的比较代码: docs/examples/crawl4ai_vs_firecrawl.py
。
有关详细文档,包括安装说明、高级功能和 API 参考,请访问我们的文档网站。
有关我们的开发计划和即将推出的功能的详细信息,请查看我们的路线图。
我们欢迎开源社区的贡献。查看我们的贡献指南以获取更多信息。
Crawl4AI 是根据 Apache 2.0 许可证发布的。
如有疑问、建议或反馈,请随时联系:
快乐爬行! ?️
我们的使命是释放数字时代个人和企业数据未开发的潜力。在当今世界,个人和组织产生了大量有价值的数字足迹,但这些数据在很大程度上仍然没有成为真正的资产。
我们的开源解决方案使开发人员和创新者能够构建数据提取和结构化工具,为数据所有权的新时代奠定基础。通过将个人和企业数据转化为结构化的、可交易的资产,我们为个人创造了利用其数字足迹的机会,并为组织创造了释放其集体知识价值的机会。
这种数据民主化代表了迈向共享数据经济的第一步,在这种经济中,愿意参与数据共享可以推动人工智能的进步,同时确保利益回流到数据创造者手中。通过这种方法,我们正在构建一个人工智能开发由真实的人类知识而不是合成替代品提供动力的未来。
如需详细了解我们的愿景、机遇和前进道路,请参阅我们完整的使命宣言。
如需详细了解我们的愿景、挑战和解决方案,请参阅我们完整的使命宣言。