crawl4ai下载crawl4ai源码下载

️ Crawl4AI：法学硕士友好的网络爬虫和爬虫

Crawl4AI 简化了异步网络爬行和数据提取，使其可供大型语言模型 (LLM) 和 AI 应用程序使用。？

0.3.74 中的新功能

极速刮削：显着提高刮削速度。
？下载管理器：在CrawlResult中集成文件爬行、下载和跟踪。
Markdown 策略：用于自定义 Markdown 生成和格式的灵活系统。
？ LLM 友好引文：自动将链接转换为带参考列表的编号引文。
？ Markdown 过滤器：基于 BM25 的内容提取，以实现更清晰、相关的 Markdown。
?️图像提取：支持srcset 、 picture和响应式图像格式。
本地/原始 HTML ：直接抓取file://路径和原始 HTML ( raw: )。
？浏览器控制：自定义浏览器设置，具有隐形集成以绕过机器人。
☁️ API 和缓存增强：CORS、静态服务和增强的基于文件系统的缓存。
？ API 网关：作为具有安全令牌身份验证的 API 服务运行。
数据库升级：针对更大的内容集进行了优化，缓存速度更快。
？错误修复：解决了浏览器上下文问题、内存泄漏并改进了错误处理。

立即尝试！

玩玩这个

访问我们的文档网站

特征

？完全免费和开源
性能极快，优于许多付费服务
？ LLM 友好的输出格式（JSON、干净的 HTML、markdown）
多浏览器支持（Chromium、Firefox、WebKit）
？支持同时抓取多个URL
？提取并返回所有媒体标签（图像、音频和视频）
？提取所有外部和内部链接
从页面中提取元数据
用于身份验证、标头和页面修改的自定义挂钩
用户代理定制
?️ 截取页面截图并增强错误处理能力
在抓取之前执行多个自定义 JavaScript
使用 JsonCssExtractionStrategy 生成结构化输出，无需 LLM
各种分块策略：基于主题、正则表达式、句子等
？高级提取策略：余弦聚类、LLM 等
CSS 选择器支持精确数据提取
传递指令/关键字以优化提取
代理支持与身份验证以增强访问
复杂多页面抓取的会话管理
用于提高性能的异步架构
?️ 通过延迟加载检测改进图像处理
?️ 增强了对延迟内容加载的处理
？ LLM 交互的自定义标头支持
?️ iframe内容提取进行综合分析
⏱️ 灵活的超时和延迟内容检索选项

安装

Crawl4AI 提供灵活的安装选项以适应各种用例。您可以将其安装为 Python 包或使用 Docker。

使用点？

选择最适合您需求的安装选项：

基本安装

对于基本的网络爬行和抓取任务：

pip install crawl4ai

默认情况下，这将安装 Crawl4AI 的异步版本，使用 Playwright 进行网络爬行。

注意：当您安装 Crawl4AI 时，安装脚本应自动安装并设置 Playwright。但是，如果您遇到任何与 Playwright 相关的错误，您可以使用以下方法之一手动安装它：

通过命令行：
```
playwright install
```
如果上述方法不起作用，请尝试这个更具体的命令：
```
python -m playwright install chromium
```

事实证明，第二种方法在某些情况下更可靠。

同步版本安装

如果您需要使用 Selenium 的同步版本：

pip install crawl4ai[sync]

开发安装

对于计划修改源代码的贡献者：

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

一键部署

一键部署您自己的 Crawl4AI 实例：

推荐规格：至少 4GB RAM。部署时选择“professional-xs”或更高版本以获得稳定运行。

部署将：

使用 Crawl4AI 设置 Docker 容器
配置 Playwright 和所有依赖项
在端口 11235 上启动 FastAPI 服务器
设置健康检查和自动部署

使用 Docker？

Crawl4AI 可作为 Docker 映像使用，以便于部署。您可以直接从 Docker Hub 拉取（推荐）或从存储库构建。

选项1：Docker Hub（推荐）

 # Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic    # Basic crawling features
docker pull unclecode/crawl4ai:all      # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu      # GPU-enabled version

# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic  # Replace 'basic' with your chosen version

# In case you want to set platform to arm64
docker run --platform linux/arm64 -p 11235:11235 unclecode/crawl4ai:basic

# In case to allocate more shared memory for the container
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic

选项 2：从存储库构建

 # Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Build the image
docker build -t crawl4ai:local 
  --build-arg INSTALL_TYPE=basic   # Options: basic, all
  .

# In case you want to set platform to arm64
docker build -t crawl4ai:local 
  --build-arg INSTALL_TYPE=basic   # Options: basic, all
  --platform linux/arm64 
  .

# Run your local build
docker run -p 11235:11235 crawl4ai:local

快速测试（适用于两个选项）：

 import requests

# Submit a crawl job
response = requests . post (
    "http://localhost:11235/crawl" ,
    json = { "urls" : "https://example.com" , "priority" : 10 }
)
task_id = response . json ()[ "task_id" ]

# Get results
result = requests . get ( f"http://localhost:11235/task/ { task_id } " )

有关高级配置、环境变量和使用示例，请参阅我们的 Docker 部署指南。

快速入门

 import asyncio
from crawl4ai import AsyncWebCrawler

async def main ():
    async with AsyncWebCrawler ( verbose = True ) as crawler :
        result = await crawler . arun ( url = "https://www.nbcnews.com/business" )
        print ( result . markdown )

if __name__ == "__main__" :
    asyncio . run ( main ())

高级用法？

执行 JavaScript 并使用 CSS 选择器

 import asyncio
from crawl4ai import AsyncWebCrawler

async def main ():
    async with AsyncWebCrawler ( verbose = True ) as crawler :
        js_code = [ "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();" ]
        result = await crawler . arun (
            url = "https://www.nbcnews.com/business" ,
            js_code = js_code ,
            css_selector = ".wide-tease-item__description" ,
            bypass_cache = True
        )
        print ( result . extracted_content )

if __name__ == "__main__" :
    asyncio . run ( main ())

使用代理

 import asyncio
from crawl4ai import AsyncWebCrawler

async def main ():
    async with AsyncWebCrawler ( verbose = True , proxy = "http://127.0.0.1:7890" ) as crawler :
        result = await crawler . arun (
            url = "https://www.nbcnews.com/business" ,
            bypass_cache = True
        )
        print ( result . markdown )

if __name__ == "__main__" :
    asyncio . run ( main ())

无需法学硕士即可提取结构化数据

JsonCssExtractionStrategy允许使用 CSS 选择器从网页中精确提取结构化数据。

 import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import JsonCssExtractionStrategy

async def extract_news_teasers ():
    schema = {
        "name" : "News Teaser Extractor" ,
        "baseSelector" : ".wide-tease-item__wrapper" ,
        "fields" : [
            {
                "name" : "category" ,
                "selector" : ".unibrow span[data-testid='unibrow-text']" ,
                "type" : "text" ,
            },
            {
                "name" : "headline" ,
                "selector" : ".wide-tease-item__headline" ,
                "type" : "text" ,
            },
            {
                "name" : "summary" ,
                "selector" : ".wide-tease-item__description" ,
                "type" : "text" ,
            },
            {
                "name" : "time" ,
                "selector" : "[data-testid='wide-tease-date']" ,
                "type" : "text" ,
            },
            {
                "name" : "image" ,
                "type" : "nested" ,
                "selector" : "picture.teasePicture img" ,
                "fields" : [
                    { "name" : "src" , "type" : "attribute" , "attribute" : "src" },
                    { "name" : "alt" , "type" : "attribute" , "attribute" : "alt" },
                ],
            },
            {
                "name" : "link" ,
                "selector" : "a[href]" ,
                "type" : "attribute" ,
                "attribute" : "href" ,
            },
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy ( schema , verbose = True )

    async with AsyncWebCrawler ( verbose = True ) as crawler :
        result = await crawler . arun (
            url = "https://www.nbcnews.com/business" ,
            extraction_strategy = extraction_strategy ,
            bypass_cache = True ,
        )

        assert result . success , "Failed to crawl the page"

        news_teasers = json . loads ( result . extracted_content )
        print ( f"Successfully extracted { len ( news_teasers ) } news teasers" )
        print ( json . dumps ( news_teasers [ 0 ], indent = 2 ))

if __name__ == "__main__" :
    asyncio . run ( extract_news_teasers ())

有关更高级的使用示例，请查看文档中的示例部分。

使用 OpenAI 提取结构化数据

 import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel , Field

class OpenAIModelFee ( BaseModel ):
    model_name : str = Field (..., description = "Name of the OpenAI model." )
    input_fee : str = Field (..., description = "Fee for input token for the OpenAI model." )
    output_fee : str = Field (..., description = "Fee for output token for the OpenAI model." )

async def main ():
    async with AsyncWebCrawler ( verbose = True ) as crawler :
        result = await crawler . arun (
            url = 'https://openai.com/api/pricing/' ,
            word_count_threshold = 1 ,
            extraction_strategy = LLMExtractionStrategy (
                provider = "openai/gpt-4o" , api_token = os . getenv ( 'OPENAI_API_KEY' ), 
                schema = OpenAIModelFee . schema (),
                extraction_type = "schema" ,
                instruction = """From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache = True ,
        )
        print ( result . extracted_content )

if __name__ == "__main__" :
    asyncio . run ( main ())

会话管理和动态内容抓取

Crawl4AI 擅长处理复杂的场景，例如抓取通过 JavaScript 加载动态内容的多个页面。以下是跨多个页面抓取 GitHub 提交的示例：

 import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler

async def crawl_typescript_commits ():
    first_commit = ""
    async def on_execution_started ( page ):
        nonlocal first_commit 
        try :
            while True :
                await page . wait_for_selector ( 'li.Box-sc-g0xbh4-0 h4' )
                commit = await page . query_selector ( 'li.Box-sc-g0xbh4-0 h4' )
                commit = await commit . evaluate ( '(element) => element.textContent' )
                commit = re . sub ( r's+' , '' , commit )
                if commit and commit != first_commit :
                    first_commit = commit
                    break
                await asyncio . sleep ( 0.5 )
        except Exception as e :
            print ( f"Warning: New content didn't appear after JavaScript execution: { e } " )

    async with AsyncWebCrawler ( verbose = True ) as crawler :
        crawler . crawler_strategy . set_hook ( 'on_execution_started' , on_execution_started )

        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        for page in range ( 3 ):  # Crawl 3 pages
            result = await crawler . arun (
                url = url ,
                session_id = session_id ,
                css_selector = "li.Box-sc-g0xbh4-0" ,
                js = js_next_page if page > 0 else None ,
                bypass_cache = True ,
                js_only = page > 0
            )

            assert result . success , f"Failed to crawl page { page + 1 } "

            soup = BeautifulSoup ( result . cleaned_html , 'html.parser' )
            commits = soup . select ( "li" )
            all_commits . extend ( commits )

            print ( f"Page { page + 1 } : Found { len ( commits ) } commits" )

        await crawler . crawler_strategy . kill_session ( session_id )
        print ( f"Successfully crawled { len ( all_commits ) } commits across 3 pages" )

if __name__ == "__main__" :
    asyncio . run ( crawl_typescript_commits ())

此示例演示了 Crawl4AI 处理异步加载内容的复杂场景的能力。它会抓取 GitHub 提交的多个页面，执行 JavaScript 来加载新内容，并使用自定义挂钩来确保在继续之前加载数据。

有关更高级的使用示例，请查看文档中的示例部分。

速度比较

Crawl4AI 的设计以速度为主要关注点。我们的目标是通过高质量的数据提取提供尽可能最快的响应，最大限度地减少数据和用户之间的抽象。

我们对 Crawl4AI 和付费服务 Firecrawl 进行了速度比较。结果证明了 Crawl4AI 的优越性能：

Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49

Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49

Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89

正如您所看到的，Crawl4AI 的性能显着优于 Firecrawl：

简单抓取：Crawl4AI 比 Firecrawl 快 4 倍以上。
使用 JavaScript 执行：即使执行 JavaScript 来加载更多内容（找到的图像数量加倍），Crawl4AI 仍然比 Firecrawl 的简单爬行更快。

您可以在我们的存储库中找到完整的比较代码： docs/examples/crawl4ai_vs_firecrawl.py 。

文档

有关详细文档，包括安装说明、高级功能和 API 参考，请访问我们的文档网站。

Crawl4AI 路线图？️

有关我们的开发计划和即将推出的功能的详细信息，请查看我们的路线图。

先进的爬行系统？

0. Graph Crawler：利用图搜索算法进行智能网站遍历，进行全面的嵌套页面提取
1. 基于问题的爬虫：自然语言驱动的网络发现和内容提取
2. 知识最优爬虫：智能爬行，最大化知识，同时最小化数据提取
3. Agentic Crawler：复杂多步爬行操作的自治系统

专业功能

4. 自动模式生成器：将自然语言转换为提取模式
5. 特定领域的抓取器：针对通用平台（学术、电子商务）的预配置提取器
6. Web Embedding Index：爬行内容的语义搜索基础设施

开发工具？

7. Interactive Playground：用于测试、比较策略与人工智能辅助的Web UI
8.性能监控：实时洞察爬虫运行情况
9. 云集成：跨云提供商的一键部署解决方案

社区与成长？

10.赞助计划：结构化支持体系，分级福利
11. 教育内容：“如何爬行”视频系列和互动教程

贡献？

我们欢迎开源社区的贡献。查看我们的贡献指南以获取更多信息。

执照？

Crawl4AI 是根据 Apache 2.0 许可证发布的。

接触？

如有疑问、建议或反馈，请随时联系：

GitHub：叔叔代码
推特：@unclecode
网站：crawl4ai.com

快乐爬行！？️

使命

我们的使命是释放数字时代个人和企业数据未开发的潜力。在当今世界，个人和组织产生了大量有价值的数字足迹，但这些数据在很大程度上仍然没有成为真正的资产。

我们的开源解决方案使开发人员和创新者能够构建数据提取和结构化工具，为数据所有权的新时代奠定基础。通过将个人和企业数据转化为结构化的、可交易的资产，我们为个人创造了利用其数字足迹的机会，并为组织创造了释放其集体知识价值的机会。

这种数据民主化代表了迈向共享数据经济的第一步，在这种经济中，愿意参与数据共享可以推动人工智能的进步，同时确保利益回流到数据创造者手中。通过这种方法，我们正在构建一个人工智能开发由真实的人类知识而不是合成替代品提供动力的未来。

如需详细了解我们的愿景、机遇和前进道路，请参阅我们完整的使命宣言。

关键机遇

数据资本化：将数字足迹转化为可出现在个人和企业资产负债表上的有价值的资产
真实数据：释放大量真实的人类洞察力和知识，促进人工智能的进步
共享经济：创造新的价值流，让数据创造者直接从他们的贡献中受益

发展历程

开源基金会：构建透明的、社区驱动的数据提取工具
数据资本化平台：创建构建和评估数字资产的工具
共享数据市场：建立道德数据交换的经济平台

如需详细了解我们的愿景、挑战和解决方案，请参阅我们完整的使命宣言。

明星历史

展开

crawl4ai