crawl4ai下載crawl4ai原始碼下載

️ Crawl4AI：法學碩士友善的網路爬蟲和爬蟲

Crawl4AI 簡化了非同步網路爬行和資料擷取，使其可供大型語言模型 (LLM) 和 AI 應用程式使用。？

0.3.74 中的新功能

極速刮削：顯著提高刮削速度。
？下載管理器：在CrawlResult中整合文件爬行、下載和追蹤。
Markdown 策略：用於自訂 Markdown 產生和格式的彈性系統。
？ LLM 友善引文：自動將連結轉換為帶有參考列表的編號引文。
？ Markdown 過濾器：基於 BM25 的內容擷取，以實現更清晰、相關的 Markdown。
️圖像提取：支援srcset 、 picture和響應式圖像格式。
本機/原始 HTML ：直接抓取file://路徑和原始 HTML ( raw: )。
？瀏覽器控制：自訂瀏覽器設置，具有隱形整合以繞過機器人。
☁️ API 和快取增強：CORS、靜態服務和增強的基於檔案系統的快取。
？ API 閘道：以具有安全性令牌驗證的 API 服務運作。
資料庫升級：針對更大的內容集進行了最佳化，快取速度更快。
？錯誤修復：解決了瀏覽器上下文問題、記憶體洩漏並改善了錯誤處理。

立即嘗試！

玩玩這個

造訪我們的文件網站

特徵

？完全免費和開源
效能極快，優於許多付費服務
？ LLM 友善的輸出格式（JSON、乾淨的 HTML、markdown）
多瀏覽器支援（Chromium、Firefox、WebKit）
？支援同時抓取多個URL
？提取並返回所有媒體標籤（圖像、音訊和視訊）
？提取所有外部和內部鏈接
從頁面中提取元數據
用於身份驗證、標頭和頁面修改的自訂掛鉤
用戶代理定制
?️ 截取頁面截圖並增強錯誤處理能力
在抓取之前執行多個自訂 JavaScript
使用 JsonCssExtractionStrategy 產生結構化輸出，無需 LLM
各種分塊策略：基於主題、正規表示式、句子等
？高階擷取策略：餘弦聚類、LLM 等
CSS 選擇器支援精確資料擷取
傳遞指令/關鍵字以優化提取
代理支援與身份驗證以增強訪問
複雜多頁面抓取的會話管理
用於提高效能的非同步架構
?️ 透過延遲載入偵測來改善影像處理
?️ 增強了對延遲內容載入的處理
？ LLM 互動的自訂標頭支持
?️ iframe內容提取進行綜合分析
⏱️ 靈活的逾時和延遲內容檢索選項

安裝

Crawl4AI 提供靈活的安裝選項以適應各種用例。您可以將其安裝為 Python 套件或使用 Docker。

使用點？

選擇最適合您需求的安裝選項：

基本安裝

對於基本的網路爬行和抓取任務：

pip install crawl4ai

預設情況下，這將安裝 Crawl4AI 的非同步版本，使用 Playwright 進行網路爬行。

注意：當您安裝 Crawl4AI 時，安裝腳本應自動安裝並設定 Playwright。但是，如果您遇到任何與 Playwright 相關的錯誤，您可以使用以下方法之一手動安裝它：

透過命令列：
```
playwright install
```
如果上述方法不起作用，請嘗試這個更具體的命令：
```
python -m playwright install chromium
```

事實證明，第二種方法在某些情況下更可靠。

同步版本安裝

如果您需要使用 Selenium 的同步版本：

pip install crawl4ai[sync]

開發安裝

對於計劃修改原始程式碼的貢獻者：

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

一鍵部署

一鍵部署您自己的 Crawl4AI 實例：

建議規格：至少 4GB RAM。部署時選擇“professional-xs”或更高版本以獲得穩定運作。

部署將：

使用 Crawl4AI 設定 Docker 容器
配置 Playwright 和所有依賴項
在連接埠 11235 上啟動 FastAPI 伺服器
設定健康檢查和自動部署

使用 Docker？

Crawl4AI 可作為 Docker 映像使用，以便於部署。您可以直接從 Docker Hub 拉取（建議）或從儲存庫建置。

選項1：Docker Hub（建議）

 # Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic    # Basic crawling features
docker pull unclecode/crawl4ai:all      # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu      # GPU-enabled version

# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic  # Replace 'basic' with your chosen version

# In case you want to set platform to arm64
docker run --platform linux/arm64 -p 11235:11235 unclecode/crawl4ai:basic

# In case to allocate more shared memory for the container
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic

選項 2：從儲存庫構建

 # Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

# Build the image
docker build -t crawl4ai:local 
  --build-arg INSTALL_TYPE=basic   # Options: basic, all
  .

# In case you want to set platform to arm64
docker build -t crawl4ai:local 
  --build-arg INSTALL_TYPE=basic   # Options: basic, all
  --platform linux/arm64 
  .

# Run your local build
docker run -p 11235:11235 crawl4ai:local

快速測試（適用於兩個選項）：

 import requests

# Submit a crawl job
response = requests . post (
    "http://localhost:11235/crawl" ,
    json = { "urls" : "https://example.com" , "priority" : 10 }
)
task_id = response . json ()[ "task_id" ]

# Get results
result = requests . get ( f"http://localhost:11235/task/ { task_id } " )

如需進階配置、環境變數和使用範例，請參閱我們的 Docker 部署指南。

快速入門

 import asyncio
from crawl4ai import AsyncWebCrawler

async def main ():
    async with AsyncWebCrawler ( verbose = True ) as crawler :
        result = await crawler . arun ( url = "https://www.nbcnews.com/business" )
        print ( result . markdown )

if __name__ == "__main__" :
    asyncio . run ( main ())

進階用法？

執行 JavaScript 並使用 CSS 選擇器

 import asyncio
from crawl4ai import AsyncWebCrawler

async def main ():
    async with AsyncWebCrawler ( verbose = True ) as crawler :
        js_code = [ "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();" ]
        result = await crawler . arun (
            url = "https://www.nbcnews.com/business" ,
            js_code = js_code ,
            css_selector = ".wide-tease-item__description" ,
            bypass_cache = True
        )
        print ( result . extracted_content )

if __name__ == "__main__" :
    asyncio . run ( main ())

使用代理

 import asyncio
from crawl4ai import AsyncWebCrawler

async def main ():
    async with AsyncWebCrawler ( verbose = True , proxy = "http://127.0.0.1:7890" ) as crawler :
        result = await crawler . arun (
            url = "https://www.nbcnews.com/business" ,
            bypass_cache = True
        )
        print ( result . markdown )

if __name__ == "__main__" :
    asyncio . run ( main ())

無需法學碩士即可提取結構化數據

JsonCssExtractionStrategy允許使用 CSS 選擇器從網頁中精確提取結構化資料。

 import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import JsonCssExtractionStrategy

async def extract_news_teasers ():
    schema = {
        "name" : "News Teaser Extractor" ,
        "baseSelector" : ".wide-tease-item__wrapper" ,
        "fields" : [
            {
                "name" : "category" ,
                "selector" : ".unibrow span[data-testid='unibrow-text']" ,
                "type" : "text" ,
            },
            {
                "name" : "headline" ,
                "selector" : ".wide-tease-item__headline" ,
                "type" : "text" ,
            },
            {
                "name" : "summary" ,
                "selector" : ".wide-tease-item__description" ,
                "type" : "text" ,
            },
            {
                "name" : "time" ,
                "selector" : "[data-testid='wide-tease-date']" ,
                "type" : "text" ,
            },
            {
                "name" : "image" ,
                "type" : "nested" ,
                "selector" : "picture.teasePicture img" ,
                "fields" : [
                    { "name" : "src" , "type" : "attribute" , "attribute" : "src" },
                    { "name" : "alt" , "type" : "attribute" , "attribute" : "alt" },
                ],
            },
            {
                "name" : "link" ,
                "selector" : "a[href]" ,
                "type" : "attribute" ,
                "attribute" : "href" ,
            },
        ],
    }

    extraction_strategy = JsonCssExtractionStrategy ( schema , verbose = True )

    async with AsyncWebCrawler ( verbose = True ) as crawler :
        result = await crawler . arun (
            url = "https://www.nbcnews.com/business" ,
            extraction_strategy = extraction_strategy ,
            bypass_cache = True ,
        )

        assert result . success , "Failed to crawl the page"

        news_teasers = json . loads ( result . extracted_content )
        print ( f"Successfully extracted { len ( news_teasers ) } news teasers" )
        print ( json . dumps ( news_teasers [ 0 ], indent = 2 ))

if __name__ == "__main__" :
    asyncio . run ( extract_news_teasers ())

有關更進階的使用範例，請查看文件中的範例部分。

使用 OpenAI 提取結構化數據

 import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel , Field

class OpenAIModelFee ( BaseModel ):
    model_name : str = Field (..., description = "Name of the OpenAI model." )
    input_fee : str = Field (..., description = "Fee for input token for the OpenAI model." )
    output_fee : str = Field (..., description = "Fee for output token for the OpenAI model." )

async def main ():
    async with AsyncWebCrawler ( verbose = True ) as crawler :
        result = await crawler . arun (
            url = 'https://openai.com/api/pricing/' ,
            word_count_threshold = 1 ,
            extraction_strategy = LLMExtractionStrategy (
                provider = "openai/gpt-4o" , api_token = os . getenv ( 'OPENAI_API_KEY' ), 
                schema = OpenAIModelFee . schema (),
                extraction_type = "schema" ,
                instruction = """From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache = True ,
        )
        print ( result . extracted_content )

if __name__ == "__main__" :
    asyncio . run ( main ())

會話管理和動態內容抓取

Crawl4AI 擅長處理複雜的場景，例如抓取透過 JavaScript 載入動態內容的多個頁面。以下是跨多個頁面抓取 GitHub 提交的範例：

 import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler

async def crawl_typescript_commits ():
    first_commit = ""
    async def on_execution_started ( page ):
        nonlocal first_commit 
        try :
            while True :
                await page . wait_for_selector ( 'li.Box-sc-g0xbh4-0 h4' )
                commit = await page . query_selector ( 'li.Box-sc-g0xbh4-0 h4' )
                commit = await commit . evaluate ( '(element) => element.textContent' )
                commit = re . sub ( r's+' , '' , commit )
                if commit and commit != first_commit :
                    first_commit = commit
                    break
                await asyncio . sleep ( 0.5 )
        except Exception as e :
            print ( f"Warning: New content didn't appear after JavaScript execution: { e } " )

    async with AsyncWebCrawler ( verbose = True ) as crawler :
        crawler . crawler_strategy . set_hook ( 'on_execution_started' , on_execution_started )

        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        for page in range ( 3 ):  # Crawl 3 pages
            result = await crawler . arun (
                url = url ,
                session_id = session_id ,
                css_selector = "li.Box-sc-g0xbh4-0" ,
                js = js_next_page if page > 0 else None ,
                bypass_cache = True ,
                js_only = page > 0
            )

            assert result . success , f"Failed to crawl page { page + 1 } "

            soup = BeautifulSoup ( result . cleaned_html , 'html.parser' )
            commits = soup . select ( "li" )
            all_commits . extend ( commits )

            print ( f"Page { page + 1 } : Found { len ( commits ) } commits" )

        await crawler . crawler_strategy . kill_session ( session_id )
        print ( f"Successfully crawled { len ( all_commits ) } commits across 3 pages" )

if __name__ == "__main__" :
    asyncio . run ( crawl_typescript_commits ())

此範例示範了 Crawl4AI 處理非同步載入內容的複雜場景的能力。它會抓取 GitHub 提交的多個頁面，執行 JavaScript 來載入新內容，並使用自訂掛鉤來確保在繼續之前載入資料。

有關更進階的使用範例，請查看文件中的範例部分。

速度比較

Crawl4AI 的設計以速度為主要關注點。我們的目標是透過高品質的資料提取提供盡可能最快的回應，最大限度地減少資料和使用者之間的抽象。

我們對 Crawl4AI 和付費服務 Firecrawl 進行了速度比較。結果證明了 Crawl4AI 的優越性能：

Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49

Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49

Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89

正如您所看到的，Crawl4AI 的性能顯著優於 Firecrawl：

簡單抓取：Crawl4AI 比 Firecrawl 快 4 倍以上。
使用 JavaScript 執行：即使執行 JavaScript 來載入更多內容（找到的圖片數量加倍），Crawl4AI 仍然比 Firecrawl 的簡單爬行更快。

您可以在我們的儲存庫中找到完整的比較程式碼： docs/examples/crawl4ai_vs_firecrawl.py 。

文件

有關詳細文檔，包括安裝說明、高級功能和 API 參考，請造訪我們的文件網站。

Crawl4AI 路線圖？

有關我們的開發計劃和即將推出的功能的詳細信息，請查看我們的路線圖。

先進的爬行系統？

0. Graph Crawler：利用圖搜尋演算法進行智慧網站遍歷，進行全面的巢狀頁面擷取
1. 問題為本的爬蟲：自然語言驅動的網路發現與內容擷取
2. 知識最優爬蟲：智慧爬行，最大化知識，同時最小化資料擷取
3. Agentic Crawler：複雜多步驟爬行作業的自治系統

專業功能

4. 自動模式產生器：將自然語言轉換為擷取模式
5. 特定領域的抓取器：針對通用平台（學術、電子商務）的預先配置擷取器
6. Web Embedding Index：爬行內容的語意搜尋基礎設施

開發工具？

7. Interactive Playground：用於測試、比較策略與人工智慧輔助的Web UI
8.效能監控：即時洞察爬蟲運作狀況
9. 雲端整合：跨雲端提供者的一鍵部署解決方案

社區與成長？

10.贊助計畫：結構化支持體系，分級福利
11. 教育內容：「如何爬行」影片系列和互動教程

貢獻？

我們歡迎開源社群的貢獻。查看我們的貢獻指南以獲取更多資訊。

執照？

Crawl4AI 是根據 Apache 2.0 許可證發布的。

接觸？

如有疑問、建議或回饋，請隨時聯絡：

GitHub：叔叔程式碼
推特：@unclecode
網址：crawl4ai.com

快樂爬行！？

使命

我們的使命是釋放數位時代個人和企業數據未開發的潛力。在當今世界，個人和組織產生了大量有價值的數位足跡，但這些數據在很大程度上仍然沒有成為真正的資產。

我們的開源解決方案使開發人員和創新者能夠建立資料提取和結構化工具，為資料所有權的新時代奠定基礎。透過將個人和企業資料轉化為結構化的、可交易的資產，我們正在為個人創造機會利用他們的數位足跡，為組織創造機會釋放他們集體知識的價值。

這種數據民主化代表了邁向共享數據經濟的第一步，在這種經濟中，願意參與數據共享可以推動人工智慧的進步，同時確保利益回流到數據創造者手中。透過這種方法，我們正在建立一個人工智慧開發由真實的人類知識而不是合成替代品提供動力的未來。

如需詳細了解我們的願景、機會和前進道路，請參閱我們完整的使命宣言。

關鍵機遇

數據資本化：將數位足跡轉化為可出現在個人和企業資產負債表上的有價值的資產
真實數據：釋放大量真實的人類洞察力和知識，促進人工智慧的進步
共享經濟：創造新的價值流，讓數據創造者直接從他們的貢獻中受益

發展歷程

開源基金會：建立透明的、社群驅動的資料擷取工具
數據資本化平台：創建建構和評估數位資產的工具
共享資料市場：建立道德資料交換的經濟平台

如需詳細了解我們的願景、挑戰和解決方案，請參閱我們完整的使命宣言。

明星歷史

展開

crawl4ai