Crawl4AI 簡化了非同步網路爬行和資料擷取,使其可供大型語言模型 (LLM) 和 AI 應用程式使用。 ?
CrawlResult
中整合文件爬行、下載和追蹤。srcset
、 picture
和響應式圖像格式。file://
路徑和原始 HTML ( raw:
)。玩玩這個
造訪我們的文件網站
Crawl4AI 提供靈活的安裝選項以適應各種用例。您可以將其安裝為 Python 套件或使用 Docker。
選擇最適合您需求的安裝選項:
對於基本的網路爬行和抓取任務:
pip install crawl4ai
預設情況下,這將安裝 Crawl4AI 的非同步版本,使用 Playwright 進行網路爬行。
注意:當您安裝 Crawl4AI 時,安裝腳本應自動安裝並設定 Playwright。但是,如果您遇到任何與 Playwright 相關的錯誤,您可以使用以下方法之一手動安裝它:
透過命令列:
playwright install
如果上述方法不起作用,請嘗試這個更具體的命令:
python -m playwright install chromium
事實證明,第二種方法在某些情況下更可靠。
如果您需要使用 Selenium 的同步版本:
pip install crawl4ai[sync]
對於計劃修改原始程式碼的貢獻者:
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
一鍵部署您自己的 Crawl4AI 實例:
建議規格:至少 4GB RAM。部署時選擇“professional-xs”或更高版本以獲得穩定運作。
部署將:
Crawl4AI 可作為 Docker 映像使用,以便於部署。您可以直接從 Docker Hub 拉取(建議)或從儲存庫建置。
# Pull and run from Docker Hub (choose one):
docker pull unclecode/crawl4ai:basic # Basic crawling features
docker pull unclecode/crawl4ai:all # Full installation (ML, LLM support)
docker pull unclecode/crawl4ai:gpu # GPU-enabled version
# Run the container
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
# In case you want to set platform to arm64
docker run --platform linux/arm64 -p 11235:11235 unclecode/crawl4ai:basic
# In case to allocate more shared memory for the container
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build the image
docker build -t crawl4ai:local
--build-arg INSTALL_TYPE=basic # Options: basic, all
.
# In case you want to set platform to arm64
docker build -t crawl4ai:local
--build-arg INSTALL_TYPE=basic # Options: basic, all
--platform linux/arm64
.
# Run your local build
docker run -p 11235:11235 crawl4ai:local
快速測試(適用於兩個選項):
import requests
# Submit a crawl job
response = requests . post (
"http://localhost:11235/crawl" ,
json = { "urls" : "https://example.com" , "priority" : 10 }
)
task_id = response . json ()[ "task_id" ]
# Get results
result = requests . get ( f"http://localhost:11235/task/ { task_id } " )
如需進階配置、環境變數和使用範例,請參閱我們的 Docker 部署指南。
import asyncio
from crawl4ai import AsyncWebCrawler
async def main ():
async with AsyncWebCrawler ( verbose = True ) as crawler :
result = await crawler . arun ( url = "https://www.nbcnews.com/business" )
print ( result . markdown )
if __name__ == "__main__" :
asyncio . run ( main ())
import asyncio
from crawl4ai import AsyncWebCrawler
async def main ():
async with AsyncWebCrawler ( verbose = True ) as crawler :
js_code = [ "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();" ]
result = await crawler . arun (
url = "https://www.nbcnews.com/business" ,
js_code = js_code ,
css_selector = ".wide-tease-item__description" ,
bypass_cache = True
)
print ( result . extracted_content )
if __name__ == "__main__" :
asyncio . run ( main ())
import asyncio
from crawl4ai import AsyncWebCrawler
async def main ():
async with AsyncWebCrawler ( verbose = True , proxy = "http://127.0.0.1:7890" ) as crawler :
result = await crawler . arun (
url = "https://www.nbcnews.com/business" ,
bypass_cache = True
)
print ( result . markdown )
if __name__ == "__main__" :
asyncio . run ( main ())
JsonCssExtractionStrategy
允許使用 CSS 選擇器從網頁中精確提取結構化資料。
import asyncio
import json
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import JsonCssExtractionStrategy
async def extract_news_teasers ():
schema = {
"name" : "News Teaser Extractor" ,
"baseSelector" : ".wide-tease-item__wrapper" ,
"fields" : [
{
"name" : "category" ,
"selector" : ".unibrow span[data-testid='unibrow-text']" ,
"type" : "text" ,
},
{
"name" : "headline" ,
"selector" : ".wide-tease-item__headline" ,
"type" : "text" ,
},
{
"name" : "summary" ,
"selector" : ".wide-tease-item__description" ,
"type" : "text" ,
},
{
"name" : "time" ,
"selector" : "[data-testid='wide-tease-date']" ,
"type" : "text" ,
},
{
"name" : "image" ,
"type" : "nested" ,
"selector" : "picture.teasePicture img" ,
"fields" : [
{ "name" : "src" , "type" : "attribute" , "attribute" : "src" },
{ "name" : "alt" , "type" : "attribute" , "attribute" : "alt" },
],
},
{
"name" : "link" ,
"selector" : "a[href]" ,
"type" : "attribute" ,
"attribute" : "href" ,
},
],
}
extraction_strategy = JsonCssExtractionStrategy ( schema , verbose = True )
async with AsyncWebCrawler ( verbose = True ) as crawler :
result = await crawler . arun (
url = "https://www.nbcnews.com/business" ,
extraction_strategy = extraction_strategy ,
bypass_cache = True ,
)
assert result . success , "Failed to crawl the page"
news_teasers = json . loads ( result . extracted_content )
print ( f"Successfully extracted { len ( news_teasers ) } news teasers" )
print ( json . dumps ( news_teasers [ 0 ], indent = 2 ))
if __name__ == "__main__" :
asyncio . run ( extract_news_teasers ())
有關更進階的使用範例,請查看文件中的範例部分。
import os
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai . extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel , Field
class OpenAIModelFee ( BaseModel ):
model_name : str = Field (..., description = "Name of the OpenAI model." )
input_fee : str = Field (..., description = "Fee for input token for the OpenAI model." )
output_fee : str = Field (..., description = "Fee for output token for the OpenAI model." )
async def main ():
async with AsyncWebCrawler ( verbose = True ) as crawler :
result = await crawler . arun (
url = 'https://openai.com/api/pricing/' ,
word_count_threshold = 1 ,
extraction_strategy = LLMExtractionStrategy (
provider = "openai/gpt-4o" , api_token = os . getenv ( 'OPENAI_API_KEY' ),
schema = OpenAIModelFee . schema (),
extraction_type = "schema" ,
instruction = """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
bypass_cache = True ,
)
print ( result . extracted_content )
if __name__ == "__main__" :
asyncio . run ( main ())
Crawl4AI 擅長處理複雜的場景,例如抓取透過 JavaScript 載入動態內容的多個頁面。以下是跨多個頁面抓取 GitHub 提交的範例:
import asyncio
import re
from bs4 import BeautifulSoup
from crawl4ai import AsyncWebCrawler
async def crawl_typescript_commits ():
first_commit = ""
async def on_execution_started ( page ):
nonlocal first_commit
try :
while True :
await page . wait_for_selector ( 'li.Box-sc-g0xbh4-0 h4' )
commit = await page . query_selector ( 'li.Box-sc-g0xbh4-0 h4' )
commit = await commit . evaluate ( '(element) => element.textContent' )
commit = re . sub ( r's+' , '' , commit )
if commit and commit != first_commit :
first_commit = commit
break
await asyncio . sleep ( 0.5 )
except Exception as e :
print ( f"Warning: New content didn't appear after JavaScript execution: { e } " )
async with AsyncWebCrawler ( verbose = True ) as crawler :
crawler . crawler_strategy . set_hook ( 'on_execution_started' , on_execution_started )
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
for page in range ( 3 ): # Crawl 3 pages
result = await crawler . arun (
url = url ,
session_id = session_id ,
css_selector = "li.Box-sc-g0xbh4-0" ,
js = js_next_page if page > 0 else None ,
bypass_cache = True ,
js_only = page > 0
)
assert result . success , f"Failed to crawl page { page + 1 } "
soup = BeautifulSoup ( result . cleaned_html , 'html.parser' )
commits = soup . select ( "li" )
all_commits . extend ( commits )
print ( f"Page { page + 1 } : Found { len ( commits ) } commits" )
await crawler . crawler_strategy . kill_session ( session_id )
print ( f"Successfully crawled { len ( all_commits ) } commits across 3 pages" )
if __name__ == "__main__" :
asyncio . run ( crawl_typescript_commits ())
此範例示範了 Crawl4AI 處理非同步載入內容的複雜場景的能力。它會抓取 GitHub 提交的多個頁面,執行 JavaScript 來載入新內容,並使用自訂掛鉤來確保在繼續之前載入資料。
有關更進階的使用範例,請查看文件中的範例部分。
Crawl4AI 的設計以速度為主要關注點。我們的目標是透過高品質的資料提取提供盡可能最快的回應,最大限度地減少資料和使用者之間的抽象。
我們對 Crawl4AI 和付費服務 Firecrawl 進行了速度比較。結果證明了 Crawl4AI 的優越性能:
Firecrawl:
Time taken: 7.02 seconds
Content length: 42074 characters
Images found: 49
Crawl4AI (simple crawl):
Time taken: 1.60 seconds
Content length: 18238 characters
Images found: 49
Crawl4AI (with JavaScript execution):
Time taken: 4.64 seconds
Content length: 40869 characters
Images found: 89
正如您所看到的,Crawl4AI 的性能顯著優於 Firecrawl:
您可以在我們的儲存庫中找到完整的比較程式碼: docs/examples/crawl4ai_vs_firecrawl.py
。
有關詳細文檔,包括安裝說明、高級功能和 API 參考,請造訪我們的文件網站。
有關我們的開發計劃和即將推出的功能的詳細信息,請查看我們的路線圖。
我們歡迎開源社群的貢獻。查看我們的貢獻指南以獲取更多資訊。
Crawl4AI 是根據 Apache 2.0 許可證發布的。
如有疑問、建議或回饋,請隨時聯絡:
快樂爬行! ?
我們的使命是釋放數位時代個人和企業數據未開發的潛力。在當今世界,個人和組織產生了大量有價值的數位足跡,但這些數據在很大程度上仍然沒有成為真正的資產。
我們的開源解決方案使開發人員和創新者能夠建立資料提取和結構化工具,為資料所有權的新時代奠定基礎。透過將個人和企業資料轉化為結構化的、可交易的資產,我們正在為個人創造機會利用他們的數位足跡,為組織創造機會釋放他們集體知識的價值。
這種數據民主化代表了邁向共享數據經濟的第一步,在這種經濟中,願意參與數據共享可以推動人工智慧的進步,同時確保利益回流到數據創造者手中。透過這種方法,我們正在建立一個人工智慧開發由真實的人類知識而不是合成替代品提供動力的未來。
如需詳細了解我們的願景、機會和前進道路,請參閱我們完整的使命宣言。
如需詳細了解我們的願景、挑戰和解決方案,請參閱我們完整的使命宣言。