crawlee python下載crawlee python原始碼下載

crawlee python

Python

0.4.5

下載

網頁抓取和瀏覽器自動化庫

Crawlee 涵蓋了端到端的爬行和抓取，並幫助您建立可靠的抓取工具。快速地。

Crawlee for Python 開放給早期採用者！

即使使用預設配置，您的爬蟲也會看起來幾乎像人類一樣，並且在現代機器人保護的雷達下飛行。 Crawlee 為您提供了在網路上抓取連結、抓取資料並將其持久儲存為機器可讀格式的工具，而無需擔心技術細節。由於豐富的設定選項，如果預設設定無法滿足您的專案需求，您幾乎可以調整 Crawlee 的任何方面。

在 Crawlee 專案網站上查看完整的文件、指南和範例？

我們還有 Crawlee 的 TypeScript 實作，您可以在您的專案中探索和利用它。請造訪我們的 GitHub 儲存庫，以了解 GitHub 上 Crawlee for JS/TS 的更多資訊。

安裝

我們建議訪問 Crawlee 文件中的簡介教程以獲取更多資訊。

Crawlee 可作為crawlee PyPI 包。核心功能包含在基礎套件中，附加功能可作為可選附加功能，以最大限度地減少套件大小和依賴性。若要安裝 Crawlee 的所有功能，請執行以下命令：

pip install ' crawlee[all] '

然後，安裝 Playwright 依賴項：

playwright install

驗證Crawlee是否安裝成功：

python -c ' import crawlee; print(crawlee.__version__) '

有關詳細的安裝說明，請參閱設定文件頁面。

使用 Crawlee CLI

開始使用 Crawlee 的最快方法是使用 Crawlee CLI 並選擇準備好的範本之一。首先，請確保您已安裝 Pipx：

pipx --help

然後，運行 CLI 並從可用範本中進行選擇：

pipx run crawlee create my-crawler

如果您已經安裝了crawlee ，您可以透過執行以下命令來啟動它：

crawlee create my-crawler

範例

以下是一些實際範例，可協助您開始使用 Crawlee 中的不同類型的爬蟲類。每個範例都示範如何針對特定用例設定和運行爬網程序，無論您需要處理簡單的 HTML 頁面還是與 JavaScript 較多的網站進行互動。爬蟲運行將在您目前的工作目錄中建立一個storage/目錄。

美麗湯爬蟲

BeautifulSoupCrawler使用 HTTP 庫下載網頁並提供使用者 HTML 解析的內容。預設情況下，它使用HttpxHttpClient進行 HTTP 通信，使用 BeautifulSoup 解析 HTML。它非常適合需要從 HTML 內容中高效提取資料的項目。該爬蟲由於不使用瀏覽器而具有非常好的性能。但是，如果您需要執行客戶端 JavaScript 來取得內容，這還不夠，您需要使用PlaywrightCrawler 。另外，如果你想使用這個爬蟲，請確保你安裝了crawlee並額外安裝了beautifulsoup 。

 import asyncio

from crawlee . beautifulsoup_crawler import BeautifulSoupCrawler , BeautifulSoupCrawlingContext


async def main () -> None :
    crawler = BeautifulSoupCrawler (
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl = 10 ,
    )

    # Define the default request handler, which will be called for every request.
    @ crawler . router . default_handler
    async def request_handler ( context : BeautifulSoupCrawlingContext ) -> None :
        context . log . info ( f'Processing { context . request . url } ...' )

        # Extract data from the page.
        data = {
            'url' : context . request . url ,
            'title' : context . soup . title . string if context . soup . title else None ,
        }

        # Push the extracted data to the default dataset.
        await context . push_data ( data )

        # Enqueue all links found on the page.
        await context . enqueue_links ()

    # Run the crawler with the initial list of URLs.
    await crawler . run ([ 'https://crawlee.dev' ])

if __name__ == '__main__' :
    asyncio . run ( main ())

劇作家爬行者

PlaywrightCrawler使用無頭瀏覽器下載網頁並提供用於資料擷取的 API。它基於 Playwright，一個專為管理無頭瀏覽器而設計的自動化庫。它擅長檢索依賴客戶端 JavaScript 產生內容的網頁，或需要與 JavaScript 驅動的內容互動的任務。對於不需要執行 JavaScript 或需要更高效能的場景，可以考慮使用BeautifulSoupCrawler 。另外，如果您想使用此爬蟲，請確保您安裝了帶有playwright extra 的crawlee 。

 import asyncio

from crawlee . playwright_crawler import PlaywrightCrawler , PlaywrightCrawlingContext


async def main () -> None :
    crawler = PlaywrightCrawler (
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl = 10 ,
    )

    # Define the default request handler, which will be called for every request.
    @ crawler . router . default_handler
    async def request_handler ( context : PlaywrightCrawlingContext ) -> None :
        context . log . info ( f'Processing { context . request . url } ...' )

        # Extract data from the page.
        data = {
            'url' : context . request . url ,
            'title' : await context . page . title (),
        }

        # Push the extracted data to the default dataset.
        await context . push_data ( data )

        # Enqueue all links found on the page.
        await context . enqueue_links ()

    # Run the crawler with the initial list of requests.
    await crawler . run ([ 'https://crawlee.dev' ])


if __name__ == '__main__' :
    asyncio . run ( main ())