web crawler下載 - web crawler源代碼下載

如何使用Web爬網爬網？

Oxylabs促銷代碼

如何使用Web爬網爬網？
- 網絡搜尋器可以做什麼？
- Web爬網設置概述
  - 端點
    - 創建新工作
    - 獲取站點地圖
    - 獲取總結果塊的列表
    - 得到一大筆匯總結果
  - 查詢參數
- 在Postman中使用Web爬行者
- 在Python中使用Web爬行者
  - 獲取URL列表
  - 得到分析結果
  - 獲得HTML結果

Web爬網是我們的刮板API的內置功能。這是一種用於發現目標URL，選擇相關內容並批量交付的工具。它可以實時和大規模爬網，以快速地基於所選標準傳遞所有內容或您需要的數據。

網絡搜尋器可以做什麼？

Web攻擊者可以做三個主要任務：

執行URL發現；
爬網站上的所有頁面；
索引域上的所有URL。

當您需要瀏覽本網站並批量接收解析數據時，請使用它，並收集特定類別中的URL列表或從整個網站中收集。

使用Web Crawler時，您可以收到三種數據輸出類型：URL列表，解析結果和HTML文件。如果需要，您可以將Web爬網設置為將結果上傳到雲存儲中。

Web爬網設置概述

您可以通過使用過濾器調整其寬度和深度來輕鬆控制爬行範圍。 Web Crawler還可以使用各種刮擦參數，例如地理位置和用戶代理，以提高爬行作業的成功率。這些刮擦參數中的大多數取決於您使用的刮板API。

端點

為了控制您的爬行工作，您需要使用不同的端點。您可以啟動，停止和恢復工作，獲取工作信息，獲取結果塊的列表並獲取結果。以下是我們在本爬行教程中使用的終點。有關更多信息和輸出示例，請訪問我們的文檔。

創建新工作

端點： https://ect.oxylabs.io/v1/jobs
方法： POST
身份驗證： Basic
請求標題： Content-Type: application/json

獲取站點地圖

該端點將在處理工作時提供的URL列表。

端點： https://ect.oxylabs.io/v1/jobs/{id}/sitemap
方法： GET
身份驗證： Basic

獲取總結果塊的列表

端點： https://ect.oxylabs.io/v1/jobs/{id}/aggregate
方法： GET
身份驗證： Basic

匯總結果可能包括很多數據，因此我們根據您指定的塊大小將它們分為多個塊。使用此端點獲取可用塊文件的列表。

得到一大筆匯總結果

端點： https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
方法： GET
身份驗證： Basic

使用此端點，您可以下載匯總結果的特定部分。響應主體的內容取決於您選擇的輸出類型。

結果可能是以下一個：

索引（URL列表）
一個帶有所有分析結果的匯總JSON文件
帶有所有HTML結果的聚合JSON文件

查詢參數

為了方便起見，我們將所有可用參數都放在下表中使用。也可以在我們的文檔中找到。

範圍	描述	預設值
URL	起點的URL	-
`filters`	這些參數用於配置爬行作業的寬度和深度，並確定最終結果應包含哪些URL。有關更多信息，請參見本節。	-
`filters:crawl`	指定在最終結果中將包含哪些網址。有關更多信息，請參見本節。	-
`filters:process`	指定網站搜尋器將刮擦哪些。有關更多信息，請參見本節。	-
`filters:max_depth`	確定將遵循URL鍊鍊器的最大長度。有關更多信息，請參見本節。	`1`
`scrape_params`	這些參數用於微調我們執行刮擦作業的方式。例如，您可能希望我們在爬網站時執行JavaScript，或者您希望我們使用特定位置的代理。	-
`scrape_params:source`	有關更多信息，請參見本節。	-
`scrape_params:geo_location`	結果應適應的地理位置。有關更多信息，請參見本節。	-
`scrape_params:user_agent_type`	設備類型和瀏覽器。有關更多信息，請參見本節。	`desktop`
`scrape_params:render`	啟用JavaScript渲染。當目標需要JavaScript加載內容時使用。如果要使用此功能，請將參數值設置為HTML。有關更多信息，請參見本節。	-
`output:type_`	輸出類型。我們可以返回站點地圖（爬行時發現的URL列表）或包含HTML結果或分析數據的聚合文件。有關更多信息，請參見本節。	-
`upload`	這些參數用於描述雲存儲位置，您希望我們在完成後將結果放置。有關更多信息，請參見本節。	-
`upload:storage_type`	定義雲存儲類型。唯一有效的值是S3（對於AWS S3）。 GCS（用於Google Cloud Storage）即將推出。	-
`upload:storage_url`	儲物桶URL。	-

使用這些參數很簡單，因為您可以使用請求有效負載傳遞它們。您可以在Python中找到代碼示例。

在Postman中使用Web爬行者

為簡單起見，您可以使用Postman提出爬行請求。下載此Postman Collection，以嘗試Web爬網的所有端點。這是一個分步視頻教程，您可以從：

如何爬網網站：逐步指南

在Python中使用Web爬行者

要在Python中提出HTTP請求，我們將使用請求庫。通過在您的終端中輸入以下內容來安裝它：

pip install requests

為了處理HTML結果，我們將使用BeautifulSoup4庫來解析結果並使它們更可讀。此步驟是可選的，但是您可以使用以下方式安裝此庫。

pip install BeautifulSoup4

獲取URL列表

在下面的示例中，我們使用sitemap參數創建一個爬網的作業，該作業抓取了Amazon HomePage並在啟動頁面中找到了一個URL列表。隨著crawl和process參數設置為“.*” ，Web爬網將遵循並返回任何Amazon URL。這兩個參數使用正則表達式（REGEX）來確定應該爬行和處理哪些URL。請務必訪問我們的文檔以獲取更多詳細信息和有用的資源。

我們不需要包含source參數，因為我們還沒有從URL中刮除內容。使用json模塊，我們將數據寫入.json文件，然後使用pprint模塊，我們打印結構化內容。讓我們看看一個示例：

 import requests , json
from pprint import pprint

# Set the content type to JSON.
headers = { "Content-Type" : "application/json" }

# Crawl all URLs inside the target URL.
payload = {
    "url" : "https://www.amazon.com/" ,
    "filters" : {
        "crawl" : [ ".*" ],
        "process" : [ ".*" ],
        "max_depth" : 1
    },
    "scrape_params" : {
        "user_agent_type" : "desktop" ,
    },
    "output" : {
        "type_" : "sitemap"
    }
}

# Create a job and store the JSON response.
response = requests . request (
    'POST' ,
    'https://ect.oxylabs.io/v1/jobs' ,
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
    headers = headers ,
    json = payload ,
)

# Write the decoded JSON response to a .json file.
with open ( 'job_sitemap.json' , 'w' ) as f :
    json . dump ( response . json (), f )

# Print the decoded JSON response.
pprint ( response . json ())

根據請求大小，該過程可能需要一些時間。您可以通過檢查工作信息來確保工作完成。完成後，將另一個請求發送到SiteMap Endpoint https://ect.oxylabs.io/v1/jobs/{id}/sitemap以返回URL列表。例如：

 import requests , json
from pprint import pprint

# Store the JSON response containing URLs (sitemap).
sitemap = requests . request (
    'GET' ,
    'https://ect.oxylabs.io/v1/jobs/{id}/sitemap' , # Replace {id] with the job ID.
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
)

# Write the decoded JSON response to a .json file.
with open ( 'sitemap.json' , 'w' ) as f :
    json . dump ( sitemap . json (), f )

# Print the decoded JSON response.
pprint ( sitemap . json ())

得到分析結果

要獲取解析內容，請使用parsed參數。使用下面的示例，我們可以在此Amazon頁面上找到所有URL，然後解析每個URL的內容。這次，我們正在使用amazon源，因為我們從指定的Amazon頁面刮擦內容。因此，讓我們看看所有這些放在Python中：

 import requests , json
from pprint import pprint

# Set the content type to JSON.
headers = { "Content-Type" : "application/json" }

# Parse content from the URLs found in the target URL.
payload = {
    "url" : "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
           "=1679564333&rnid=502394&ref=sr_pg_1" ,
    "filters" : {
        "crawl" : [ ".*" ],
        "process" : [ ".*" ],
        "max_depth" : 1
    },
    "scrape_params" : {
        "source" : "amazon" ,
        "user_agent_type" : "desktop"
    },
    "output" : {
        "type_" : "parsed"
    }
}

# Create a job and store the JSON response.
response = requests . request (
    'POST' ,
    'https://ect.oxylabs.io/v1/jobs' ,
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
    headers = headers ,
    json = payload ,
)

# Write the decoded JSON response to a .json file.
with open ( 'job_parsed.json' , 'w' ) as f :
    json . dump ( response . json (), f )

# Print the decoded JSON response.
pprint ( response . json ())

請注意，如果要在刮擦Amazon頁面時使用geo_location參數，則必須將其值設置為首選位置的zip/postal代碼。有關更多信息，請訪問我們的文檔中的此頁面。

作業完成後，您可以檢查您的請求生成了多少塊，然後從每個塊中下載此端點的內容： https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk} 。例如，使用以下代碼段，我們正在打印第一個塊：

 import requests , json
from pprint import pprint

# Store the JSON response containing parsed results.
parsed_results = requests . request (
    'GET' ,
    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1' ,  # Replace {id] with the job ID.
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
)

# Write the decoded JSON response to a .json file.
with open ( 'parsed_results_1.json' , 'w' ) as f :
    json . dump ( parsed_results . json (), f )

# Print the decoded JSON response.
pprint ( parsed_results . json ())

獲得HTML結果

獲得HTML結果的代碼與上一節中的代碼沒有太大差異。唯一的區別是我們將type_參數設置為html 。讓我們看看代碼樣本：

 import requests , json
from pprint import pprint

# Set the content type to JSON.
headers = { "Content-Type" : "application/json" }

# Index HTML results of URLs found in the target URL. 
payload = {
    "url" : "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
           "=1679564333&rnid=502394&ref=sr_pg_1" ,
    "filters" : {
        "crawl" : [ ".*" ],
        "process" : [ ".*" ],
        "max_depth" : 1
    },
    "scrape_params" : {
        "source" : "universal" ,
        "user_agent_type" : "desktop"
    },
    "output" : {
        "type_" : "html"
    }
}

# Create a job and store the JSON response.
response = requests . request (
    'POST' ,
    'https://ect.oxylabs.io/v1/jobs' ,
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here
    headers = headers ,
    json = payload ,
)

# Write the decoded JSON response to a .json file.
with open ( 'job_html.json' , 'w' ) as f :
    json . dump ( response . json (), f )

# Print the decoded JSON response.
pprint ( response . json ())

同樣，您需要提出請求以檢索結果的每個部分。我們將使用BeautifureSoup4庫來解析HTML，但是此步驟是可選的。然後，我們將解析的內容寫入.html文件。下面的代碼示例從第一個塊下載內容：

 import requests
from bs4 import BeautifulSoup

# Store the JSON response containing HTML results.
html_response = requests . request (
    'GET' ,
    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1' ,  # Replace {id] with the job ID.
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
)

# Parse the HTML content.
soup = BeautifulSoup ( html_response . content , 'html.parser' )
html_results = soup . prettify ()

# Write the HTML results to an .html file.
with open ( 'html_results.html' , 'w' ) as f :
    f . write ( html_results )

# Print the HTML results.
print ( html_results )