web crawler下载 - web crawler源代码下载

如何使用Web爬网爬网？

Oxylabs促销代码

如何使用Web爬网爬网？
- 网络搜寻器可以做什么？
- Web爬网设置概述
  - 端点
    - 创建新工作
    - 获取站点地图
    - 获取总结果块的列表
    - 得到一大笔汇总结果
  - 查询参数
- 在Postman中使用Web爬行者
- 在Python中使用Web爬行者
  - 获取URL列表
  - 得到分析结果
  - 获得HTML结果

Web爬网是我们的刮板API的内置功能。这是一种用于发现目标URL，选择相关内容并批量交付的工具。它可以实时和大规模爬网，以快速地基于所选标准传递所有内容或您需要的数据。

网络搜寻器可以做什么？

Web攻击者可以做三个主要任务：

执行URL发现；
爬网站上的所有页面；
索引域上的所有URL。

当您需要浏览本网站并批量接收解析数据时，请使用它，并收集特定类别中的URL列表或从整个网站中收集。

使用Web Crawler时，您可以收到三种数据输出类型：URL列表，解析结果和HTML文件。如果需要，您可以将Web爬网设置为将结果上传到云存储中。

Web爬网设置概述

您可以通过使用过滤器调整其宽度和深度来轻松控制爬行范围。 Web Crawler还可以使用各种刮擦参数，例如地理位置和用户代理，以提高爬行作业的成功率。这些刮擦参数中的大多数取决于您使用的刮板API。

端点

为了控制您的爬行工作，您需要使用不同的端点。您可以启动，停止和恢复工作，获取工作信息，获取结果块的列表并获取结果。以下是我们在本爬行教程中使用的终点。有关更多信息和输出示例，请访问我们的文档。

创建新工作

端点： https://ect.oxylabs.io/v1/jobs
方法： POST
身份验证： Basic
请求标题： Content-Type: application/json

获取站点地图

该端点将在处理工作时提供的URL列表。

端点： https://ect.oxylabs.io/v1/jobs/{id}/sitemap
方法： GET
身份验证： Basic

获取总结果块的列表

端点： https://ect.oxylabs.io/v1/jobs/{id}/aggregate
方法： GET
身份验证： Basic

汇总结果可能包括很多数据，因此我们根据您指定的块大小将它们分为多个块。使用此端点获取可用块文件的列表。

得到一大笔汇总结果

端点： https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
方法： GET
身份验证： Basic

使用此端点，您可以下载汇总结果的特定部分。响应主体的内容取决于您选择的输出类型。

结果可能是以下一个：

索引（URL列表）
一个带有所有分析结果的汇总JSON文件
带有所有HTML结果的聚合JSON文件

查询参数

为了方便起见，我们将所有可用参数都放在下表中使用。也可以在我们的文档中找到。

范围	描述	默认值
URL	起点的URL	-
`filters`	这些参数用于配置爬行作业的宽度和深度，并确定最终结果应包含哪些URL。有关更多信息，请参见本节。	-
`filters:crawl`	指定在最终结果中将包含哪些网址。有关更多信息，请参见本节。	-
`filters:process`	指定网站搜寻器将刮擦哪些。有关更多信息，请参见本节。	-
`filters:max_depth`	确定将遵循URL链链器的最大长度。有关更多信息，请参见本节。	`1`
`scrape_params`	这些参数用于微调我们执行刮擦作业的方式。例如，您可能希望我们在爬网站时执行JavaScript，或者您希望我们使用特定位置的代理。	-
`scrape_params:source`	有关更多信息，请参见本节。	-
`scrape_params:geo_location`	结果应适应的地理位置。有关更多信息，请参见本节。	-
`scrape_params:user_agent_type`	设备类型和浏览器。有关更多信息，请参见本节。	`desktop`
`scrape_params:render`	启用JavaScript渲染。当目标需要JavaScript加载内容时使用。如果要使用此功能，请将参数值设置为HTML。有关更多信息，请参见本节。	-
`output:type_`	输出类型。我们可以返回站点地图（爬行时发现的URL列表）或包含HTML结果或分析数据的聚合文件。有关更多信息，请参见本节。	-
`upload`	这些参数用于描述云存储位置，您希望我们在完成后将结果放置。有关更多信息，请参见本节。	-
`upload:storage_type`	定义云存储类型。唯一有效的值是S3（对于AWS S3）。 GCS（用于Google Cloud Storage）即将推出。	-
`upload:storage_url`	储物桶URL。	-

使用这些参数很简单，因为您可以使用请求有效负载传递它们。您可以在Python中找到代码示例。

在Postman中使用Web爬行者

为简单起见，您可以使用Postman提出爬行请求。下载此Postman Collection，以尝试Web爬网的所有端点。这是一个分步视频教程，您可以从：

如何爬网网站：逐步指南

在Python中使用Web爬行者

要在Python中提出HTTP请求，我们将使用请求库。通过在您的终端中输入以下内容来安装它：

pip install requests

为了处理HTML结果，我们将使用BeautifulSoup4库来解析结果并使它们更可读。此步骤是可选的，但是您可以使用以下方式安装此库。

pip install BeautifulSoup4

获取URL列表

在下面的示例中，我们使用sitemap参数创建一个爬网的作业，该作业抓取了Amazon HomePage并在启动页面中找到了一个URL列表。随着crawl和process参数设置为“.*” ，Web爬网将遵循并返回任何Amazon URL。这两个参数使用正则表达式（REGEX）来确定应该爬行和处理哪些URL。请务必访问我们的文档以获取更多详细信息和有用的资源。

我们不需要包含source参数，因为我们还没有从URL中刮除内容。使用json模块，我们将数据写入.json文件，然后使用pprint模块，我们打印结构化内容。让我们看看一个示例：

 import requests , json
from pprint import pprint

# Set the content type to JSON.
headers = { "Content-Type" : "application/json" }

# Crawl all URLs inside the target URL.
payload = {
    "url" : "https://www.amazon.com/" ,
    "filters" : {
        "crawl" : [ ".*" ],
        "process" : [ ".*" ],
        "max_depth" : 1
    },
    "scrape_params" : {
        "user_agent_type" : "desktop" ,
    },
    "output" : {
        "type_" : "sitemap"
    }
}

# Create a job and store the JSON response.
response = requests . request (
    'POST' ,
    'https://ect.oxylabs.io/v1/jobs' ,
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
    headers = headers ,
    json = payload ,
)

# Write the decoded JSON response to a .json file.
with open ( 'job_sitemap.json' , 'w' ) as f :
    json . dump ( response . json (), f )

# Print the decoded JSON response.
pprint ( response . json ())

根据请求大小，该过程可能需要一些时间。您可以通过检查工作信息来确保工作完成。完成后，将另一个请求发送到SiteMap Endpoint https://ect.oxylabs.io/v1/jobs/{id}/sitemap以返回URL列表。例如：

 import requests , json
from pprint import pprint

# Store the JSON response containing URLs (sitemap).
sitemap = requests . request (
    'GET' ,
    'https://ect.oxylabs.io/v1/jobs/{id}/sitemap' , # Replace {id] with the job ID.
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
)

# Write the decoded JSON response to a .json file.
with open ( 'sitemap.json' , 'w' ) as f :
    json . dump ( sitemap . json (), f )

# Print the decoded JSON response.
pprint ( sitemap . json ())

得到分析结果

要获取解析内容，请使用parsed参数。使用下面的示例，我们可以在此Amazon页面上找到所有URL，然后解析每个URL的内容。这次，我们正在使用amazon源，因为我们从指定的Amazon页面刮擦内容。因此，让我们看看所有这些放在Python中：

 import requests , json
from pprint import pprint

# Set the content type to JSON.
headers = { "Content-Type" : "application/json" }

# Parse content from the URLs found in the target URL.
payload = {
    "url" : "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
           "=1679564333&rnid=502394&ref=sr_pg_1" ,
    "filters" : {
        "crawl" : [ ".*" ],
        "process" : [ ".*" ],
        "max_depth" : 1
    },
    "scrape_params" : {
        "source" : "amazon" ,
        "user_agent_type" : "desktop"
    },
    "output" : {
        "type_" : "parsed"
    }
}

# Create a job and store the JSON response.
response = requests . request (
    'POST' ,
    'https://ect.oxylabs.io/v1/jobs' ,
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
    headers = headers ,
    json = payload ,
)

# Write the decoded JSON response to a .json file.
with open ( 'job_parsed.json' , 'w' ) as f :
    json . dump ( response . json (), f )

# Print the decoded JSON response.
pprint ( response . json ())

请注意，如果要在刮擦Amazon页面时使用geo_location参数，则必须将其值设置为首选位置的zip/postal代码。有关更多信息，请访问我们的文档中的此页面。

作业完成后，您可以检查您的请求生成了多少块，然后从每个块中下载此端点的内容： https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk} ：//ect.oxylabs.io/v1/jobs/jobs/ {id }/aggregate/ https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk} 。例如，使用以下代码段，我们正在打印第一个块：

 import requests , json
from pprint import pprint

# Store the JSON response containing parsed results.
parsed_results = requests . request (
    'GET' ,
    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1' ,  # Replace {id] with the job ID.
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
)

# Write the decoded JSON response to a .json file.
with open ( 'parsed_results_1.json' , 'w' ) as f :
    json . dump ( parsed_results . json (), f )

# Print the decoded JSON response.
pprint ( parsed_results . json ())

获得HTML结果

获得HTML结果的代码与上一节中的代码没有太大差异。唯一的区别是我们将type_参数设置为html 。让我们看看代码样本：

 import requests , json
from pprint import pprint

# Set the content type to JSON.
headers = { "Content-Type" : "application/json" }

# Index HTML results of URLs found in the target URL. 
payload = {
    "url" : "https://www.amazon.com/s?i=electronics-intl-ship&bbn=16225009011&rh=n%3A502394%2Cn%3A281052&dc&qid"
           "=1679564333&rnid=502394&ref=sr_pg_1" ,
    "filters" : {
        "crawl" : [ ".*" ],
        "process" : [ ".*" ],
        "max_depth" : 1
    },
    "scrape_params" : {
        "source" : "universal" ,
        "user_agent_type" : "desktop"
    },
    "output" : {
        "type_" : "html"
    }
}

# Create a job and store the JSON response.
response = requests . request (
    'POST' ,
    'https://ect.oxylabs.io/v1/jobs' ,
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here
    headers = headers ,
    json = payload ,
)

# Write the decoded JSON response to a .json file.
with open ( 'job_html.json' , 'w' ) as f :
    json . dump ( response . json (), f )

# Print the decoded JSON response.
pprint ( response . json ())

同样，您需要提出请求以检索结果的每个部分。我们将使用BeautifureSoup4库来解析HTML，但是此步骤是可选的。然后，我们将解析的内容写入.html文件。下面的代码示例从第一个块下载内容：

 import requests
from bs4 import BeautifulSoup

# Store the JSON response containing HTML results.
html_response = requests . request (
    'GET' ,
    'https://ect.oxylabs.io/v1/jobs/{id}/aggregate/1' ,  # Replace {id] with the job ID.
    auth = ( 'USERNAME' , 'PASSWORD' ),  # Your credentials go here.
)

# Parse the HTML content.
soup = BeautifulSoup ( html_response . content , 'html.parser' )
html_results = soup . prettify ()

# Write the HTML results to an .html file.
with open ( 'html_results.html' , 'w' ) as f :
    f . write ( html_results )

# Print the HTML results.
print ( html_results )