浏览器指令

关于在使用网页爬虫API 时如何定义浏览器指令以处理复杂动态页面的信息。

您可以定义在渲染 JavaScript 时执行的自定义浏览器指令。

设置浏览器指令最简单的方法是使用位于 Web Scraper API Playground上的 AI 驱动的可视化浏览器指令构建器。阅读相关内容 here.

用法

要使用浏览器指令，请在创建作业时提供一组 browser_instructions 。

假设您想在网站中搜索术语 pizza boxes 。

示例作业参数如下所示：

{
    "source": "universal",
    "url": "https://www.ebay.com/",
    "render": "html",
    "browser_instructions": [
        {
            "type": "input",
            "value": "pizza boxes",
            "selector": {
                "type": "xpath",
                "value": "//input[@class='gh-tb ui-autocomplete-input']"
            }
        },
        {
            "type": "click",
            "selector": {
                "type": "xpath",
                "value": "//input[@type='submit']"
            }
        },
        {
            "type": "wait",
            "wait_time_s": 5
        }
    ]
}

步骤 1。 您必须提供 "render": "html" 参数一起使用时有用。

步骤 2。 浏览器指令应在 "browser_instructions" 字段中描述。

上面的示例浏览器指令指定了目标是将搜索词 pizza boxes 输入到搜索字段，点击 搜索 按钮并等待 5 秒以便内容加载。

抓取结果应如下所示：

{
  "results": [
    {
      "content": "<!doctype html><html>
        执行指令后的内容      
      </html>",
      "created_at": "2023-10-11 11:35:23",
      "updated_at": "2023-10-11 11:36:08",
      "page": 1,
      "url": "https://www.ebay.com/",
      "job_id": "7117835067442906113",
      "status_code": 200
    }
  ]
}

抓取到的 HTML 应该如下所示：

获取浏览器资源

我们提供了一个独立的浏览器指令用于获取浏览器资源。

该函数在此定义：

使用 fetch_resource 将导致作业返回与提供的格式匹配的第一个 Fetch/XHR 资源的内容，而不是所针对的 HTML。

假设我们想针对在浏览器中以自然方式访问产品页面时获取的 GraphQL 资源。我们将提供如下作业信息：

{
    "source": "universal",
    "url": "https://www.example.com/product-page/123",
    "render": "html",
    "browser_instructions": [
        {
            "type": "fetch_resource",
            "filter": "/graphql/product-info/123"
        }
    ]
}

这些指令将产生如下结果：

{
  "results": [
    {
      "content": "{'product_id': 123, 'description': '', 'price': 123}",
      "created_at": "2023-10-11 11:35:23",
      "updated_at": "2023-10-11 11:36:08",
      "page": 1,
      "url": "https://example.com/v1/graphql/product-info/123/",
      "job_id": "7117835067442906114",
      "status_code": 200
    }
  ]
}

支持的浏览器指令列表

指令列表

状态代码

请参阅我们在此概述的响应代码 here.

关于指令验证的状态代码已记录在案 here.

错误和警告

如果您的浏览操作产生错误或警告，您会在结果中找到对应键下的信息 browser_instructions_error 或 browser_instructions_warnings。例如，如果您发送了以下浏览器指令但在页面上未找到预期的 xpath，结果将包含警告。

browser_instructions:

[
    {
        "type": "input", 
        "selector": {
            "type": "xpath",
            "value": "//input[@type='search']"
        },
        "value": "oxylabs"
    }
]

结果：

{
  "results": [
    {
      "content": "<!doctype html><html>
        执行指令后的内容      
      </html>",
      "created_at": "2023-10-11 11:35:23",
      "updated_at": "2023-10-11 11:36:08",
      "browser_instructions_warnings": [
        {
          "action_type": "click",
          "msg": "无法在页面上找到类型为 `xpath` 且值为 `//input[@type=search]` 的选择器。"
        },
      ],
      "page": 1,
      "url": "https://example.com",
      "job_id": "7117835067442906113",
      "status_code": 200
    }
  ]
}

可能的错误和警告

将浏览器指令转换为动作时发生了意外错误。

执行 {action.type} 浏览器指令时发生了意外错误。

动作 {action.type} 超时。

无法找到选择器类型 {selector.type} ，其值为 {selector.value} 在页面上。

上一页JavaScript 渲染下一页指令列表

最后更新于29天前

这有帮助吗？

晚上好

hashtag用法

hashtag获取浏览器资源

hashtag支持的浏览器指令列表

hashtag状态代码

hashtag错误和警告

用法

获取浏览器资源

支持的浏览器指令列表

状态代码

错误和警告