> For the complete documentation index, see [llms.txt](https://developers.oxylabs.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://developers.oxylabs.io/products/cn/web-scraper-api/features/custom-parser/writing-instructions-manually/parsing-instruction-examples.md). # 解析指令示例以下 HTML 片段将在接下来的章节中使用示例解析指令进行解析。 ### 示例 HTML ```html

Shoes

223.12

Super

Pants

60.12

Amazing
Quality

Socks

123.12

Very
Nice
Socks

``` ### 最基本用法 {% hint style="info" %} 使用场景：你想从所有 **shoes** **description** **items**. {% endhint %} *示例 1. 使用 XPath 选择 Shoes description items。* ```json { "shoes_description": { "_fns": [ { "_fn": "xpath", "_args": [ ".//div[@id='shoes']//li[@class='description-item']/text()" ] } ] } } ``` 该 `xpath` 函数将找到单个项目，并将其作为字符串放入列表中： ```json { "shoes_description": [ "Super" ] } ``` 确切的 `xpath` 函数行为说明如下 [**此处**](/products/cn/web-scraper-api/features/custom-parser/writing-instructions-manually/list-of-functions.md). ### 嵌套解析指令 {% hint style="info" %} 使用场景：你想解析与 shoes 相关的所有信息。此外，解析结果应表示所提供 HTML 的文档结构。 {% endhint %} 你的目标是 Sample HTML 的这一部分： ```html

Shoes

223.12

Super

``` 并且你希望解析结果具有以下结构： ```json { "shoes": { "title": "Shoes", "price": "223.12", "description": [ "Super" ] } } ``` 解析指令如下所示。 *示例 2. 使用解析指令来解析* `shoes` *信息。* ```json { "shoes": { "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']/div[@class='title']/text()"] } ] }, "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']/div[@class='price']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@id='shoes']//li[@class='description-item']/text()"] } ] } } } ``` `xpath_one` 其工作方式类似于 `xpath`，但它不会返回所有匹配项的列表，而是 **返回第一个匹配的项目**. 在上面的示例中， `shoes` 属性是在最外层指令作用域中定义的唯一属性。该 `shoes` 属性包含嵌套解析指令。该 `shoes` 指令作用域没有定义管道（`_fns` 属性缺失）。这意味着在 `title`, `price`、 `description` 作用域中定义的管道将使用被解析文档作为管道输入。在示例 2 中，你可以看到 `//div[@id='shoes']` 在 XPath 表达式中重复出现。可以通过在 `shoes` 作用域中定义管道来避免这种重复： *示例 3. 在* `shoes` *作用域指令中定义管道，以避免 XPath 表达式重复。* ```json { "shoes": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']"] } ], "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='title']/text()"] } ] }, "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='price']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath", "_args": [".//li[@class='description-item']/text()"] } ] } } } ``` 通过使用示例 3 中提供的解析指令，自定义解析器将： 1. 首先处理 `shoes._fns` 管道，它将输出 `shoes` HTML 元素； 2. 获取 `shoes._fns` 管道输出，并将其用作在 `title`, `price`、 `description` 作用域中定义的管道的输入； 3. 处理 `title`, `price`、 `description` 管道以生成最终值。结果将与示例 2 的结果相同： ```json { "shoes": { "title": "Shoes", "price": "223.12", "description": [ "Super" ] } } ``` 示例 2 和示例 3 之间的主要区别在于，在示例 3 中，管道定义在 `shoes` 作用域中。 **这个附加管道会选择 shoes 的元素，并将其传递给在指令层级结构中更深处找到的后续管道。** ### 嵌套对象列表 {% hint style="info" %} **使用场景：** 此前，你只想解析 `shoes` 信息。现在你想解析 HTML 中所有产品的信息。 {% endhint %} 该 [**示例 HTML**](/products/cn/web-scraper-api/features/custom-parser/writing-instructions-manually/parsing-instruction-examples.md#sample-html) 再次被用作被解析文档。如果你希望解析结果如下所示： ```json { "products": [ { "title": "Shoes", "price": "223.12", "description": [ "Super" ] }, { "title": "Pants", "price": "60.12", "description": [ "Amazing", "Quality" ] }, { "title": "Socks", "price": "123.12", "description": [ "Very", "Nice", "Socks" ] } ] } ``` 解析指令如下所示： *示例 4. 解析 HTML 文档中找到的所有产品。* ```json { "products": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='product']"] } ], "_items": { "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='title']/text()"] } ] }, "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='price']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath", "_args": [".//li[@class='description-item']/text()"] } ] } } } } ``` 解析指令结构看起来与示例 3 中的类似。不过，有两个主要例外： 1. `xpath` 使用了 `xpath_one` 在 `products._fns` 管道，而不是 `products._fns` 该管道现在将输出与所提供 XPath 表达式匹配的所有元素的列表（产品元素列表）。 2. `_items` 保留属性用于表明你希望通过遍历 `products._fns` 管道输出的每一项并 **分别传递/处理列表中的每个项目** 到管道作用域的下游，来形成一个列表。如果 `_items` 保留属性未在示例 4 的解析指令中使用，解析结果将如下所示： ```json { "products": { "title": [ "Shoes", "Pants", "Socks" ], "price": [ "223.12", "60.12", "123.12" ], "description": [ [ "Super" ], [ "Amazing", "Quality" ], [ "Very", "Nice", "Socks" ] ] } } ``` {% hint style="warning" %} `_items` 用于指定自定义解析器必须传递 ***单独的列表项*** 而不是 ***整个列表*** 沿着解析指令向下传递。 {% endhint %} ### 从列表中选择第 N 个元素本节演示了管道的灵活性。同一个问题可以用不同方式处理。可以使用多种选项从任意值列表中选择第 N 个元素。 {% hint style="info" %} **使用场景：** 你想从页面中选择第二个产品价格。 {% endhint %} 该 [**示例 HTML**](#sample-html) 再次用作示例。你有多种方式选择第 2 个产品。 #### 选项 1 你可以使用 XPath `[]` 选择器，并在 XPath 表达式中定义选择。 *示例 5. 使用 XPath \`\[]\` 选择器选择第 2 个价格。* ```json { "second_price": { "_fns": [ { "_fn": "xpath", "_args": [ "(//div[@class='price'])[2]/text()" ] } ] } } ``` 结果： ```json { "second_price": [ "60.12" ] } ``` #### 选项 2 你也可以使用 `xpath` 函数查找所有价格，并将其通过管道传递给函数 `select_nth`，该函数从提取出的价格列表中选择第 n 个元素。 *示例 6. 使用 \`select\_nth\` 函数选择第 2 个值。* ```json { "second_price": { "_fns": [ { "_fn": "xpath", "_args": [ "//div[@class='price']/text()" ] }, { "_fn": "select_nth", "_args": 1 } ] } } ``` 结果： ```json { "second_price": "60.12" } ``` {% hint style="warning" %} 请注意， `select_nth` 函数从列表中返回一个项目，而 `xpath` 函数返回项目列表，即使只找到一个项目也是如此。 {% endhint %} #### 选项 3 你可以使用 `select_nth` 处理任何列表类型，包括 HTML 元素列表： *示例 7. 使用* `class="product"` *选择所有产品 HTML 元素 ==> 从列表中选择第 2 个产品元素 ==> 从所选产品 HTML 元素中提取价格文本*. ```json { "second_price": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='product']"] }, { "_fn": "select_nth", "_args": 1 }, { "_fn": "xpath", "_args": ["./div[@class='price']/text()"] } ] } } ``` 结果： ```json { "second_price": ["60.12"] } ``` ### 错误处理给定以下 HTML 片段时： ```html

Nice Shoes

223.12

Super

``` 并尝试使用以下解析指令对其进行解析： ```json { "product": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']"] } ], "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='price']/text()"] } ] }, "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='title']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='description']/text()"] }, { "_fn": "convert_to_float" } ] } } } ``` 自定义解析器将返回一个解析结果，其中 `price` 和 `title` 被正常解析，但 `description` 解析失败，原因是 `convert_to_float` 函数无法转换 `string` 到 `float`: ```json { "product": { "price": "223.12", "title": "Shoes", "description": null }, "_warnings": [ { "_fn": "convert_to_float", "_fn_idx": 1, "_msg": "Failed to process function.", "_path": ".product.description" } ] } ``` 默认情况下，所有错误都会被计为警告，并放入 `_warnings` 列表中。如果你想在解析字段时忽略错误，可以使用以下方式抑制警告/错误： `"_on_error": "suppress"` 参数： ```json { "product": { ..., "description": { "_on_error": "suppress", "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='description']/text()"] }, { "_fn": "convert_to_float" } ] } } } ``` 随后将生成以下结果： ```json { "product": { "price": "223.12", "title": "Shoes", "description": null } } ``` ### 数组的数组自定义解析器允许在解析结果中使用 N 维数组。作为示例，我们使用以下 HTML 片段： ```html

``` 假设你想解析该文档，使结果成为一个 3x3 的二维整数数组： ```json { "table": [ [1, 2, 3], [4, 5, 6], [7, 8, 9], ] } ``` 要将 HTML 解析为上述 JSON，你可以使用以下解析指令： ```json { "table": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='row']"] }, { "_fn": "xpath", "_args": [".//div[@class='column']/text()"] }, { "_fn": "convert_to_int" } ] } } ``` --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://developers.oxylabs.io/products/cn/web-scraper-api/features/custom-parser/writing-instructions-manually/parsing-instruction-examples.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.