解析指令示例

查看自定义解析器的实际解析指令示例：处理嵌套对象、列表、错误和数组的数组。

以下 HTML 片段使用即将介绍的示例解析指令进行解析。

示例 HTML

<body>
    <div id="products">
        <div class="product" id="shoes">
            <div class="title">Shoes</div>
            <div class="price">223.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Super</li>
                </ul>
            </div>
        </div>
        <div class="product" id="pants">
            <div class="title">Pants</div>
            <div class="price">60.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Amazing</li>
                    <li class="description-item">Quality</li>
                </ul>
            </div>
        </div>
        <div class="product" id="socks">
            <div class="title">Socks</div>
            <div class="price">123.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Very</li>
                    <li class="description-item">Nice</li>
                    <li class="description-item">Socks</li>
                </ul>
            </div>
        </div>
    </div>
</body>

最简说明

使用场景：你想提取所有文本中的 shoes description 项目.

示例 1。使用 XPath 选择 Shoes 描述项。

{
    "shoes_description": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    ".//div[@id='shoes']//li[@class='description-item']/text()"
                ]
            }
        ]
    }
}

该 xpath 函数将找到单个项并将其作为字符串放入列表：

{
    "shoes_description": [
        "Super"
    ]
}

确切的 xpath 函数行为已被描述 here.

嵌套解析指令

使用场景：你想解析与 shoes 相关的所有信息。此外，解析结果应表示所提供 HTML 的文档结构。

你将目标定位在示例 HTML 的这一部分：

<div class="product" id="shoes">
    <div class="title">Shoes</div>
    <div class="price">223.12</div>
    <div class="description">
        <ul>
            <li class="description-item">Super</li>
        </ul>
    </div>
</div>

并且你希望解析结果具有以下结构：

{
    "shoes": {
        "title": "Shoes",
        "price": "223.12",
        "description": [
            "Super"
        ]
    }
}

解析指令将如下所示。

示例 2。使用解析指令来解析 shoes 信息。

{
    "shoes": {
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@id='shoes']/div[@class='title']/text()"]
                }
            ]
        },
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@id='shoes']/div[@class='price']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": ["//div[@id='shoes']//li[@class='description-item']/text()"]
                }
            ]
        }
    }
}

xpath_one 的工作方式类似于 xpath，但它不是返回所有匹配项的列表， 而是返回第一个匹配项.

在上面的示例中， shoes 属性是最外层指令作用域中定义的唯一属性。该 shoes 属性包含嵌套的解析指令。

该 shoes 指令作用域没有定义管道（_fns 属性缺失）。这意味着在 title, price，和 description 作用域中定义的管道将使用作为管道输入的正在解析的文档。

在示例 2 中，你可以看到 XPath 表达式中重复出现了 //div[@id='shoes'] 通过在 shoes 作用域中定义管道可以避免这种重复：

示例 3。在 shoes 作用域指令中定义管道以避免 XPath 表达式重复。

{
    "shoes": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='shoes']"]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["./div[@class='title']/text()"]
                }
            ]
        },
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["./div[@class='price']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [".//li[@class='description-item']/text()"]
                }
            ]
        }
    }
}

通过使用示例 3 中提供的解析指令，自定义解析器将：

从处理开始 shoes._fns 管道，该管道将输出 shoes HTML 元素；
将 shoes._fns 管道输出并将其用作在 title, price，和 description 作用域中定义的管道的输入；
处理 title, price，和 description 管道以生成最终值。

结果将与示例 2 的结果看起来相同：

{
    "shoes": {
        "title": "Shoes",
        "price": "223.12",
        "description": [
            "Super"
        ]
    }
}

示例 2 与示例 3 之间的主要区别是，在示例 3 中，管道定义在 shoes 作用域中。 该额外管道选择 shoes 的元素并将其传递给指令层次结构中更深处的后续管道。

嵌套对象列表

使用场景： 之前，你只想解析 shoes 信息。现在你想解析 HTML 中所有产品的信息。

该 示例 HTML 再次将

用作正在解析的文档。

{
    如果你希望解析结果如下所示：
        {
            "title": "Shoes",
            "price": "223.12",
            "description": [
                "Super"
            ]
        },
        {
            "products": [
            "title": "Pants",
            "description": [
                "price": "60.12",
                "Amazing",
            ]
        },
        {
            "Quality"
            "title": "Socks",
            "description": [
                "price": "123.12",
                "Very",
                "Nice",
            ]
        }
    ]
}

"Socks"

解析指令将如下所示：

{
    示例 4。解析在 HTML 文档中找到的所有产品。
        "_fns": [
            {
                "_fn": "xpath",
                "products": {
            }
        ],
        "_args": ["//div[@class='product']"]
            "title": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["./div[@class='title']/text()"]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["./div[@class='price']/text()"]
                    }
                ]
            },
            "description": {
                "_fns": [
                    {
                        "_fn": "xpath",
                        "_args": [".//li[@class='description-item']/text()"]
                    }
                ]
            }
        }
    }
}

"_items": {

xpath 解析指令结构看起来类似于示例 3 中的结构。然而，有两个主要例外： xpath_one 在 中使用，而不是 products._fns 中使用，而不是 管道。
管道现在将输出一个匹配所提供 XPath 表达式的所有元素的列表（产品元素列表）。 _items 中使用，而不是 保留属性用于指示你希望通过迭代 管道输出的每一项来形成列表并 分别传递/处理每个列表项

通过管道作用域。 管道现在将输出一个匹配所提供 XPath 表达式的所有元素的列表（产品元素列表）。 如果在示例 4 的解析指令中未使用

{
    示例 4。解析在 HTML 文档中找到的所有产品。
        保留属性，解析结果将如下所示：
            "title": [
            "Shoes",
            "Nice",
        ],
        "Pants",
            "223.12",
            "60.12",
            "123.12"
        ],
        "description": [
            [
                "Super"
            ],
            [
                "price": "60.12",
                "Amazing",
            ],
            [
                "price": "123.12",
                "Very",
                "Nice",
            ]
        ]
    }
}

管道现在将输出一个匹配所提供 XPath 表达式的所有元素的列表（产品元素列表）。 "price": [ 用于指定自定义解析器必须传递 单独的列表项 而不是 整个列表

通过解析指令。

从列表中选择第 N 个元素

本节展示了管道的灵活性。相同的问题可以通过不同方式解决。

使用场景： 可以使用多种选项从任意值列表中选择第 N 个元素。

该 示例 HTML 你想从页面中选择第二个产品的价格。

再次使用

作为示例。你有多种选择来选择第 2 个产品。 [] 选项 1

你可以利用 XPath

{
    选择器并在 XPath 表达式中定义选择。
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    示例 5。使用 XPath 的 [] 选择器选择第 2 个价格。
                ]
            }
        ]
    }
}

"second_price": {

{
    "(//div[@class='price'])[2]/text()"
        "60.12"
    ]
}

结果：

"second_price": [ xpath 选项 2 你也可以使用函数查找所有价格并将其管道到函数

select_nth

{
    选择器并在 XPath 表达式中定义选择。
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    ，该函数从提取的价格列表中选择第 n 个元素。
                ]
            },
            {
                示例 6。使用 `select_nth` 函数选择第 2 个值。
                "//div[@class='price']/text()"
            }
        ]
    }
}

"second_price": {

{
    "_fn": "select_nth",
}

"_args": 1 你也可以使用 "second_price": "60.12" xpath 注意到

函数从列表中返回一项，而

函数返回一列项目，即使只找到单个项目。 你也可以使用 选项 3

你可以将 与任何列表类型一起使用，包括 HTML 元素列表： 示例 7。使用.

{
    选择器并在 XPath 表达式中定义选择。
        "_fns": [
            {
                "_fn": "xpath",
                "products": {
            },
            {
                示例 6。使用 `select_nth` 函数选择第 2 个值。
                "//div[@class='price']/text()"
            },
            {
                "_fn": "xpath",
                "_args": ["./div[@class='price']/text()"]
            }
        ]
    }
}

"second_price": {

{
    选择所有具有
}

class="product"

==> 从列表中选择第 2 个产品元素 ==> 从所选产品 HTML 元素中提取价格文本

<div class="product" id="shoes">
    "second_price": ["60.12"]
    <div class="price">223.12</div>
    错误处理
</div>

当给定以下 HTML 片段时：

{
    <div class="title">Nice Shoes</div>
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='shoes']"]
            }
        ],
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    <div class="description">Super</div>
                }
            ]
        },
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    并尝试使用以下解析指令解析它：
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "product": {
                },
                {
                    "_args": ["//div[@class='price']/text()"]
                }
            ]
        }
    }
}

"_args": ["//div[@class='title']/text()"] price 和 title "_args": ["//div[@class='description']/text()"] description "_fn": "convert_to_float" 自定义解析器将返回一个解析结果，其中 被正常解析，但 字符串 to 浮点数:

{
    <div class="title">Nice Shoes</div>
        "price": "223.12",
        "title": "Shoes",
        由于
    },
    convert_to_float
        {
            函数无法将
            转换而导致未能解析
            "description": null
            "_warnings": [
        }
    ]
}

"_fn": "convert_to_float", "_fn_idx": 1, "_msg": "Failed to process function.", "_path": ".product.description" 默认情况下，所有错误均计为警告并放置在

{
    <div class="title">Nice Shoes</div>
        ...,
        "description": {
            _warnings
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "product": {
                },
                {
                    "_args": ["//div[@class='price']/text()"]
                }
            ]
        }
    }
}

列表中。如果你想在解析字段时忽略错误，可以使用

{
    <div class="title">Nice Shoes</div>
        "price": "223.12",
        "title": "Shoes",
        由于
    }
}

"_on_error": "suppress"

参数来抑制警告/错误：

"_on_error": "suppress",
    这将产生以下结果：
    数组的数组
    自定义解析器允许在解析结果中使用 N 维数组。作为示例，我们使用以下 HTML 片段：
</div>
"_on_error": "suppress",
    <div class="row">
    <div class="column">1</div>
    <div class="column">2</div>
</div>
"_on_error": "suppress",
    <div class="column">3</div>
    <div class="column">4</div>
    <div class="column">5</div>
</div>

{
    <div class="column">7</div>
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
    ]
}

{
    <div class="column">9</div>
        "_fns": [
            {
                "_fn": "xpath",
                假设你想将文档解析为一个 3x3 的二维整数数组：
            },
            {
                "_fn": "xpath",
                "table": [
            },
            {
                要将 HTML 解析为上面的 JSON，你可以使用以下解析指令：
            }
        ]
    }
}

上一页手动编写指令下一页编写 XPath 表达式的技巧

最后更新于9天前

这有帮助吗？

下午好

hashtag示例 HTML

hashtag最简说明

hashtag嵌套解析指令

hashtag嵌套对象列表

hashtag通过解析指令。

hashtag再次使用

hashtag结果：

hashtag函数从列表中返回一项，而

hashtagclass="product"

hashtag"_on_error": "suppress"

示例 HTML

最简说明

嵌套解析指令

嵌套对象列表

通过解析指令。

再次使用

结果：

函数从列表中返回一项，而

class="product"

"_on_error": "suppress"