入门

Previous自定义解析器 (Custom Parser)Next解析指令的示例

Last updated 1 year ago

Was this helpful?

入门

如何使用自定义解析器

介绍

如需使用自定义解析器，请在创建作业时提供一组parsing_instructions。

比如，您想解析在Bing搜索中搜索测试所得出总结果的数量：

一个作业参数的示例如下：

{
    "source": "bing_search",
    "query": "test",
    "parse": true,
    "parsing_instructions": {
        "number_of_results": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@class='sb_count']/text()"]
                }
            ]
        }
    }
}

第1步. 您必须提供"parse": true参数。

第2步. 解析指令应在"parsing_instructions"字段中进行描述。

上述解析说明样本指定的目的是解析抓取文档中的搜索结果数量，并将结果放在number_of_results字段中。关于如何通过定义“pipeline”来解析字段的指令如下：

"_fns": [
    {
        "_fn": "xpath_one",
        "_args": [".//span[@class='sb_count']/text()"]
    }
]

管线是指一个要执行的数据处理函数的列表。这些函数将按照它们在列表中出现的顺序执行，并将前一个函数的输出作为输入。

上述样本作业的解析结果应如下所示：

{
    "results": [
        {
            "content": {
                "number_of_results": "About 35.700.000.000 results",
                "parse_status_code": 12000
            },
            "created_at": "2023-03-24 08:27:16",
            "internal": [],
            "job_id": "7044947765926856705",
            "page": 1,
            "parser_type": "",
            "status_code": 200,
            "updated_at": "2023-03-24 08:27:21",
            "url": "https://www.bing.com/search?form=QBLH&q=test"
        }
    ]
}

自定义解析器不仅提供了从抓取的HTML中提取文本的功能，而且还可以执行基本的数据处理功能。

例如，前文中描述的解析指令在提取number_of_results为文本的同时会带有您可能不需要的额外关键词。如果您想得到数字数据类型的指定query=test结果数量，则您可以重复使用相同的解析指令，并在现有的管线中添加amount_from_string函数：

{
    "source": "bing_search",
    "query": "test",
    "parse": true,
    "parsing_instructions": {
        "number_of_results": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@class='sb_count']/text()"]
                },
                {
                    "_fn": "amount_from_string"
                }
            ]
        }
    }
}

上述样本作业的解析结果应如下所示：

{
    "results": [
        {
            "content": {
                "number_of_results": 2190000000,
                "parse_status_code": 12000
            },
            "created_at": "2023-03-24 08:52:21",
            "internal": [],
            "job_id": "7044954079138679809",
            "page": 1,
            "parser_type": "",
            "status_code": 200,
            "updated_at": "2023-03-24 08:52:25",
            "url": "https://www.bing.com/search?form=QBLH&q=test"
        }
    ]
}

我们可以看到，它已准确地解析了该文件：

编写XPath表达式的技巧

抓取的文件和浏览器加载的文件之间的HTML结构可能不同

在编写HTML元素选择函数时，要确保使用抓取的文件，而不是浏览器上加载的实时网站版本，因为这些文件可能有所不同。这个问题背后的主要原因在于JavaScript的渲染。当一个网站被打开时，您的浏览器负责加载额外的文件，如CSS样式表和JavaScript脚本，它们可以改变初始HTML文件的结构。当解析抓取的HTML时，自定义解析器并不像浏览器那样加载HTML文档（解析器忽略了JavaScript指令），因此HTML树在解析器和浏览器呈现的内容可能有所不同。

如下所示，请查看下列HTML文档：

<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
    <div>
        <h3>This is a product</h3>
        <div id="price-container">
            <p>This is the price:</p>
        </div>
        <p>And here is some description</p>
    </div>
    <script>
        const priceContainer = document.querySelector("#price-container");
        const priceElement = document.createElement("p");
        priceElement.textContent = "123";
        priceElement.id = "price"
        priceContainer.insertAdjacentElement("beforeend", priceElement);
    </script>
</body>
</html>

如果您通过浏览器打开文档，则它将显示出您可以使用以下XPath表达式选择的价格 //p[@id="price"]：

现在，如果您在浏览器中禁用了JavaScript渲染，则网站将呈现如下结果：

同样的//p[@id="price"]XPath表达式将不再匹配价格，因为其未经渲染。

确保为目标元素编写所有可能的HTML选择器

由于各种原因，同一个页面被抓取两次可能会有不同的布局（抓取时使用的用户代理不同，目标网站做A/B测试等等）。

为了解决这个问题，我们建议为最初抓取的文档定义 parsing_instructions，并立即用相同页面类型的其他多个抓取作业结果测试这些指令。

建议的HTML选择器编写流程

使用爬虫API抓取目标页面的HTML文档。
禁用JavaScript，在您的浏览器上本地打开抓取的HTML。如果在打开HTML后禁用了JavaScript，请务必重新加载页面，以便在没有JavaScript的情况下重新加载HTML。

如何编写解析指令

假设您有以下页面需要解析：

`<!doctype html>
<html lang="en">
<head></head>
<body>
<style>
.variant {
  display: flex;
  flex-wrap: nowrap;
}
.variant p {
  white-space: nowrap;
  margin-right: 20px;
}
</style>
<div>
    <h1 id="title">This is a cool product</h1>
    <div id="description-container">
        <h2>This is a product description</h2>
        <ul>
            <li class="description-item">Durable</li>
            <li class="description-item">Nice</li>
            <li class="description-item">Sweet</li>
            <li class="description-item">Spicy</li>
        </ul>
    </div>
    <div id="price-container">
        <h2>Variants</h2>
        <div id="variants">
            <div class="variant">
                <p class="color">Red</p>
                <p class="price">99.99</p>
            </div>
            <div class="variant">
                <p class="color">Green</p>
                <p class="price">87.99</p>
            </div>
            <div class="variant">
                <p class="color">Blue</p>
                <p class="price">65.99</p>
            </div>
            <div class="variant">
                <p class="color">Black</p>
                <p class="price">99.99</p>
            </div>
        </div>
    </div>
</div>
</body>
</html>

解析产品标题

创建一个新的JSON对象并给它分配一个新的字段。

您可以用任何喜欢的方式来命名这个字段，但有一些例外（用户定义的字段名不能以下划线_开头，例如"_title"）。

字段名将显示在解析的结果中。

新字段必须拥有一个JSON对象类型值：

{
    "title": {}  // defining a title field to be parsed
}

如果您向自定义解析器提供这些指令，它不会执行任何操作，或者发送一个投诉表示您没有提供任何指令。

为了真正将标题解析到标题段中，则您必须在标题对象内部使用保留的_fns属性（其始终为数组类型）定义一个数据处理管线：

{
    "title": {
        "_fns": []  // defining data processing pipeline for the title field
    }
}

{
    "title": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//h1/text()"]
            }
        ]
    }
}

上述的解析指令应该得出以下结果：

{
    "title": "This is a cool product"
}

解析描述

同样，在解析指令中，您可以定义另一个字段，需要进行解析的产品描述容器、描述标题和项目。如果标题和项目的描述要嵌套在description对象下，则指令的结构应该如下：

{
    "title": {...},
    "description": { // description container
        "title": {}, // description title
        "items": {} // description items
    } 
}

给定的解析指令结构意味着description.title和description.items将根据description的元素进行解析。您可以为描述字段定义一个管线。在这种情况下，它会被优先完成，因为它将简化对标题描述的XPath表达式。

{
    "title": {...},
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='description-container']"]
            }
        ],  // Pipeline result will be used when parsing `title` and `items`.
        "title": {},
        "items": {}
    }
}

在这个示例中，description._fns管线将选择description-containerHTML元素，并将被用作解析描述标题和项目的参考点。

为了解析其余的描述字段，须为字段description.items和description.title添加两个不同的管线：

{
    "title": {...},
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//div[@id='description-container']"
                ]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [
                        "//h2/text()"
                    ]
                }
            ]
        },
        "items": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li/text()"
                    ]
                }
            ]
        }
    }
}

请注意xpath函数的使用方法，而不是xpath_one来提取符合XPath表达式的所有项目。

解析指令产生的结果如下：

{
    "title": {...},
    "description": {
        "title": "This is description about the product",
        "items": [
            "Durable",
            "Nice",
            "Sweet",
            "Spicy"
        ]
    }
}

解析产品变体

下面的示例中显示了如果您想把信息解析到product_variants字段中的指令结构，其将包含一个变体对象的列表。在这种情况下，其变体对象拥有价格和颜色字段。

{
    "title": {...},
    "description": {...},
    "product_variants": [
        {
            "price": ...,
            "color": ...
        },
        {
            ...
        },
        ...
    ]
}

首先选择所有的产品变体元素：

{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='variant']"]
            }
        ]
    }
}

为了使product_variants成为一个包含JSON对象的列表，您将不得不使用_items迭代器来迭代找到变体：

{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='variant']"]
            }
        ],
        "_items": { // with this, you are instructing to process found elements one by one
            // field instructions to be described here
        } 
    }
}

最后，定义关于如何解析颜色和价格字段的说明：

{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='variant']"
                ]
            }
        ],
        "_items": {
            "color": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            // As we are using relative XPath expressions,
                            // make sure XPath starts with a dot (.)
                            ".//p[@class='color']/text()"
                        ]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='price']/text()"
                        ]
                    }
                ]
            }
        }
    }
}

有了product_variants的描述，其最终的指令将如下：

{
    "title": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//h1/text()"
                ]
            }
        ]
    },
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//div[@id='description-container']"
                ]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [
                        "//h2/text()"
                    ]
                }
            ]
        },
        "items": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li/text()"
                    ]
                }
            ]
        }
    },
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='variant']"
                ]
            }
        ],
        "_items": {
            "color": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='color']/text()"
                        ]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='price']/text()"
                        ]
                    }
                ]
            }
        }
    }
}

其将产生以下输出内容：

{
    "title": "This is a cool product",
    "description": {
        "title": "This is a product description",
        "items": [
            "Durable",
            "Nice",
            "Sweet",
            "Spicy"
        ]
    },
    "product_variants": [
        {
            "color": "Red",
            "price": "99.99"
        },
        {
            "color": "Green",
            "price": "87.99"
        },
        {
            "color": "Blue",
            "price": "65.99"
        },
        {
            "color": "Black",
            "price": "99.99"
        }
    ]
}

用管道函数的

使用自定义解析器时，如果解析失败会发生什么？

如果自定义解析器无法处理客户定义的解析指令，我们将返回12005状态代码（解析后有警告）。

{
    "source": "bing_search",
    "query": "test",
    "parse": true,
    "parsing_instructions": {
        "number_of_results": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@class='sb_count']/text()"]
                },
                {
                    "_fn": "amount_from_string"
                }
            ]
        },
        "number_of_organics": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": ["//this-will-not-match-anything"]
                },
                {
                    "_fn": "length"
                }
            ]
        }
    }
}

对于此类结果客户将被收取费用：

{
    "results": [
        {
            "content": {
                "_warnings": [
                    {
                        "_fn": "xpath",
                        "_fn_idx": 0,
                        "_msg": "XPath expressions did not match any data.",
                        "_path": ".number_of_organics"
                    }
                ],
                "number_of_organics": null,
                "number_of_results": 18000000000,
                "parse_status_code": 12005
            },
            "created_at": "2023-03-24 09:46:46",
            "internal": [],
            "job_id": "7044967772475916289",
            "page": 1,
            "parser_type": "",
            "status_code": 200,
            "updated_at": "2023-03-24 09:46:48",
            "url": "https://www.bing.com/search?form=QBLH&q=test"
        }
    ]
}

如果自定义解析器在解析操作中遇到异常并中断，则它将返回这些状态代码：12002、12006、12007。您不会因为这些意外的错误而被收费。

状态代码

Previous自定义解析器 (Custom Parser)Next解析指令的示例

Last updated 1 year ago

Was this helpful?