Parsing instruction examples
This document presents sample use cases of the Custom Parser.
The following HTML snippet is parsed using example parsing instructions in the upcoming sections.
Sample HTML
Bare minimum
Use case: you want to extract the text from all shoes description items.
Example 1. Shoes description items selection using XPath.
The xpath
function will find a single item and put it in a list as a string:
The exact xpath
function behavior is described here.
Nested parsing instructions
Use case: you want to parse all information related to shoes. Also, the parsed result should represent the document structure of the provided HTML.
You are targeting this part of the Sample HTML:
And you would like the parsed result to be of the following structure:
Parsing instructions would look as follows.
Example 2. Parsing instructions are used to parse shoes
information.
xpath_one
works similarly to xpath
, but instead of returning a list of all matches, it returns the first matched item.
In the example above, the shoes
property is the only property defined in the outermost instructions scope. The shoes
property contains nested parsing instructions.
The shoes
instructions scope does not have a pipeline defined (_fns
property is missing). This means pipelines defined in title
, price
, and description
scopes will use the document-under-parse as a pipeline input.
In Example 2, you can see a repetition of //div[@id='shoes']
in XPath expressions. The repetition can be avoided by defining a pipeline in shoes
scope:
Example 3. Defining a pipeline in shoes
scope instructions to avoid XPath expression repetition.
By using the parsing instructions provided in Example 3, Custom Parser will:
Start with processing
shoes._fns
pipeline, which will output theshoes
HTML element;Take
shoes._fns
pipeline output and use it as an input for pipelines, defined intitle
,price
, anddescription
scopes;Process
title
,price
, anddescription
pipelines to produce final values.
The result will look the same as a result from Example 2:
The main difference between Example 2 and Example 3 is that in Example 3, pipeline is defined in the shoes
scope. This additional pipeline selects the element of the shoes and passes it on to further pipelines found deeper in the instructions hierarchy.
List of nested objects
Use case: Previously, you wanted to parse only shoes
information. Now you want to parse the information of all products in the HTML.
The Sample HTML is used again as the document-under-parse.
If you want your parsed result to look like this:
The parsing instructions would look as follows:
Example 4. Parsing all products found on HTML document.
The parsing instruction structure looks similar to the one in Example 3. However, there are two major exceptions:
xpath
is used instead ofxpath_one
inproducts._fns
pipeline.products._fns
pipeline will now output a list of all elements that match the provided XPath expression (a list of product elements)._items
reserved property is used to indicate that you want to form a list by iterating through each item of theproducts._fns
pipeline output and passing/processing every list item separately down the pipeline scope.
If _items
reserved property wasn't used in Example 4 parsing instructions, the parsed result would look as follows:
_items
is used to specify that the Custom Parser must pass separate list items instead of the whole list down the parsing instructions.
Select the N-th element from a list
This section demonstrates the flexibility of pipelines. The same problem can be approached in different ways.
Multiple options can be used to select the N-th element from a list of any values.
Use case: you want to select the second product price from the page.
The Sample HTML is again used as an example. You have multiple options to select the 2nd product.
Option 1
You can utilize the XPath []
selector and define selection in the XPath expression.
Example 5. Select 2nd price using XPath [] selector.
Result:
Option 2
You can also use the xpath
function to find all prices and pipe it to the function select_nth
, which selects the n-th element from the extracted list of prices.
Example 6. Select the 2nd value using the `select_nth` function.
Result:
Notice how the select_nth
function returns an item from a list while the xpath
function returns a list of items, even if a single item is found.
Option 3
You can use select_nth
with any list type, including lists of HTML elements:
Example 7. Selecting all product HTML elements with class="product"
==> selecting 2nd product element from the list ==> extracting price text from the selected product HTML element.
Result:
Error handling
When given the following HTML snippet:
And trying to parse it with the following parsing instructions:
Custom Parser will return a parsed result where price
and title
were parsed normally, but the description
was failed to parse due to the convert_to_float
function failing to convert string
to float
:
By default, all errors are counted as warnings and are placed inside of the _warnings
list. If you would like to ignore the errors when parsing a field, they can suppress warnings/errors with "_on_error": "suppress"
parameter:
Which will then produce the following result:
Array of arrays
Custom Parser allows N-dimensional arrays in parsed results. As an example, let’s use the following HTML snippet:
Let's say you want to parse the document so that the result is a 3x3 2-dimension array of integers:
To parse the HTML into the JSON above, you can use the following parsing instructions:
Last updated