Parsing instruction examples

This document presents sample use cases of the Custom Parser.

The following HTML snippet is parsed using example parsing instructions in the upcoming sections.

Sample HTML

<body>
    <div id="products">
        <div class="product" id="shoes">
            <div class="title">Shoes</div>
            <div class="price">223.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Super</li>
                </ul>
            </div>
        </div>
        <div class="product" id="pants">
            <div class="title">Pants</div>
            <div class="price">60.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Amazing</li>
                    <li class="description-item">Quality</li>
                </ul>
            </div>
        </div>
        <div class="product" id="socks">
            <div class="title">Socks</div>
            <div class="price">123.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Very</li>
                    <li class="description-item">Nice</li>
                    <li class="description-item">Socks</li>
                </ul>
            </div>
        </div>
    </div>
</body>

Bare minimum

Use case: you want to extract the text from all shoes description items.

Example 1. Shoes description items selection using XPath.

{
    "shoes_description": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    ".//div[@id='shoes']//li[@class='description-item']/text()"
                ]
            }
        ]
    }
}

The xpath function will find a single item and put it in a list as a string:

{
    "shoes_description": [
        "Super"
    ]
}

The exact xpath function behavior is described here.

Nested parsing instructions

Use case: you want to parse all information related to shoes. Also, the parsed result should represent the document structure of the provided HTML.

You are targeting this part of the Sample HTML:

<div class="product" id="shoes">
    <div class="title">Shoes</div>
    <div class="price">223.12</div>
    <div class="description">
        <ul>
            <li class="description-item">Super</li>
        </ul>
    </div>
</div>

And you would like the parsed result to be of the following structure:

{
    "shoes": {
        "title": "Shoes",
        "price": "223.12",
        "description": [
            "Super"
        ]
    }
}

Parsing instructions would look as follows.

Example 2. Parsing instructions are used to parse shoes information.

{
    "shoes": {
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@id='shoes']/div[@class='title']/text()"]
                }
            ]
        },
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@id='shoes']/div[@class='price']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": ["//div[@id='shoes']//li[@class='description-item']/text()"]
                }
            ]
        }
    }
}

xpath_one works similarly to xpath, but instead of returning a list of all matches, it returns the first matched item.

In the example above, the shoes property is the only property defined in the outermost instructions scope. The shoes property contains nested parsing instructions.

The shoes instructions scope does not have a pipeline defined (_fns property is missing). This means pipelines defined in title, price, and description scopes will use the document-under-parse as a pipeline input.

In Example 2, you can see a repetition of //div[@id='shoes'] in XPath expressions. The repetition can be avoided by defining a pipeline in shoes scope:

Example 3. Defining a pipeline in shoes scope instructions to avoid XPath expression repetition.

{
    "shoes": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='shoes']"]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["./div[@class='title']/text()"]
                }
            ]
        },
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["./div[@class='price']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [".//li[@class='description-item']/text()"]
                }
            ]
        }
    }
}

By using the parsing instructions provided in Example 3, Custom Parser will:

  1. Start with processing shoes._fns pipeline, which will output the shoes HTML element;

  2. Take shoes._fns pipeline output and use it as an input for pipelines, defined in title, price, and description scopes;

  3. Process title, price, and description pipelines to produce final values.

The result will look the same as a result from Example 2:

{
    "shoes": {
        "title": "Shoes",
        "price": "223.12",
        "description": [
            "Super"
        ]
    }
}

The main difference between Example 2 and Example 3 is that in Example 3, pipeline is defined in the shoes scope. This additional pipeline selects the element of the shoes and passes it on to further pipelines found deeper in the instructions hierarchy.

List of nested objects

Use case: Previously, you wanted to parse only shoes information. Now you want to parse the information of all products in the HTML.

The Sample HTML is used again as the document-under-parse.

If you want your parsed result to look like this:

{
    "products": [
        {
            "title": "Shoes",
            "price": "223.12",
            "description": [
                "Super"
            ]
        },
        {
            "title": "Pants",
            "price": "60.12",
            "description": [
                "Amazing",
                "Quality"
            ]
        },
        {
            "title": "Socks",
            "price": "123.12",
            "description": [
                "Very",
                "Nice",
                "Socks"
            ]
        }
    ]
}

The parsing instructions would look as follows:

Example 4. Parsing all products found on HTML document.

{
    "products": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='product']"]
            }
        ],
        "_items": {
            "title": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["./div[@class='title']/text()"]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["./div[@class='price']/text()"]
                    }
                ]
            },
            "description": {
                "_fns": [
                    {
                        "_fn": "xpath",
                        "_args": [".//li[@class='description-item']/text()"]
                    }
                ]
            }
        }
    }
}json

The parsing instruction structure looks similar to the one in Example 3. However, there are two major exceptions:

  1. xpath is used instead of xpath_one in products._fns pipeline. products._fns pipeline will now output a list of all elements that match the provided XPath expression (a list of product elements).

  2. _items reserved property is used to indicate that you want to form a list by iterating through each item of the products._fns pipeline output and passing/processing every list item separately down the pipeline scope.

If _items reserved property wasn't used in Example 4 parsing instructions, the parsed result would look as follows:

{
    "products": {
        "title": [
            "Shoes",
            "Pants",
            "Socks"
        ],
        "price": [
            "223.12",
            "60.12",
            "123.12"
        ],
        "description": [
            [
                "Super"
            ],
            [
                "Amazing",
                "Quality"
            ],
            [
                "Very",
                "Nice",
                "Socks"
            ]
        ]
    }
}

_items is used to specify that the Custom Parser must pass separate list items instead of the whole list down the parsing instructions.

Select the N-th element from a list

This section demonstrates the flexibility of pipelines. The same problem can be approached in different ways.

Multiple options can be used to select the N-th element from a list of any values.

Use case: you want to select the second product price from the page.

The Sample HTML is again used as an example. You have multiple options to select the 2nd product.

Option 1

You can utilize the XPath [] selector and define selection in the XPath expression.

Example 5. Select 2nd price using XPath [] selector.

{
    "second_price": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "(//div[@class='price'])[2]/text()"
                ]
            }
        ]
    }
}

Result:

{
    "second_price": [
        "60.12"
    ]
}

Option 2

You can also use the xpath function to find all prices and pipe it to the function select_nth, which selects the n-th element from the extracted list of prices.

Example 6. Select the 2nd value using the `select_nth` function.

{
    "second_price": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='price']/text()"
                ]
            },
            {
                "_fn": "select_nth",
                "_args": 1
            }
        ]
    }
}

Result:

{
    "second_price": "60.12"
}

Notice how the select_nth function returns an item from a list while the xpath function returns a list of items, even if a single item is found.

Option 3

You can use select_nth with any list type, including lists of HTML elements:

Example 7. Selecting all product HTML elements with class="product" ==> selecting 2nd product element from the list ==> extracting price text from the selected product HTML element.

{
    "second_price": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='product']"]
            },
            {
                "_fn": "select_nth",
                "_args": 1
            },
            {
                "_fn": "xpath",
                "_args": ["./div[@class='price']/text()"]
            }
        ]
    }
}

Result:

{
    "second_price": ["60.12"]
}

Error handling

When given the following HTML snippet:

<div class="product" id="shoes">
    <div class="title">Nice Shoes</div>
    <div class="price">223.12</div>
    <div class="description">Super</div>
</div>

And trying to parse it with the following parsing instructions:

{
    "product": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='shoes']"]
            }
        ],
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='price']/text()"]
                }
            ]
        },
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='title']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='description']/text()"]
                },
                {
                    "_fn": "convert_to_float"
                }
            ]
        }
    }
}

Custom Parser will return a parsed result where price and title were parsed normally, but the description was failed to parse due to the convert_to_float function failing to convert string to float:

{
    "product": {
        "price": "223.12",
        "title": "Shoes",
        "description": null
    },
    "_warnings": [
        {
            "_fn": "convert_to_float",
            "_fn_idx": 1,
            "_msg": "Failed to process function.",
            "_path": ".product.description"
        }
    ]
}

By default, all errors are counted as warnings and are placed inside of the _warnings list. If you would like to ignore the errors when parsing a field, they can suppress warnings/errors with "_on_error": "suppress" parameter:

{
    "product": {
        ...,
        "description": {
            "_on_error": "suppress",
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='description']/text()"]
                },
                {
                    "_fn": "convert_to_float"
                }
            ]
        }
    }
}

Which will then produce the following result:

{
    "product": {
        "price": "223.12",
        "title": "Shoes",
        "description": null
    }
}

Array of arrays

Custom Parser allows N-dimensional arrays in parsed results. As an example, let’s use the following HTML snippet:

<div class="row">
    <div class="column">1</div>
    <div class="column">2</div>
    <div class="column">3</div>
</div>
<div class="row">
    <div class="column">4</div>
    <div class="column">5</div>
    <div class="column">6</div>
</div>
<div class="row">
    <div class="column">7</div>
    <div class="column">8</div>
    <div class="column">9</div>
</div>

Let's say you want to parse the document so that the result is a 3x3 2-dimension array of integers:

{
    "table": [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
    ]
}

To parse the HTML into the JSON above, you can use the following parsing instructions:

{
    "table": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='row']"]
            },
            {
                "_fn": "xpath",
                "_args": [".//div[@class='column']/text()"]
            },
            {
                "_fn": "convert_to_int"
            }
        ]
    }
}

Last updated