# Parsing instruction examples The following HTML snippet is parsed using example parsing instructions in the upcoming sections. ### Sample HTML ```html

Shoes

223.12

Super

Pants

60.12

Amazing
Quality

Socks

123.12

Very
Nice
Socks

``` ### Bare minimum {% hint style="info" %} Use case: you want to extract the text from all **shoes** **description** **items**. {% endhint %} *Example 1. Shoes description items selection using XPath.* ```json { "shoes_description": { "_fns": [ { "_fn": "xpath", "_args": [ ".//div[@id='shoes']//li[@class='description-item']/text()" ] } ] } } ``` The `xpath` function will find a single item and put it in a list as a string: ```json { "shoes_description": [ "Super" ] } ``` The exact `xpath` function behavior is described [**here**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/custom-parser/writing-instructions-manually/list-of-functions). ### Nested parsing instructions {% hint style="info" %} Use case: you want to parse all information related to shoes. Also, the parsed result should represent the document structure of the provided HTML. {% endhint %} You are targeting this part of the Sample HTML: ```html

Shoes

223.12

Super

``` And you would like the parsed result to be of the following structure: ```json { "shoes": { "title": "Shoes", "price": "223.12", "description": [ "Super" ] } } ``` Parsing instructions would look as follows. *Example 2. Parsing instructions are used to parse* `shoes` *information.* ```json { "shoes": { "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']/div[@class='title']/text()"] } ] }, "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']/div[@class='price']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@id='shoes']//li[@class='description-item']/text()"] } ] } } } ``` `xpath_one` works similarly to `xpath`, but instead of returning a list of all matches, it **returns the first matched item**. In the example above, the `shoes` property is the only property defined in the outermost instructions scope. The `shoes` property contains nested parsing instructions. The `shoes` instructions scope does not have a pipeline defined (`_fns` property is missing). This means pipelines defined in `title`, `price`, and `description` scopes will use the document-under-parse as a pipeline input. In Example 2, you can see a repetition of `//div[@id='shoes']` in XPath expressions. The repetition can be avoided by defining a pipeline in `shoes` scope: *Example 3. Defining a pipeline in* `shoes` *scope instructions to avoid XPath expression repetition.* ```json { "shoes": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']"] } ], "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='title']/text()"] } ] }, "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='price']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath", "_args": [".//li[@class='description-item']/text()"] } ] } } } ``` By using the parsing instructions provided in Example 3, Custom Parser will: 1. Start with processing `shoes._fns` pipeline, which will output the `shoes` HTML element; 2. Take `shoes._fns` pipeline output and use it as an input for pipelines, defined in `title`, `price`, and `description` scopes; 3. Process `title`, `price`, and `description` pipelines to produce final values. The result will look the same as a result from Example 2: ```json { "shoes": { "title": "Shoes", "price": "223.12", "description": [ "Super" ] } } ``` The main difference between Example 2 and Example 3 is that in Example 3, pipeline is defined in the `shoes` scope. **This additional pipeline selects the element of the shoes and passes it on to further pipelines found deeper in the instructions hierarchy.** ### List of nested objects {% hint style="info" %} **Use case:** Previously, you wanted to parse only `shoes` information. Now you want to parse the information of all products in the HTML. {% endhint %} The [**Sample HTML**](#sample-html) is used again as the document-under-parse. If you want your parsed result to look like this: ```json { "products": [ { "title": "Shoes", "price": "223.12", "description": [ "Super" ] }, { "title": "Pants", "price": "60.12", "description": [ "Amazing", "Quality" ] }, { "title": "Socks", "price": "123.12", "description": [ "Very", "Nice", "Socks" ] } ] } ``` The parsing instructions would look as follows: *Example 4. Parsing all products found on HTML document.* ```json { "products": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='product']"] } ], "_items": { "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='title']/text()"] } ] }, "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["./div[@class='price']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath", "_args": [".//li[@class='description-item']/text()"] } ] } } } } ``` The parsing instruction structure looks similar to the one in Example 3. However, there are two major exceptions: 1. `xpath` is used instead of `xpath_one` in `products._fns` pipeline. `products._fns` pipeline will now output a list of all elements that match the provided XPath expression (a list of product elements). 2. `_items` reserved property is used to indicate that you want to form a list by iterating through each item of the `products._fns` pipeline output and **passing/processing every list item separately** down the pipeline scope. If `_items` reserved property wasn't used in Example 4 parsing instructions, the parsed result would look as follows: ```json { "products": { "title": [ "Shoes", "Pants", "Socks" ], "price": [ "223.12", "60.12", "123.12" ], "description": [ [ "Super" ], [ "Amazing", "Quality" ], [ "Very", "Nice", "Socks" ] ] } } ``` {% hint style="warning" %} `_items` is used to specify that the Custom Parser must pass ***separate list items*** instead of the ***whole list*** down the parsing instructions. {% endhint %} ### Select the N-th element from a list This section demonstrates the flexibility of pipelines. The same problem can be approached in different ways. Multiple options can be used to select the N-th element from a list of any values. {% hint style="info" %} **Use case:** you want to select the second product price from the page. {% endhint %} The [**Sample HTML**](#sample-html) is again used as an example. You have multiple options to select the 2nd product. #### Option 1 You can utilize the XPath `[]` selector and define selection in the XPath expression. *Example 5. Select 2nd price using XPath \[] selector.* ```json { "second_price": { "_fns": [ { "_fn": "xpath", "_args": [ "(//div[@class='price'])[2]/text()" ] } ] } } ``` Result: ```json { "second_price": [ "60.12" ] } ``` #### Option 2 You can also use the `xpath` function to find all prices and pipe it to the function `select_nth`, which selects the n-th element from the extracted list of prices. *Example 6. Select the 2nd value using the \`select\_nth\` function.* ```json { "second_price": { "_fns": [ { "_fn": "xpath", "_args": [ "//div[@class='price']/text()" ] }, { "_fn": "select_nth", "_args": 1 } ] } } ``` Result: ```json { "second_price": "60.12" } ``` {% hint style="warning" %} Notice how the `select_nth` function returns an item from a list while the `xpath` function returns a list of items, even if a single item is found. {% endhint %} #### Option 3 You can use `select_nth` with any list type, including lists of HTML elements: *Example 7. Selecting all product HTML elements with* `class="product"` *==> selecting 2nd product element from the list ==> extracting price text from the selected product HTML element*. ```json { "second_price": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='product']"] }, { "_fn": "select_nth", "_args": 1 }, { "_fn": "xpath", "_args": ["./div[@class='price']/text()"] } ] } } ``` Result: ```json { "second_price": ["60.12"] } ``` ### Error handling When given the following HTML snippet: ```html

Nice Shoes

223.12

Super

``` And trying to parse it with the following parsing instructions: ```json { "product": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@id='shoes']"] } ], "price": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='price']/text()"] } ] }, "title": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='title']/text()"] } ] }, "description": { "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='description']/text()"] }, { "_fn": "convert_to_float" } ] } } } ``` Custom Parser will return a parsed result where `price` and `title` were parsed normally, but the `description` was failed to parse due to the `convert_to_float` function failing to convert `string` to `float`: ```json { "product": { "price": "223.12", "title": "Shoes", "description": null }, "_warnings": [ { "_fn": "convert_to_float", "_fn_idx": 1, "_msg": "Failed to process function.", "_path": ".product.description" } ] } ``` By default, all errors are counted as warnings and are placed inside of the `_warnings` list. If you would like to ignore the errors when parsing a field, they can suppress warnings/errors with `"_on_error": "suppress"` parameter: ```json { "product": { ..., "description": { "_on_error": "suppress", "_fns": [ { "_fn": "xpath_one", "_args": ["//div[@class='description']/text()"] }, { "_fn": "convert_to_float" } ] } } } ``` Which will then produce the following result: ```json { "product": { "price": "223.12", "title": "Shoes", "description": null } } ``` ### Array of arrays Custom Parser allows N-dimensional arrays in parsed results. As an example, let’s use the following HTML snippet: ```html

``` Let's say you want to parse the document so that the result is a 3x3 2-dimension array of integers: ```json { "table": [ [1, 2, 3], [4, 5, 6], [7, 8, 9], ] } ``` To parse the HTML into the JSON above, you can use the following parsing instructions: ```json { "table": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='row']"] }, { "_fn": "xpath", "_args": [".//div[@class='column']/text()"] }, { "_fn": "convert_to_int" } ] } } ```