Documentation has been updated: see help center and changelog in one place.
⭐Explore
LogoLogo
Oxylabs dashboardContact usProduct
English
  • Documentation
  • Help center
  • Changelog
  • Overview
  • PROXIES
    • Integration Guides
      • Get IP Address for Integrations
      • Residential Proxies Guides
        • AdsPower
        • Android
        • ClonBrowser
        • Dolphin Anty
        • FoxyProxy
        • Ghost Browser
        • GoLogin
        • Helium Scraper
        • Incogniton
        • iOS
        • Kameleo
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Multilogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • ParseHub
        • Playwright
        • Proxifier
        • Puppeteer
        • Selenium
        • SEO Neo
        • SessionBox
        • Shadowrocket
        • Super Proxy
        • SwitchyOmega
        • Ubuntu
        • VMLogin
        • WebHarvy
        • Hidemyacc
      • ISP Proxies Guides
        • AdsPower
        • Android
        • Dolphin Anty
        • FoxyProxy
        • GoLogin
        • Incogniton
        • iOS
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Multilogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • Proxifier
        • SEO Neo
        • Shadowrocket
        • Sphere
        • Super Proxy
        • SwitchyOmega
        • Ubuntu
        • Hidemyacc
      • Mobile Proxies Guides
        • AdsPower
        • Android
        • ClonBrowser
        • Dolphin Anty
        • Ghost Browser
        • GoLogin
        • Helium Scraper
        • Incogniton
        • iOS
        • Kameleo
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Multilogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • ParseHub
        • Playwright
        • Proxifier
        • Puppeteer
        • Selenium
        • SEO Neo
        • SessionBox
        • Shadowrocket
        • SwitchyOmega
        • Ubuntu
        • VMLogin
        • WebHarvy
      • Dedicated Datacenter Proxies Guides
        • Enterprise
          • Dolphin Anty
          • FoxyProxy
          • GoLogin
          • Lalicat Browser
          • MoreLogin
          • MuLogin
          • Nstbrowser
          • Octoparse
          • Oxy® Proxy Extension for Chrome
          • Proxifier
          • SEO Neo
          • Shadowrocket
          • Sphere
          • Super Proxy
          • SwitchyOmega
          • Ubuntu
          • Hidemyacc
        • Self-Service
          • Android
          • Dolphin Anty
          • FoxyProxy
          • GoLogin
          • iOS
          • Lalicat Browser
          • MacOS
          • MoreLogin
          • MuLogin
          • Nstbrowser
          • Octoparse
          • Oxy® Proxy Extension for Chrome
          • Proxifier
          • SEO Neo
          • Shadowrocket
          • Sphere
          • Super Proxy
          • SwitchyOmega
          • Ubuntu
          • Hidemyacc
      • Datacenter Proxies Guides
        • AdsPower
        • Android
        • Dolphin Anty
        • FoxyProxy
        • GoLogin
        • iOS
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • Proxifier
        • SEO Neo
        • Shadowrocket
        • Super Proxy
        • SwitchyOmega
        • Ubuntu
        • Hidemyacc
    • Residential Proxies
      • Getting Started
      • Making Requests
        • Entry Node for China
      • Location Settings
        • Country
        • City
        • State
        • Continent
        • ZIP/Postal code
        • Coordinates
        • ASN Targeting
      • Session Control
        • Sticky Proxy Entry Nodes
      • Protocols
      • Whitelisting IPs
        • Requests with Whitelisted IPs
      • Endpoint Generator
      • Restricted Targets
      • Public API
      • Response Codes
    • ISP Proxies
      • Making Requests
      • Proxy List
      • Proxy Rotation
      • Location Settings
      • Protocols
      • Whitelisting IPs
      • Response Codes
      • Restricted Targets
      • Fair usage policy
    • Mobile Proxies
      • Getting Started
      • Making Requests
        • Entry Node for China
      • Location Settings
        • Country
        • City
        • State
        • Continent
        • Coordinates
        • ASN Targeting
      • Session Control
        • Sticky Proxy Entry Nodes
      • Protocols
      • Whitelisting IPs
      • Endpoint Generator
      • Restricted Targets
      • Public API
      • Response Codes
    • Datacenter Proxies
      • Proxy List
      • IP Control
      • Select Country
      • Protocols
      • Whitelisting
      • Response Codes
      • Restricted Targets
      • Fair usage policy
      • Free Datacenter IPs
    • Dedicated Datacenter Proxies
      • Enterprise
        • Getting Started
        • Proxy List
        • Making Requests
        • Protocols
        • Whitelisting IPs
          • Dashboard
          • RESTful
            • Getting Whitelisted IPs List
            • Adding a Whitelisted IP
            • Removing a Whitelisted IP
            • Saving Changes (5min Cooldown)
        • Datacenter Proxy API
        • Proxy Rotator - Optional
        • Response Codes
      • Self-Service
        • Getting Started
        • Making Requests
        • Proxy List
        • Proxy Rotation
        • Location Settings
        • Protocols
        • Whitelisting IPs
        • Response Codes
        • Restricted Targets
        • Fair usage policy
    • Dedicated ISP Proxies
      • Getting Started
      • Proxy List
      • Making Requests
      • Protocols
      • Whitelisting IPs (RESTful)
        • Getting Whitelisted IPs List
        • Adding a Whitelisted IP
        • Removing a Whitelisted IP
        • Saving Changes (5min Cooldown)
      • Proxy API
      • Proxy Rotator - Optional
      • Response Codes
  • Advanced proxy solutions
    • Web Unblocker
      • Getting Started
      • Making Requests
        • Session
        • Geo-location
        • Headers & Cookies
        • Custom status code
        • POST requests
      • Headless Browser
        • JavaScript rendering
        • Browser instructions (Beta)
          • List of instructions
      • Sample Response
      • Response Codes
      • Rate Limits
      • Migration Guides
        • From Bright Data Web Unlocker
      • Usage Statistics
      • Billing Information
  • VIDEO DATA
    • High-Bandwidth Proxies
      • YouTube Downloader (yt_dlp) integration
  • Video Data API
  • Scraping Solutions
    • Web Scraper API
      • Integration Methods
        • Realtime
        • Push-Pull
        • Proxy Endpoint
      • Features
        • Localization
          • Proxy Location
          • SERP Localization
          • E-Commerce Localization
          • Domain, Locale, Results Language
        • JS Rendering & Browser Control
          • JavaScript Rendering
          • Browser Instructions
            • List of instructions
          • Capturing network requests (Fetch/XHR)
        • Result Processing & Storage
          • Dedicated Parsers
          • Custom Parser
            • Getting started
            • Parsing instruction examples
            • List of functions
              • Function examples
          • Download Images
          • Cloud Storage
        • HTTP Context & Job Management
          • Headers, Cookies, Method
          • User Agent Type
          • Client Notes
        • Scheduler
      • Solutions for AI Workflows
        • Model Context Protocol (MCP)
        • LangChain
        • LlamaIndex
      • Targets
        • Google
          • Search
            • Web Search
            • AI Overviews
            • Image Search
            • News Search
            • Local Search
            • Reverse Image Search
            • Google Suggest
          • Ads Max
          • Shopping
            • Shopping Product
            • Shopping Search
            • Shopping Pricing
          • Trends: Explore
          • Travel: Hotels
          • Lens
          • URL
        • Amazon
          • Product
          • Search
          • Pricing
          • Sellers
          • Best Sellers
          • Reviews
          • Questions & Answers
          • URL
        • YouTube
          • YouTube Scraping Guide for AI
          • YouTube Search
          • YouTube Video Trainability
          • YouTube Metadata
          • YouTube Downloader
          • YouTube Transcript
        • Generic Target
        • Walmart
          • Search
          • Product
        • Ebay
        • Etsy
          • Search
          • Product
        • Bing
          • Search
          • URL
        • ChatGPT
        • North American E-Commerce
          • Best Buy
            • Search
            • Product
          • Kroger
            • Product
            • Search
            • URL
          • Lowe's
            • Search
            • Product
            • URL
          • Target
            • Search
            • Product
            • Category
          • Bed Bath & Beyond
          • Costco
          • Menards
          • Petco
          • Staples
          • Grainger
          • Instacart
        • European E-Commerce
          • Allegro
            • Search
            • Product
          • Idealo
          • Mediamarkt
          • Cdiscount
        • Asian E-Commerce
          • Alibaba
          • Aliexpress
          • Lazada
          • Rakuten
          • Tokopedia
          • Flipkart
          • Avnet
          • Indiamart
        • Latin American E-Commerce
          • Mercado Livre
          • Magazine Luiza
          • Falabella
          • Dcard
      • Restricted Targets
      • Response Codes
      • Usage and Billing
        • Usage Statistics
        • Traffic and Billing
        • Rate Limits
    • OxyCopilot
    • Unblocking Browser
      • Chrome
      • Firefox
      • Restricted Targets
      • Integration with MCP
      • Troubleshooting Guide
  • Dashboard
    • Teams
    • Billing Information
      • Accessing Billing Information
      • Managing Payment Methods
      • Updating Billing Information
      • Canceling a Subscription
    • IP Replacement
  • Guides for Scraper APIs
    • Python SDK
    • Go SDK
    • Forming Requests
    • Forming URLs
    • Using Postman
  • Useful links
    • Oxylabs Dashboard
    • Release Notes
    • Network status
    • Open Source Tools
      • Oxy Parser
      • Oxy Mouse
      • Web Scraper API Scheduler
    • Discord Community
    • GitHub
    • Scraping Experts
  • SUPPORT
    • FAQ
    • Have a Question?
Powered by GitBook
On this page
  • Sample HTML
  • Bare minimum
  • Nested parsing instructions
  • List of nested objects
  • Select the N-th element from a list
  • Error handling
  • Array of arrays

Was this helpful?

  1. Scraping Solutions
  2. Web Scraper API
  3. Features
  4. Result Processing & Storage
  5. Custom Parser

Parsing instruction examples

This document presents sample use cases of the Custom Parser.

The following HTML snippet is parsed using example parsing instructions in the upcoming sections.

Sample HTML

<body>
    <div id="products">
        <div class="product" id="shoes">
            <div class="title">Shoes</div>
            <div class="price">223.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Super</li>
                </ul>
            </div>
        </div>
        <div class="product" id="pants">
            <div class="title">Pants</div>
            <div class="price">60.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Amazing</li>
                    <li class="description-item">Quality</li>
                </ul>
            </div>
        </div>
        <div class="product" id="socks">
            <div class="title">Socks</div>
            <div class="price">123.12</div>
            <div class="description">
                <ul>
                    <li class="description-item">Very</li>
                    <li class="description-item">Nice</li>
                    <li class="description-item">Socks</li>
                </ul>
            </div>
        </div>
    </div>
</body>

Bare minimum

Use case: you want to extract the text from all shoes description items.

Example 1. Shoes description items selection using XPath.

{
    "shoes_description": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    ".//div[@id='shoes']//li[@class='description-item']/text()"
                ]
            }
        ]
    }
}

The xpath function will find a single item and put it in a list as a string:

{
    "shoes_description": [
        "Super"
    ]
}

The exact xpath function behavior is described here.

Nested parsing instructions

Use case: you want to parse all information related to shoes. Also, the parsed result should represent the document structure of the provided HTML.

You are targeting this part of the Sample HTML:

<div class="product" id="shoes">
    <div class="title">Shoes</div>
    <div class="price">223.12</div>
    <div class="description">
        <ul>
            <li class="description-item">Super</li>
        </ul>
    </div>
</div>

And you would like the parsed result to be of the following structure:

{
    "shoes": {
        "title": "Shoes",
        "price": "223.12",
        "description": [
            "Super"
        ]
    }
}

Parsing instructions would look as follows.

Example 2. Parsing instructions are used to parse shoes information.

{
    "shoes": {
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@id='shoes']/div[@class='title']/text()"]
                }
            ]
        },
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@id='shoes']/div[@class='price']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": ["//div[@id='shoes']//li[@class='description-item']/text()"]
                }
            ]
        }
    }
}

xpath_one works similarly to xpath, but instead of returning a list of all matches, it returns the first matched item.

In the example above, the shoes property is the only property defined in the outermost instructions scope. The shoes property contains nested parsing instructions.

The shoes instructions scope does not have a pipeline defined (_fns property is missing). This means pipelines defined in title, price, and description scopes will use the document-under-parse as a pipeline input.

In Example 2, you can see a repetition of //div[@id='shoes'] in XPath expressions. The repetition can be avoided by defining a pipeline in shoes scope:

Example 3. Defining a pipeline in shoes scope instructions to avoid XPath expression repetition.

{
    "shoes": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='shoes']"]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["./div[@class='title']/text()"]
                }
            ]
        },
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["./div[@class='price']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [".//li[@class='description-item']/text()"]
                }
            ]
        }
    }
}

By using the parsing instructions provided in Example 3, Custom Parser will:

  1. Start with processing shoes._fns pipeline, which will output the shoes HTML element;

  2. Take shoes._fns pipeline output and use it as an input for pipelines, defined in title, price, and description scopes;

  3. Process title, price, and description pipelines to produce final values.

The result will look the same as a result from Example 2:

{
    "shoes": {
        "title": "Shoes",
        "price": "223.12",
        "description": [
            "Super"
        ]
    }
}

The main difference between Example 2 and Example 3 is that in Example 3, pipeline is defined in the shoes scope. This additional pipeline selects the element of the shoes and passes it on to further pipelines found deeper in the instructions hierarchy.

List of nested objects

Use case: Previously, you wanted to parse only shoes information. Now you want to parse the information of all products in the HTML.

The Sample HTML is used again as the document-under-parse.

If you want your parsed result to look like this:

{
    "products": [
        {
            "title": "Shoes",
            "price": "223.12",
            "description": [
                "Super"
            ]
        },
        {
            "title": "Pants",
            "price": "60.12",
            "description": [
                "Amazing",
                "Quality"
            ]
        },
        {
            "title": "Socks",
            "price": "123.12",
            "description": [
                "Very",
                "Nice",
                "Socks"
            ]
        }
    ]
}

The parsing instructions would look as follows:

Example 4. Parsing all products found on HTML document.

{
    "products": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='product']"]
            }
        ],
        "_items": {
            "title": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["./div[@class='title']/text()"]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": ["./div[@class='price']/text()"]
                    }
                ]
            },
            "description": {
                "_fns": [
                    {
                        "_fn": "xpath",
                        "_args": [".//li[@class='description-item']/text()"]
                    }
                ]
            }
        }
    }
}

The parsing instruction structure looks similar to the one in Example 3. However, there are two major exceptions:

  1. xpath is used instead of xpath_one in products._fns pipeline. products._fns pipeline will now output a list of all elements that match the provided XPath expression (a list of product elements).

  2. _items reserved property is used to indicate that you want to form a list by iterating through each item of the products._fns pipeline output and passing/processing every list item separately down the pipeline scope.

If _items reserved property wasn't used in Example 4 parsing instructions, the parsed result would look as follows:

{
    "products": {
        "title": [
            "Shoes",
            "Pants",
            "Socks"
        ],
        "price": [
            "223.12",
            "60.12",
            "123.12"
        ],
        "description": [
            [
                "Super"
            ],
            [
                "Amazing",
                "Quality"
            ],
            [
                "Very",
                "Nice",
                "Socks"
            ]
        ]
    }
}

_items is used to specify that the Custom Parser must pass separate list items instead of the whole list down the parsing instructions.

Select the N-th element from a list

This section demonstrates the flexibility of pipelines. The same problem can be approached in different ways.

Multiple options can be used to select the N-th element from a list of any values.

Use case: you want to select the second product price from the page.

The Sample HTML is again used as an example. You have multiple options to select the 2nd product.

Option 1

You can utilize the XPath [] selector and define selection in the XPath expression.

Example 5. Select 2nd price using XPath [] selector.

{
    "second_price": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "(//div[@class='price'])[2]/text()"
                ]
            }
        ]
    }
}

Result:

{
    "second_price": [
        "60.12"
    ]
}

Option 2

You can also use the xpath function to find all prices and pipe it to the function select_nth, which selects the n-th element from the extracted list of prices.

Example 6. Select the 2nd value using the `select_nth` function.

{
    "second_price": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='price']/text()"
                ]
            },
            {
                "_fn": "select_nth",
                "_args": 1
            }
        ]
    }
}

Result:

{
    "second_price": "60.12"
}

Notice how the select_nth function returns an item from a list while the xpath function returns a list of items, even if a single item is found.

Option 3

You can use select_nth with any list type, including lists of HTML elements:

Example 7. Selecting all product HTML elements with class="product" ==> selecting 2nd product element from the list ==> extracting price text from the selected product HTML element.

{
    "second_price": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='product']"]
            },
            {
                "_fn": "select_nth",
                "_args": 1
            },
            {
                "_fn": "xpath",
                "_args": ["./div[@class='price']/text()"]
            }
        ]
    }
}

Result:

{
    "second_price": ["60.12"]
}

Error handling

When given the following HTML snippet:

<div class="product" id="shoes">
    <div class="title">Nice Shoes</div>
    <div class="price">223.12</div>
    <div class="description">Super</div>
</div>

And trying to parse it with the following parsing instructions:

{
    "product": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='shoes']"]
            }
        ],
        "price": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='price']/text()"]
                }
            ]
        },
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='title']/text()"]
                }
            ]
        },
        "description": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='description']/text()"]
                },
                {
                    "_fn": "convert_to_float"
                }
            ]
        }
    }
}

Custom Parser will return a parsed result where price and title were parsed normally, but the description was failed to parse due to the convert_to_float function failing to convert string to float:

{
    "product": {
        "price": "223.12",
        "title": "Shoes",
        "description": null
    },
    "_warnings": [
        {
            "_fn": "convert_to_float",
            "_fn_idx": 1,
            "_msg": "Failed to process function.",
            "_path": ".product.description"
        }
    ]
}

By default, all errors are counted as warnings and are placed inside of the _warnings list. If you would like to ignore the errors when parsing a field, they can suppress warnings/errors with "_on_error": "suppress" parameter:

{
    "product": {
        ...,
        "description": {
            "_on_error": "suppress",
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": ["//div[@class='description']/text()"]
                },
                {
                    "_fn": "convert_to_float"
                }
            ]
        }
    }
}

Which will then produce the following result:

{
    "product": {
        "price": "223.12",
        "title": "Shoes",
        "description": null
    }
}

Array of arrays

Custom Parser allows N-dimensional arrays in parsed results. As an example, let’s use the following HTML snippet:

<div class="row">
    <div class="column">1</div>
    <div class="column">2</div>
    <div class="column">3</div>
</div>
<div class="row">
    <div class="column">4</div>
    <div class="column">5</div>
    <div class="column">6</div>
</div>
<div class="row">
    <div class="column">7</div>
    <div class="column">8</div>
    <div class="column">9</div>
</div>

Let's say you want to parse the document so that the result is a 3x3 2-dimension array of integers:

{
    "table": [
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
    ]
}

To parse the HTML into the JSON above, you can use the following parsing instructions:

{
    "table": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='row']"]
            },
            {
                "_fn": "xpath",
                "_args": [".//div[@class='column']/text()"]
            },
            {
                "_fn": "convert_to_int"
            }
        ]
    }
}
PreviousGetting startedNextList of functions

Last updated 4 months ago

Was this helpful?