Documentation has been updated: see help center and changelog in one place.
Explore
LogoLogo
Oxylabs dashboardProduct
English
  • Documentation
  • Help center
  • Changelog
English
  • Overview
  • PROXIES
    • Integration Guides
      • Get IP Address for Integrations
      • Residential Proxies Guides
        • AdsPower
        • Android
        • ClonBrowser
        • Dolphin Anty
        • Ghost Browser
        • GoLogin
        • Helium Scraper
        • Incogniton
        • iOS
        • Kameleo
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Multilogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • ParseHub
        • Playwright
        • Puppeteer
        • Selenium
        • SEO Neo
        • SessionBox
        • Shadowrocket
        • Super Proxy
        • SwitchyOmega
        • Ubuntu
        • VMLogin
        • WebHarvy
        • Hidemyacc
      • ISP Proxies Guides
        • AdsPower
        • Android
        • Dolphin Anty
        • FoxyProxy
        • GoLogin
        • Incogniton
        • iOS
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Multilogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • SEO Neo
        • Shadowrocket
        • Sphere
        • Super Proxy
        • SwitchyOmega
        • Ubuntu
        • Hidemyacc
      • Mobile Proxies Guides
        • AdsPower
        • Android
        • ClonBrowser
        • Dolphin Anty
        • Ghost Browser
        • GoLogin
        • Helium Scraper
        • Incogniton
        • iOS
        • Kameleo
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Multilogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • ParseHub
        • Playwright
        • Puppeteer
        • Selenium
        • SEO Neo
        • SessionBox
        • Shadowrocket
        • SwitchyOmega
        • Ubuntu
        • VMLogin
        • WebHarvy
      • Dedicated Datacenter Proxies Guides
        • Enterprise
          • Dolphin Anty
          • FoxyProxy
          • GoLogin
          • Lalicat Browser
          • MoreLogin
          • MuLogin
          • Nstbrowser
          • Octoparse
          • Oxy® Proxy Extension for Chrome
          • SEO Neo
          • Shadowrocket
          • Sphere
          • Super Proxy
          • SwitchyOmega
          • Ubuntu
          • Hidemyacc
        • Self-Service
          • Android
          • Dolphin Anty
          • FoxyProxy
          • GoLogin
          • iOS
          • Lalicat Browser
          • MacOS
          • MoreLogin
          • MuLogin
          • Nstbrowser
          • Octoparse
          • Oxy® Proxy Extension for Chrome
          • SEO Neo
          • Shadowrocket
          • Sphere
          • Super Proxy
          • SwitchyOmega
          • Ubuntu
          • Hidemyacc
      • Datacenter Proxies Guides
        • AdsPower
        • Android
        • Dolphin Anty
        • GoLogin
        • iOS
        • Lalicat Browser
        • MacOS
        • MoreLogin
        • MuLogin
        • Nstbrowser
        • Octoparse
        • Oxy® Proxy Extension for Chrome
        • SEO Neo
        • Shadowrocket
        • Super Proxy
        • SwitchyOmega
        • Ubuntu
        • Hidemyacc
    • Residential Proxies
      • Getting Started
      • Making Requests
        • Entry Node for China
      • Location Settings
        • Country
        • City
        • State
        • Continent
        • ZIP/Postal code
        • Coordinates
        • ASN Targeting
      • Session Control
        • Sticky Proxy Entry Nodes
      • Protocols
      • Whitelisting IPs
        • Requests with Whitelisted IPs
      • Endpoint Generator
      • Restricted Targets
      • Public API
      • Response Codes
    • ISP Proxies
      • Making Requests
      • Proxy List
      • Proxy Rotation
      • Location Settings
      • Protocols
      • Whitelisting IPs
      • Response Codes
      • Restricted Targets
      • Fair usage policy
    • Mobile Proxies
      • Getting Started
      • Making Requests
        • Entry Node for China
      • Location Settings
        • Country
        • City
        • State
        • Continent
        • Coordinates
        • ASN Targeting
      • Session Control
        • Sticky Proxy Entry Nodes
      • Protocols
      • Whitelisting IPs
      • Endpoint Generator
      • Restricted Targets
      • Public API
      • Response Codes
    • Datacenter Proxies
      • Proxy List
      • IP Control
      • Select Country
      • Protocols
      • Whitelisting
      • Response Codes
      • Restricted Targets
      • Fair usage policy
      • Free Datacenter IPs
    • Dedicated Datacenter Proxies
      • Enterprise
        • Getting Started
        • Proxy List
        • Making Requests
        • Protocols
        • Whitelisting IPs
          • Dashboard
          • RESTful
            • Getting Whitelisted IPs List
            • Adding a Whitelisted IP
            • Removing a Whitelisted IP
            • Saving Changes (5min Cooldown)
        • Datacenter Proxy API
        • Proxy Rotator - Optional
        • Response Codes
      • Self-Service
        • Getting Started
        • Making Requests
        • Proxy List
        • Proxy Rotation
        • Location Settings
        • Protocols
        • Whitelisting IPs
        • Response Codes
        • Restricted Targets
        • Fair usage policy
    • Dedicated ISP Proxies
      • Getting Started
      • Proxy List
      • Making Requests
      • Protocols
      • Whitelisting IPs (RESTful)
        • Getting Whitelisted IPs List
        • Adding a Whitelisted IP
        • Removing a Whitelisted IP
        • Saving Changes (5min Cooldown)
      • Proxy API
      • Proxy Rotator - Optional
      • Response Codes
    • (LEGACY) Shared Datacenter Proxies
      • Getting Started
      • Making Requests
      • Select Country
      • Session Control
      • Response Codes
      • Restricted Targets
  • Advanced proxy solutions
    • Web Unblocker
      • Getting Started
      • Making Requests
        • Session
        • Geo-location
        • Headers & Cookies
        • Custom status code
        • POST requests
      • Headless Browser
        • JavaScript rendering
        • Browser instructions (Beta)
          • List of instructions
      • Sample Response
      • Response Codes
      • Rate Limits
      • Migration Guides
        • From Bright Data Web Unlocker
      • Usage Statistics
      • Billing Information
  • Scraper APIs
    • Web Scraper API
      • Integration Methods
        • Realtime
        • Push-Pull
        • Proxy Endpoint
      • Features
        • Localization
          • Proxy Location
          • SERP Localization
          • E-Commerce Localization
          • Domain, Locale, Results Language
        • JS Rendering & Browser Control
          • JavaScript Rendering
          • Browser Instructions
            • List of instructions
        • Result Processing & Storage
          • Dedicated Parsers
          • Custom Parser
            • Getting started
            • Parsing instruction examples
            • List of functions
              • Function examples
          • Download Images
          • Cloud Storage
        • HTTP Context & Job Management
          • Headers, Cookies, Method
          • User Agent Type
          • Client Notes
        • Scheduler
      • Solutions for AI Workflows
        • Model Context Protocol (MCP)
      • Targets
        • Google
          • Search
            • Web Search
            • AI Overviews
            • Image Search
            • News Search
            • Local Search
            • Reverse Image Search
            • Google Suggest
          • Ads Max
          • Shopping
            • Shopping Product
            • Shopping Search
            • Shopping Pricing
          • Trends: Explore
          • Travel: Hotels
          • Lens
          • URL
        • Amazon
          • Product
          • Search
          • Pricing
          • Sellers
          • Best Sellers
          • Reviews
          • Questions & Answers
          • URL
        • YouTube
          • YouTube Scraping Guide for AI
          • YouTube Search
          • YouTube Video Trainability
          • YouTube Metadata
          • YouTube Downloader
          • YouTube Transcript
        • Generic Target
        • Walmart
          • Search
          • Product
        • Ebay
        • Etsy
          • Search
          • Product
        • Bing
          • Search
          • URL
        • North American E-Commerce
          • Best Buy
            • Search
            • Product
          • Target
            • Search
            • Product
            • Category
          • Kroger
            • Product
            • Search
            • URL
          • Bed Bath & Beyond
          • Menards
          • Petco
          • Staples
          • Grainger
          • Instacart
        • European E-Commerce
          • Allegro
            • Search
            • Product
          • Idealo
          • Mediamarkt
          • Cdiscount
        • Asian E-Commerce
          • Alibaba
          • Aliexpress
          • Lazada
          • Rakuten
          • Tokopedia
          • Flipkart
          • Avnet
          • Indiamart
        • Latin American E-Commerce
          • Mercado Livre
          • Magazine Luiza
          • Falabella
          • Dcard
      • Restricted Targets
      • Response Codes
    • OxyCopilot (Beta)
    • Usage and Billing
      • Usage Statistics
      • Traffic and Billing
      • Rate Limits
  • Dashboard
    • Teams
    • Billing Information
      • Accessing Billing Information
      • Managing Payment Methods
      • Updating Billing Information
      • Canceling a Subscription
    • IP Replacement
  • Guides for Scraper APIs
    • Python SDK
    • Go SDK
    • Forming Requests
    • Forming URLs
    • Using Postman
  • Useful links
    • Oxylabs Dashboard
    • Release Notes
    • Network status
    • Open Source Tools
      • Oxy Parser
      • Oxy Mouse
      • Web Scraper API Scheduler
    • Discord Community
    • GitHub
    • Scraping Experts
  • SUPPORT
    • FAQ
    • Have a Question?
Powered by GitBook
On this page
  • How to use Custom Parser
  • Intro
  • Tips on writing XPath expressions
  • HTML structure may differ between scraped and browser-loaded document
  • Make sure to write all possible HTML selectors for the target element
  • Suggested HTML selector writing flow
  • How to write parsing instructions
  • List of functions to parse results with
  • What happens if parsing fails when using Custom Parser
  • Status codes

Was this helpful?

  1. Scraper APIs
  2. Web Scraper API
  3. Features
  4. Result Processing & Storage
  5. Custom Parser

Getting started

PreviousCustom ParserNextParsing instruction examples

Last updated 2 months ago

Was this helpful?

How to use Custom Parser

Intro

To use Custom Parser, please provide a set of parsing_instructions when creating a job.

You can use OxyCopilot in the Web Scraper API Playground on our dashboard to generate parsing instructions automatically. Define schemas, validate data extraction, and export AI-generated parsing instructions in JSON format.

Let's say you want to parse the number of total results Bing Search yields with a search term test:

An example job parameters would look as follows:

{
    "source": "bing_search",
    "query": "test",
    "parse": true,
    "parsing_instructions": {
        "number_of_results": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@class='sb_count']/text()"]
                }
            ]
        }
    }
}

Step 1. You must provide the "parse": true parameter.

Step 2. Parsing instructions should be described in the "parsing_instructions" field.

The sample parsing instructions above specifies that the aim is to parse the number of search results from the scraped document and put the result in the number_of_results field. The instructions on how to parse the field by defining a “pipeline” is given:

"_fns": [
    {
        "_fn": "xpath_one",
        "_args": [".//span[@class='sb_count']/text()"]
    }
]

The pipeline describes a list of data processing functions to be executed. The functions are executed in the order they appear on the list and take the output of the previous function as the input.

In the sample pipeline above, the xpath_one function (full list of available functions) is used. It allows you to process an HTML document using XPath expressions and XSLT functions. As a function argument, specify the exact path where the target element can be found: .//span[@class='sb_count']. You can also instruct the parser to select the text() found in the target element.

The parsed result of the sample job above should look like this:

{
    "results": [
        {
            "content": {
                "number_of_results": "About 35.700.000.000 results",
                "parse_status_code": 12000
            },
            "created_at": "2023-03-24 08:27:16",
            "internal": [],
            "job_id": "7044947765926856705",
            "page": 1,
            "parser_type": "",
            "status_code": 200,
            "updated_at": "2023-03-24 08:27:21",
            "url": "https://www.bing.com/search?form=QBLH&q=test"
        }
    ]
}

Custom Parser not only offers text extraction from a scraped HTML, but it can also execute basic data processing functions.

For example, the previously described parsing instructions extract number_of_results as a text with extra keywords you may not necessarily need. If you want to get the number of results for the given query=test in the numeric data type, you can reuse the same parsing instructions and add the amount_from_string function to the existing pipeline:

{
    "source": "bing_search",
    "query": "test",
    "parse": true,
    "parsing_instructions": {
        "number_of_results": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@class='sb_count']/text()"]
                },
                {
                    "_fn": "amount_from_string"
                }
            ]
        }
    }
}

The parsed result of the sample job above should look like this:

{
    "results": [
        {
            "content": {
                "number_of_results": 2190000000,
                "parse_status_code": 12000
            },
            "created_at": "2023-03-24 08:52:21",
            "internal": [],
            "job_id": "7044954079138679809",
            "page": 1,
            "parser_type": "",
            "status_code": 200,
            "updated_at": "2023-03-24 08:52:25",
            "url": "https://www.bing.com/search?form=QBLH&q=test"
        }
    ]
}

We can see that it parsed the document accurately:

Tips on writing XPath expressions

HTML structure may differ between scraped and browser-loaded document

When writing HTML element selection functions, make sure to work with scraped documents instead of live website version loaded on your browser, as the documents can differ. The main reason behind this issue is JavaScript rendering. When a website is opened, your browser is responsible for loading additional documents, such as CSS stylesheets and JavaScript scripts, which can change the structure of the initial HTML document. When parsing scraped HTMLs, Custom Parser does not load the HTML document the same way browsers do (parsers ignore JavaScript instructions), thus the HTML tree can differ between what the parser and browser render.

As an example, take a look at the following HTML document:

<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
    <div>
        <h3>This is a product</h3>
        <div id="price-container">
            <p>This is the price:</p>
        </div>
        <p>And here is some description</p>
    </div>
    <script>
        const priceContainer = document.querySelector("#price-container");
        const priceElement = document.createElement("p");
        priceElement.textContent = "123";
        priceElement.id = "price"
        priceContainer.insertAdjacentElement("beforeend", priceElement);
    </script>
</body>
</html>

If you open the document via browser, it will show the price you can select using the following XPath expression //p[@id="price"]:

Now if you disable JavaScript rendering in the browser, the website will render as follows:

The same //p[@id="price"] XPath expression no longer matches the price as it is not rendered.

Make sure to write all possible HTML selectors for the target element

For various reasons, the same page scraped twice may have different layouts (different User Agents used when scraping, target website doing A/B testing, etc.).

To tackle this problem, we suggest defining parsing_instructions for the initially scraped document and testing these instructions right away with multiple other scraped job results of the same page type.

HTML selector functions (xpath/xpath_one) support selector fallbacks.

Suggested HTML selector writing flow

  1. Scrape the HTML document of the target page using Scraper API.

  2. Disable JavaScript and open the scraped HTML locally on your browser. If JavaScript is disabled after the HTML is opened, make sure to reload the page so that the HTML can reload without JavaScript.

  3. Use browser dev tools.

How to write parsing instructions

Let's say you have the following page to parse:

`<!doctype html>
<html lang="en">
<head></head>
<body>
<style>
.variant {
  display: flex;
  flex-wrap: nowrap;
}
.variant p {
  white-space: nowrap;
  margin-right: 20px;
}
</style>
<div>
    <h1 id="title">This is a cool product</h1>
    <div id="description-container">
        <h2>This is a product description</h2>
        <ul>
            <li class="description-item">Durable</li>
            <li class="description-item">Nice</li>
            <li class="description-item">Sweet</li>
            <li class="description-item">Spicy</li>
        </ul>
    </div>
    <div id="price-container">
        <h2>Variants</h2>
        <div id="variants">
            <div class="variant">
                <p class="color">Red</p>
                <p class="price">99.99</p>
            </div>
            <div class="variant">
                <p class="color">Green</p>
                <p class="price">87.99</p>
            </div>
            <div class="variant">
                <p class="color">Blue</p>
                <p class="price">65.99</p>
            </div>
            <div class="variant">
                <p class="color">Black</p>
                <p class="price">99.99</p>
            </div>
        </div>
    </div>
</div>
</body>
</html>

Parse product title

Create a new JSON object and assign a new field to it.

You can name the field any way you prefer with some exceptions (user-defined field name cannot start with an underscore _ , e.g., "_title").

The field name will be displayed in the parsed result.

The new field must hold a value of JSON object type:

{
    "title": {}  // defining a title field to be parsed
} 

If you provide these instructions to Custom Parser, it would do nothing or send a complaint that you haven’t provided any instructions.

To actually parse the title into the title field, you must define a data processing pipeline inside of the title object using the reserved _fns property (which is always of array type):

{
    "title": {
        "_fns": []  // defining data processing pipeline for the title field
    }
}

For Custom Parser to select the text of the title, you can utilize the HTML selector function xpath_one. To use the function on the HTML document, it should be added to the data processing pipeline. The function is defined as a JSON object with required _fn (function name) and required _args (function arguments) fields. See the full list of function definitions here.

{
    "title": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//h1/text()"]
            }
        ]
    }
}

The parsing instructions above should produce the following result:

{
    "title": "This is a cool product"
}

Parse description

Similarly, in parsing instructions, you can define another field where the product description container, description title, and items will be parsed. For the title and the items of the description to be nested under the description object, the structure of instructions should be as follows:

{
    "title": {...},
    "description": { // description container
        "title": {}, // description title
        "items": {} // description items
    } 
}

The given structure of parsing instructions implies that description.title and description.items will be parsed based on description element. You can define a pipeline for the description field. In this case, it's done first as it will simplify the XPath expression of the title of the description.

{
    "title": {...},
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='description-container']"]
            }
        ],  // Pipeline result will be used when parsing `title` and `items`.
        "title": {},
        "items": {}
    }
}

In the example, the description._fns pipeline will select the description-container HTML element, which will be used as a reference point for parsing the description title and items.

To parse the remaining description fields, add two different pipelines for fields description.items, and description.title:

{
    "title": {...},
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//div[@id='description-container']"
                ]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [
                        "//h2/text()"
                    ]
                }
            ]
        },
        "items": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li/text()"
                    ]
                }
            ]
        }
    }
}

Notice how the xpath function is used instead of xpath_one to extract all items that match the XPath expression.

The parsing instructions produce the following result:

{
    "title": {...},
    "description": {
        "title": "This is description about the product",
        "items": [
            "Durable",
            "Nice",
            "Sweet",
            "Spicy"
        ]
    }
}

Parse product variants

The following example shows the structure of instructions if you want to parse information into product_variants field, which will contain a list of variant objects. In this case, the variant object has price and color fields.

{
    "title": {...},
    "description": {...},
    "product_variants": [
        {
            "price": ...,
            "color": ...
        },
        {
            ...
        },
        ...
    ]
}

Start with selecting all product variant elements:

{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='variant']"]
            }
        ]
    }
}

To make product_variants a list containing JSON objects, you will have to iterate through found variants using _items iterator:

{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='variant']"]
            }
        ],
        "_items": { // with this, you are instructing to process found elements one by one
            // field instructions to be described here
        } 
    }
}

Lastly, define instructions on how to parse the color and price fields:

{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='variant']"
                ]
            }
        ],
        "_items": {
            "color": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            // As we are using relative XPath expressions,
                            // make sure XPath starts with a dot (.)
                            ".//p[@class='color']/text()"
                        ]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='price']/text()"
                        ]
                    }
                ]
            }
        }
    }
}

With product_variants described, the final instructions will look as follows:

{
    "title": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//h1/text()"
                ]
            }
        ]
    },
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//div[@id='description-container']"
                ]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [
                        "//h2/text()"
                    ]
                }
            ]
        },
        "items": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li/text()"
                    ]
                }
            ]
        }
    },
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='variant']"
                ]
            }
        ],
        "_items": {
            "color": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='color']/text()"
                        ]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='price']/text()"
                        ]
                    }
                ]
            }
        }
    }
}

Which will produce the following output:

{
    "title": "This is a cool product",
    "description": {
        "title": "This is a product description",
        "items": [
            "Durable",
            "Nice",
            "Sweet",
            "Spicy"
        ]
    },
    "product_variants": [
        {
            "color": "Red",
            "price": "99.99"
        },
        {
            "color": "Green",
            "price": "87.99"
        },
        {
            "color": "Blue",
            "price": "65.99"
        },
        {
            "color": "Black",
            "price": "99.99"
        }
    ]
}

You can find more examples of parsing instructions here: Parsing instruction examples.

List of functions to parse results with

Pipeline functions

What happens if parsing fails when using Custom Parser

If Custom Parser fails to process client-defined parsing instructions, we will return the 12005 status code (parsed with warnings).

{
    "source": "bing_search",
    "query": "test",
    "parse": true,
    "parsing_instructions": {
        "number_of_results": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [".//span[@class='sb_count']/text()"]
                },
                {
                    "_fn": "amount_from_string"
                }
            ]
        },
        "number_of_organics": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": ["//this-will-not-match-anything"]
                },
                {
                    "_fn": "length"
                }
            ]
        }
    }
}

The client will be charged for such results:

{
    "results": [
        {
            "content": {
                "_warnings": [
                    {
                        "_fn": "xpath",
                        "_fn_idx": 0,
                        "_msg": "XPath expressions did not match any data.",
                        "_path": ".number_of_organics"
                    }
                ],
                "number_of_organics": null,
                "number_of_results": 18000000000,
                "parse_status_code": 12005
            },
            "created_at": "2023-03-24 09:46:46",
            "internal": [],
            "job_id": "7044967772475916289",
            "page": 1,
            "parser_type": "",
            "status_code": 200,
            "updated_at": "2023-03-24 09:46:48",
            "url": "https://www.bing.com/search?form=QBLH&q=test"
        }
    ]
}

If Custom Parser encounters an exception and breaks during the parsing operation, it can return these status codes: 12002, 12006, 12007. You will not be charged for these unexpected errors.

Status codes

See our status codes outlined here.