# Tips for writing XPath expressions

## HTML structure may differ between scraped and browser-loaded document <a href="#html-structure-may-differ-between-scraped-and-browser-loaded-document" id="html-structure-may-differ-between-scraped-and-browser-loaded-document"></a>

When writing HTML element selection functions, **make sure to work with scraped documents instead of live website version loaded on your browser**, as the documents can differ. The main reason behind this issue is JavaScript rendering. When a website is opened, your browser is responsible for loading additional documents, such as CSS stylesheets and JavaScript scripts, which can change the structure of the initial HTML document. When parsing scraped HTMLs, Custom Parser does not load the HTML document the same way browsers do (parsers ignore JavaScript instructions), thus the HTML tree can differ between what the parser and browser render.

As an example, take a look at the following HTML document:

```html
<!doctype html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document</title>
</head>
<body>
    <div>
        <h3>This is a product</h3>
        <div id="price-container">
            <p>This is the price:</p>
        </div>
        <p>And here is some description</p>
    </div>
    <script>
        const priceContainer = document.querySelector("#price-container");
        const priceElement = document.createElement("p");
        priceElement.textContent = "123";
        priceElement.id = "price"
        priceContainer.insertAdjacentElement("beforeend", priceElement);
    </script>
</body>
</html>
```

If you open the document via browser, it will show the price you can select using the following XPath expression `//p[@id="price"]`:

<figure><img src="https://63892162-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FzrXw45naRpCZ0Ku9AjY1%2Fuploads%2Fa1Q7Xy0RGkCtyypmJiaz%2Fimage.png?alt=media&#x26;token=b8a517cb-77a5-49c3-9611-d61923657e83" alt=""><figcaption></figcaption></figure>

Now if you disable JavaScript rendering in the browser, the website will render as follows:

<figure><img src="https://63892162-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FzrXw45naRpCZ0Ku9AjY1%2Fuploads%2FAZ0eGq3gY8dklt1AayeD%2Fimage.png?alt=media&#x26;token=0d2254da-65f9-4fde-8f0b-68ab5caddc4b" alt=""><figcaption></figcaption></figure>

The same `//p[@id="price"]` XPath expression no longer matches the price as it is not rendered.

## Make sure to write all possible HTML selectors for the target element <a href="#make-sure-to-write-all-possible-html-selectors-for-the-target-element" id="make-sure-to-write-all-possible-html-selectors-for-the-target-element"></a>

For various reasons, the same page scraped twice may have different layouts (different User Agents used when scraping, target website doing A/B testing, etc.).

To tackle this problem, we suggest defining `parsing_instructions` for the initially scraped document and testing these instructions right away with multiple other scraped job results of the same page type.

HTML selector functions (`xpath`/`xpath_one`) support [**selector fallbacks**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/custom-parser/list-of-functions/function-examples#xpath).

## Suggested HTML selector writing flow <a href="#suggested-html-selector-writing-flow" id="suggested-html-selector-writing-flow"></a>

1. Scrape the HTML document of the target page using Scraper API.
2. Disable JavaScript and open the scraped HTML locally on your browser. If JavaScript is disabled **after** the HTML is opened, make sure to reload the page so that the HTML can reload without JavaScript.
3. [**Use browser dev tools**](https://www.computerhope.com/issues/ch002153.htm).

<figure><img src="https://63892162-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FzrXw45naRpCZ0Ku9AjY1%2Fuploads%2FoG0t7V4hrgrtY5EgtqvF%2Fimage.png?alt=media&#x26;token=c53e0220-0cef-420f-974b-7123c49c1187" alt=""><figcaption></figcaption></figure>

### How to write parsing instructions  <a href="#how-to-write-parsing-instructions-inlineextension" id="how-to-write-parsing-instructions-inlineextension"></a>

Let's say you have the following page to parse:

```html
`<!doctype html>
<html lang="en">
<head></head>
<body>
<style>
.variant {
  display: flex;
  flex-wrap: nowrap;
}
.variant p {
  white-space: nowrap;
  margin-right: 20px;
}
</style>
<div>
    <h1 id="title">This is a cool product</h1>
    <div id="description-container">
        <h2>This is a product description</h2>
        <ul>
            <li class="description-item">Durable</li>
            <li class="description-item">Nice</li>
            <li class="description-item">Sweet</li>
            <li class="description-item">Spicy</li>
        </ul>
    </div>
    <div id="price-container">
        <h2>Variants</h2>
        <div id="variants">
            <div class="variant">
                <p class="color">Red</p>
                <p class="price">99.99</p>
            </div>
            <div class="variant">
                <p class="color">Green</p>
                <p class="price">87.99</p>
            </div>
            <div class="variant">
                <p class="color">Blue</p>
                <p class="price">65.99</p>
            </div>
            <div class="variant">
                <p class="color">Black</p>
                <p class="price">99.99</p>
            </div>
        </div>
    </div>
</div>
</body>
</html>
```

<figure><img src="https://63892162-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FzrXw45naRpCZ0Ku9AjY1%2Fuploads%2FHJ39wJ8MBPxXnS8pjXuF%2Fimage.png?alt=media&#x26;token=ac49bb0c-655e-4bc5-abb2-7b82d7b56a6a" alt=""><figcaption></figcaption></figure>

### Parse product title

Create a new JSON object and assign a new field to it.

You can name the field any way you prefer with some exceptions (user-defined field name cannot start with an underscore `_` , e.g., `"_title"`).&#x20;

The field name will be displayed in the parsed result.

The new field must hold a value of JSON object type:

```json
{
    "title": {}  // defining a title field to be parsed
} 
```

If you provide these instructions to Custom Parser, it would do nothing or send a complaint that you haven’t provided any instructions.

To actually parse the title into the `title` field, you must define a data processing pipeline inside of the `title` object using the reserved `_fns` property (which is always of array type):

```json
{
    "title": {
        "_fns": []  // defining data processing pipeline for the title field
    }
}
```

For Custom Parser to select the text of the title, you can utilize the HTML selector function `xpath_one`. To use the function on the HTML document, it should be added to the data processing pipeline. The function is defined as a JSON object with required `_fn` (function name) and required `_args` (function arguments) fields. See the full list of function definitions [**here**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/custom-parser/writing-instructions-manually/list-of-functions).

```json
{
    "title": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//h1/text()"]
            }
        ]
    }
}
```

The parsing instructions above should produce the following result:

```json
{
    "title": "This is a cool product"
}
```

### Parse description

Similarly, in parsing instructions, you can define another field where the product description container, description title, and items will be parsed. For the title and the items of the description to be nested under the `description` object, the structure of instructions should be as follows:

```json
{
    "title": {...},
    "description": { // description container
        "title": {}, // description title
        "items": {} // description items
    } 
}
```

The given structure of parsing instructions implies that `description.title` and `description.items` will be parsed based on `description` element. You can define a pipeline for the `description` field. In this case, it's done first as it will simplify the XPath expression of the title of the description.

```json
{
    "title": {...},
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": ["//div[@id='description-container']"]
            }
        ],  // Pipeline result will be used when parsing `title` and `items`.
        "title": {},
        "items": {}
    }
}
```

In the example, the `description._fns` pipeline will select the `description-container` HTML element, which will be used as a reference point for parsing the description title and items.

To parse the remaining description fields, add two different pipelines for fields `description.items`, and `description.title`:

```json
{
    "title": {...},
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//div[@id='description-container']"
                ]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [
                        "//h2/text()"
                    ]
                }
            ]
        },
        "items": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li/text()"
                    ]
                }
            ]
        }
    }
}
```

Notice how the `xpath` function is used instead of `xpath_one` to extract all items that match the XPath expression.

The parsing instructions produce the following result:

```json
{
    "title": {...},
    "description": {
        "title": "This is description about the product",
        "items": [
            "Durable",
            "Nice",
            "Sweet",
            "Spicy"
        ]
    }
}
```

### Parse product variants

The following example shows the structure of instructions if you want to parse information into `product_variants` field, which will contain a list of variant objects. In this case, the variant object has `price` and `color` fields.

```json
{
    "title": {...},
    "description": {...},
    "product_variants": [
        {
            "price": ...,
            "color": ...
        },
        {
            ...
        },
        ...
    ]
}
```

Start with selecting all product variant elements:

```json
{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='variant']"]
            }
        ]
    }
}
```

To make `product_variants` a list containing JSON objects, you will have to iterate through found variants using `_items` iterator:

```json
{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": ["//div[@class='variant']"]
            }
        ],
        "_items": { // with this, you are instructing to process found elements one by one
            // field instructions to be described here
        } 
    }
}
```

Lastly, define instructions on how to parse the `color` and `price` fields:

```json
{
    "title": {...},
    "description": {...},
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='variant']"
                ]
            }
        ],
        "_items": {
            "color": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            // As we are using relative XPath expressions,
                            // make sure XPath starts with a dot (.)
                            ".//p[@class='color']/text()"
                        ]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='price']/text()"
                        ]
                    }
                ]
            }
        }
    }
}
```

With `product_variants` described, the final instructions will look as follows:

```json
{
    "title": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//h1/text()"
                ]
            }
        ]
    },
    "description": {
        "_fns": [
            {
                "_fn": "xpath_one",
                "_args": [
                    "//div[@id='description-container']"
                ]
            }
        ],
        "title": {
            "_fns": [
                {
                    "_fn": "xpath_one",
                    "_args": [
                        "//h2/text()"
                    ]
                }
            ]
        },
        "items": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li/text()"
                    ]
                }
            ]
        }
    },
    "product_variants": {
        "_fns": [
            {
                "_fn": "xpath",
                "_args": [
                    "//div[@class='variant']"
                ]
            }
        ],
        "_items": {
            "color": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='color']/text()"
                        ]
                    }
                ]
            },
            "price": {
                "_fns": [
                    {
                        "_fn": "xpath_one",
                        "_args": [
                            ".//p[@class='price']/text()"
                        ]
                    }
                ]
            }
        }
    }
}
```

Which will produce the following output:

```json
{
    "title": "This is a cool product",
    "description": {
        "title": "This is a product description",
        "items": [
            "Durable",
            "Nice",
            "Sweet",
            "Spicy"
        ]
    },
    "product_variants": [
        {
            "color": "Red",
            "price": "99.99"
        },
        {
            "color": "Green",
            "price": "87.99"
        },
        {
            "color": "Blue",
            "price": "65.99"
        },
        {
            "color": "Black",
            "price": "99.99"
        }
    ]
}
```

You can find more examples of parsing instructions here: [**Parsing instruction examples**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/custom-parser/writing-instructions-manually/parsing-instruction-examples).
