Web Crawler

Intro

Web Crawler is a Scraper API feature that lets you crawl any site, select useful content, and have it delivered to you in bulk. You can use Web Crawler to perform URL discovery, crawl all pages on a site, index all URLs on a domain, and for other purposes.

The tutorial below will guide you through the Web Crawler workflow showcasing the process of retrieving public data from an E-Commerce website.

How to Crawl a Website: Step-by-step Guide

Web Crawler uses basic HTTP authentication that requires sending a username and a password.

Before using the Web Crawler, please read how its filters work and how you may be charged when using this functionality in the Filters and Usage statistics sections of this documentation.

Then, get familiar with the available input parameters and API endpoints.

Check out the Integrations section to find request templates and schemas.

If you would like guidance, drop us a line at support@oxylabs.io or contact your account manager.

Endpoints

Web Crawler has a number of endpoints you can use to control the service: initiate, stop, and resume your job, get job info, get the list of result chunks, and get the results.

Create a new job

Use this endpoint to initiate a new Web Crawler job.

  • Endpoint: https://ect.oxylabs.io/v1/jobs

  • Method: POST

  • Authentication: Basic

  • Request headers: Content-Type: application/json

Sample payload:

{
  "url": "https://amazon.com",
  "filters": {
    "crawl": [".*"],
    "process": [".*"],
    "max_depth": 1
  },
  "scrape_params": {
    "source": "universal",
    "user_agent_type": "desktop"
  },
  "output": {
    "type_": "sitemap"
  },
  "upload": {
    "storage_type": "s3",
    "storage_url": "bucket_name"
    }
}

Sample response:

{
  "id": "10374369707989137859",
  "client": "username",
  "job_params": {
    "url": "https://amazon.com",
    "filters": {
      "crawl": [".*"],
      "process": [".*"],
      "max_depth": 1
    },
    "scrape_params": {
      "source": "universal",
      "geo_location": null,
      "user_agent_type": "desktop",
      "render": null
    },
    "output": {
      "type": "sitemap",
      "selector": null
    },
    "upload": {
      "storage_type": "s3",
      "storage_url": "bucket_name"
    }
  },
  "_links": [
    {
      "rel": "self",
      "href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859",
      "method": "GET"
    },
    {
      "rel": "stop-indexing",
      "href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859/stop-indexing",
      "method": "POST"
    }
  ],
  "events": [],
  "created_at": "2021-11-19 14:32:01",
  "updated_at": "2021-11-19 14:32:01"
}

Stop a job

Use this endpoint to stop a certain job.

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/stop-indexing

  • Method: POST

  • Authentication: Basic

Resume a job

Use this endpoint to resume a certain job.

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/resume-indexing

  • Method: POST

  • Authentication: Basic

Example response:

null

Get job information

Use this endpoint to get the job information of an existing job.

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}

  • Method: GET

  • Authentication: Basic

Sample response:

{
  "id": "10374369707989137859",
  "client": "username",
  "job_params": {
    "url": "https://amazon.com",
    "filters": {
      "crawl": [],
      "process": [],
      "max_depth": 1
    },
    "scrape_params": {
      "source": "universal",
      "geo_location": null,
      "user_agent_type": "desktop",
      "render": null
    },
    "output": {
      "type": "sitemap",
      "selector": null
    },
    "upload": {
      "storage_type": "s3",
      "storage_url": "bucket_name"
    }
  },
  "_links": [
    {
      "rel": "self",
      "href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859",
      "method": "GET"
    }
  ],
  "events": [
    {
      "event": "job_indexing_finished",
      "status": "done",
      "reason": null,
      "created_at": "2021-11-19 14:32:16"
    },
    {
      "event": "job_results_aggregated",
      "status": "done",
      "reason": null,
      "created_at": "2021-11-19 14:32:17"
    }
  ],
  "created_at": "2021-11-19 14:32:01",
  "updated_at": "2021-11-19 14:32:01"
}

Get sitemap

Use this endpoint to get the list of URLs found while processing the job.

  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/sitemap

  • Method: GET

  • Authentication: Basic

Sample response:

{
  "results": [
    {
      "sitemap": [
        "https://www.amazon.com/Apple-MU8F2AM-A-Pencil-Generation/dp/B07K1WWBJK/ref=lp_16225007011_1_1",
        "https://www.amazon.com/HP-Cartridge-Black-3YM57AN-Tri-Color/dp/B08412PTS8/ref=lp_16225007011_1_2",
        "https://www.amazon.com/Seagate-Portable-External-Hard-Drive/dp/B07CRG94G3/ref=lp_16225007011_1_4",
        "https://www.amazon.com/Logitech-MK270-Wireless-Keyboard-Mouse/dp/B079JLY5M5/ref=lp_16225007011_1_6"
      ]
    }
  ]
}

Get the list of aggregate result chunks

Once your crawling job is finished, you can download the aggregate result. The result can be one of the following:

  • An index (a list of URLs);

  • An aggregate file with all parsed results;

  • An aggregate file with all HTML results.

Depending on your crawling preferences and output type, the aggregate result may consist of a lot of data. To make downloading results more manageable, we split the aggregate result into multiple chunks based on the chunk size you specify while submitting your crawling job.

Use this endpoint to get the list of chunk files available for download. You may then proceed to download any individual chunk.

Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate

Method: GET

Authentication: Basic

Sample response:

{
  "chunks": 3,
  "chunk_urls": [
    {
      "rel": "chunk",
      "href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/1",
      "method": "GET"
    },
    {
      "rel": "chunk",
      "href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/2",
      "method": "GET"
    },
    {
      "rel": "chunk",
      "href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/3",
      "method": "GET"
    }
  ]
}

Get a chunk of the aggregate result

Use this endpoint to download a particular chunk of the aggregate result. The contents of the response body depend on the output type chosen.

Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}

Method: GET

Authentication: Basic

Query parameters

Below are all the available parameters you can use.

ParameterDescriptionDefault Value

url

The URL of the starting point

-

filters

These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result.

-

filters:crawl

Specifies which URLs Web Crawler will include in the end result. See this section for more information.

-

filters:process

Specifies which URLs Web Crawler will scrape. See this section for more information.

-

filters:max_depth

Determines the max length of URL chains Web Crawler will follow. See this section for more information.

1

scrape_params

These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location.

-

scrape_params:source

See this section for more information.

-

scrape_params:geo_location

The geographical location that the result should be adapted for. See this section for more information.

-

scrape_params:user_agent_type

Device type and browser. See this section for more information.

desktop

scrape_params:render

See this section for more information.

-

output:type_

The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTMLs or parsed data. See this section for more information.

-

upload

These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See this section for more information.

-

upload:storage_type

Define the cloud storage type. The only valid value is s3 (for AWS S3). gcs (for Google Cloud Storage) is coming soon.

-

upload:storage_url

The storage bucket URL.

-

- required parameter

Scrape_params

You can fine-tune your configuration with a few more Scraper APIs parameters. Most of their values will depend on the type of scraper you use.

geo_location

The geographical location that the result should be adapted for. Your geo_location parameter value format will depend on the source you choose. Visit your chosen source documentation for more information.

render

Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html. More info.

source

source lets you specify which scraper should be used to perform the scraping jobs while crawling. The parameter value you should use depends on the URL you are submitting. The table below outlines which source value you should use.

IMPORTANT: The below list is incomplete. You will get access to the full list of available source values after signing up for a free trial or making a purchase.

URLSource

Any Amazon URL

amazon

Any Bing URL

bing

Any Baidu URL

baidu

Any Google URL

google

Any other URL

universal(for Sitemap or HTML output) or universal_ecommerce (for parsed output)

user_agent_type

Device type and browser. The complete list can be found here.

Filters

Filters let you control the breadth and depth of your crawling jobs. You will be invoiced for the number of URLs scraped, so it is crucial that you set up your filters correctly. Otherwise, you can let Web Crawler scrape more URLs than is necessary.

process and crawl filters rely solely on regular expressions (regex) to decide whether some action should be performed on a URL (or a result associated with it).

We don't add any process or crawl filters by default. This means that if you don't submit any regular expressions for these filters, no crawling will take place (as we won't follow any URLs) and no results will be included in the sitemap/aggregate result.

Regex value examples

Regex valueDescription

.*

Matches any number of any characters, except line breaks. Use this expression as a wildcard for one or more characters.

https:\/\/www.amazon.com\/[^\/]*\/[^\/]*

Matches all amazon.com URLs that have no more than two / (slash) symbols in the path.

https:\/\/www.amazon.com\/.*\/[A-Z0-9]{10}.*

Matches the domain name, followed by any string of characters, which is then followed by a 10-character-long alphanumeric string of characters, which is again followed by any string of characters. Use this value to match all product URLs on amazon.com.

A few regex-related links that we find helpful:

process

The process filter lets you specify which URLs should be included in the job result. Every URL we come across will be evaluated for matching the crawl filters. If it's a match, the URL (or the contents of the URL) will be included in the job result. As a parameter value, please send one or more regular expressions in a JSON array, like this: "process": [".e", ".c", ".t"].

crawl

The crawl filter lets you specify which URLs (apart from the URL of the starting point) are to be scraped and checked for more URLs. In simple terms, every URL we find while crawling is evaluated for matching the crawl filters. If it's a match, we'll scrape the URL in question to look for more URLs. As a parameter value, please send one or more regular expressions in a JSON array, like this: "indexable": [".e", ".c", ".t"].

max_depth

The value of max_depth filter determines the max length of URL chains Web Crawler API will follow.

ValueDescription

-1

Crawls without any depth limits.

0

Scrapes the starting page only.

1

Scrapes all URLs found in the starting page. This is the default setting.

2

Scrapes all URLs found in the URLs found in the starting page.

3

Scrapes all URLs found in the URLs found in the URLs found in the starting page.

4

Scrapes all URLs found in the URLs found in the URLs found in the URLs found in the starting page.

n

Scrapes n * (all URLs found in) starting page.

NOTE: Web Crawler will only crawl URLs in the same domain as the domain of the URL of the starting point. An ability to override this setting is on our roadmap.

Output

type_

The type_ parameter determines what the output of the Web Crawler job will contain. The output types break down like this:

ValueDescription

sitemap

A list of URLs.

parsed

A JSON file, containing an aggregate of parsed results.

html

A JSON file, containing an aggregate of HTML results.

Upload

You can upload your results to your own cloud storage. We support Amazon S3 and Google Cloud Storage.

storage_type

storage_type specifies the type of the cloud storage the results are to be uploaded to. We support s3 and gcs storage types.

storage_url

storage_url specifies the name of the bucket the results are to be uploaded to.

Usage statistics

When using Web Crawler, you will be invoiced for the number of scraped URLs. The price per scrape will be the same as your regular Scraper APIs usage.

You can read about checking your usage statistics here.

Feel free to contact us at support@oxylabs.io or your sales/account management rep if you need help with how using Web Crawler affects your monthly usage statistics.

Integrations

Postman

If you want to try out all of Web Crawler's endpoints, you can download and use this Postman collection.

Swagger / OpenAPI

You can look at our Swagger documentation page, which contains API schema and other useful information. To log in, please use the username and password of your Scraper API.

Last updated