Oxylabs Documentation
English
Search…
⌃K

Crawler

Intro

Crawler is a Scraper API feature that lets you crawl any site, select useful content and have it delivered to you in bulk. You can use Crawler to perform URL discovery, crawl all pages on a site, index all URLs on a domain, and other purposes.
How to Crawl a Website: Step-by-step Guide
Crawler uses basic HTTP authentication that requires sending a username and a password.
Before using the Crawler, please read how its filters work and how you may be charged when using this functionality in the Filters and Usage statistics sections of this documentation.
Then, get familiar with the available input parameters and API endpoints.
Check out the Integrations section to find request templates and schemas.
If you are lost and would like some guidance, just drop us a line at [email protected] or contact your sales/account management rep.

Endpoints

Crawler has a number of endpoints you can use to control the service: initiate, stop, and resume your job, get job info, get the list of result chunks, and get the results.

Create a new job

Use this endpoint to initiate a new Crawler job.
  • Endpoint: https://ect.oxylabs.io/v1/jobs
  • Method: POST
  • Authentication: Basic
  • Request headers: Content-Type: application/json
Sample payload:
{
"url": "https://example.com",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": -1
},
"scrape_params": {
"source": "universal",
"user_agent_type": "desktop"
},
"output": {
"type_": "sitemap"
},
"upload": {
"storage_type": "s3",
"storage_url": "http://s3.amazonaws.com/{bucket_name}/"
}
}
Sample response:
{
"id": "10374369707989137859",
"client": "username",
"job_params": {
"url": "https://example.com",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": -1
},
"scrape_params": {
"source": "universal",
"geo_location": null,
"user_agent_type": "desktop",
"render": null
},
"output": {
"type": "sitemap",
"selector": null
},
"upload": {
"storage_type": "s3",
"storage_url": "http://s3.amazonaws.com/{bucket_name}/"
}
},
"_links": [
{
"rel": "self",
"href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859",
"method": "GET"
},
{
"rel": "stop-indexing",
"href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859/stop-indexing",
"method": "POST"
}
],
"events": [],
"created_at": "2021-11-19 14:32:01",
"updated_at": "2021-11-19 14:32:01"
}

Stop a job

Use this endpoint to stop a certain job.
  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/stop-indexing
  • Method: POST
  • Authentication: Basic

Resume a job

Use this endpoint to resume a certain job.
  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/stop-indexing
  • Method: POST
  • Authentication: Basic
Example response:
null

Get job information

Use this endpoint to get the job information of an existing job.
  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}
  • Method: GET
  • Authentication: Basic
Sample response:
{
"id": "10374369707989137859",
"client": "username",
"job_params": {
"url": "https://example.com",
"filters": {
"crawl": [],
"process": [],
"max_depth": -1
},
"scrape_params": {
"source": "universal",
"geo_location": null,
"user_agent_type": "desktop",
"render": null
},
"output": {
"type": "sitemap",
"selector": null
},
"upload": {
"storage_type": "s3",
"storage_url": "http://s3.amazonaws.com/{bucket_name}/"
}
},
"_links": [
{
"rel": "self",
"href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859",
"method": "GET"
}
],
"events": [
{
"event": "job_indexing_finished",
"status": "done",
"reason": null,
"created_at": "2021-11-19 14:32:16"
},
{
"event": "job_results_aggregated",
"status": "done",
"reason": null,
"created_at": "2021-11-19 14:32:17"
}
],
"created_at": "2021-11-19 14:32:01",
"updated_at": "2021-11-19 14:32:01"
}

Get sitemap

Use this endpoint to get the list of URLs found while processing the job.
  • Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/sitemap
  • Method: GET
  • Authentication: Basic
Sample response:
{
"results": [
{
"sitemap": [
"https://example.com",
"https://example.com/url1.html",
"https://example.com/url2.html",
"https://example.com/url3.html"
]
}
]
}

Get the list of aggregate result chunks

Once your crawling job is finished, you can download the aggregate result. The result can be one of the following:
  • An index (a list of URLs);
  • An aggregate file with all parsed results;
  • An aggregate file with all HTML results.
Depending on your crawling preferences and output type, the aggregate result may consist of a lot of data. To make downloading results easier, we split the aggregate result into multiple chunks based on the chunk size you specify while submitting your crawling job.
Use this endpoint to get the list of chunk files available for download. You may then proceed to download any individual chunk.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate
Method: GET
Authentication: Basic
Sample response:
{
"chunks": 3,
"chunk_urls": [
{
"rel": "chunk",
"href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/1",
"method": "GET"
},
{
"rel": "chunk",
"href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/2",
"method": "GET"
},
{
"rel": "chunk",
"href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/3",
"method": "GET"
}
]
}

Get a chunk of the aggregate result

Use this endpoint to download a particular chunk of the aggregate result. The contents of the response body depend on the output type chosen.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
Method: GET
Authentication: Basic

Query parameters

Below are all the available parameters you can use.
Parameter
Description
Default Value
url
The URL of the starting point
-
filters
These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result.
-
filters:crawl
Specifies which URLs Crawler will include in the end result. See this section for more information.
-
filters:process
Specifies which URLs Crawler will scrape. See this section for more information.
-
filters:max_depth
Determines the max length of URL chains Crawler will follow. See this section for more information.
-1
scrape_params
These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location.
-
scrape_params:source
See this section for more information.
-
scrape_params:geo_location
The geographical location that the result should be adapted for. See this section for more information.
-
scrape_params:user_agent_type
Device type and browser. See this section for more information.
desktop
scrape_params:render
See this section for more information.
-
output:type_
The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTMLs or parsed data. See this section for more information.
-
upload
These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See this section for more information.
-
upload:storage_type
Define the cloud storage type. The only valid value is s3 (for AWS S3). gcs (for Google Cloud Storage) is coming soon.
-
upload:storage_url
The storage bucket URL.
-
- required parameter

Scrape_params

You can fine-tune your configuration with a few more Scraper APIs parameters. Most of their values will depend on the type of scraper you use.
Parameter
Description
Default value
render
Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html. More info.
-

geo_location

The geographical location that the result should be adapted for. Your geo_location parameter value format will depend on the source you choose. Visit your chosen source documentation for more information. E.g. if your chosen source is universal_ecommerce, go to E-Commerce Scraper API -> Other Domains -> Parameter Values to find the geo_location parameter values explained.

render

Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html. More info.

source

source lets you specify which scraper should be used to perform the scraping jobs while crawling. The parameter value you should use depends on the URL you are submitting. The table below outlines which source value you should use.
IMPORTANT: The below list is incomplete. You will get access to the full list of available source values after signing up for a free trial or making a purchase.
URL
Source
Any Amazon URL
amazon
Any Bing URL
bing
Any Baidu URL
baidu
Any Google URL
google
Any Idealo URL
idealo
Any other URL
universal(for Sitemap or HTML output) or universal_ecommerce (for parsed output)

user_agent_type

Device type and browser. The full list can be found here.

Filters

Filters let you control the breadth and depth of your crawling jobs. You will be invoiced for the number of URLs scraped, so it is crucial that you set up your filters correctly. Otherwise, you can let Crawler scrape more URLs than is necessary.
process and crawl filters rely solely on regular expressions (regex) to decide whether some action should be performed on a URL (or a result associated with it). There is no shortage of online material written on regex, so we are not going to go into detail on constructing regular expressions.
A few regex-related links that we find useful:
We don't add any process or crawl filters by default. This means that if you don't submit any regular expressions for these filters, no crawling will take place (as we won't follow any URLs) and no results will be included in the sitemap/aggregate result.

process

The process filter lets you specify which URLs should be included in the job result. Every URL we come across will be evaluated for matching the crawl filters. If it's a match, the URL (or the contents of the URL) will be included in the job result. As a parameter value, please send one or more regular expressions in a JSON array, like this: "process": [".e", ".c", ".t"].

crawl

The Crawl filter lets you specify which URLs (apart from the URL of the starting point) are to be scraped and checked for more URLs. In simple terms, every URL we find while crawling is evaluated for matching the crawl filters. If it's a match, we'll scrape the URL in question to look for more URLs. As a parameter value, please send one or more regular expressions in a JSON array, like this: "indexable": [".e", ".c", ".t"].

max_depth

The value of max_depth filter determines the max length of URL chains Crawler API will follow.
Value
Description
-1
Crawls without any depth limits. This is the default setting.
0
Scrapes the starting page only.
1
Scrapes all URLs found in the starting page.
2
Scrapes all URLs found in the URLs found in the starting page.
3
Scrapes all URLs found in the URLs found in the URLs found in the starting page.
4
Scrapes all URLs found in the URLs found in the URLs found in the URLs found in the starting page.
n
Scrapes n * (all URLs found in) starting page.
NOTE: Crawler will only crawl URLs in the same domain as the domain of the URL of the starting point. An ability to override this setting is on our roadmap.

Output

type_

The type_ parameter determines what the output of the Crawler job will contain. The output types break down like this:
Value
Description
sitemap
A list of URLs.
parsed
A JSON file, containing an aggregate of parsed results.
html
A JSON file, containing an aggregate of HTML results.
IMPORTANT: You can upload your results to your own cloud storage. Currently, we support Amazon S3 and are working on adding Google Cloud Storage as an option.

Usage statistics

When using Crawler, you will be invoiced for the number of scraped URLs. The price per scrape will be the same as that of your regular Scraper APIs usage.
You can read about checking your usage statistics here.
Feel free to reach out to us at [email protected] or contact your sales/account management rep if you are unsure about how using Crawler affects your monthly usage statistics.

Integrations

Postman

If you want to try out all of Crawler endpoints, you can download and use this Postman collection.

Swagger / OpenAPI

You may have a look at our Swagger documentation page which contains API schema as well as other useful information.