Web Crawler
Web Crawler is a Scraper API feature that lets you crawl any site, select useful content, and have it delivered to you in bulk. You can use Web Crawler to perform URL discovery, crawl all pages on a site, index all URLs on a domain, and for other purposes.
The tutorial below will guide you through the Web Crawler workflow showcasing the process of retrieving public data from an E-Commerce website.
How to Crawl a Website: Step-by-step Guide
Before using the Web Crawler, please read how its filters work and how you may be charged when using this functionality in the Filters and Usage statistics sections of this documentation.
Use this endpoint to initiate a new Web Crawler job.
- Endpoint:
https://ect.oxylabs.io/v1/jobs
- Method:
POST
- Authentication:
Basic
- Request headers:
Content-Type: application/json
Sample payload:
{
"url": "https://amazon.com",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": 1
},
"scrape_params": {
"source": "universal",
"user_agent_type": "desktop"
},
"output": {
"type_": "sitemap"
},
"upload": {
"storage_type": "s3",
"storage_url": "bucket_name"
}
}
Sample response:
{
"id": "10374369707989137859",
"client": "username",
"job_params": {
"url": "https://amazon.com",
"filters": {
"crawl": [".*"],
"process": [".*"],
"max_depth": 1
},
"scrape_params": {
"source": "universal",
"geo_location": null,
"user_agent_type": "desktop",
"render": null
},
"output": {
"type": "sitemap",
"selector": null
},
"upload": {
"storage_type": "s3",
"storage_url": "bucket_name"
}
},
"_links": [
{
"rel": "self",
"href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859",
"method": "GET"
},
{
"rel": "stop-indexing",
"href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859/stop-indexing",
"method": "POST"
}
],
"events": [],
"created_at": "2021-11-19 14:32:01",
"updated_at": "2021-11-19 14:32:01"
}
Use this endpoint to stop a certain job.
- Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/stop-indexing
- Method:
POST
- Authentication:
Basic
Use this endpoint to resume a certain job.
- Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/resume-indexing
- Method:
POST
- Authentication:
Basic
Example response:
null
Use this endpoint to get the job information of an existing job.
- Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}
- Method:
GET
- Authentication:
Basic
Sample response:
{
"id": "10374369707989137859",
"client": "username",
"job_params": {
"url": "https://amazon.com",
"filters": {
"crawl": [],
"process": [],
"max_depth": 1
},
"scrape_params": {
"source": "universal",
"geo_location": null,
"user_agent_type": "desktop",
"render": null
},
"output": {
"type": "sitemap",
"selector": null
},
"upload": {
"storage_type": "s3",
"storage_url": "bucket_name"
}
},
"_links": [
{
"rel": "self",
"href": "http://ect.oxylabs.io/v1/jobs/10374369707989137859",
"method": "GET"
}
],
"events": [
{
"event": "job_indexing_finished",
"status": "done",
"reason": null,
"created_at": "2021-11-19 14:32:16"
},
{
"event": "job_results_aggregated",
"status": "done",
"reason": null,
"created_at": "2021-11-19 14:32:17"
}
],
"created_at": "2021-11-19 14:32:01",
"updated_at": "2021-11-19 14:32:01"
}
Use this endpoint to get the list of URLs found while processing the job.
- Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/sitemap
- Method:
GET
- Authentication:
Basic
Sample response:
{
"results": [
{
"sitemap": [
"https://www.amazon.com/Apple-MU8F2AM-A-Pencil-Generation/dp/B07K1WWBJK/ref=lp_16225007011_1_1",
"https://www.amazon.com/HP-Cartridge-Black-3YM57AN-Tri-Color/dp/B08412PTS8/ref=lp_16225007011_1_2",
"https://www.amazon.com/Seagate-Portable-External-Hard-Drive/dp/B07CRG94G3/ref=lp_16225007011_1_4",
"https://www.amazon.com/Logitech-MK270-Wireless-Keyboard-Mouse/dp/B079JLY5M5/ref=lp_16225007011_1_6"
]
}
]
}
Once your crawling job is finished, you can download the aggregate result. The result can be one of the following:
- An index (a list of URLs);
- An aggregate file with all parsed results;
- An aggregate file with all HTML results.
Depending on your crawling preferences and output type, the aggregate result may consist of a lot of data. To make downloading results more manageable, we split the aggregate result into multiple chunks based on the chunk size you specify while submitting your crawling job.
Use this endpoint to get the list of chunk files available for download. You may then proceed to download any individual chunk.
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/aggregate
Method:
GET
Authentication:
Basic
Sample response:
{
"chunks": 3,
"chunk_urls": [
{
"rel": "chunk",
"href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/1",
"method": "GET"
},
{
"rel": "chunk",
"href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/2",
"method": "GET"
},
{
"rel": "chunk",
"href": "http://ect.oxylabs.io/v1/jobs/12116031016250208332/aggregate/3",
"method": "GET"
}
]
}
Use this endpoint to download a particular chunk of the aggregate result. The contents of the response body depend on the output type chosen.
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
Method:
GET
Authentication:
Basic
Below are all the available parameters you can use.
Parameter | Description | Default Value |
---|---|---|
url | The URL of the starting point | - |
filters | These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result. | - |
filters :crawl | Specifies which URLs Web Crawler will include in the end result. See this section for more information. | - |
filters :process | - | |
filters :max_depth | Determines the max length of URL chains Web Crawler will follow. See this section for more information. | 1 |
scrape_params | These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location. | - |
scrape_params :source | - | |
scrape_params :geo_location | The geographical location that the result should be adapted for. See this section for more information. | - |
scrape_params :user_agent_type | desktop | |
scrape_params :render | - | |
output :type_ | The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTMLs or parsed data. See this section for more information. | - |
upload | These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See this section for more information. | - |
upload :storage_type | Define the cloud storage type. The only valid value is s3 (for AWS S3). gcs (for Google Cloud Storage) is coming soon. | - |
upload :storage_url | The storage bucket URL. | - |
- required parameter
You can fine-tune your configuration with a few more Scraper APIs parameters. Most of their values will depend on the type of scraper you use.
The geographical location that the result should be adapted for. Your
geo_location
parameter value format will depend on the source
you choose. Visit your chosen source
documentation for more information. E.g. if your chosen source
is universal_ecommerce
, go to E-Commerce Scraper API -> Other Domains -> Parameter Values to find the geo_location
parameter values explained.Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to
html
. More info.source
lets you specify which scraper should be used to perform the scraping jobs while crawling. The parameter value you should use depends on the URL you are submitting. The table below outlines which source
value you should use.IMPORTANT: The below list is incomplete. You will get access to the full list of available
source
values after signing up for a free trial or making a purchase.URL | Source |
---|---|
Any Amazon URL | amazon |
Any Bing URL | bing |
Any Baidu URL | baidu |
Any Google URL | google |
Any other URL | universal (for Sitemap or HTML output) or universal_ecommerce (for parsed output) |
Filters let you control the breadth and depth of your crawling jobs. You will be invoiced for the number of URLs scraped, so it is crucial that you set up your filters correctly. Otherwise, you can let Web Crawler scrape more URLs than is necessary.
process
and crawl
filters rely solely on regular expressions (regex) to decide whether some action should be performed on a URL (or a result associated with it).We don't add any
process
or crawl
filters by default. This means that if you don't submit any regular expressions for these filters, no crawling will take place (as we won't follow any URLs) and no results will be included in the sitemap/aggregate result.Regex value | Description |
---|---|
.* | Matches any number of any characters, except line breaks. Use this expression as a wildcard for one or more characters. |
https:\/\/www.amazon.com\/[^\/]*\/[^\/]* | Matches all amazon.com URLs that have no more than two / (slash) symbols in the path. |
https:\/\/www.amazon.com\/.*\/[A-Z0-9]{10}.* | Matches the domain name, followed by any string of characters, which is then followed by a 10-character-long alphanumeric string of characters, which is again followed by any string of characters. Use this value to match all product URLs on amazon.com . |
| |
A few regex-related links that we find helpful:
The
process
filter lets you specify which URLs should be included in the job result. Every URL we come across will be evaluated for matching the crawl
filters. If it's a match, the URL (or the contents of the URL) will be included in the job result. As a parameter value, please send one or more regular expressions in a JSON array, like this: "process": [".e", ".c", ".t"]
.The
crawl
filter lets you specify which URLs (apart from the URL of the starting point) are to be scraped and checked for more URLs. In simple terms, every URL we find while crawling is evaluated for matching the crawl
filters. If it's a match, we'll scrape the URL in question to look for more URLs. As a parameter value, please send one or more regular expressions in a JSON array, like this: "indexable": [".e", ".c", ".t"]
. The value of
max_depth
filter determines the max length of URL chains Web Crawler API will follow.
Value | Description |
---|---|
-1 | Crawls without any depth limits. |
0 | Scrapes the starting page only. |
1 | Scrapes all URLs found in the starting page. This is the default setting. |
2 | Scrapes all URLs found in the URLs found in the starting page. |
3 | Scrapes all URLs found in the URLs found in the URLs found in the starting page. |
4 | Scrapes all URLs found in the URLs found in the URLs found in the URLs found in the starting page. |
n | Scrapes n * (all URLs found in) starting page. |
NOTE: Web Crawler will only crawl URLs in the same domain as the domain of the URL of the starting point. An ability to override this setting is on our roadmap.
The
type_
parameter determines what the output of the Web Crawler job will contain. The output types break down like this:Value | Description |
---|---|
sitemap | A list of URLs. |
parsed | A JSON file, containing an aggregate of parsed results. |
html | A JSON file, containing an aggregate of HTML results. |
You can upload your results to your own cloud storage. We support Amazon S3 and Google Cloud Storage.
storage_type
specifies the type of the cloud storage the results are to be uploaded to. We support s3
and gcs
storage types.storage_url
specifies the name of the bucket the results are to be uploaded to.When using Web Crawler, you will be invoiced for the number of scraped URLs. The price per scrape will be the same as your regular Scraper APIs usage.
Feel free to contact us at [email protected] or your sales/account management rep if you need help with how using Web Crawler affects your monthly usage statistics.
If you want to try out all of Web Crawler's endpoints, you can download and use this Postman collection.
You can look at our Swagger documentation page, which contains API schema and other useful information. To log in, please use the username and password of your Scraper API.
Last modified 1mo ago