Web Crawler
Intro
Web Crawler is a Scraper API feature that lets you crawl any site, select useful content, and have it delivered to you in bulk. You can use Web Crawler to perform URL discovery, crawl all pages on a site, index all URLs on a domain, and for other purposes.
The tutorial below will guide you through the Web Crawler workflow showcasing the process of retrieving public data from an E-Commerce website.
Web Crawler uses basic HTTP authentication that requires sending a username and a password.
Before using the Web Crawler, please read how its filters work and how you may be charged when using this functionality in the Filters and Usage statistics sections of this documentation.
Then, get familiar with the available input parameters and API endpoints.
Check out the Integrations section to find request templates and schemas.
If you would like guidance, drop us a line at support@oxylabs.io or contact your account manager.
Endpoints
Web Crawler has a number of endpoints you can use to control the service: initiate, stop, and resume your job, get job info, get the list of result chunks, and get the results.
Create a new job
Use this endpoint to initiate a new Web Crawler job.
Endpoint:
https://ect.oxylabs.io/v1/jobs
Method:
POST
Authentication:
Basic
Request headers:
Content-Type: application/json
Sample payload:
Sample response:
Stop a job
Use this endpoint to stop a certain job.
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/stop-indexing
Method:
POST
Authentication:
Basic
Resume a job
Use this endpoint to resume a certain job.
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/resume-indexing
Method:
POST
Authentication:
Basic
Example response:
Get job information
Use this endpoint to get the job information of an existing job.
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}
Method:
GET
Authentication:
Basic
Sample response:
Get sitemap
Use this endpoint to get the list of URLs found while processing the job.
Endpoint:
https://ect.oxylabs.io/v1/jobs/{id}/sitemap
Method:
GET
Authentication:
Basic
Sample response:
Get the list of aggregate result chunks
Once your crawling job is finished, you can download the aggregate result. The result can be one of the following:
An index (a list of URLs);
An aggregate file with all parsed results;
An aggregate file with all HTML results.
Depending on your crawling preferences and output type, the aggregate result may consist of a lot of data. To make downloading results more manageable, we split the aggregate result into multiple chunks based on the chunk size you specify while submitting your crawling job.
Use this endpoint to get the list of chunk files available for download. You may then proceed to download any individual chunk.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate
Method: GET
Authentication: Basic
Sample response:
Get a chunk of the aggregate result
Use this endpoint to download a particular chunk of the aggregate result. The contents of the response body depend on the output type chosen.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
Method: GET
Authentication: Basic
Query parameters
Below are all the available parameters you can use.
url
The URL of the starting point
-
filters
These parameters are used to configure the breadth and depth of the crawling job, as well as determine which URLs should be included in the end result.
-
filters:crawl
Specifies which URLs Web Crawler will include in the end result. See this section for more information.
-
filters:process
Specifies which URLs Web Crawler will scrape. See this section for more information.
-
filters
:max_depth
Determines the max length of URL chains Web Crawler will follow. See this section for more information.
1
scrape_params
These parameters are used to fine-tune the way we perform the scraping jobs. For instance, you may want us to execute Javascript while crawling a site, or you may prefer us to use proxies from a particular location.
-
scrape_params
:geo_location
The geographical location that the result should be adapted for. See this section for more information.
-
output
:type_
The output type. We can return a sitemap (list of URLs found while crawling) or an aggregate file containing HTMLs or parsed data. See this section for more information.
-
upload
These parameters are used to describe the cloud storage location where you would like us to put the result once we're done. See this section for more information.
-
upload
:storage_type
Define the cloud storage type. The only valid value is s3
(for AWS S3). gcs
(for Google Cloud Storage) is coming soon.
-
upload
:storage_url
The storage bucket URL.
-
- required parameter
Scrape_params
You can fine-tune your configuration with a few more Scraper APIs parameters. Most of their values will depend on the type of scraper you use.
geo_location
geo_location
The geographical location that the result should be adapted for. Your geo_location
parameter value format will depend on the source
you choose. Visit your chosen source
documentation for more information.
render
render
Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html
. More info.
source
source
source
lets you specify which scraper should be used to perform the scraping jobs while crawling. The parameter value you should use depends on the URL you are submitting. The table below outlines which source
value you should use.
Any Amazon URL
amazon
Any Bing URL
bing
Any Google URL
google
Any Google Shopping URL
google_shopping
Any other URL
universal
(for Sitemap or HTML output) or universal_ecommerce
(for parsed output)
user_agent_type
user_agent_type
Device type and browser. The complete list can be found here.
Filters
Filters let you control the breadth and depth of your crawling jobs. You will be invoiced for the number of URLs scraped, so it is crucial that you set up your filters correctly. Otherwise, you can let Web Crawler scrape more URLs than is necessary.
process
and crawl
filters rely solely on regular expressions (regex) to decide whether some action should be performed on a URL (or a result associated with it).
We don't add any process
or crawl
filters by default. This means that if you don't submit any regular expressions for these filters, no crawling will take place (as we won't follow any URLs) and no results will be included in the sitemap/aggregate result.
Regex value examples
.*
Matches any number of any characters, except line breaks. Use this expression as a wildcard for one or more characters.
https:\/\/www.amazon.com\/[^\/]*\/[^\/]*
Matches all amazon.com
URLs that have no more than two /
(slash) symbols in the path.
https:\/\/www.amazon.com\/.*\/[A-Z0-9]{10}.*
Matches the domain name, followed by any string of characters, which is then followed by a 10-character-long alphanumeric string of characters, which is again followed by any string of characters. Use this value to match all product URLs on amazon.com
.
A few regex-related links that we find helpful:
process
process
The process
filter lets you specify which URLs should be included in the job result. Every URL we come across will be evaluated for matching the crawl
filters. If it's a match, the URL (or the contents of the URL) will be included in the job result. As a parameter value, please send one or more regular expressions in a JSON array, like this: "process": [".e", ".c", ".t"]
.
crawl
crawl
The crawl
filter lets you specify which URLs (apart from the URL of the starting point) are to be scraped and checked for more URLs. In simple terms, every URL we find while crawling is evaluated for matching the crawl
filters. If it's a match, we'll scrape the URL in question to look for more URLs. As a parameter value, please send one or more regular expressions in a JSON array, like this: "indexable": [".e", ".c", ".t"]
.
max_depth
max_depth
The value of max_depth
filter determines the max length of URL chains Web Crawler API will follow.
-1
Crawls without any depth limits.
0
Scrapes the starting page only.
1
Scrapes all URLs found in the starting page. This is the default setting.
2
Scrapes all URLs found in the URLs found in the starting page.
3
Scrapes all URLs found in the URLs found in the URLs found in the starting page.
4
Scrapes all URLs found in the URLs found in the URLs found in the URLs found in the starting page.
n
Scrapes n * (all URLs found in)
starting page.
NOTE: Web Crawler will only crawl URLs in the same domain as the domain of the URL of the starting point. An ability to override this setting is on our roadmap.
Output
type_
type_
The type_
parameter determines what the output of the Web Crawler job will contain. The output types break down like this:
sitemap
A list of URLs.
parsed
A JSON file, containing an aggregate of parsed results.
html
A JSON file, containing an aggregate of HTML results.
Upload
You can upload your results to your own cloud storage. We support Amazon S3 and Google Cloud Storage.
storage_type
storage_type
storage_type
specifies the type of the cloud storage the results are to be uploaded to. We support s3
and gcs
storage types.
storage_url
storage_url
storage_url
specifies the name of the bucket the results are to be uploaded to.
Usage statistics
When using Web Crawler, you will be invoiced for the number of scraped URLs. The price per scrape will be the same as your regular Scraper APIs usage.
You can read about checking your usage statistics here.
Feel free to contact us at support@oxylabs.io or your sales/account management rep if you need help with how using Web Crawler affects your monthly usage statistics.
Integrations
Postman
If you want to try out all of Web Crawler's endpoints, you can download and use this Postman collection.
Swagger / OpenAPI
You can look at our Swagger documentation page, which contains API schema and other useful information. To log in, please use the username and password of your Scraper API.
Last updated