Last updated
Last updated
Web Crawler is a Scraper API feature that lets you crawl any site, select useful content, and have it delivered to you in bulk. You can use Web Crawler to perform URL discovery, crawl all pages on a site, index all URLs on a domain, and for other purposes.
The tutorial below will guide you through the Web Crawler workflow showcasing the process of retrieving public data from an E-Commerce website.
Use this endpoint to initiate a new Web Crawler job.
Endpoint: https://ect.oxylabs.io/v1/jobs
Method: POST
Authentication: Basic
Request headers: Content-Type: application/json
Sample payload:
Sample response:
Use this endpoint to stop a certain job.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/stop-indexing
Method: POST
Authentication: Basic
Use this endpoint to resume a certain job.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/resume-indexing
Method: POST
Authentication: Basic
Example response:
Use this endpoint to get the job information of an existing job.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}
Method: GET
Authentication: Basic
Sample response:
Use this endpoint to get the list of URLs found while processing the job.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/sitemap
Method: GET
Authentication: Basic
Sample response:
Once your crawling job is finished, you can download the aggregate result. The result can be one of the following:
An index (a list of URLs);
An aggregate file with all parsed results;
An aggregate file with all HTML results.
Depending on your crawling preferences and output type, the aggregate result may consist of a lot of data. To make downloading results more manageable, we split the aggregate result into multiple chunks based on the chunk size you specify while submitting your crawling job.
Use this endpoint to get the list of chunk files available for download. You may then proceed to download any individual chunk.
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate
Method: GET
Authentication: Basic
Sample response:
Endpoint: https://ect.oxylabs.io/v1/jobs/{id}/aggregate/{chunk}
Method: GET
Authentication: Basic
Below are all the available parameters you can use.
- required parameter
You can fine-tune your configuration with a few more Scraper APIs parameters. Most of their values will depend on the type of scraper you use.
geo_location
The geographical location that the result should be adapted for. Your geo_location
parameter value format will depend on the source
you choose. Visit your chosen source
documentation for more information.
render
source
source
lets you specify which scraper should be used to perform the scraping jobs while crawling. The parameter value you should use depends on the URL you are submitting. The table below outlines which source
value you should use.
user_agent_type
Filters let you control the breadth and depth of your crawling jobs. You will be invoiced for the number of URLs scraped, so it is crucial that you set up your filters correctly. Otherwise, you can let Web Crawler scrape more URLs than is necessary.
process
and crawl
filters rely solely on regular expressions (regex) to decide whether some action should be performed on a URL (or a result associated with it).
We don't add any process
or crawl
filters by default. This means that if you don't submit any regular expressions for these filters, no crawling will take place (as we won't follow any URLs) and no results will be included in the sitemap/aggregate result.
A few regex-related links that we find helpful:
process
The process
filter lets you specify which URLs should be included in the job result. Every URL we come across will be evaluated for matching the crawl
filters. If it's a match, the URL (or the contents of the URL) will be included in the job result. As a parameter value, please send one or more regular expressions in a JSON array, like this: "process": [".e", ".c", ".t"]
.
crawl
The crawl
filter lets you specify which URLs (apart from the URL of the starting point) are to be scraped and checked for more URLs. In simple terms, every URL we find while crawling is evaluated for matching the crawl
filters. If it's a match, we'll scrape the URL in question to look for more URLs. As a parameter value, please send one or more regular expressions in a JSON array, like this: "indexable": [".e", ".c", ".t"]
.
max_depth
The value of max_depth
filter determines the max length of URL chains Web Crawler API will follow.
NOTE: Web Crawler will only crawl URLs in the same domain as the domain of the URL of the starting point. An ability to override this setting is on our roadmap.
type_
The type_
parameter determines what the output of the Web Crawler job will contain. The output types break down like this:
storage_type
storage_type
specifies the type of the cloud storage the results are to be uploaded to. We support s3
and gcs
storage types.
storage_url
storage_url
specifies the name of the bucket the results are to be uploaded to.
When using Web Crawler, you will be invoiced for the number of scraped URLs. The price per scrape will be the same as your regular Scraper APIs usage.
Web Crawler uses that requires sending a username and a password.
Before using the Web Crawler, please read how its filters work and how you may be charged when using this functionality in the and sections of this documentation.
Then, get familiar with the available input and
Check out the section to find request templates and schemas.
If you would like guidance, drop us a line at or contact your account manager.
Web Crawler has a number of endpoints you can use to control the service: , , and your job, get , get the , and get the .
Use this endpoint to download a particular chunk of the aggregate result. The contents of the response body depend on the chosen.
Enables JavaScript rendering. Use when the target requires JavaScript to load content. If you want to use this feature, set the parameter value to html
. .
IMPORTANT: The below list is incomplete. You will get access to the full list of available source
values after or making a purchase.
Device type and browser. The complete list can be found .
;
;
.
You can upload your results to your own . We support Amazon S3 and Google Cloud Storage.
You can read about checking your usage statistics .
Feel free to contact us at or your sales/account management rep if you need help with how using Web Crawler affects your monthly usage statistics.
If you want to try out all of Web Crawler's endpoints, you can download and use .
You can look at our page, which contains API schema and other useful information. To log in, please use the username and password of your Scraper API.