Push-Pull is an asynchronous integration method. Upon job submission, you will promptly receive a JSON response containing all job details, including job parameters, ID, and URLs for result download and status checking. Once your job is processed, we will update you via a JSON payload sent to your server, if you provided a callback URL. Results remain available for retrieval for at least 24 hours after completion.
With Push-Pull, you can upload your results directly to your cloud storage (AWS S3 or Google Cloud Storage).
If you prefer not to set up a service for incoming callback notifications, you can simply retrieve your results periodically (polling).
You can also explore how Push-Pull works using Postman.
Batch Query
Scraper APIs supports submitting up to 5,000 query or url parameter values within a single batch request.
Endpoint
POST https://data.oxylabs.io/v1/queries/batch
The system will handle every query or url submitted as a separate job. If you provide a callback URL, you will get a separate call for each keyword. Otherwise, our initial response will contain job ids for all keywords. For example, if you sent 50 keywords, we will return 50 unique job ids.
IMPORTANT: With /batch endpoint, you can only submit lists of queryor urlparameter values (depending on the source you use). All other parameters should have singular values.
Input
You need to post query parameters as a JSON payload. Here is how you submit a batch job:
You may notice that the code example above doesn't explain how the JSON payload should be formatted and points out to a pre-made JSON file. Below is the content of keywords.json file, containing multiple query parameter values:
For detailed descriptions of the job input parameters, please consult the table below or refer to the specific documentation pages for the scrapers you are interested in.
Key
Description
Type
created_at
The datetime the job was created at.
String
client_id
The numerical ID associated with the username of the client making the request.
String
client_notes
Submitted notes by the client when sending a job.
String
content_encoding
Add this parameter if you are downloading images. Learn more here.
String
id
The unique ID of the job.
String
statuses
The status code of the scraping or parsing job. You can see the status codes described here.
Integer
status
The status of the job. pending means the job is still being processed. done means we've completed the job. faulted means we came across errors while trying to complete the job and gave up at it.
String
subdomain
The subdomain of the website.
String
updated_at
The datetime the job was last updated at. For jobs that are finished (status is done or faulted), this datetime indicates when the job was finished.
String
links
The list of links, related to the provided input.
JSON Array
links:rel
The link type. self URL contains the metadata of the job, while results URL contains the job results.
String
links:href
The URL to the resource.
String
links:method
The HTTP method that should be used to interact with a given URL.
String
Callback
The callback is a POST request we send to your machine, informing that the data extraction task is completed and providing a URL to download scraped content. This means that you no don't need to check job status manually. Once the data is here, we will let you know, and all you need to do now is to retrieve it.
Input
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.# It will print job results to stdout.import requestsfrom pprint import pprintfrom sanic import Sanic, responseAUTH_TUPLE = ('user','pass1')app =Sanic()# Define /job_listener endpoint that accepts POST requests.@app.route('/job_listener', methods=['POST'])asyncdefjob_listener(request):try: res = request.json links = res.get('_links', [])for link in links:if link['rel']=='results':# Sanic is async, but requests are synchronous, to fully take# advantage of Sanic, use aiohttp. res_response = requests.request( method='GET', url=link['href'], auth=AUTH_TUPLE, )pprint(res_response.json())breakexceptExceptionas e:print("Listener exception: {}".format(e))return response.json(status=200, body={'status': 'ok'})if__name__=='__main__': app.run(host='0.0.0.0', port=8080)