Push-Pull is an asynchronous integration method. Upon job submission, you will promptly receive a JSON response containing all job details, including job parameters, ID, and URLs for result download and status checking. Once your job is processed, we will update you via a JSON payload sent to your server, if you provided a callback URL. Results remain available for retrieval for at least 24 hours after completion.
With Push-Pull, you can upload your results directly to your cloud storage (AWS S3 or Google Cloud Storage).
If you prefer not to set up a service for incoming callback notifications, you can simply retrieve your results periodically (polling).
You can also explore how Push-Pull works using Postman.
Single Job
This endpoint accepts only a single query or URL value.
POST https://data.oxylabs.io/v1/queries
Provide the job parameters in a JSON payload as shown in the examples below. Python and PHP examples include comments for clarity.
import requestsfrom pprint import pprint# Structure payload.payload ={"source":"ENTER_SOURCE_HERE",# Source you choose e.g. "universal_ecommerce""url":"https://www.example.com",# Check speficic source if you should use "url" or "query""geo_location":"United States",# Some sources accept zip-code or cooprdinates#"render" : "html", # Uncomment you want to render JavaScript within the page#"render" : "png", # Uncomment if you want to take a screenshot of a scraped web page#"parse" : true, # Check what sources support parsed data#"callback_url": "https://your.callback.url", #required if using callback listener"callback_url":"https://your.callback.url","storage_type":"s3","storage_url":"s3://your.storage.bucket.url"}# Get response.response = requests.request('POST','https://data.oxylabs.io/v1/queries', auth=('YOUR_USERNAME', 'YOUR_PASSWORD'), #Your credentials go here json=payload,)# Print prettified response to stdout.pprint(response.json())
<?php$params =array('source'=>'ENTER_SOURCE_HERE',//Source you choose e.g. "universal_ecommerce"'url'=>'https://www.example.com',// Check speficic source if you should use "url" or "query"'geo_location'=>'United States',//Some sources accept zip-code or cooprdinates//'render' : 'html', // Uncomment you want to render JavaScript within the page//'render' : 'png', // Uncomment if you want to take a screenshot of a scraped web page//'parse' : TRUE, // Check what sources support parsed data//'callback_url' => 'https://your.callback.url', //required if using callback listener'callback_url': 'https://your.callback.url','storage_type'=>'s3','storage_url'=>'s3://your.storage.bucket.url');$ch =curl_init();curl_setopt($ch, CURLOPT_URL,"https://data.oxylabs.io/v1/queries");curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));curl_setopt($ch, CURLOPT_POST,1);curl_setopt($ch, CURLOPT_USERPWD,"YOUR_USERNAME".":"."YOUR_PASSWORD"); //Your credentials go here$headers =array();$headers[] ="Content-Type: application/json";curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);$result =curl_exec($ch);echo $result;if (curl_errno($ch)) {echo'Error:'.curl_error($ch);}curl_close ($ch);?>
For detailed descriptions of the job input parameters, please consult the table below or refer to the specific documentation pages for the scrapers you are interested in.
The datetime the job was created at.
The numerical ID associated with the username of the client making the request.
Submitted notes by the client when sending a job.
The unique ID of the job.
The status of the job. pending means the job is still being processed. done means we've completed the job. faulted means we came across errors while trying to complete the job and gave up at it.
The subdomain of the website.
The datetime the job was last updated at. For jobs that are finished (status is done or faulted), this datetime indicates when the job was finished.
The list of links, related to the provided input.
JSON Array
The link type. self URL contains the metadata of the job, while results URL contains the job results.
The URL to the resource.
The HTTP method that should be used to interact with a given URL.
The callback is a POST request we send to your machine, informing that the data extraction task is completed and providing a URL to download scraped content. This means that you no don't need to check job status manually. Once the data is here, we will let you know, and all you need to do now is to retrieve it.
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.# It will print job results to stdout.import requestsfrom pprint import pprintfrom sanic import Sanic, responseAUTH_TUPLE = ('user','pass1')app =Sanic()# Define /job_listener endpoint that accepts POST requests.@app.route('/job_listener', methods=['POST'])asyncdefjob_listener(request):try: res = request.json links = res.get('_links', [])for link in links:if link['rel']=='results':# Sanic is async, but requests are synchronous, to fully take# advantage of Sanic, use aiohttp. res_response = requests.request( method='GET', url=link['href'], auth=AUTH_TUPLE, )pprint(res_response.json())breakexceptExceptionas e:print("Listener exception: {}".format(e))return response.json(status=200, body={'status': 'ok'})if__name__=='__main__': app.run(host='', port=8080)
If you provided a valid callback URL when submitting your job, we will notify you upon completion by sending a JSON payload to the specified callback URL. This payload will indicate that the job has been completed and its status set to done.
However, if you submitted a job without usingcallback service, you can check the job status manually. Retrieve the URL from the href field in the rel:self section of the response message received after job submission. The URL for checking the job status will resemble the following: http://data.oxylabs.io/v1/queries/12345678900987654321. Querying this URL will return the job information, including its current status.
Upon completion of the job, the API will respond with query information in JSON format. The job status will be changed to done, indicating that the job is finished. You can retrieve the content by querying one of the provided links. Additionally, the response will include the timestamp of when the job was last updated, allowing you to track its processing time.
The job is still being processed and has not been completed.
The job is completed. You can retrieve the result by querying the URL provided in the href field under the rel:results section, for example: http://data.oxylabs.io/v1/queries/12345678900987654321/results.
There was an issue with the job, and we couldn't complete it. You are not charged for any faulted jobs.
Retrieve Job Content
Once the job is ready to be retrieved, you can use the URL provided in the response under the rel:results section. The URL will look like this: http://data.oxylabs.io/v1/queries/7173957294344910849/results.
You can retrieve different results types by using the following endpoints:
GET https://data.oxylabs.io/v1/queries/{job_id}/results
GET https://data.oxylabs.io/v1/queries/{job_id}/results?type=raw
GET https://data.oxylabs.io/v1/queries/{job_id}/results?type=parsed
GET https://data.oxylabs.io/v1/queries/{job_id}/results?type=png
Below are code examples demonstrating how to use the /results endpoint:
The results can be automatically retrieved without periodically checking job status by setting up Callback service. To do that, specify the URL of a server that is able to accept incoming HTTP(S) requests while submitting a job. When our system completes the job, it will POST a JSON payload to the provided URL, and the Callback service will download the results as described in the Callback implementation example.
Get Notifier IP Address List
You may want to whitelist the IPs sending you callback messages or get the list of these IPs for other purposes. You can do this by GETting this endpoint:
GET https://data.oxylabs.io/v1/info/callbacker_ips
The code examples below show how you can access the /callbacker_ips endpoint:
The API will return the list of IPs making callback requests to your system:
{"ips": ["x.x.x.x","y.y.y.y" ]}
Scheduler is a service that you can use to schedule recurring scraping jobs.
It extends the functionality of Push-Pull integration and is best used together with the Cloud integration functionality. Read more about Scheduler feature here.