Push-Pull: Single Job

Push-Pull is our recommended integration method for reliably handling large amounts of data, including batch queries.

Visit the Oxylabs GitHub repository for a complete working example of Push-Pull integration in Python.

Push-Pull is an asynchronous integration method. Upon job submission, you will promptly receive a JSON response containing all job details, including job parameters, ID, and URLs for result download and status checking. Once your job is processed, we will update you via a JSON payload sent to your server, if you provided a callback URL. Results remain available for retrieval for at least 24 hours after completion.

With Push-Pull, you can upload your results directly to your cloud storage (AWS S3 or Google Cloud Storage).

If you prefer not to set up a service for incoming callback notifications, you can simply retrieve your results periodically (polling).

You can also explore how Push-Pull works using Postman.

Single Job

Endpoint

This endpoint accepts only a single query or URL value.

POST https://data.oxylabs.io/v1/queries

Input

Provide the job parameters in a JSON payload as shown in the examples below. Python and PHP examples include comments for clarity.

curl --user "user:pass1" \
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
 -d '{"source": "ENTER_SOURCE_HERE", "url": "https://www.example.com", "geo_location": "United States", "callback_url": "https://your.callback.url", "storage_type": "s3", "storage_url": "s3://your.storage.bucket.url"}'

Output

The API will respond with a JSON containing the job information, similar to this:

{
  "callback_url": "https://your.callback.url",
  "client_id": 5,
  "context": [
    {
      "key": "results_language",
      "value": null
    },
    {
      "key": "safe_search",
      "value": null
    },
    {
      "key": "tbm",
      "value": null
    },
    {
      "key": "cr",
      "value": null
    },
    {
      "key": "filter",
      "value": null
    }
  ],
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": "United States",
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": false,
  "render": null,
  "url": "https://www.example.com",
  "source": "universal",
  "start_page": 1,
  "status": "pending",
  "storage_type": "s3",
  "storage_url": "YOUR_BUCKET_NAME/12345678900987654321.json",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:01",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

Data dictionary

For detailed descriptions of the job input parameters, please consult the table below or refer to the specific documentation pages for the scrapers you are interested in.

KeyDescriptionType

created_at

The datetime the job was created at.

String

client_id

The numerical ID associated with the username of the client making the request.

String

client_notes

Submitted notes by the client when sending a job.

String

content_encoding

Add this parameter if you are downloading images. Learn more here.

String

id

The unique ID of the job.

String

statuses

The status code of the scraping or parsing job. You can see the status codes described here.

Integer

status

The status of the job. pending means the job is still being processed. done means we've completed the job. faulted means we came across errors while trying to complete the job and gave up at it.

String

subdomain

The subdomain of the website.

String

updated_at

The datetime the job was last updated at. For jobs that are finished (status is done or faulted), this datetime indicates when the job was finished.

String

links

The list of links, related to the provided input.

JSON Array

links:rel

The link type. self URL contains the metadata of the job, while results URL contains the job results.

String

links:href

The URL to the resource.

String

links:method

The HTTP method that should be used to interact with a given URL.

String

Callback

The callback is a POST request we send to your machine, informing that the data extraction task is completed and providing a URL to download scraped content. This means that you no don't need to check job status manually. Once the data is here, we will let you know, and all you need to do now is to retrieve it.

Input

# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response


AUTH_TUPLE = ('user', 'pass1')

app = Sanic()


# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
    try:
        res = request.json
        links = res.get('_links', [])
        for link in links:
            if link['rel'] == 'results':
                # Sanic is async, but requests are synchronous, to fully take
                # advantage of Sanic, use aiohttp.
                res_response = requests.request(
                    method='GET',
                    url=link['href'],
                    auth=AUTH_TUPLE,
                )
                pprint(res_response.json())
                break
    except Exception as e:
        print("Listener exception: {}".format(e))
    return response.json(status=200, body={'status': 'ok'})


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Output

{  
   "created_at":"2019-10-01 00:00:01",
   "updated_at":"2019-10-01 00:00:15",
   "locale":null,
   "client_id":163,
   "user_agent_type":"desktop",
   "source":"google_shopping_search",
   "pages":1,
   "subdomain":"www",
   "status":"done",
   "start_page":1,
   "parse":0,
   "render":null,
   "priority":0,
   "ttl":0,
   "origin":"api",
   "persist":true,
   "id":"12345678900987654321",
   "callback_url":"http://your.callback.url/",
   "query":"adidas",
   "domain":"com",
   "limit":10,
   "geo_location":null,
   {...}
   "_links":[
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321",
         "method":"GET",
         "rel":"self"
      },
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321/results",
         "method":"GET",
         "rel":"results"
      }
   ],
}

Check Job Status

If you provided a valid callback URL when submitting your job, we will notify you upon completion by sending a JSON payload to the specified callback URL. This payload will indicate that the job has been completed and its status set to done.

However, if you submitted a job without using callback service, you can check the job status manually. Retrieve the URL from the href field in the rel:self section of the response message received after job submission. The URL for checking the job status will resemble the following: http://data.oxylabs.io/v1/queries/12345678900987654321. Querying this URL will return the job information, including its current status.

Endpoint

GET https://data.oxylabs.io/v1/queries/{id}

Input

curl --user "user:pass1" \
'http://data.oxylabs.io/v1/queries/12345678900987654321'