NAV Navbar
cURL Python PHP HTTP
电商通用爬虫API

快速入门

爬虫 API 专为帮助您进行繁重繁琐的数据检索操作而建立。您可以使用爬虫 API 访问各类公众网页。它能够毫不费力地爬取网页数据,绝不出现任何延迟或错误。

爬虫 API 使用基础的 HTTP 身份验证,需要发送用户名和密码。

到目前为止,这是开始使用爬虫 API 的最快方式。您将使用 Realtime 集成方法从美国 geo-locationhttps://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html 提出请求,并检索已 parsed 的 JSON 数据。如果您希望获取 HTML 页面内容而不是已解析的数据,只需简单地删除 parseparser_type 参数。切勿忘记将 USERNAMEPASSWORD 替换为您的代理用户凭据。

curl --user "USERNAME:PASSWORD" 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "geo-location": "United States", "parser_type": "ecommerce_product", "parse": true}'

如果您有任何本文档未涉及的问题,请发送电子邮件至 support@oxylabs.io 联系您的客户经理或我们的支持人员。

集成方法

爬虫 API 支持三种集成方法,它们都有各自独特的优势:

我们推荐的数据提取方法是 Push-Pull

Push-Pull

这是最简单但也是最可靠的推荐数据传输方法。在 Push-Pull 方案中,您向我们发送一个查询,我们向您返回一个工作 id,一旦工作完成,您可以使用该 id/results 端点检索内容。您可以自己检查工作进展情况,也可以设置一个能够接受 POST 查询的简单监听器。这样,一旦准备检索工作,我们会向您发送一个回调消息。在这个特殊的例子中,结果将自动上传到您的 S3 存储桶,名为YOUR_BUCKET_NAME

单一查询

curl --user user:pass1\
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "callback_url": "https://your.callback.url", "storage_type": "s3", "storage_url": "YOUR_BUCKET_NAME"}'
import requests
from pprint import pprint


# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'callback_url': 'https://your.callback.url',
    'storage_type': 's3',
    'storage_url': 'YOUR_BUCKET_NAME'
}

# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)

# Print prettified response to stdout.
pprint(response.json())
<?php

$params = array(
    'source' => 'universal_ecommerce',
    'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'callback_url' => 'https://your.callback.url',
    'storage_type' => 's3',
    'storage_url' => 'YOUR_BUCKET_NAME'
);

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

API 将通过打印在响应体中响应 JSON 格式的查询信息,类似以下示例:

{
  "callback_url": "https://your.callback.url",
  "client_id": 5,
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": null,
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": false,
  "render": null,
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "source": "universal_ecommerce",
  "start_page": 1,
  "status": "pending",
  "storage_type": "s3",
  "storage_url": "YOUR_BUCKET_NAME/12345678900987654321.json",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:01",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

以下端点将处理一个关键字或 URL 的单一查询。该 API 将返回一个包含工作信息的确认讯息,包括工作 id。您可以使用该 id 检查进展情况,也可以在查询中加入 callback_url,要求我们在抓取任务完成后 ping 您的回调端点。

POST https://data.oxylabs.io/v1/queries

您需要发布查询参数作为 JSON 主体的数据。

检查工作状态

curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321'
import requests
from pprint import pprint

# Get a response from the stats endpoint.
response = requests.request(
    method='GET',
    url='http://data.oxylabs.io/v1/queries/12345678900987654321',
    auth=('user', 'pass1'),
)

# Print prettified JSON response to stdout.
pprint(response.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

API 将通过打印在响应体中响应 JSON 格式的查询信息。注意,工作 status 已被改为 done。现在您可以通过查询检索内容 http://data.oxylabs.io/v1/queries/12345678900987654321/results.

您还可以看到任务已经 updated_at 2019-10-01 00:00:15 - 需要 14 秒完成查询。

{
  "client_id": 5,
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": null,
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": false,
  "render": null,
  "url": "sofa",
  "source": "universal_ecommerce",
  "start_page": 1,
  "status": "done",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:15",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

如果您的查询有 callback_url,我们会在抓取任务完成后向您发送一条包含内容链接的讯息。但是,如果查询中没有 callback_url,您需要自己检查工作状态。为此,您需要使用您向我们的 API 提交查询后收到的响应讯息 rel:self 下的 href 中的 URL。应该类似以下示例: http://data.oxylabs.io/v1/queries/12345678900987654321.

GET https://data.oxylabs.io/v1/queries/{id}

查询此链接将返回工作信息,包括其status。可能的 status 值有 3 个。

pending 该工作仍在队列中,尚未完成。
done 工作已完成,您可以通过查询 rel:resultshref 中的 URL 获取结果。 : http://data.oxylabs.io/v1/queries/12345678900987654321/results
faulted 工作出了问题,我们无法完成,很可能是由于目标网站方面的服务器错误。

检索工作内容

curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321/results'
import requests
from pprint import pprint

# Get response from the stats endpoint.
response = requests.request(
    method='GET',
    url='http://data.oxylabs.io/v1/queries/12345678900987654321/results',
    auth=('user', 'pass1'),
)

# Print the prettified JSON response to stdout.
pprint(response.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321/results");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

API 将返回工作内容:

{
  "results": [
    {
      "content": "<!doctype html><html>
        CONTENT      
      </html>",
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "page": 1,
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

一旦您通过检查工作状态得知准备好检索工作,您便可使用我们初始响应 rel:resultshref 中的 URL 来获取。应该类似以下示例: http://data.oxylabs.io/v1/queries/12345678900987654321/results.

GET https://data.oxylabs.io/v1/queries/{id}/results

通过设置 回调服务,可以自动检索结果,无需定期检查工作状态。用户需要指定运行回调服务的服务器的 IP 或域名。当我们的系统完成一项作业时,它将向所提供的IP或域发送一条消息,回调服务将下载结果,如回调实现实例所述。

回调

# Please see the code samples in Python and PHP.
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response


AUTH_TUPLE = ('user', 'pass1')

app = Sanic()


# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
    try:
        res = request.json
        links = res.get('_links', [])
        for link in links:
            if link['rel'] == 'results':
                # Sanic is async, but requests are synchronous, to fully take
                # advantage of Sanic, use aiohttp.
                res_response = requests.request(
                    method='GET',
                    url=link['href'],
                    auth=AUTH_TUPLE,
                )
                pprint(res_response.json())
                break
    except Exception as e:
        print("Listener exception: {}".format(e))
    return response.json(status=200, body={'status': 'ok'})


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

<?php
$stdout = fopen('php://stdout', 'w');

if (isset($_POST)) {
    $result = array_merge($_POST, (array) json_decode(file_get_contents('php://input')));

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/".$result['id'].'/results');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
    curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

    $result = curl_exec($ch);
    fwrite($stdout, $result);

    if (curl_errno($ch)) {
        echo 'Error:' . curl_error($ch);
    }
    curl_close ($ch);
}
?>
HTTP method is currently not supported

样例回调输出

{  
   "created_at":"2019-10-01 00:00:01",
   "updated_at":"2019-10-01 00:00:15",
   "locale":null,
   "client_id":163,
   "user_agent_type":"desktop",
   "source":"universal_ecommerce",
   "pages":1,
   "subdomain":"www",
   "status":"done",
   "start_page":1,
   "parse":0,
   "render":null,
   "priority":0,
   "ttl":0,
   "origin":"api",
   "persist":true,
   "id":"12345678900987654321",
   "callback_url":"http://your.callback.url/",
   "url":"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
   "domain":"de",
   "limit":10,
   "geo_location":null,
   {...}
   "_links":[
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321",
         "method":"GET",
         "rel":"self"
      },
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321/results",
         "method":"GET",
         "rel":"results"
      }
   ],
}

回调是一个 POST 请求,我们将其发送到您的机器上,通知您数据提取任务已经完成,并提供一个 URL 来下载刮取的内容。也就是说,您不再需要手动检查工作状态。一旦获取数据,我们会通知您,现在您需要做的就是检索它

批量查询

curl --user user:pass1 'https://data.oxylabs.io/v1/queries/batch' -H 'Content-Type: application/json' \
 -d '@keywords.json'
import requests
import json
from pprint import pprint


# Get payload from file.
with open('keywords.json', 'r') as f:
    payload = json.loads(f.read())

response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries/batch',
    auth=('user', 'pass1'),
    json=payload,
)

# Print prettified response.
pprint(response.json())
<?php

$paramsFile = file_get_contents(realpath("keywords.json"));
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/batch");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $paramsFile);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

keywords.json 内容:

{  
   "url":[  
      "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
      "https://books.toscrape.com/catalogue/soumission_998/index.html"
   ],
   "source": "universal_ecommerce",
   "callback_url": "https://your.callback.url"
}

API 将响应 JSON 格式的查询信息,将其打印在响应体中,类似以下示例:

{
  "queries": [
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "12345678900987654321",
      {...}
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "source": "universal_ecommerce",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
          "method": "GET"
        }
      ]
    },
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "12345678901234567890",
      {...}
      "url": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
      "source": "universal_ecommerce",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/12345678901234567890/results",
          "method": "GET"
        }
      ]
    },
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "01234567899876543210",
      {...}
      "url": "https://books.toscrape.com/catalogue/soumission_998/index.html",
      "source": "universal_ecommerce",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/01234567899876543210/results",
          "method": "GET"
        }
      ]
    }
  ]
}

此外,爬虫 API 可在每项查询接受多个关键字,每批最多可执行 1,000 个关键字。以下端点将提交多个关键字到提取队列中。

POST https://data.oxylabs.io/v1/queries/batch

您需要发布查询参数作为 JSON 主体的数据。

系统会将每个关键字作为一个单独的请求进行处理。如果您提供了回调 URL,您将为每个关键词得到一个单独的调用。否则,我们的初始响应将包含所有关键字的工作 id。例如,如果您发送了50个关键词,我们将返回50个独特的工作id

重要事项! query 是唯一可以有多个值的参数。所有其他参数对该批次查询都是一样的。

获取通知者 IP 地址列表

curl --user user:pass1 'https://data.oxylabs.io/v1/info/callbacker_ips'
import requests
from pprint import pprint

# Get response from the callback IPs endpoint.
response = requests.request(
    method='GET',
    url='https://data.oxylabs.io/v1/info/callbacker_ips',
    auth=('user', 'pass1'),
)

# Print the prettified JSON response to stdout.
pprint(response.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/info/callbacker_ips");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

该 API 将返回向您的系统发出回调请求的 IP 列表:

{
    "ips": [
        "x.x.x.x",
        "y.y.y.y"
    ]
}

您可能想把向您发送回调信息的 IP 列入白名单,或者为其他目的获得这些 IP 的列表。可以通过 GET 端点来做到这一点: https://data.oxylabs.io/v1/info/callbacker_ips

上传至存储器

{
    "Version": "2012-10-17",
    "Id": "Policy1577442634787",
    "Statement": [
        {
            "Sid": "Stmt1577442633719",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
            },
            "Action": "s3:GetBucketLocation",
            "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
        },
        {
            "Sid": "Stmt1577442633719",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
            },
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
        }
    ]
}

默认情况下,电子商务爬虫 API 的工作结果会存储在我们的数据库中。也就是说,您需要查询我们的结果端点并自己检索内容。自定义存储功能允许您将结果存储在您自己的云存储中。这个功能的优势在于,您无需额外的请求即可获取结果 - 所有内容都会直接进入您的存储桶。

目前,我们只支持 Amazon S3。如果您想使用不同类型的存储器,请联系您的客户经理以讨论时间安排。

为了将工作结果上传到您的 Amazon S3 存储桶,您需要设置特殊权限。要做到这一点,请进入 https://s3.console.aws.amazon.com/ > S3 > Storage > Bucket Name (if don't have one, create new) > Permissions > Bucket Policy

您可以在该 JSON 或右侧代码样例区中找到桶策略。请勿忘记修改 YOUR_BUCKET_NAME 下的桶名称。这个策略允许我们写到您的桶里,为您上传文件,并知道桶的位置。

要使用该功能,您需要在您的请求中指定两个额外参数。此处了解更多信息。

上传路径类似以下示例:YOUR_BUCKET_NAME/job_ID.json。您将在提交请求后我们向您发送的响应体中找到工作 ID。在该示例中工作 ID 是12345678900987654321

Realtime

curl --user user:pass1 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" \
 -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"}'
import requests
from pprint import pprint


# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
}

# Get response.
response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)

# Instead of response with job status and results url, this will return the
# JSON response with results.
pprint(response.json())
<?php

$params = array(
    'source' => 'universal_ecommerce',
    'query' => 'sofa',
);

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://realtime.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
# URL has to be encoded to escape `&` and `=` characters. It is not necessary in this example.

https://realtime.oxylabs.io/v1/queries?source=universal_ecommerce&url=https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html&access_token=12345abcde

将在开放连接上返回的响应体示例

{
  "results": [
    {
      "content": "<html>
      CONTENT
      </html>"
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "id": null,
      "page": 1,
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

数据提交与 Push-Pull 方法相同,但在 Realtime 情况下,我们将在开放连接上返回内容。您向我们发送一个查询,连接保持开放,我们检索内容并将其返回给您。处理这个问题的端点是这样的:

POST https://realtime.oxylabs.io/v1/queries

开放连接有150秒的超时限制,因此在罕见的大负荷情况下,我们可能无法确保数据到达您手中。

您需要发布查询参数作为 JSON 主体的数据。请参阅示例了解详情。

SuperAPI

curl -k \
-x realtime.oxylabs.io:60000 \
-U user:pass1 \
-H "X-Oxylabs-User-Agent-Type: desktop_chrome" \
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
import requests
from pprint import pprint

# Define proxy dict. Do not forget to put your real user and pass here as well.
proxies = {
  'http': 'http://user:pass1@realtime.oxylabs.io:60000',
  'https': 'https://user:pass1@realtime.oxylabs.io:60000',
}

response = requests.request(
    'GET',
    'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    auth=('user', 'pass1'),
    verify=False,  # Or accept our certificate.
    proxies=proxies,
)

# Print result page to stdout
pprint(response.text)

# Save returned HTML to result.html file
with open('result.html', 'w') as f:
    f.write(response.text)
<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, 'realtime.oxylabs.io:60000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "user" . ":" . "pass1");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is not supported with SuperAPI

如果您曾使用过一般代理进行数据抓取,集成 SuperAPI 的交付方式将变得轻而易举。您只需要使用我们的入口节点作为代理,使用爬虫 API 凭证进行授权,并忽略证书。在 cURL 中是 -k--insecure。您的数据将在开放的连接上送达您的手中。

GET realtime.oxylabs.io:60000

SuperAPI 只支持少数几个参数,因为它只适用于提供完整 URL 的 Direct 数据源。这些参数应该作为消息头发送。这是一个可接受参数的列表:

X-Oxylabs-User-Agent-Type 没有任何方法表明一个特定的用户代理,但您可以告诉我们使用哪种浏览器和平台。所支持用户代理的列表可以在此处查阅。

如果您在设置 SuperAPI 时需要帮助,请发送邮件至 support@oxylabs.io

内容类型

爬虫 API 返回原始 HTML,以及结构化的 JSON。

下载图片

import base64
import json
import requests

# Your credentials.
USERNAME = ''
PASSWORD = ''

# Image URL which will be saved to file.
URL_IMAGE = 'https://example.com/image.jpg'

# Realtime URL.
API_URL = f'http://{USERNAME}:{PASSWORD}@realtime.oxylabs.io/v1/queries'


def dump_to_file(filename: str, data: bytes):
    with open(filename, 'wb') as file:
        file.write(data)


def main():
    parameters = {
        'source': 'universal_ecommerce',
        'url': URL_IMAGE,
        'content_encoding': 'base64',
    }
    response = requests.post(API_URL, json=parameters)
    if response.ok:
        data = json.loads(response.text)
        content_base64 = data['results'][0]['content']
        # Decode base64 encoded data into bytes.
        content = base64.b64decode(content_base64)
        dump_to_file('out.jpg', content)


if __name__ == '__main__':
    main()

可以通过爬虫 API 下载图片。如果您通过 SuperAPI 下载,则可以简单地保存输出到图片扩展名中。例如:

curl -k -x realtime.oxylabs.io:60000 -U user:pass1 "https://example.com/image.jpg" >> image.jpg

如果您使用 Push-PullRealtime方法,您需要添加一个 content_encoding 参数,值为 base64。在您收到结果后,您需要将 content 的编码数据解码成字节,并将其保存为图像文件。请参考右侧的一个 Python 示例。

数据来源

爬虫 API 接受 URL 以及诸如用户代理类型、代理位置等其他参数。参见以下方法,我们称之为 Direct

爬虫 API 在抓取时能够渲染 JavaScript。因此,您能够从网页上获取更多数据,也能获得屏幕截图。

如果您不确定文档的任何部分,请发送电子邮件至 support@oxylabs.io 给我们留言或联系您的客户经理。

Direct

在这个例子中,API 将以 Push-Pull 方式检索一个电商通用产品页面。所有可用参数都包括在内(尽管在同一个请求中并不总是必要的或兼容的),使您了解如何格式化您的请求:

curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
 -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "user_agent_type": "mobile", "render": "html", 
 "context": [{"key": "headers", "value": ["Accept-Language": "en-US", "Content-Type": "application/octet-stream", "Custom-Header": "custom header content"]}, {"key": "cookies", "value": [{"key": "NID", "value": "1234567890"}, {"key": "1P JAR", "value": "0987654321"}, {"key": "follow_redirects", "value": true}, {"key": "http_method", "value": "get"}, {"key": "content", "value": "base64EncodedPOSTBody"}, {"key": "successful_status_codes", "value": [303, 808, 909]}]}]}'
import requests
from pprint import pprint


# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://www.etsy.com/listing/399423455/big-glass-house-planter-handmade-glass?ref=hp_prn&frs=1',
    'user_agent_type': 'desktop',
    'geo_location': 'United States',
    'parse': true,
    'parser_type': "ecommerce_product",
    'context': [
        {
          'key': 'session_id',
          'value': '1234567890abcdef'
        }
        {
          'key': 'headers', 'value': 
            {
             'Accept-Language': 'en-US',
             'Content-Type': 'application/octet-stream',
             'Custom-Header': 'custom header content'
            }
        },
        {
          'key': 'cookies',
          'value': [{
              'key': 'NID',
             'value': '1234567890'
           },
           {
              'key': '1P_JAR',
             'value': '0987654321'
           }
         ]
        },
        {
          'key': 'follow_redirects',
          'value': true
        },
        {
          'key': 'successful_status_codes',
          'value': [303, 808, 909]
        },
        {
          'key': 'http_method',
          'value': 'get'
        }
        {
          'key': 'content'
          'value': 'base64EncodedPOSTBody'
        }
    ],
    'callback_url': 'https://your.callback.url',
}

# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)

# Print prettified response to stdout.
pprint(response.json())
<?php

$params = [
    'source' => 'universal_ecommerce',
    'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'context' => [
        [
            'key' => 'session_id',
            'value' => '1234567890abcdef'
        ],
        [
            'key' => 'headers',
            'value' => [
                'Accept-Language' => 'en-US',
                'Content-Type' => 'application/octet-stream',
                'Custom-Header' => 'custom header content'
            ],
        ],
        [
            'key' => 'cookies',
            'value' => [
                ['key' => 'NID', 'value' => '1234567890'],
                ['key' => '1P_JAR', 'value' => '0987654321']
            ]
        ],
        [
            'key' => 'follow_redirects',
            'value' => 'true'
        ],
        [
            'key' => 'successful_status_codes',
            'value' => [303, 808, 909]
        ],
        [
            'key' => 'http_method',
            'value' => 'get'
        ],
        [
            'key' => 'content',
            'value' => 'base64EncodedPOSTBody'
        ]
    ]
];

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported with Push-Pull

以下是以 Realtime 执行所述任务的相同示例:

curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "user_agent_type": "mobile", "context": [{"key": "headers", "value": ["Accept-Language": "en-US", "Content-Type": "application/octet-stream", "Custom-Header": "custom header content"]}, {"key": "cookies", "value": [{"key": "NID", "value": "1234567890"}, {"key": "1P JAR", "value": "0987654321"}, {"key": "follow_redirects", "value": true}, {"key": "http_method", "value": "get"}, {"key": "content", "value": "base64EncodedPOSTBody"}, {"key": "successful_status_codes", "value": [303, 808, 909]}]}]}'
import requests
from pprint import pprint

# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'user_agent_type': 'mobile',
    'geo_location': 'United States',
    'context': [
        {
          'key': 'session_id',
          'value': '1234567890abcdef'
        }
        {
          'key': 'headers', 'value': 
            {
             'Accept-Language': 'en-US',
             'Content-Type': 'application/octet-stream',
             'Custom-Header': 'custom header content'
            }
        },
        {
          'key': 'cookies',
          'value': [{
              'key': 'NID',
             'value': '1234567890'
           },
           {
              'key': '1P_JAR',
             'value': '0987654321'
           }
         ]
        },
        {
          'key': 'follow_redirects',
          'value': true
        },
        {
          'key': 'successful_status_codes',
          'value': [303, 808, 909]
        },
        {
          'key': 'http_method',
          'value': 'get'
        }
        {
          'key': 'content'
          'value': 'base64EncodedPOSTBody'
        }
    ],
}

# Get response.
response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('user', 'pass1'),
    json=payload,
)

# Instead of response with job status and results url, this will return the
# JSON response with the result.
pprint(response.json())
<?php

$params = [
    'source' => 'universal_ecommerce',
    'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'context' => [
        [
            'key' => 'session_id',
            'value' => '1234567890abcdef'
        ],
        [
            'key' => 'headers',
            'value' => [
                'Accept-Language' => 'en-US',
                'Content-Type' => 'application/octet-stream',
                'Custom-Header' => 'custom header content'
            ],
        ],
        [
            'key' => 'cookies',
            'value' => [
                ['key' => 'NID', 'value' => '1234567890'],
                ['key' => '1P_JAR', 'value' => '0987654321']
            ]
        ],
        [
            'key' => 'follow_redirects',
            'value' => 'true'
        ],
        [
            'key' => 'successful_status_codes',
            'value' => [303, 808, 909]
        ],
        [
            'key' => 'http_method',
            'value' => 'get'
        ],
        [
            'key' => 'content',
            'value' => 'base64EncodedPOSTBody'
        ]
    ]
];

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
# The whole string you submit has to be URL-encoded.

https://realtime.oxylabs.io/v1/queries?source=universal_ecommerce&url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2Ftagged%2Fpython&access_token=12345abcde

并通过 SuperAPI 执行所述任务:

# A GET request could look something like this:
curl -k \
-x http://realtime.oxylabs.io:60000 \
-U user:pass1 \
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" \
-H "X-Oxylabs-Session-Id: 1234567890abcdef" \
-H "X-Oxylabs-Geo-Location: India" \
-H "Accept-Language: en-US" \
-H "Content-Type: application/octet-stream" \
-H "Custom-Header: custom header content" \
-H "Cookie: NID=1234567890; 1P_JAR=0987654321" \
-H "X-Status-Code: 303, 808, 909"

# A POST request would have the same structure but contain a parameter specifying that it is a POST request:
curl -X POST \
-k \
-x http://realtime.oxylabs.io:60000 \
-U user:pass1 "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" \
-H "X-Oxylabs-Session-Id: 1234567890abcdef" \
-H "X-Oxylabs-Geo-Location: India" \
-H "Custom-Header: custom header content" \
-H "Cookie: NID=1234567890; 1P_JAR=0987654321" \
-H "X-Status-Code: 303, 808, 909"

import requests
from pprint import pprint

# Define proxy dict. Do not forget to put your real user and pass here as well.
proxies = {
  'http': 'http://user:pass1@realtime.oxylabs.io:60000',
  'https': 'https://user:pass1@realtime.oxylabs.io:60000',
}

response = requests.request(
    'GET',
    'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    auth=('user', 'pass1'),
    verify=False,  # Or accept our certificate.
    proxies=proxies,
)

# Print result page to stdout
pprint(response.text)

# Save returned HTML to result.html file
with open('result.html', 'w') as f:
    f.write(response.text)
<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, 'realtime.oxylabs.io:60000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "user" . ":" . "pass1");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is not supported with SuperAPI

Push-PullRealtimeSuperAPI原始 HTML已解析的 JSON渲染 JS

universal_ecommerce 源旨在从互联网上的任何 URL 检索内容。POST将 JSON 格式的参数发送到以下端点,将提交指定的 URL 到提取队列中。

查询参数

参数 描述 默认值
source 数据源 universal_ecommerce
url 转到通用页面的直接 URL(链接) -
user_agent_type 设备类型和浏览器。完整列表可以在此处查阅。 desktop
geo_location 用于检索数据的代理地理位置。所支持位置的完整列表可以在此处查阅。
locale 区域设置,如 Accept-Language 标头所预期。
render 启用 JavaScript 渲染。当目标需要 JavaScript 加载内容时使用它。只通过 Push-Pull(又称回调)方法工作。这个参数有两个可用值:html(获得原始输出)和 png(获得一个 Base64 编码的屏幕截图)。
content_encoding 如果您要下载图片,请添加这个参数。此处了解更多信息。 base64
context:
content
Base64 编码的 POST 请求体。只有当 http_method 被设置为 post 时,它才有用。 -
context:
cookies
传递您自己的 cookies。 -
context:
follow_redirects
指明您是否希望爬虫跟踪重定向(带有目标 URL 的 3xx 响应)以从重定向链末端的 URL 获得内容。 -
context:
headers
传递您自己的消息头。 -
context:
http_method
如果您想通过电商万能爬虫向您的目标 URL 发出post请求,请将其设置为 post get
context:
session_id
如果您想在多个请求中使用同一个代理,则可使用这个参数。只要把您的会话设置成您喜欢的任何字符串,我们将为这个 ID 分配一个代理,保留最多 10 分钟。在此之后,如果您用相同的会话 ID 提出另一个请求,一个新的代理将被分配给这个特定的会话 ID。 -
context:
successful_status_codes
定义一个(或几个)自定义的 HTTP 响应代码,在这个代码,我们应该认为抓取成功并将内容返回给您。如果您希望我们返回 503 错误页面或其他一些非标准情况,则可能有用。 -
callback_url 转至回调端点的 URL -
parse true 将返回结构化数据,前提是提交的 URL 指向一个电子商务产品页面。将此参数与 parser_type 参数结合使用,可以使用我们的自适应解析器。 false
parser_type 将该值设为 ecommerce_product,访问自适应解析器。 -
storage_type 存储器服务提供者。目前,只支持 Amazon S3:s3。完整建置可以在 上传至存储器 页面查阅。只通过 Push-Pull(回调)方法工作。 -
storage_url 您的 Amazon S3 桶的名称。只通过 Push-Pull(回调)方法工作。 -

   - 所需参数

参数值

Geo_Location

所支持地理位置的完整列表可以在此处查阅。结果以 CSV 格式呈现。

"United Arab Emirates",
"Albania",
"Armenia",
"Angola",
"Argentina",
"Australia",
...
"Uruguay",
"Uzbekistan",
"Venezuela Bolivarian Republic of",
"Viet Nam",
"South Africa",
"Zimbabwe"

HTTP_Method

电商万能爬虫支持两种 HTTP 方法。GET(默认)和 POST

"GET",
"POST"

渲染

电商万能爬虫 API 可以渲染 Javascript,并返回经渲染的 HTML 文档或网页的 PNG 截图。

"html",
"png"

User_Agent_Type

[
  {
    "user_agent_type": "desktop",
    "description": "Random desktop browser User-Agent"
  },
  {
    "user_agent_type": "desktop_firefox",
    "description": "Random User-Agent of one of the latest versions of desktop Firefox"
  },
  {
    "user_agent_type": "desktop_chrome",
    "description": "Random User-Agent of one of the latest versions of desktop Chrome"
  },
  {
    "user_agent_type": "desktop_opera",
    "description": "Random User-Agent of one of the latest versions of desktop Opera"
  },
  {
    "user_agent_type": "desktop_edge",
    "description": "Random User-Agent of one of the latest versions of desktop Edge"
  },
  {
    "user_agent_type": "desktop_safari",
    "description": "Random User-Agent of one of the latest versions of desktop Safari"
  },
  {
    "user_agent_type": "mobile",
    "description": "Random mobile browser User-Agent"
  },
  {
    "user_agent_type": "mobile_android",
    "description": "Random User-Agent of one of the latest versions of Android browser"
  },
  {
    "user_agent_type": "mobile_ios",
    "description": "Random User-Agent of one of the latest versions of iPhone browser"
  },
  {
    "user_agent_type": "tablet",
    "description": "Random tablet browser User-Agent"
  },
  {
    "user_agent_type": "tablet_android",
    "description": "Random User-Agent of one of the latest versions of Android tablet"
  },
  {
    "user_agent_type": "tablet_ios",
    "description": "Random User-Agent of one of the latest versions of iPad tablet"
  }
]

此处下载 JSON 格式的 user_agent_type 值的完整列表。

账户状态

使用情况统计

该查询将返回所有时间的统计数据。您可以通过添加 ?group_by=day?group_by=month 查询您每天和每月的使用情况。

curl --user user:pass1 'https://data.oxylabs.io/v2/stats'
import requests
from pprint import pprint

# Get response from stats endpoint.
response = requests.request(
    method='GET',
    url='https://data.oxylabs.io/v2/stats',
    auth=('user', 'pass1'),
)

# Print prettified JSON response to stdout.
pprint(response.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v2/stats");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>

样例输出:

{
    "data": {
        "sources": [
            {
                "realtime_results_count": "90",
                "results_count": "10",
                "title": "universal_ecommerce"
            }
        ]
    },
    "meta": {
        "group_by": null
    }
}

您可以通过查询以下端点查询您的使用统计数据:

GET https://data.oxylabs.io/v2/stats

默认情况下,API 将返回所有时间的使用统计数据。添加 ?group_by=month 将返回每月统计数据,而 ?group_by=day 将返回每日数据。

限制

curl --user user:pass1 'https://data.oxylabs.io/v2/stats/limits'
import requests
from pprint import pprint

# Get response from stats endpoint.
response = requests.request(
    method='GET',
    url='https://data.oxylabs.io/v2/stats/limits',
    auth=('user', 'pass1'),
)

# Print prettified JSON response to stdout.
pprint(response.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v2/stats/limits");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

if (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>

样例输出:

{
    "monthly_requests_commitment": 4500000,
    "used_requests": 985000
}

以下端点将向您提供每月承诺信息,以及已使用的流量:

GET https://data.oxylabs.io/v2/stats/limits

响应代码

代码 状况 描述
204 没有内容 您正试图检索一个尚未完成的工作。
400 多个错误信息 错误请求结构可能是参数拼写错误或无效值。响应体将有一个更具体的错误信息。
401 “未提供授权头”/“无效授权头”\“未找到客户” 缺少授权头或登录凭据不正确。
403 禁用 您的账户无法访问此资源。
404 未找到 您正在寻找的工作编号已不存在。
429 请求次数太多 超出了速率上限。请联系您的客户经理以增加上限。
500 未知错误 服务不可用。
524 超时 服务不可用。
612 未定义的内部错误 出了点问题,我们没能处理您提交的工作。您可以重试一次,但无需支付额外费用,因为我们不对faulted的工作收费。如果这不起作用,请与我们联系。
613 重试太多次后出现故障 我们尝试抓取您提交的工作,但在达到我们的重试上限后取消。您可以重试一次,但无需支付额外费用,因为我们不对faulted的工作收费。如果这不起作用,请与我们联系。