快速入门
爬虫 API 专为帮助您进行繁重繁琐的数据检索操作而建立。您可以使用爬虫 API 访问各类公众网页。它能够毫不费力地爬取网页数据,绝不出现任何延迟或错误。
爬虫 API 使用基础的 HTTP 身份验证,需要发送用户名和密码。
到目前为止,这是开始使用爬虫 API 的最快方式。您将使用 Realtime 集成方法从美国 geo-location
向 https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
提出请求,并检索已 parsed
的 JSON 数据。如果您希望获取 HTML 页面内容而不是已解析的数据,只需简单地删除 parse
和 parser_type
参数。切勿忘记将 USERNAME
和 PASSWORD
替换为您的代理用户凭据。
curl --user "USERNAME:PASSWORD" 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "geo-location": "United States", "parser_type": "ecommerce_product", "parse": true}'
如果您有任何本文档未涉及的问题,请发送电子邮件至 support@oxylabs.io 联系您的客户经理或我们的支持人员。
集成方法
爬虫 API 支持三种集成方法,它们都有各自独特的优势:
- Push-Pull. 使用该方法,现在需要与我们的端点保持活动连接才能检索数据。在提出请求后,我们的系统能够在工作完成后自动 ping 用户的服务器(详见 Callback)。这种方法节省了计算资源,规模易于扩展。
- Realtime. 该方法要求用户与我们的端点保持活动连接,以便在工作完成后成功获取结果。这种方法可以建置为一个服务,而 Push-Pull 方法是一个两步过程。
- SuperAPI. 这种方法与 Realtime 非常相似,但用户可以使用 HTML 爬虫作为代理,而不是将数据发布到我们的端点。为了检索数据,用户必须设置一个代理端点,并向所需的 URL 发出 GET 请求。额外参数必须使用消息头添加。
我们推荐的数据提取方法是 Push-Pull。
Push-Pull
这是最简单但也是最可靠的推荐数据传输方法。在 Push-Pull 方案中,您向我们发送一个查询,我们向您返回一个工作 id
,一旦工作完成,您可以使用该 id
从/results
端点检索内容。您可以自己检查工作进展情况,也可以设置一个能够接受 POST 查询的简单监听器。这样,一旦准备检索工作,我们会向您发送一个回调消息。在这个特殊的例子中,结果将自动上传到您的 S3 存储桶,名为YOUR_BUCKET_NAME
。
单一查询
curl --user user:pass1\
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "callback_url": "https://your.callback.url", "storage_type": "s3", "storage_url": "YOUR_BUCKET_NAME"}'
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'universal_ecommerce',
'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'callback_url': 'https://your.callback.url',
'storage_type': 's3',
'storage_url': 'YOUR_BUCKET_NAME'
}
# Get response.
response = requests.request(
'POST',
'https://data.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# Print prettified response to stdout.
pprint(response.json())
<?php
$params = array(
'source' => 'universal_ecommerce',
'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'callback_url' => 'https://your.callback.url',
'storage_type' => 's3',
'storage_url' => 'YOUR_BUCKET_NAME'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported
API 将通过打印在响应体中响应 JSON 格式的查询信息,类似以下示例:
{
"callback_url": "https://your.callback.url",
"client_id": 5,
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"geo_location": null,
"id": "12345678900987654321",
"limit": 10,
"locale": null,
"pages": 1,
"parse": false,
"render": null,
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"source": "universal_ecommerce",
"start_page": 1,
"status": "pending",
"storage_type": "s3",
"storage_url": "YOUR_BUCKET_NAME/12345678900987654321.json",
"subdomain": "www",
"updated_at": "2019-10-01 00:00:01",
"user_agent_type": "desktop",
"_links": [
{
"rel": "self",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
"method": "GET"
},
{
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method": "GET"
}
]
}
以下端点将处理一个关键字或 URL 的单一查询。该 API 将返回一个包含工作信息的确认讯息,包括工作 id
。您可以使用该 id
检查进展情况,也可以在查询中加入 callback_url
,要求我们在抓取任务完成后 ping 您的回调端点。
您需要发布查询参数作为 JSON 主体的数据。
检查工作状态
curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321'
import requests
from pprint import pprint
# Get a response from the stats endpoint.
response = requests.request(
method='GET',
url='http://data.oxylabs.io/v1/queries/12345678900987654321',
auth=('user', 'pass1'),
)
# Print prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported
API 将通过打印在响应体中响应 JSON 格式的查询信息。注意,工作
status
已被改为done
。现在您可以通过查询检索内容http://data.oxylabs.io/v1/queries/12345678900987654321/results
.您还可以看到任务已经
updated_at
2019-10-01 00:00:15
- 需要 14 秒完成查询。
{
"client_id": 5,
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"geo_location": null,
"id": "12345678900987654321",
"limit": 10,
"locale": null,
"pages": 1,
"parse": false,
"render": null,
"url": "sofa",
"source": "universal_ecommerce",
"start_page": 1,
"status": "done",
"subdomain": "www",
"updated_at": "2019-10-01 00:00:15",
"user_agent_type": "desktop",
"_links": [
{
"rel": "self",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
"method": "GET"
},
{
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method": "GET"
}
]
}
如果您的查询有 callback_url
,我们会在抓取任务完成后向您发送一条包含内容链接的讯息。但是,如果查询中没有 callback_url
,您需要自己检查工作状态。为此,您需要使用您向我们的 API 提交查询后收到的响应讯息 rel:self
下的 href
中的 URL。应该类似以下示例: http://data.oxylabs.io/v1/queries/12345678900987654321
.
查询此链接将返回工作信息,包括其status
。可能的 status
值有 3 个。
pending |
该工作仍在队列中,尚未完成。 |
done |
工作已完成,您可以通过查询 rel:results 下 href 中的 URL 获取结果。 : http://data.oxylabs.io/v1/queries/12345678900987654321/results |
faulted |
工作出了问题,我们无法完成,很可能是由于目标网站方面的服务器错误。 |
检索工作内容
curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321/results'
import requests
from pprint import pprint
# Get response from the stats endpoint.
response = requests.request(
method='GET',
url='http://data.oxylabs.io/v1/queries/12345678900987654321/results',
auth=('user', 'pass1'),
)
# Print the prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321/results");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported
API 将返回工作内容:
{
"results": [
{
"content": "<!doctype html><html>
CONTENT
</html>",
"created_at": "2019-10-01 00:00:01",
"updated_at": "2019-10-01 00:00:15",
"page": 1,
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"job_id": "12345678900987654321",
"status_code": 200
}
]
}
一旦您通过检查工作状态得知准备好检索工作,您便可使用我们初始响应 rel:results
下 href
中的 URL 来获取。应该类似以下示例: http://data.oxylabs.io/v1/queries/12345678900987654321/results
.
通过设置 回调服务,可以自动检索结果,无需定期检查工作状态。用户需要指定运行回调服务的服务器的 IP 或域名。当我们的系统完成一项作业时,它将向所提供的IP或域发送一条消息,回调服务将下载结果,如回调实现实例所述。
回调
# Please see the code samples in Python and PHP.
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response
AUTH_TUPLE = ('user', 'pass1')
app = Sanic()
# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
try:
res = request.json
links = res.get('_links', [])
for link in links:
if link['rel'] == 'results':
# Sanic is async, but requests are synchronous, to fully take
# advantage of Sanic, use aiohttp.
res_response = requests.request(
method='GET',
url=link['href'],
auth=AUTH_TUPLE,
)
pprint(res_response.json())
break
except Exception as e:
print("Listener exception: {}".format(e))
return response.json(status=200, body={'status': 'ok'})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
<?php
$stdout = fopen('php://stdout', 'w');
if (isset($_POST)) {
$result = array_merge($_POST, (array) json_decode(file_get_contents('php://input')));
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/".$result['id'].'/results');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
fwrite($stdout, $result);
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
}
?>
HTTP method is currently not supported
样例回调输出
{
"created_at":"2019-10-01 00:00:01",
"updated_at":"2019-10-01 00:00:15",
"locale":null,
"client_id":163,
"user_agent_type":"desktop",
"source":"universal_ecommerce",
"pages":1,
"subdomain":"www",
"status":"done",
"start_page":1,
"parse":0,
"render":null,
"priority":0,
"ttl":0,
"origin":"api",
"persist":true,
"id":"12345678900987654321",
"callback_url":"http://your.callback.url/",
"url":"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"domain":"de",
"limit":10,
"geo_location":null,
{...}
"_links":[
{
"href":"https://data.oxylabs.io/v1/queries/12345678900987654321",
"method":"GET",
"rel":"self"
},
{
"href":"https://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method":"GET",
"rel":"results"
}
],
}
回调是一个 POST
请求,我们将其发送到您的机器上,通知您数据提取任务已经完成,并提供一个 URL 来下载刮取的内容。也就是说,您不再需要手动检查工作状态。一旦获取数据,我们会通知您,现在您需要做的就是检索它。
批量查询
curl --user user:pass1 'https://data.oxylabs.io/v1/queries/batch' -H 'Content-Type: application/json' \
-d '@keywords.json'
import requests
import json
from pprint import pprint
# Get payload from file.
with open('keywords.json', 'r') as f:
payload = json.loads(f.read())
response = requests.request(
'POST',
'https://data.oxylabs.io/v1/queries/batch',
auth=('user', 'pass1'),
json=payload,
)
# Print prettified response.
pprint(response.json())
<?php
$paramsFile = file_get_contents(realpath("keywords.json"));
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/batch");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $paramsFile);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported
keywords.json
内容:
{
"url":[
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"https://books.toscrape.com/catalogue/soumission_998/index.html"
],
"source": "universal_ecommerce",
"callback_url": "https://your.callback.url"
}
API 将响应 JSON 格式的查询信息,将其打印在响应体中,类似以下示例:
{
"queries": [
{
"callback_url": "https://your.callback.url",
{...}
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"id": "12345678900987654321",
{...}
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"source": "universal_ecommerce",
{...}
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method": "GET"
}
]
},
{
"callback_url": "https://your.callback.url",
{...}
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"id": "12345678901234567890",
{...}
"url": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"source": "universal_ecommerce",
{...}
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678901234567890/results",
"method": "GET"
}
]
},
{
"callback_url": "https://your.callback.url",
{...}
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"id": "01234567899876543210",
{...}
"url": "https://books.toscrape.com/catalogue/soumission_998/index.html",
"source": "universal_ecommerce",
{...}
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/01234567899876543210/results",
"method": "GET"
}
]
}
]
}
此外,爬虫 API 可在每项查询接受多个关键字,每批最多可执行 1,000 个关键字。以下端点将提交多个关键字到提取队列中。
您需要发布查询参数作为 JSON 主体的数据。
系统会将每个关键字作为一个单独的请求进行处理。如果您提供了回调 URL,您将为每个关键词得到一个单独的调用。否则,我们的初始响应将包含所有关键字的工作 id
。例如,如果您发送了50个关键词,我们将返回50个独特的工作id
。
重要事项! query
是唯一可以有多个值的参数。所有其他参数对该批次查询都是一样的。
获取通知者 IP 地址列表
curl --user user:pass1 'https://data.oxylabs.io/v1/info/callbacker_ips'
import requests
from pprint import pprint
# Get response from the callback IPs endpoint.
response = requests.request(
method='GET',
url='https://data.oxylabs.io/v1/info/callbacker_ips',
auth=('user', 'pass1'),
)
# Print the prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/info/callbacker_ips");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported
该 API 将返回向您的系统发出回调请求的 IP 列表:
{
"ips": [
"x.x.x.x",
"y.y.y.y"
]
}
您可能想把向您发送回调信息的 IP 列入白名单,或者为其他目的获得这些 IP 的列表。可以通过 GET
端点来做到这一点: https://data.oxylabs.io/v1/info/callbacker_ips
。
上传至存储器
{
"Version": "2012-10-17",
"Id": "Policy1577442634787",
"Statement": [
{
"Sid": "Stmt1577442633719",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
},
"Action": "s3:GetBucketLocation",
"Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
},
{
"Sid": "Stmt1577442633719",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
},
"Action": [
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
}
]
}
默认情况下,电子商务爬虫 API 的工作结果会存储在我们的数据库中。也就是说,您需要查询我们的结果端点并自己检索内容。自定义存储功能允许您将结果存储在您自己的云存储中。这个功能的优势在于,您无需额外的请求即可获取结果 - 所有内容都会直接进入您的存储桶。
目前,我们只支持 Amazon S3。如果您想使用不同类型的存储器,请联系您的客户经理以讨论时间安排。
为了将工作结果上传到您的 Amazon S3 存储桶,您需要设置特殊权限。要做到这一点,请进入 https://s3.console.aws.amazon.com/ > S3 > Storage > Bucket Name (if don't have one, create new) > Permissions > Bucket Policy
您可以在该 JSON 或右侧代码样例区中找到桶策略。请勿忘记修改 YOUR_BUCKET_NAME
下的桶名称。这个策略允许我们写到您的桶里,为您上传文件,并知道桶的位置。
要使用该功能,您需要在您的请求中指定两个额外参数。此处了解更多信息。
上传路径类似以下示例:YOUR_BUCKET_NAME/job_ID.json
。您将在提交请求后我们向您发送的响应体中找到工作 ID。在该示例中工作 ID 是12345678900987654321
。
Realtime
curl --user user:pass1 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"}'
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'universal_ecommerce',
'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
}
# Get response.
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# Instead of response with job status and results url, this will return the
# JSON response with results.
pprint(response.json())
<?php
$params = array(
'source' => 'universal_ecommerce',
'query' => 'sofa',
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://realtime.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
# URL has to be encoded to escape `&` and `=` characters. It is not necessary in this example.
https://realtime.oxylabs.io/v1/queries?source=universal_ecommerce&url=https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html&access_token=12345abcde
将在开放连接上返回的响应体示例
{
"results": [
{
"content": "<html>
CONTENT
</html>"
"created_at": "2019-10-01 00:00:01",
"updated_at": "2019-10-01 00:00:15",
"id": null,
"page": 1,
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"job_id": "12345678900987654321",
"status_code": 200
}
]
}
数据提交与 Push-Pull 方法相同,但在 Realtime 情况下,我们将在开放连接上返回内容。您向我们发送一个查询,连接保持开放,我们检索内容并将其返回给您。处理这个问题的端点是这样的:
开放连接有150秒的超时限制,因此在罕见的大负荷情况下,我们可能无法确保数据到达您手中。
您需要发布查询参数作为 JSON 主体的数据。请参阅示例了解详情。
SuperAPI
curl -k \
-x realtime.oxylabs.io:60000 \
-U user:pass1 \
-H "X-Oxylabs-User-Agent-Type: desktop_chrome" \
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
import requests
from pprint import pprint
# Define proxy dict. Do not forget to put your real user and pass here as well.
proxies = {
'http': 'http://user:pass1@realtime.oxylabs.io:60000',
'https': 'https://user:pass1@realtime.oxylabs.io:60000',
}
response = requests.request(
'GET',
'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
auth=('user', 'pass1'),
verify=False, # Or accept our certificate.
proxies=proxies,
)
# Print result page to stdout
pprint(response.text)
# Save returned HTML to result.html file
with open('result.html', 'w') as f:
f.write(response.text)
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, 'realtime.oxylabs.io:60000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "user" . ":" . "pass1");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is not supported with SuperAPI
如果您曾使用过一般代理进行数据抓取,集成 SuperAPI 的交付方式将变得轻而易举。您只需要使用我们的入口节点作为代理,使用爬虫 API 凭证进行授权,并忽略证书。在 cURL
中是 -k
或 --insecure
。您的数据将在开放的连接上送达您的手中。
SuperAPI 只支持少数几个参数,因为它只适用于提供完整 URL 的 Direct 数据源。这些参数应该作为消息头发送。这是一个可接受参数的列表:
X-Oxylabs-User-Agent-Type |
没有任何方法表明一个特定的用户代理,但您可以告诉我们使用哪种浏览器和平台。所支持用户代理的列表可以在此处查阅。 |
如果您在设置 SuperAPI 时需要帮助,请发送邮件至 support@oxylabs.io。
内容类型
爬虫 API 返回原始 HTML,以及结构化的 JSON。
下载图片
import base64
import json
import requests
# Your credentials.
USERNAME = ''
PASSWORD = ''
# Image URL which will be saved to file.
URL_IMAGE = 'https://example.com/image.jpg'
# Realtime URL.
API_URL = f'http://{USERNAME}:{PASSWORD}@realtime.oxylabs.io/v1/queries'
def dump_to_file(filename: str, data: bytes):
with open(filename, 'wb') as file:
file.write(data)
def main():
parameters = {
'source': 'universal_ecommerce',
'url': URL_IMAGE,
'content_encoding': 'base64',
}
response = requests.post(API_URL, json=parameters)
if response.ok:
data = json.loads(response.text)
content_base64 = data['results'][0]['content']
# Decode base64 encoded data into bytes.
content = base64.b64decode(content_base64)
dump_to_file('out.jpg', content)
if __name__ == '__main__':
main()
可以通过爬虫 API 下载图片。如果您通过 SuperAPI 下载,则可以简单地保存输出到图片扩展名中。例如:
curl -k -x realtime.oxylabs.io:60000 -U user:pass1 "https://example.com/image.jpg" >> image.jpg
如果您使用 Push-Pull 或 Realtime方法,您需要添加一个 content_encoding
参数,值为 base64
。在您收到结果后,您需要将 content
的编码数据解码成字节,并将其保存为图像文件。请参考右侧的一个 Python 示例。
数据来源
爬虫 API 接受 URL 以及诸如用户代理类型、代理位置等其他参数。参见以下方法,我们称之为 Direct。
爬虫 API 在抓取时能够渲染 JavaScript。因此,您能够从网页上获取更多数据,也能获得屏幕截图。
如果您不确定文档的任何部分,请发送电子邮件至 support@oxylabs.io 给我们留言或联系您的客户经理。
Direct
在这个例子中,API 将以 Push-Pull 方式检索一个电商通用产品页面。所有可用参数都包括在内(尽管在同一个请求中并不总是必要的或兼容的),使您了解如何格式化您的请求:
curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "user_agent_type": "mobile", "render": "html",
"context": [{"key": "headers", "value": ["Accept-Language": "en-US", "Content-Type": "application/octet-stream", "Custom-Header": "custom header content"]}, {"key": "cookies", "value": [{"key": "NID", "value": "1234567890"}, {"key": "1P JAR", "value": "0987654321"}, {"key": "follow_redirects", "value": true}, {"key": "http_method", "value": "get"}, {"key": "content", "value": "base64EncodedPOSTBody"}, {"key": "successful_status_codes", "value": [303, 808, 909]}]}]}'
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'universal_ecommerce',
'url': 'https://www.etsy.com/listing/399423455/big-glass-house-planter-handmade-glass?ref=hp_prn&frs=1',
'user_agent_type': 'desktop',
'geo_location': 'United States',
'parse': true,
'parser_type': "ecommerce_product",
'context': [
{
'key': 'session_id',
'value': '1234567890abcdef'
}
{
'key': 'headers', 'value':
{
'Accept-Language': 'en-US',
'Content-Type': 'application/octet-stream',
'Custom-Header': 'custom header content'
}
},
{
'key': 'cookies',
'value': [{
'key': 'NID',
'value': '1234567890'
},
{
'key': '1P_JAR',
'value': '0987654321'
}
]
},
{
'key': 'follow_redirects',
'value': true
},
{
'key': 'successful_status_codes',
'value': [303, 808, 909]
},
{
'key': 'http_method',
'value': 'get'
}
{
'key': 'content'
'value': 'base64EncodedPOSTBody'
}
],
'callback_url': 'https://your.callback.url',
}
# Get response.
response = requests.request(
'POST',
'https://data.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# Print prettified response to stdout.
pprint(response.json())
<?php
$params = [
'source' => 'universal_ecommerce',
'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'context' => [
[
'key' => 'session_id',
'value' => '1234567890abcdef'
],
[
'key' => 'headers',
'value' => [
'Accept-Language' => 'en-US',
'Content-Type' => 'application/octet-stream',
'Custom-Header' => 'custom header content'
],
],
[
'key' => 'cookies',
'value' => [
['key' => 'NID', 'value' => '1234567890'],
['key' => '1P_JAR', 'value' => '0987654321']
]
],
[
'key' => 'follow_redirects',
'value' => 'true'
],
[
'key' => 'successful_status_codes',
'value' => [303, 808, 909]
],
[
'key' => 'http_method',
'value' => 'get'
],
[
'key' => 'content',
'value' => 'base64EncodedPOSTBody'
]
]
];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported with Push-Pull
以下是以 Realtime 执行所述任务的相同示例:
curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "user_agent_type": "mobile", "context": [{"key": "headers", "value": ["Accept-Language": "en-US", "Content-Type": "application/octet-stream", "Custom-Header": "custom header content"]}, {"key": "cookies", "value": [{"key": "NID", "value": "1234567890"}, {"key": "1P JAR", "value": "0987654321"}, {"key": "follow_redirects", "value": true}, {"key": "http_method", "value": "get"}, {"key": "content", "value": "base64EncodedPOSTBody"}, {"key": "successful_status_codes", "value": [303, 808, 909]}]}]}'
import requests
from pprint import pprint
# Structure payload.
payload = {
'source': 'universal_ecommerce',
'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'user_agent_type': 'mobile',
'geo_location': 'United States',
'context': [
{
'key': 'session_id',
'value': '1234567890abcdef'
}
{
'key': 'headers', 'value':
{
'Accept-Language': 'en-US',
'Content-Type': 'application/octet-stream',
'Custom-Header': 'custom header content'
}
},
{
'key': 'cookies',
'value': [{
'key': 'NID',
'value': '1234567890'
},
{
'key': '1P_JAR',
'value': '0987654321'
}
]
},
{
'key': 'follow_redirects',
'value': true
},
{
'key': 'successful_status_codes',
'value': [303, 808, 909]
},
{
'key': 'http_method',
'value': 'get'
}
{
'key': 'content'
'value': 'base64EncodedPOSTBody'
}
],
}
# Get response.
response = requests.request(
'POST',
'https://realtime.oxylabs.io/v1/queries',
auth=('user', 'pass1'),
json=payload,
)
# Instead of response with job status and results url, this will return the
# JSON response with the result.
pprint(response.json())
<?php
$params = [
'source' => 'universal_ecommerce',
'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'context' => [
[
'key' => 'session_id',
'value' => '1234567890abcdef'
],
[
'key' => 'headers',
'value' => [
'Accept-Language' => 'en-US',
'Content-Type' => 'application/octet-stream',
'Custom-Header' => 'custom header content'
],
],
[
'key' => 'cookies',
'value' => [
['key' => 'NID', 'value' => '1234567890'],
['key' => '1P_JAR', 'value' => '0987654321']
]
],
[
'key' => 'follow_redirects',
'value' => 'true'
],
[
'key' => 'successful_status_codes',
'value' => [303, 808, 909]
],
[
'key' => 'http_method',
'value' => 'get'
],
[
'key' => 'content',
'value' => 'base64EncodedPOSTBody'
]
]
];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
# The whole string you submit has to be URL-encoded.
https://realtime.oxylabs.io/v1/queries?source=universal_ecommerce&url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2Ftagged%2Fpython&access_token=12345abcde
并通过 SuperAPI 执行所述任务:
# A GET request could look something like this:
curl -k \
-x http://realtime.oxylabs.io:60000 \
-U user:pass1 \
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" \
-H "X-Oxylabs-Session-Id: 1234567890abcdef" \
-H "X-Oxylabs-Geo-Location: India" \
-H "Accept-Language: en-US" \
-H "Content-Type: application/octet-stream" \
-H "Custom-Header: custom header content" \
-H "Cookie: NID=1234567890; 1P_JAR=0987654321" \
-H "X-Status-Code: 303, 808, 909"
# A POST request would have the same structure but contain a parameter specifying that it is a POST request:
curl -X POST \
-k \
-x http://realtime.oxylabs.io:60000 \
-U user:pass1 "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" \
-H "X-Oxylabs-Session-Id: 1234567890abcdef" \
-H "X-Oxylabs-Geo-Location: India" \
-H "Custom-Header: custom header content" \
-H "Cookie: NID=1234567890; 1P_JAR=0987654321" \
-H "X-Status-Code: 303, 808, 909"
import requests
from pprint import pprint
# Define proxy dict. Do not forget to put your real user and pass here as well.
proxies = {
'http': 'http://user:pass1@realtime.oxylabs.io:60000',
'https': 'https://user:pass1@realtime.oxylabs.io:60000',
}
response = requests.request(
'GET',
'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
auth=('user', 'pass1'),
verify=False, # Or accept our certificate.
proxies=proxies,
)
# Print result page to stdout
pprint(response.text)
# Save returned HTML to result.html file
with open('result.html', 'w') as f:
f.write(response.text)
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, 'realtime.oxylabs.io:60000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "user" . ":" . "pass1");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is not supported with SuperAPI
Push-Pull
Realtime
SuperAPI
原始 HTML
已解析的 JSON
渲染 JS
universal_ecommerce
源旨在从互联网上的任何 URL 检索内容。POST
将 JSON 格式的参数发送到以下端点,将提交指定的 URL 到提取队列中。
查询参数
参数 | 描述 | 默认值 | |
---|---|---|---|
|
数据源 | universal_ecommerce |
|
|
转到通用页面的直接 URL(链接) | - | |
user_agent_type |
设备类型和浏览器。完整列表可以在此处查阅。 | desktop |
|
geo_location |
用于检索数据的代理地理位置。所支持位置的完整列表可以在此处查阅。 | ||
locale |
区域设置,如 Accept-Language 标头所预期。 | ||
render |
启用 JavaScript 渲染。当目标需要 JavaScript 加载内容时使用它。只通过 Push-Pull(又称回调)方法工作。这个参数有两个可用值:html (获得原始输出)和 png (获得一个 Base64 编码的屏幕截图)。 |
||
content_encoding |
如果您要下载图片,请添加这个参数。此处了解更多信息。 | base64 |
|
context :content |
Base64 编码的 POST 请求体。只有当 http_method 被设置为 post 时,它才有用。 |
- | |
context :cookies |
传递您自己的 cookies。 | - | |
context :follow_redirects |
指明您是否希望爬虫跟踪重定向(带有目标 URL 的 3xx 响应)以从重定向链末端的 URL 获得内容。 | - | |
context :headers |
传递您自己的消息头。 | - | |
context :http_method |
如果您想通过电商万能爬虫向您的目标 URL 发出post 请求,请将其设置为 post 。 |
get |
|
context :session_id |
如果您想在多个请求中使用同一个代理,则可使用这个参数。只要把您的会话设置成您喜欢的任何字符串,我们将为这个 ID 分配一个代理,保留最多 10 分钟。在此之后,如果您用相同的会话 ID 提出另一个请求,一个新的代理将被分配给这个特定的会话 ID。 | - | |
context :successful_status_codes |
定义一个(或几个)自定义的 HTTP 响应代码,在这个代码,我们应该认为抓取成功并将内容返回给您。如果您希望我们返回 503 错误页面或其他一些非标准情况,则可能有用。 | - | |
callback_url |
转至回调端点的 URL | - | |
parse |
true 将返回结构化数据,前提是提交的 URL 指向一个电子商务产品页面。将此参数与 parser_type 参数结合使用,可以使用我们的自适应解析器。 |
false |
|
parser_type |
将该值设为 ecommerce_product ,访问自适应解析器。 |
- | |
storage_type |
存储器服务提供者。目前,只支持 Amazon S3:s3。完整建置可以在 上传至存储器 页面查阅。只通过 Push-Pull(回调)方法工作。 | - | |
storage_url |
您的 Amazon S3 桶的名称。只通过 Push-Pull(回调)方法工作。 | - |
- 所需参数
参数值
Geo_Location
所支持地理位置的完整列表可以在此处查阅。结果以 CSV 格式呈现。
"United Arab Emirates",
"Albania",
"Armenia",
"Angola",
"Argentina",
"Australia",
...
"Uruguay",
"Uzbekistan",
"Venezuela Bolivarian Republic of",
"Viet Nam",
"South Africa",
"Zimbabwe"
HTTP_Method
电商万能爬虫支持两种 HTTP 方法。GET
(默认)和 POST
。
"GET",
"POST"
渲染
电商万能爬虫 API 可以渲染 Javascript,并返回经渲染的 HTML 文档或网页的 PNG 截图。
"html",
"png"
User_Agent_Type
[
{
"user_agent_type": "desktop",
"description": "Random desktop browser User-Agent"
},
{
"user_agent_type": "desktop_firefox",
"description": "Random User-Agent of one of the latest versions of desktop Firefox"
},
{
"user_agent_type": "desktop_chrome",
"description": "Random User-Agent of one of the latest versions of desktop Chrome"
},
{
"user_agent_type": "desktop_opera",
"description": "Random User-Agent of one of the latest versions of desktop Opera"
},
{
"user_agent_type": "desktop_edge",
"description": "Random User-Agent of one of the latest versions of desktop Edge"
},
{
"user_agent_type": "desktop_safari",
"description": "Random User-Agent of one of the latest versions of desktop Safari"
},
{
"user_agent_type": "mobile",
"description": "Random mobile browser User-Agent"
},
{
"user_agent_type": "mobile_android",
"description": "Random User-Agent of one of the latest versions of Android browser"
},
{
"user_agent_type": "mobile_ios",
"description": "Random User-Agent of one of the latest versions of iPhone browser"
},
{
"user_agent_type": "tablet",
"description": "Random tablet browser User-Agent"
},
{
"user_agent_type": "tablet_android",
"description": "Random User-Agent of one of the latest versions of Android tablet"
},
{
"user_agent_type": "tablet_ios",
"description": "Random User-Agent of one of the latest versions of iPad tablet"
}
]
此处下载 JSON 格式的 user_agent_type
值的完整列表。
账户状态
使用情况统计
该查询将返回所有时间的统计数据。您可以通过添加
?group_by=day
或?group_by=month
查询您每天和每月的使用情况。
curl --user user:pass1 'https://data.oxylabs.io/v2/stats'
import requests
from pprint import pprint
# Get response from stats endpoint.
response = requests.request(
method='GET',
url='https://data.oxylabs.io/v2/stats',
auth=('user', 'pass1'),
)
# Print prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v2/stats");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
样例输出:
{
"data": {
"sources": [
{
"realtime_results_count": "90",
"results_count": "10",
"title": "universal_ecommerce"
}
]
},
"meta": {
"group_by": null
}
}
您可以通过查询以下端点查询您的使用统计数据:
默认情况下,API 将返回所有时间的使用统计数据。添加 ?group_by=month
将返回每月统计数据,而 ?group_by=day
将返回每日数据。
限制
curl --user user:pass1 'https://data.oxylabs.io/v2/stats/limits'
import requests
from pprint import pprint
# Get response from stats endpoint.
response = requests.request(
method='GET',
url='https://data.oxylabs.io/v2/stats/limits',
auth=('user', 'pass1'),
)
# Print prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v2/stats/limits");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
样例输出:
{
"monthly_requests_commitment": 4500000,
"used_requests": 985000
}
以下端点将向您提供每月承诺信息,以及已使用的流量:
响应代码
代码 | 状况 | 描述 |
---|---|---|
204 |
没有内容 | 您正试图检索一个尚未完成的工作。 |
400 |
多个错误信息 | 错误请求结构可能是参数拼写错误或无效值。响应体将有一个更具体的错误信息。 |
401 |
“未提供授权头”/“无效授权头”\“未找到客户” | 缺少授权头或登录凭据不正确。 |
403 |
禁用 | 您的账户无法访问此资源。 |
404 |
未找到 | 您正在寻找的工作编号已不存在。 |
429 |
请求次数太多 | 超出了速率上限。请联系您的客户经理以增加上限。 |
500 |
未知错误 | 服务不可用。 |
524 |
超时 | 服务不可用。 |
612 |
未定义的内部错误 | 出了点问题,我们没能处理您提交的工作。您可以重试一次,但无需支付额外费用,因为我们不对faulted 的工作收费。如果这不起作用,请与我们联系。 |
613 |
重试太多次后出现故障 | 我们尝试抓取您提交的工作,但在达到我们的重试上限后取消。您可以重试一次,但无需支付额外费用,因为我们不对faulted 的工作收费。如果这不起作用,请与我们联系。 |