推拉
总览
推拉式并非是最简单的集成方法,但它是最可靠的方法,这就是为什么我们推荐实施这种方法,尤其是当您处理大量数据时。
推拉式是一种异步集成方法。这意味着,提交作业后,我们将迅速返回一个包含作业信息(所有提交的作业参数和作业 ID,以及用于下载结果和检查作业状态的 URL)的 JSON。通过这种集成方法,作业提交过程完全独立于下载结果。
在我们处理完您的作业之后,如果您在提交作业时提供了一个 回调 URL,我们将 POST
一个包含更新作业信息的 JSON 有效载荷(包括作业的 status
设置为 done
)到您的服务器。此时,您可以继续从我们的系统中下载结果。我们将结果保留在完成后至少 24 小时 内可供检索。
通过推拉式,您可以将您的结果直接上传到您的云存储 (AWS S3 或 Google 云存储)。
注意:如果您不想麻烦地设置一个接受传入回调通知的服务,则可以尝试每隔几秒钟就得到您的结果(这个概念叫做 轮询).
您也可以尝试通过 Postman 了解推拉式的工作原理。
单一作业
描述
下面这个端点只接受一个 query
或 url
值。
端点
POST https://data.oxylabs.io/v1/queries
输入
您必须以 JSON 有效载荷发送您的作业参数,如以下代码示例所示:
curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "Content-Type: application/json" \
-d '{"source": "ENTER_SOURCE_HERE", "url": "https://www.example.com", "geo_location": "United States", "callback_url": "https://your.callback.url", "storage_type": "s3", "storage_url": "s3://your.storage.bucket.url"}'
import requests
from pprint import pprint
# Structure payload.
payload = {
"source": "ENTER_SOURCE_HERE", # Source you choose e.g. "universal"
"url": "https://www.example.com", # Check speficic source if you should use "url" or "query"
"geo_location": "United States", # Some sources accept zip-code or cooprdinates
#"render" : "html", # Uncomment you want to render JavaScript within the page
#"parse" : true, # Check what sources support parsed data
#"callback_url": "https://your.callback.url", #required if using callback listener
"callback_url": "https://your.callback.url",
"storage_type": "s3",
"storage_url": "s3://your.storage.bucket.url"
}
# Get response.
response = requests.request(
'POST',
'https://data.oxylabs.io/v1/queries',
auth=('YOUR_USERNAME', 'YOUR_PASSWORD'), #Your credentials go here
json=payload,
)
# Print prettified response to stdout.
pprint(response.json())
<?php
$params = array(
'source' => 'ENTER_SOURCE_HERE', //Source you choose e.g. "universal"
'url' => 'https://www.example.com', // Check speficic source if you should use "url" or "query"
'geo_location' => 'United States', //Some sources accept zip-code or cooprdinates
//'render' : 'html', // Uncomment you want to render JavaScript within the page
//'parse' : TRUE, // Check what sources support parsed data
//'callback_url' => 'https://your.callback.url', //required if using callback listener
'callback_url': 'https://your.callback.url',
'storage_type' => 's3',
'storage_url' => 's3://your.storage.bucket.url'
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "YOUR_USERNAME" . ":" . "YOUR_PASSWORD"); //Your credentials go here
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Net.Http.Json;
using System.Threading.Tasks;
namespace OxyApi
{
class Program
{
static async Task Main()
{
const string Username = "YOUR_USERNAME";
const string Password = "YOUR_PASSWORD";
var parameters = new Dictionary<string, string>()
{
{ "source", "ENTER_SOURCE_HERE" },
{ "url", "https://example.com" },
{ "geo_location", "United States" },
{ "callback_url", "https://your.callback.url" },
};
var client = new HttpClient();
Uri baseUri = new Uri("https://data.oxylabs.io");
client.BaseAddress = baseUri;
var requestMessage = new HttpRequestMessage(HttpMethod.Post, "/v1/queries");
requestMessage.Content = JsonContent.Create(parameters);
var authenticationString = $"{Username}:{Password}";
var base64EncodedAuthenticationString = Convert.ToBase64String(System.Text.ASCIIEncoding.UTF8.GetBytes(authenticationString));
requestMessage.Headers.Add("Authorization", "Basic " + base64EncodedAuthenticationString);
var response = await client.SendAsync(requestMessage);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
}
}
}
package main
import (
"bytes"
"encoding/json"
"fmt"
"io/ioutil"
"net/http"
)
func main() {
const Username = "YOUR_USERNAME"
const Password = "YOUR_PASSWORD"
payload := map[string]string{
"source": "ENTER_SOURCE_HERE",
"url": "https://example.com",
"geo_location": "United States",
"callback_url": "https://your.callback.url",
}
jsonValue, _ := json.Marshal(payload)
client := &http.Client{}
request, _ := http.NewRequest("POST",
"https://data.oxylabs.io/v1/queries",
bytes.NewBuffer(jsonValue),
)
request.Header.Add("Content-type", "application/json")
request.SetBasicAuth(Username, Password)
response, _ := client.Do(request)
responseText, _ := ioutil.ReadAll(response.Body)
fmt.Println(string(responseText))
}
import okhttp3.*;
import org.json.JSONObject;
public class Main implements Runnable {
private static final String AUTHORIZATION_HEADER = "Authorization";
public static final String USERNAME = "YOUR_USERNAME";
public static final String PASSWORD = "YOUR_PASSWORD";
public void run() {
JSONObject jsonObject = new JSONObject();
jsonObject.put("source", "ENTER_SOURCE_HERE");
jsonObject.put("url", "https://example.com");
jsonObject.put("geo_location", "United States");
jsonObject.put("callback_url", "https://your.callback.url");
Authenticator authenticator = (route, response) -> {
String credential = Credentials.basic(USERNAME, PASSWORD);
return response
.request()
.newBuilder()
.header(AUTHORIZATION_HEADER, credential)
.build();
};
var client = new OkHttpClient.Builder()
.authenticator(authenticator)
.build();
var mediaType = MediaType.parse("application/json; charset=utf-8");
var body = RequestBody.create(jsonObject.toString(), mediaType);
var request = new Request.Builder()
.url("https://data.oxylabs.io/v1/queries")
.post(body)
.build();
try (var response = client.newCall(request).execute()) {
assert response.body() != null;
System.out.println(response.body().string());
} catch (Exception exception) {
System.out.println("Error: " + exception.getMessage());
}
System.exit(0);
}
public static void main(String[] args) {
new Thread(new Main()).start();
}
}
import fetch from 'node-fetch';
const username = 'YOUR_USERNAME';
const password = 'YOUR_PASSWORD';
const body = {
source: 'ENTER_SOURCE_HERE',
url: 'https://www.example.com',
geo_location: 'United States',
callback_url: 'https://your.callback.url',
};
const response = await fetch('https://data.oxylabs.io/v1/queries', {
method: 'post',
body: JSON.stringify(body),
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + Buffer.from(`${username}:${password}`).toString('base64'),
}
});
console.log(await response.json());
输出
API 将响应一个包含作业信息的 JSON,具体如下:
{
"callback_url": "https://your.callback.url",
"client_id": 5,
"context": [
{
"key": "results_language",
"value": null
},
{
"key": "safe_search",
"value": null
},
{
"key": "tbm",
"value": null
},
{
"key": "cr",
"value": null
},
{
"key": "filter",
"value": null
}
],
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"geo_location": "United States",
"id": "12345678900987654321",
"limit": 10,
"locale": null,
"pages": 1,
"parse": false,
"render": null,
"url": "https://www.example.com",
"source": "universal",
"start_page": 1,
"status": "pending",
"storage_type": "s3",
"storage_url": "YOUR_BUCKET_NAME/12345678900987654321.json",
"subdomain": "www",
"updated_at": "2019-10-01 00:00:01",
"user_agent_type": "desktop",
"_links": [
{
"rel": "self",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
"method": "GET"
},
{
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method": "GET"
}
]
}
检查作业状态
描述
如果您在提交作业时包含一个有效的 callback_url
值。完成作业后,我们将POST
一个 JSON 有效载荷到您指定的回调 URL。JSON 有效载荷将表明作业已经完成,其状态设置为done
。
然而,如果您提交的作业没有callback_url
,则可以自己检查作业状态。要做到这一点,使用 rel
:self
中的 href
URL,取自您提交作业后收到的响应信息。
检查作业状态的 URL 看起来与此类似:http://data.oxylabs.io/v1/queries/12345678900987654321
。查询这个 URL 将返回作业信息,包括其status
。
可能的 status
值共有 3 种:
端点
GET https://data.oxylabs.io/v1/queries/{id}
输入
curl --user user:pass1 \
'http://data.oxylabs.io/v1/queries/12345678900987654321'
import requests
from pprint import pprint
# Get response from stats endpoint.
response = requests.request(
method='GET',
url='http://data.oxylabs.io/v1/queries/12345678900987654321',
auth=('user', 'pass1'),
)
# Print prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Net.Http.Json;
using System.Threading.Tasks;
namespace OxyApi
{
class Program
{
static async Task Main()
{
const string JobId = "12345678900987654321";
const string Username = "YOUR_USERNAME";
const string Password = "YOUR_PASSWORD";
var client = new HttpClient();
Uri baseUri = new Uri("https://data.oxylabs.io");
client.BaseAddress = baseUri;
var requestMessage = new HttpRequestMessage(HttpMethod.Get, $"/v1/queries/{JobId}");
var authenticationString = $"{Username}:{Password}";
var base64EncodedAuthenticationString = Convert.ToBase64String(System.Text.ASCIIEncoding.UTF8.GetBytes(authenticationString));
requestMessage.Headers.Add("Authorization", "Basic " + base64EncodedAuthenticationString);
var response = await client.SendAsync(requestMessage);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
}
}
}
package main
import (
"fmt"
"io/ioutil"
"net/http"
)
func main() {
const JobId = "12345678900987654321"
const Username = "YOUR_USERNAME"
const Password = "YOUR_PASSWORD"
client := &http.Client{}
request, _ := http.NewRequest("GET",
fmt.Sprintf("https://data.oxylabs.io/v1/queries/%s", JobId),
nil,
)
request.Header.Add("Content-type", "application/json")
request.SetBasicAuth(Username, Password)
response, _ := client.Do(request)
responseText, _ := ioutil.ReadAll(response.Body)
fmt.Println(string(responseText))
}
import okhttp3.*;
public class Main implements Runnable {
private static final String AUTHORIZATION_HEADER = "Authorization";
private static final String JOB_ID = "12345678900987654321";
public static final String USERNAME = "YOUR_USERNAME";
public static final String PASSWORD = "YOUR_PASSWORD";
public void run() {
Authenticator authenticator = (route, response) -> {
String credential = Credentials.basic(USERNAME, PASSWORD);
return response
.request()
.newBuilder()
.header(AUTHORIZATION_HEADER, credential)
.build();
};
var client = new OkHttpClient.Builder()
.authenticator(authenticator)
.build();
var request = new Request.Builder()
.url(String.format("https://data.oxylabs.io/v1/queries/%s", JOB_ID))
.get()
.build();
try (var response = client.newCall(request).execute()) {
assert response.body() != null;
System.out.println(response.body().string());
} catch (Exception exception) {
System.out.println("Error: " + exception.getMessage());
}
System.exit(0);
}
public static void main(String[] args) {
new Thread(new Main()).start();
}
}
import fetch from 'node-fetch';
const jobId = '12345678900987654321';
const username = 'YOUR_USERNAME';
const password = 'YOUR_PASSWORD';
const response = await fetch(`https://data.oxylabs.io/v1/queries/${jobId}`, {
method: 'get',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + Buffer.from(`${username}:${password}`).toString('base64'),
}
});
console.log(await response.json());
输出
{
"client_id": 5,
"context": [
{
"key": "results_language",
"value": null
},
{
"key": "safe_search",
"value": null
},
{
"key": "tbm",
"value": null
},
{
"key": "cr",
"value": null
},
{
"key": "filter",
"value": null
}
],
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"geo_location": null,
"id": "12345678900987654321",
"limit": 10,
"locale": null,
"pages": 1,
"parse": false,
"render": null,
"query": "adidas",
"source": "google_shopping_search",
"start_page": 1,
"status": "done",
"subdomain": "www",
"updated_at": "2019-10-01 00:00:15",
"user_agent_type": "desktop",
"_links": [
{
"rel": "self",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
"method": "GET"
},
{
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method": "GET"
}
]
}
API 将通过在响应主体中打印 JSON 格式响应查询信息。请注意,作业状态
已被改变为status
。您现在可以通过向 http://data.oxylabs.io/v1/queries/12345678900987654321/results
发送查询来检索内容。您还可以看到,该作业 updated_at
2019-10-01 00:00:15
- 作业花了 14 秒完成。
检索作业内容
描述
在知道可以准备检索作业后,您便可使用作业信息响应 rel
:results
中的 href
URL 来进行 GET
。结果链接看起来像这样:http://data.oxylabs.io/v1/queries/12345678900987654321/results
。
通过设置回调 服务,无需定期检查作业状态即可自动检索结果 。要做到这一点,在提交作业时指定一个能够接受传入 HTTP 请求的服务器的 URL。当我们的系统完成作业时,它将寄送
一个 JSON 有效载荷到所提供的 URL,而回调服务将下载结果,正如在回调执行示例所示。
端点
GET https://data.oxylabs.io/v1/queries/{id}/results
输入
以下代码示例展示了如何使用/results
端点。
curl --user user:pass1 \
'http://data.oxylabs.io/v1/queries/12345678900987654321/results'
import requests
from pprint import pprint
# Get response from stats endpoint.
response = requests.request(
method='GET',
url='http://data.oxylabs.io/v1/queries/12345678900987654321/results',
auth=('user', 'pass1'),
)
# Print prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321/results");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
using System;
using System.Net.Http;
using System.Threading.Tasks;
namespace OxyApi
{
class Program
{
static async Task Main()
{
const string JobId = "12345678900987654321";
const string Username = "YOUR_USERNAME";
const string Password = "YOUR_PASSWORD";
var client = new HttpClient();
Uri baseUri = new Uri("https://data.oxylabs.io");
client.BaseAddress = baseUri;
var requestMessage = new HttpRequestMessage(HttpMethod.Get, $"/v1/queries/{JobId}/results");
var authenticationString = $"{Username}:{Password}";
var base64EncodedAuthenticationString = Convert.ToBase64String(System.Text.ASCIIEncoding.UTF8.GetBytes(authenticationString));
requestMessage.Headers.Add("Authorization", "Basic " + base64EncodedAuthenticationString);
var response = await client.SendAsync(requestMessage);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
}
}
}
package main
import (
"fmt"
"io/ioutil"
"net/http"
)
func main() {
const JobId = "12345678900987654321"
const Username = "YOUR_USERNAME"
const Password = "YOUR_PASSWORD"
client := &http.Client{}
request, _ := http.NewRequest("GET",
fmt.Sprintf("https://data.oxylabs.io/v1/queries/%s/results", JobId),
nil,
)
request.Header.Add("Content-type", "application/json")
request.SetBasicAuth(Username, Password)
response, _ := client.Do(request)
responseText, _ := ioutil.ReadAll(response.Body)
fmt.Println(string(responseText))
}
import okhttp3.*;
public class Main implements Runnable {
private static final String AUTHORIZATION_HEADER = "Authorization";
private static final String JOB_ID = "12345678900987654321";
public static final String USERNAME = "YOUR_USERNAME";
public static final String PASSWORD = "YOUR_PASSWORD";
public void run() {
Authenticator authenticator = (route, response) -> {
String credential = Credentials.basic(USERNAME, PASSWORD);
return response
.request()
.newBuilder()
.header(AUTHORIZATION_HEADER, credential)
.build();
};
var client = new OkHttpClient.Builder()
.authenticator(authenticator)
.build();
var request = new Request.Builder()
.url(String.format("https://data.oxylabs.io/v1/queries/%s/results", JOB_ID))
.get()
.build();
try (var response = client.newCall(request).execute()) {
assert response.body() != null;
System.out.println(response.body().string());
} catch (Exception exception) {
System.out.println("Error: " + exception.getMessage());
}
System.exit(0);
}
public static void main(String[] args) {
new Thread(new Main()).start();
}
}
import fetch from 'node-fetch';
const jobId = '12345678900987654321';
const username = 'YOUR_USERNAME';
const password = 'YOUR_PASSWORD';
const response = await fetch(`https://data.oxylabs.io/v1/queries/${jobId}/results`, {
method: 'get',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + Buffer.from(`${username}:${password}`).toString('base64'),
}
});
console.log(await response.json());
输出
下面的 JSON 文件包含了一个 /results
端点的响应示例:
{
"results": [
{
"content": "<!doctype html><html>
CONTENT
</html>",
"created_at": "2019-10-01 00:00:01",
"updated_at": "2019-10-01 00:00:15",
"page": 1,
"url": "https://www.google.com/search?q=adidas&hl=en&gl=US",
"job_id": "12345678900987654321",
"status_code": 200
}
]
}
回调
该回调是一个我们向您的机器发送的 POST
请求,通知您数据提取任务已经完成,并提供一个 URL 来下载抓取的内容。这意味着,您不再需要手动检查作业状态 。一旦数据到手,我们会通知您,您现在需要做的就是进行检索 。请查看 Python 和 PHP 的代码样本。
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
import requests
from pprint import pprint
from sanic import Sanic, response
AUTH_TUPLE = ('user', 'pass1')
app = Sanic()
# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(request):
try:
res = request.json
links = res.get('_links', [])
for link in links:
if link['rel'] == 'results':
# Sanic is async, but requests are synchronous, to fully take
# advantage of Sanic, use aiohttp.
res_response = requests.request(
method='GET',
url=link['href'],
auth=AUTH_TUPLE,
)
pprint(res_response.json())
break
except Exception as e:
print("Listener exception: {}".format(e))
return response.json(status=200, body={'status': 'ok'})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
<?php
$stdout = fopen('php://stdout', 'w');
if (isset($_POST)) {
$result = array_merge($_POST, (array) json_decode(file_get_contents('php://input')));
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/".$result['id'].'/results');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
fwrite($stdout, $result);
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
}
?>
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using System;
using System.Collections.Generic;
using System.Net.Http;
namespace OxyApiWeb
{
public class Callback
{
public Link[] _links { get; set; }
}
public class Link
{
public string rel { get; set; }
public string href { get; set; }
}
public class Startup
{
private const string USERNAME = "YOUR_USERNAME";
private const string PASSWORD = "YOUR_PASSWORD";
public Startup(IConfiguration configuration)
{
Configuration = configuration;
client = new HttpClient();
}
public IConfiguration Configuration { get; }
private HttpClient client;
public void ConfigureServices(IServiceCollection services)
{
services.AddControllers();
}
public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
if (env.IsDevelopment())
{
app.UseDeveloperExceptionPage();
}
app.UseRouting();
app.UseAuthorization();
app.UseEndpoints(endpoints =>
{
endpoints.MapPost("/job_listener", async context =>
{
var callback = await System.Text.Json.JsonSerializer.DeserializeAsync<Callback>(context.Request.Body);
foreach (var link in callback._links)
{
if (link.rel != "results")
{
continue;
}
var requestMessage = new HttpRequestMessage(HttpMethod.Get, new Uri(link.href));
var authenticationString = $"{USERNAME}:{PASSWORD}";
var base64EncodedAuthenticationString = Convert.ToBase64String(System.Text.ASCIIEncoding.UTF8.GetBytes(authenticationString));
requestMessage.Headers.Add("Authorization", "Basic " + base64EncodedAuthenticationString);
var response = await client.SendAsync(requestMessage);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
}
var okMessage = new Dictionary<string, string>()
{
{ "message", "ok" }
};
await System.Text.Json.JsonSerializer.SerializeAsync(context.Response.Body, okMessage);
});
});
}
}
}
package main
import (
"fmt"
"github.com/labstack/echo/v4"
"io/ioutil"
"net/http"
)
const Username = "YOUR_USERNAME"
const Password = "YOUR_PASSWORD"
type Callback struct {
Links []Link `json:"_links"`
}
type Link struct {
Href string `json:"href"`
Method string `json:"method"`
Rel string `json:"rel"`
}
func main() {
echoServer := echo.New()
client := &http.Client{}
echoServer.POST("/job_listener", func(context echo.Context) error {
callback := new(Callback)
if err := context.Bind(callback); err != nil {
return err
}
for _, link := range callback.Links {
if link.Rel != "results" {
continue
}
request, _ := http.NewRequest("GET",
link.Href,
nil,
)
request.Header.Add("Content-type", "application/json")
request.SetBasicAuth(Username, Password)
response, _ := client.Do(request)
responseText, _ := ioutil.ReadAll(response.Body)
fmt.Println(string(responseText))
}
return context.JSON(http.StatusOK, map[string]string { "status": "ok" })
})
echoServer.Logger.Fatal(echoServer.Start(":8080"))
}
import okhttp3.*;
import com.sun.net.httpserver.HttpServer;
import org.apache.commons.io.IOUtils;
import org.json.JSONArray;
import org.json.JSONObject;
import java.io.IOException;
import java.io.OutputStream;
import java.net.InetSocketAddress;
import java.nio.charset.StandardCharsets;
import java.util.Map;
import java.util.Objects;
public class Main implements Runnable {
private static final String AUTHORIZATION_HEADER = "Authorization";
public static final String USERNAME = "YOUR_USERNAME";
public static final String PASSWORD = "YOUR_PASSWORD";
public void run() {
HttpServer server = null;
try {
server = HttpServer.create(new InetSocketAddress("0.0.0.0", 8080), 0);
} catch (IOException exception) {
exception.printStackTrace();
System.exit(1);
}
Authenticator authenticator = (route, response) -> {
String credential = Credentials.basic(USERNAME, PASSWORD);
return response
.request()
.newBuilder()
.header(AUTHORIZATION_HEADER, credential)
.build();
};
var client = new OkHttpClient.Builder()
.authenticator(authenticator)
.build();
server.createContext("/job_listener", exchange -> {
var requestBody = IOUtils.toString(exchange.getRequestBody(), StandardCharsets.UTF_8);
JSONObject requestJson = new JSONObject(requestBody);
JSONArray links = requestJson.getJSONArray("_links");
for (var link : links.toList()) {
var linkMap = (Map<?, ?>)link;
if (!Objects.equals(linkMap.get("rel"), "results")) {
continue;
}
var request = new Request.Builder()
.url((String) linkMap.get("href"))
.get()
.build();
try (var response = client.newCall(request).execute()) {
assert response.body() != null;
System.out.println(response.body().string());
} catch (Exception exception) {
System.out.println("Error: " + exception.getMessage());
}
}
var responseJson = new JSONObject();
responseJson.put("status", "ok");
exchange.sendResponseHeaders(200, responseJson.toString().length());
OutputStream responseBody = exchange.getResponseBody();
responseBody.write(responseJson.toString().getBytes());
responseBody.flush();
responseBody.close();
exchange.close();
});
server.setExecutor(null);
server.start();
}
public static void main(String[] args) {
new Thread(new Main()).start();
}
}
import express from 'express'
import fetch from 'node-fetch';
const username = 'YOUR_USERNAME';
const password = 'YOUR_PASSWORD';
const app = express();
app.use(express.json());
app.post('/job_listener', async(request, response) => {
for (const index in request.body._links) {
const link = request.body._links[index];
if (link.rel !== 'results') {
continue;
}
const jobResultResponse = await fetch(link.href, {
method: 'get',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + Buffer.from(`${username}:${password}`).toString('base64'),
}
});
console.log(await jobResultResponse.json());
}
response.send({status: 'ok'});
});
app.listen(8080);
样例回调输出
{
"created_at":"2019-10-01 00:00:01",
"updated_at":"2019-10-01 00:00:15",
"locale":null,
"client_id":163,
"user_agent_type":"desktop",
"source":"google_shopping_search",
"pages":1,
"subdomain":"www",
"status":"done",
"start_page":1,
"parse":0,
"render":null,
"priority":0,
"ttl":0,
"origin":"api",
"persist":true,
"id":"12345678900987654321",
"callback_url":"http://your.callback.url/",
"query":"adidas",
"domain":"com",
"limit":10,
"geo_location":null,
{...}
"_links":[
{
"href":"https://data.oxylabs.io/v1/queries/12345678900987654321",
"method":"GET",
"rel":"self"
},
{
"href":"https://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method":"GET",
"rel":"results"
}
],
}
批量查询
描述
爬虫API 支持在单批请求中提交多达 1,000 个 query
或 url
参数值。
该系统将把每项 query
或 url
都作为一个单独作业来处理。如果您提供一个回调 URL,则将获得每个关键词的单独调用。否则,我们的初始响应将包含所有关键词的作业 id
。例如,如果您发送了 50 个关键词,我们将返 50 个独特的作业 id
。
重要事项:通过 /batch
端点,您只能提交query
或url
参数值的列表(取决于您使用的 source)
。所有其他参数都应该有单数值。
端点
POST https://data.oxylabs.io/v1/queries/batch
输入
您必须将查询参数作为 JSON 有效载荷发布。下面列出了提交批量作业的方式:
curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries/batch' \
-H 'Content-Type: application/json' \
-d '@keywords.json'
import requests
import json
from pprint import pprint
# Get payload from file.
with open('keywords.json', 'r') as f:
payload = json.loads(f.read())
response = requests.request(
'POST',
'https://data.oxylabs.io/v1/queries/batch',
auth=('user', 'pass1'),
json=payload,
)
# Print prettified response.
pprint(response.json())
<?php
$paramsFile = file_get_contents(realpath("keywords.json"));
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/batch");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $paramsFile);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$headers = array();
$headers[] = "Content-Type: application/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Threading.Tasks;
namespace OxyApi
{
class Program
{
static async Task Main()
{
const string Username = "YOUR_USERNAME";
const string Password = "YOUR_PASSWORD";
var content = File.ReadAllText(@"C:\path\to\keywords.json");
var client = new HttpClient();
var requestMessage = new HttpRequestMessage(HttpMethod.Post, new Uri("https://data.oxylabs.io/v1/queries/batch"));
requestMessage.Content = new StringContent(content, Encoding.UTF8, "application/json");
var authenticationString = $"{Username}:{Password}";
var base64EncodedAuthenticationString = Convert.ToBase64String(ASCIIEncoding.UTF8.GetBytes(authenticationString));
requestMessage.Headers.Add("Authorization", "Basic " + base64EncodedAuthenticationString);
var response = await client.SendAsync(requestMessage);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
}
}
}
package main
import (
"bytes"
"fmt"
"io/ioutil"
"net/http"
"os"
)
func main() {
const Username = "YOUR_USERNAME"
const Password = "YOUR_PASSWORD"
content, err := os.ReadFile("keywords.json")
if err != nil {
panic(err)
}
client := &http.Client{}
request, _ := http.NewRequest("POST",
"https://data.oxylabs.io/v1/queries/batch",
bytes.NewBuffer(content),
)
request.Header.Add("Content-type", "application/json")
request.SetBasicAuth(Username, Password)
response, _ := client.Do(request)
responseText, _ := ioutil.ReadAll(response.Body)
fmt.Println(string(responseText))
}
import okhttp3.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
public class Main implements Runnable {
private static final String AUTHORIZATION_HEADER = "Authorization";
public static final String USERNAME = "YOUR_USERNAME";
public static final String PASSWORD = "YOUR_PASSWORD";
public void run() {
Path filePath = Path.of("/path/to/keywords.json");
String jsonContent = null;
try {
jsonContent = Files.readString(filePath);
} catch (IOException e) {
throw new RuntimeException(e);
}
Authenticator authenticator = (route, response) -> {
String credential = Credentials.basic(USERNAME, PASSWORD);
return response
.request()
.newBuilder()
.header(AUTHORIZATION_HEADER, credential)
.build();
};
var client = new OkHttpClient.Builder()
.authenticator(authenticator)
.build();
var mediaType = MediaType.parse("application/json; charset=utf-8");
var body = RequestBody.create(jsonContent, mediaType);
var request = new Request.Builder()
.url("https://data.oxylabs.io/v1/queries/batch")
.post(body)
.build();
try (var response = client.newCall(request).execute()) {
assert response.body() != null;
System.out.println(response.body().string());
} catch (Exception exception) {
System.out.println("Error: " + exception.getMessage());
}
System.exit(0);
}
public static void main(String[] args) {
new Thread(new Main()).start();
}
}
import fetch from 'node-fetch';
import fs from 'fs'
const username = 'YOUR_USERNAME';
const password = 'YOUR_PASSWORD';
const payload = fs.readFileSync('keywords.json').toString();
const response = await fetch('https://data.oxylabs.io/v1/queries/batch', {
method: 'post',
body: payload,
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + Buffer.from(`${username}:${password}`).toString('base64'),
}
});
console.log(await response.json());
您可能注意到,以上代码示例没有解释 JSON 有效载荷的格式化方式,而是指出了一个预先制作的 JSON 文件。下面是 keywords.json
文件的内容,其中包含多个 query
参数值:
{
"query":[
"adidas",
"nike",
"reebok"
],
"source": "google_shopping_search",
"domain": "com",
"callback_url": "https://your.callback.url"
}
...而这里有一个keywords.json
批量输入文件,包含多个 URL:
{
"url":[
"https://example.com/url1.html",
"https://example.com/url2.html",
"https://example.com/url3.html"
],
"source": "universal",
"callback_url": "https://your.callback.url"
}
响应
API 将响应一个 JSON 对象,其中包含每项所创建作业的信息。回复类似于以下:
{
"queries": [
{
"callback_url": "https://your.callback.url",
{...}
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"id": "12345678900987654321",
{...}
"query": "adidas",
"source": "google_shopping_search",
{...}
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
"method": "GET"
}
]
},
{
"callback_url": "https://your.callback.url",
{...}
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"id": "12345678901234567890",
{...}
"query": "nike",
"source": "google_shopping_search",
{...}
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/12345678901234567890/results",
"method": "GET"
}
]
},
{
"callback_url": "https://your.callback.url",
{...}
"created_at": "2019-10-01 00:00:01",
"domain": "com",
"id": "01234567899876543210",
{...}
"query": "reebok",
"source": "google_shopping_search",
{...}
"rel": "results",
"href": "http://data.oxylabs.io/v1/queries/01234567899876543210/results",
"method": "GET"
}
]
}
]
}
获取通知程序 IP 地址列表
描述
您可能想把向您发送回调信息的 IP 列入白名单,或为其他目的获得这些 IP 的列表。通过GET
这个端点即可:
端点
GET https://data.oxylabs.io/v1/info/callbacker_ips
输入
以下代码示例展示了如何访问 /callbacker_ips
端点:
curl --user user:pass1 \
'https://data.oxylabs.io/v1/info/callbacker_ips'
import requests
from pprint import pprint
# Get response from the callback IPs endpoint.
response = requests.request(
method='GET',
url='https://data.oxylabs.io/v1/info/callbacker_ips',
auth=('user', 'pass1'),
)
# Print prettified JSON response to stdout.
pprint(response.json())
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/info/callbacker_ips");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");
$result = curl_exec($ch);
echo $result;
if (curl_errno($ch)) {
echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
using System;
using System.Net.Http;
using System.Threading.Tasks;
namespace OxyApi
{
class Program
{
static async Task Main()
{
const string Username = "YOUR_USERNAME";
const string Password = "YOUR_PASSWORD";
var client = new HttpClient();
Uri baseUri = new Uri("https://data.oxylabs.io");
client.BaseAddress = baseUri;
var requestMessage = new HttpRequestMessage(HttpMethod.Get, "/v1/info/callbacker_ips");
var authenticationString = $"{Username}:{Password}";
var base64EncodedAuthenticationString = Convert.ToBase64String(System.Text.ASCIIEncoding.UTF8.GetBytes(authenticationString));
requestMessage.Headers.Add("Authorization", "Basic " + base64EncodedAuthenticationString);
var response = await client.SendAsync(requestMessage);
var contents = await response.Content.ReadAsStringAsync();
Console.WriteLine(contents);
}
}
}
package main
import (
"fmt"
"io/ioutil"
"net/http"
)
func main() {
const Username = "YOUR_USERNAME"
const Password = "YOUR_PASSWORD"
client := &http.Client{}
request, _ := http.NewRequest("GET",
"https://data.oxylabs.io/v1/info/callbacker_ips",
nil,
)
request.Header.Add("Content-type", "application/json")
request.SetBasicAuth(Username, Password)
response, _ := client.Do(request)
responseText, _ := ioutil.ReadAll(response.Body)
fmt.Println(string(responseText))
}
import okhttp3.*;
public class Main implements Runnable {
private static final String AUTHORIZATION_HEADER = "Authorization";
public static final String USERNAME = "YOUR_USERNAME";
public static final String PASSWORD = "YOUR_PASSWORD";
public void run() {
Authenticator authenticator = (route, response) -> {
String credential = Credentials.basic(USERNAME, PASSWORD);
return response
.request()
.newBuilder()
.header(AUTHORIZATION_HEADER, credential)
.build();
};
var client = new OkHttpClient.Builder()
.authenticator(authenticator)
.build();
var request = new Request.Builder()
.url("https://data.oxylabs.io/v1/info/callbacker_ips")
.get()
.build();
try (var response = client.newCall(request).execute()) {
assert response.body() != null;
System.out.println(response.body().string());
} catch (Exception exception) {
System.out.println("Error: " + exception.getMessage());
}
System.exit(0);
}
public static void main(String[] args) {
new Thread(new Main()).start();
}
}
import fetch from 'node-fetch';
const username = 'YOUR_USERNAME';
const password = 'YOUR_PASSWORD';
const response = await fetch('https://data.oxylabs.io/v1/info/callbacker_ips', {
method: 'get',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Basic ' + Buffer.from(`${username}:${password}`).toString('base64'),
}
});
console.log(await response.json());
输出
API 将返回向您的系统发出回调请求的 IP 列表。
{
"ips": [
"x.x.x.x",
"y.y.y.y"
]
}
上传到云存储器
描述
爬虫 API 作业结果存储在我们的存储库中。如上所述,您可以通过 GET
/results
端点而从我们的存储中获得您的结果。
作为一种选择,我们可以将结果上传到您的云存储中。这样,您就不必发出额外请求来获取结果 - 所有内容都直接转到您的存储桶。
目前,我们支持 Amazon S3和 Google 云存储。如果您想使用不同存储类型,请联系您的客户经理以讨论功能交付时间表。
上传路径看起来像这样:YOUR_BUCKET_NAME/job_ID.json
.您可以在提交作业后从我们收到的回复中找到作业 ID。在这个示例 中,作业 ID 是12345678900987654321
。
参数值
Amazon S3
To get your job results uploaded to your Amazon S3 bucket, please set up access permissions for our service. To do that, go to https://s3.console.aws.amazon.com/ > S3 > Storage > Bucket Name (if don't have one, create a new one) > Permissions > Bucket Policy
要将您的作业结果上传到您的 Amazon S3 桶,请为我们的服务设置访问权限。要做到这一点,请进入https://s3.console.aws.amazon.com/ > S3 > Storage > Bucket Name(如果没有,就创建一个新的桶)> Permissions > Bucket Policy
您可以在以下随附的桶策略或在代码示例区找到。
不要忘记在 YOUR_BUCKET_NAME
下修改桶的名称。这项策略允许我们写到您的桶,让您访问上传的文件,并知道桶的位置。
{
"Version": "2012-10-17",
"Id": "Policy1577442634787",
"Statement": [
{
"Sid": "Stmt1577442633719",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
},
"Action": "s3:GetBucketLocation",
"Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
},
{
"Sid": "Stmt1577442633719",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
},
"Action": [
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
}
]
}
Google Cloud Storage
为了将您的作业结果上传到您的 Google 云存储桶,请为我们的服务设置特殊权限。要做到这一点,请用 storage.objects.create
权限创建一个自定义角色,并将其分配给 Oxylabs 服务账户的电子邮件 oxyserps-storage@oxyserps-storage.iam.gserviceaccount.com
.
Scheduler
Scheduler 是一项服务,您可以用来安排经常性抓取作业。
它扩展了推拉式集成的功能,最好与上传至存储器功能一起使用。
访问“Scheduler”部分 以了解如何使用这一功能。
Last updated