> For the complete documentation index, see [llms.txt](https://developers.oxylabs.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://developers.oxylabs.io/products/cn/web-scraper-api/solutions-for-ai-workflows/llamaindex.md).

# LlamaIndex

LlamaIndex 与以下内容的集成 [**Oxylabs 网页爬虫API**](https://oxylabs.io/products/scraper-api/web) 使你能够在同一工作流中通过 LLM（大型语言模型）抓取并处理网页数据。

## 概述

[**LlamaIndex**](https://docs.llamaindex.ai/en/stable/examples/data_connectors/OxylabsDemo/) 是一个用于借助外部数据源构建 LLM 应用的数据框架。可与以下内容一起使用 [**Oxylabs 网页爬虫API**](https://oxylabs.io/products/scraper-api/web) 用于：

* 抓取结构化数据，无需处理 CAPTCHA、IP 封锁或 JS 渲染
* 在同一流水线中使用 LLM 处理结果
* 构建从提取到 AI 驱动输出的端到端工作流

## 入门

**创建你的 API 用户凭据：** 注册免费试用，或在以下位置购买产品 [**Oxylabs 控制台**](https://dashboard.oxylabs.io/en/registration) 以创建你的 API 用户凭据（`USERNAME` 和 `PASSWORD`).

{% hint style="info" %}
如果你的账户需要多个 API 用户，请联系我们的客服，或通过我们的 7×24 小时在线聊天支持发送消息。
{% endhint %}

### 环境设置

在本指南中，我们将使用 Python 编程语言。使用 pip 安装所需库：

```
pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web
```

创建一个 `.env` 文件，在你的项目目录中写入你的 Oxylabs 网页爬虫API 凭据和 OpenAI API 密钥：

```
OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key
```

在你的 Python 脚本中加载这些环境变量：

```python
import os
from dotenv import load_dotenv

load_dotenv()
```

## 集成方法

在 LlamaIndex 中通过网页爬虫API 访问网页内容有两种方式：

### Oxylabs 读取器

该 `llama-index-readers-oxylabs` 模块包含特定类，可让你从各种来源抓取数据：

| API 数据源     | 读取器类                       |
| ----------- | -------------------------- |
| Google 网页搜索 | OxylabsGoogleSearchReader  |
| Google 搜索广告 | OxylabsGoogleAdsReader     |
| Amazon 商品   | OxylabsAmazonProductReader |
| Amazon 搜索   | OxylabsAmazonSearchReader  |
| Amazon 评论   | OxylabsAmazonReviewsReader |

例如，你可以提取 Google 搜索结果：

```python
import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader

load_dotenv()
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
    'query': 'best pancake recipe',
    'parse': True
})
print(results[0].text)
```

### Oxylabs 网页读取器

通过 `OxylabsWebReader` 类，你可以从任何 URL 提取数据：

```python
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader

load_dotenv()
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1',
        'https://sandbox.oxylabs.io/products/2'
    ]
)
for result in results:
    print(result.text + '\n')
```

## 构建基础 AI 搜索代理

下面是一个简单 AI 代理的示例，它可以搜索 Google 并回答问题：

```python
import os
import asyncio
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

load_dotenv()
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

def web_search(query: str) -> str:
    results = reader.load_data({'query': query, 'parse': True})
    return results[0].text

agent = FunctionAgent(
    tools=[web_search],
    llm=OpenAI(model='gpt-4o-mini'),
    max_function_calls=1,
    system_prompt=(
        '编写一个简短的 Google 搜索查询，用于 `web_search` 工具。 '
        '分析最相关的结果并回答问题。'
    )
)

async def main():
    response = await agent.run('How did DeepSeek affect the stock market?')
    print(response)

if __name__ == '__main__':
    asyncio.run(main())
```

## 高级配置

### 处理动态内容

网页爬虫API 可以处理 JavaScript 渲染：

```python
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://quotes.toscrape.com/js/'
    ],
    {'render': 'html'}
)
```

### 设置用户代理类型

你可以指定不同的用户代理：

```python
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1'
    ],
    {'user_agent_type': 'mobile'}
)
```

### 使用特定目标参数

许多针对特定目标的爬虫支持额外参数：

```python
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'),
    os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
    'query': 'iphone',
    'parse': True,
    'domain': 'com',
    'start_page': 2,
    'pages': 3
})
```

## 创建向量索引

LlamaIndex 尤其适用于从网页内容构建向量索引：

```python
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI

load_dotenv()
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
documents = reader.load_data([
    'https://sandbox.oxylabs.io/products/1',
    'https://sandbox.oxylabs.io/products/2'
])

# Configure LlamaIndex settings
Settings.llm = OpenAI(model='gpt-4o-mini')

# Create an index
index = VectorStoreIndex.from_documents(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query('What is the main topic of these pages?')
print(response)
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://developers.oxylabs.io/products/cn/web-scraper-api/solutions-for-ai-workflows/llamaindex.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.