# LlamaIndex

The LlamaIndex integration with the [**Oxylabs Web Scraper API**](https://oxylabs.io/products/scraper-api/web) enables you to scrape and process web data through an LLM (Large Language Model) in the same workflow.

## Overview

[**LlamaIndex**](https://docs.llamaindex.ai/en/stable/examples/data_connectors/OxylabsDemo/) is a data framework designed for building LLM applications with external data sources. Use it with [**Oxylabs Web Scraper API**](https://oxylabs.io/products/scraper-api/web) to:

* Scrape structured data without handling CAPTCHAs, IP blocks, or JS rendering
* Process results with an LLM in the same pipeline
* Build end-to-end workflows from extraction to AI-powered output

## Getting started

**Create your API user credentials:** sign up for a free trial or purchase the product in the [**Oxylabs dashboard**](https://dashboard.oxylabs.io/en/registration) to create your API user credentials (`USERNAME` and `PASSWORD`).

{% hint style="info" %}
If you need more than one API user for your account, please contact our customer support or message our 24/7 live chat support.
{% endhint %}

### Environment setup

In this guide we will use Python programming language. Install the required libraries using pip:

```
pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web
```

Create a `.env` file in your project directory with your Oxylabs Web Scraper API credentials and OpenAI API key:

```
OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key
```

Load these environment variables in your Python script:

```python
import os
from dotenv import load_dotenv

load_dotenv()
```

## Integration methods

There are two ways to access web content via Web Scraper API in LlamaIndex:

### Oxylabs Reader

The `llama-index-readers-oxylabs` module contains specific classes that enable you to scrape data from various sources:

| API Data Source    | Reader Class                   |
| ------------------ | ------------------------------ |
| Google Web Search  | OxylabsGoogleSearchReader      |
| Google Search Ads  | OxylabsGoogleAdsReader         |
| Amazon Product     | OxylabsAmazonProductReader     |
| Amazon Search      | OxylabsAmazonSearchReader      |
| Amazon Reviews     | OxylabsAmazonReviewsReader     |
| YouTube Transcript | OxylabsYoutubeTranscriptReader |

For example, you can extract Google search results:

```python
import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader

load_dotenv()
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
    'query': 'best pancake recipe',
    'parse': True
})
print(results[0].text)
```

### Oxylabs Web Reader

With the `OxylabsWebReader` class, you can extract data from any URL:

```python
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader

load_dotenv()
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1',
        'https://sandbox.oxylabs.io/products/2'
    ]
)
for result in results:
    print(result.text + '\n')
```

## Building a basic AI search agent

Below is an example of a simple AI agent that can search Google and answer questions:

```python
import os
import asyncio
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

load_dotenv()
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

def web_search(query: str) -> str:
    results = reader.load_data({'query': query, 'parse': True})
    return results[0].text

agent = FunctionAgent(
    tools=[web_search],
    llm=OpenAI(model='gpt-4o-mini'),
    max_function_calls=1,
    system_prompt=(
        'Craft a short Google search query to use with the `web_search` tool. '
        'Analyze the most relevant results and answer the question.'
    )
)

async def main():
    response = await agent.run('How did DeepSeek affect the stock market?')
    print(response)

if __name__ == '__main__':
    asyncio.run(main())
```

## Advanced configuration

### Handling dynamic content

The Web Scraper API can handle JavaScript rendering:

```python
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://quotes.toscrape.com/js/'
    ],
    {'render': 'html'}
)
```

### Setting user agent type

You can specify different user agents:

```python
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1'
    ],
    {'user_agent_type': 'mobile'}
)
```

### Using target-specific parameters

Many target-specific scrapers support additional parameters:

```python
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'),
    os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
    'query': 'iphone',
    'parse': True,
    'domain': 'com',
    'start_page': 2,
    'pages': 3
})
```

## Creating vector indices

LlamaIndex is particularly useful for building vector indices from web content:

```python
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI

load_dotenv()
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
documents = reader.load_data([
    'https://sandbox.oxylabs.io/products/1',
    'https://sandbox.oxylabs.io/products/2'
])

# Configure LlamaIndex settings
Settings.llm = OpenAI(model='gpt-4o-mini')

# Create an index
index = VectorStoreIndex.from_documents(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query('What is the main topic of these pages?')
print(response)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://developers.oxylabs.io/scraping-solutions/web-scraper-api/solutions-for-ai-workflows/llamaindex.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
