# LlamaIndex

The LlamaIndex integration with the [**Oxylabs Web Scraper API**](https://oxylabs.io/products/scraper-api/web) enables you to scrape and process web data through an LLM (Large Language Model) in the same workflow.

## Overview

[**LlamaIndex**](https://docs.llamaindex.ai/en/stable/examples/data_connectors/OxylabsDemo/) is a data framework designed for building LLM applications with external data sources. Use it with [**Oxylabs Web Scraper API**](https://oxylabs.io/products/scraper-api/web) to:

* Scrape structured data without handling CAPTCHAs, IP blocks, or JS rendering
* Process results with an LLM in the same pipeline
* Build end-to-end workflows from extraction to AI-powered output

## Getting started

**Create your API user credentials:** sign up for a free trial or purchase the product in the [**Oxylabs dashboard**](https://dashboard.oxylabs.io/en/registration) to create your API user credentials (`USERNAME` and `PASSWORD`).

{% hint style="info" %}
If you need more than one API user for your account, please contact our customer support or message our 24/7 live chat support.
{% endhint %}

### Environment setup

In this guide we will use Python programming language. Install the required libraries using pip:

```
pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web
```

Create a `.env` file in your project directory with your Oxylabs Web Scraper API credentials and OpenAI API key:

```
OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key
```

Load these environment variables in your Python script:

```python
import os
from dotenv import load_dotenv

load_dotenv()
```

## Integration methods

There are two ways to access web content via Web Scraper API in LlamaIndex:

### Oxylabs Reader

The `llama-index-readers-oxylabs` module contains specific classes that enable you to scrape data from various sources:

| API Data Source    | Reader Class                   |
| ------------------ | ------------------------------ |
| Google Web Search  | OxylabsGoogleSearchReader      |
| Google Search Ads  | OxylabsGoogleAdsReader         |
| Amazon Product     | OxylabsAmazonProductReader     |
| Amazon Search      | OxylabsAmazonSearchReader      |
| Amazon Reviews     | OxylabsAmazonReviewsReader     |
| YouTube Transcript | OxylabsYoutubeTranscriptReader |

For example, you can extract Google search results:

```python
import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader

load_dotenv()
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
    'query': 'best pancake recipe',
    'parse': True
})
print(results[0].text)
```

### Oxylabs Web Reader

With the `OxylabsWebReader` class, you can extract data from any URL:

```python
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader

load_dotenv()
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1',
        'https://sandbox.oxylabs.io/products/2'
    ]
)
for result in results:
    print(result.text + '\n')
```

## Building a basic AI search agent

Below is an example of a simple AI agent that can search Google and answer questions:

```python
import os
import asyncio
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI

load_dotenv()
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

def web_search(query: str) -> str:
    results = reader.load_data({'query': query, 'parse': True})
    return results[0].text

agent = FunctionAgent(
    tools=[web_search],
    llm=OpenAI(model='gpt-4o-mini'),
    max_function_calls=1,
    system_prompt=(
        'Craft a short Google search query to use with the `web_search` tool. '
        'Analyze the most relevant results and answer the question.'
    )
)

async def main():
    response = await agent.run('How did DeepSeek affect the stock market?')
    print(response)

if __name__ == '__main__':
    asyncio.run(main())
```

## Advanced configuration

### Handling dynamic content

The Web Scraper API can handle JavaScript rendering:

```python
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://quotes.toscrape.com/js/'
    ],
    {'render': 'html'}
)
```

### Setting user agent type

You can specify different user agents:

```python
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)

results = reader.load_data(
    [
        'https://sandbox.oxylabs.io/products/1'
    ],
    {'user_agent_type': 'mobile'}
)
```

### Using target-specific parameters

Many target-specific scrapers support additional parameters:

```python
reader = OxylabsGoogleSearchReader(
    os.getenv('OXYLABS_USERNAME'),
    os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
    'query': 'iphone',
    'parse': True,
    'domain': 'com',
    'start_page': 2,
    'pages': 3
})
```

## Creating vector indices

LlamaIndex is particularly useful for building vector indices from web content:

```python
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI

load_dotenv()
reader = OxylabsWebReader(
    os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
documents = reader.load_data([
    'https://sandbox.oxylabs.io/products/1',
    'https://sandbox.oxylabs.io/products/2'
])

# Configure LlamaIndex settings
Settings.llm = OpenAI(model='gpt-4o-mini')

# Create an index
index = VectorStoreIndex.from_documents(documents)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query('What is the main topic of these pages?')
print(response)
```
