LlamaIndex
The LlamaIndex integration with the Oxylabs Web Scraper API enables you to scrape and process web data through an LLM (Large Language Model) in the same workflow.
Overview
LlamaIndex is a data framework designed for building LLM applications with external data sources. Use it with Oxylabs Web Scraper API to:
Scrape structured data without handling CAPTCHAs, IP blocks, or JS rendering
Process results with an LLM in the same pipeline
Build end-to-end workflows from extraction to AI-powered output
Getting started
Create your API user credentials: sign up for a free trial or purchase the product in the Oxylabs dashboard to create your API user credentials (USERNAME
and PASSWORD
).
Environment setup
In this guide we will use Python programming language. Install the required libraries using pip:
pip install -qU llama-index llama-index-readers-oxylabs llama-index-readers-web
Create a .env
file in your project directory with your Oxylabs Web Scraper API credentials and OpenAI API key:
OXYLABS_USERNAME=your_API_username
OXYLABS_PASSWORD=your_API_password
OPENAI_API_KEY=your-openai-key
Load these environment variables in your Python script:
import os
from dotenv import load_dotenv
load_dotenv()
Integration methods
There are two ways to access web content via Web Scraper API in LlamaIndex:
Oxylabs Reader
The llama-index-readers-oxylabs
module contains specific classes that enable you to scrape data from various sources:
Google Web Search
OxylabsGoogleSearchReader
Google Search Ads
OxylabsGoogleAdsReader
Amazon Product
OxylabsAmazonProductReader
Amazon Search
OxylabsAmazonSearchReader
Amazon Reviews
OxylabsAmazonReviewsReader
YouTube Transcript
OxylabsYoutubeTranscriptReader
For example, you can extract Google search results:
import os
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
load_dotenv()
reader = OxylabsGoogleSearchReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
'query': 'best pancake recipe',
'parse': True
})
print(results[0].text)
Oxylabs Web Reader
With the OxylabsWebReader
class, you can extract data from any URL:
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader
load_dotenv()
reader = OxylabsWebReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
[
'https://sandbox.oxylabs.io/products/1',
'https://sandbox.oxylabs.io/products/2'
]
)
for result in results:
print(result.text + '\n')
Building a basic AI search agent
Below is an example of a simple AI agent that can search Google and answer questions:
import os
import asyncio
from dotenv import load_dotenv
from llama_index.readers.oxylabs import OxylabsGoogleSearchReader
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
load_dotenv()
reader = OxylabsGoogleSearchReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
def web_search(query: str) -> str:
results = reader.load_data({'query': query, 'parse': True})
return results[0].text
agent = FunctionAgent(
tools=[web_search],
llm=OpenAI(model='gpt-4o-mini'),
max_function_calls=1,
system_prompt=(
'Craft a short Google search query to use with the `web_search` tool. '
'Analyze the most relevant results and answer the question.'
)
)
async def main():
response = await agent.run('How did DeepSeek affect the stock market?')
print(response)
if __name__ == '__main__':
asyncio.run(main())
Advanced configuration
Handling dynamic content
The Web Scraper API can handle JavaScript rendering:
reader = OxylabsWebReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
[
'https://quotes.toscrape.com/js/'
],
{'render': 'html'}
)
Setting user agent type
You can specify different user agents:
reader = OxylabsWebReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data(
[
'https://sandbox.oxylabs.io/products/1'
],
{'user_agent_type': 'mobile'}
)
Using target-specific parameters
Many target-specific scrapers support additional parameters:
reader = OxylabsGoogleSearchReader(
os.getenv('OXYLABS_USERNAME'),
os.getenv('OXYLABS_PASSWORD')
)
results = reader.load_data({
'query': 'iphone',
'parse': True,
'domain': 'com',
'start_page': 2,
'pages': 3
})
Creating vector indices
LlamaIndex is particularly useful for building vector indices from web content:
import os
from dotenv import load_dotenv
from llama_index.readers.web import OxylabsWebReader
from llama_index.core import Settings, VectorStoreIndex
from llama_index.llms.openai import OpenAI
load_dotenv()
reader = OxylabsWebReader(
os.getenv('OXYLABS_USERNAME'), os.getenv('OXYLABS_PASSWORD')
)
documents = reader.load_data([
'https://sandbox.oxylabs.io/products/1',
'https://sandbox.oxylabs.io/products/2'
])
# Configure LlamaIndex settings
Settings.llm = OpenAI(model='gpt-4o-mini')
# Create an index
index = VectorStoreIndex.from_documents(documents)
# Query the index
query_engine = index.as_query_engine()
response = query_engine.query('What is the main topic of these pages?')
print(response)
Last updated
Was this helpful?