Documentation has been updated: see help center and changelog in one place.

AI-Crawler

Learn how to crawl a website starting from a URL, find relevant pages, and extract data – all guided by your natural language prompt.

Overview

AI-Crawler is a data extraction app that uses advanced AI algorithms to crawl a given domain. It identifies relevant pages based on a natural language prompt and extracts structured JSON or Markdown output data.

This low-code tool is designed to simplify complex data acquisition tasks, allowing developers and data scientists to focus on analysis rather than building and maintaining custom web scrapers. The AI web crawler offers advanced filtering, schema-based parsing, and seamless integration with various automation pipelines.

You can preview the tool here and integrate it into your workflows by our Python/JavaScript SDKs, MCP server, or one of our 3rd-party integrations.

Key features

  • Start a crawl from any given URL: Begin your data extraction from any valid web address using the AI Crawler as a starting point.

  • Natural language prompt: Define your data needs in plain English, and the crawl agent will interpret the prompt to find relevant content.

  • AI-assisted URL selection: The AI web crawler intelligently explores the site, identifying and prioritizing pages most aligned with your prompt.

  • Multiple output formats: Choose between structured JSON or Markdown output for seamless integration into automation or AI workflows.

  • Schema-based parsing: For JSON output, you can define a parsing schema in natural language to ensure the extracted data is structured to fit your application.

Usage

To get started with the AI Crawler, follow this four-step process:

  1. Provide a starting URL of the website you want the web crawler to explore.

  2. Describe the content you want to retrieve using a natural language prompt for the crawl agent.

  3. Select the output format. Choose between structured JSON or Markdown.

  4. If using JSON output, provide a schema to guide the AI web crawler in parsing and structuring the extracted data.

Installation

To begin, be sure you have access to an API key (or get a free trial with 1,000 credits) and Python 3.10+ installed. You can install the oxylabs-ai-studio package using pip:

pip install oxylabs-ai-studio

Code examples (Python)

The following examples demonstrate how to use the AiCrawler to perform common crawling tasks.

from oxylabs_ai_studio.apps.ai_crawler import AiCrawler
import json

# Initialize the AI Crawler with your API key
crawler = AiCrawler(api_key="your_api_key")

# Generate a schema automatically from natural language
schema = crawler.generate_schema(prompt="want to parse name, platform, price")
print(f"Generated schema: {schema}")

# Crawl a website and extract structured data
url = "https://sandbox.oxylabs.io/products"
result = crawler.crawl(
    url=url,
    user_prompt="Find all Halo games for Xbox",
    output_format="json",
    schema=schema,
    render_javascript=False,
    return_sources_limit=3,
    geo_location="US",
)

# Print the crawl output as JSON
print("Results:")
print(json.dumps(result.data, indent=2))

Learn more about AI-Crawler and Oxylabs AI Studio Python SDK in our PyPI repository. You can also check out our AI Studio JavaScript SDK guide for JS users.

Request parameters

Parameter
Description
Default Value

url*

Starting URL to crawl

user_prompt*

Natural language prompt to guide extraction

output_format

Output format (json, markdown)

markdown

schema

OpenAPI schema for structured extraction (mandatory for JSON)

render_javascript

Enable render JavaScript

False

return_sources_limit

Max number of sources to return

25

geo_location

Proxy location in ISO2 format

* – mandatory parameters

Output samples

AI-Crawler can return parsed, ready-to-use output that is easy to integrate into your applications.

Here's what its JSON output looks like:

[
  {
    "data": {
      "items": [
        {
          "name": "Halo: Reach",
          "platform": "Xbox platform",
          "price": 84.99
        }
      ]
    },
    "src": "https://sandbox.oxylabs.io/products/141"
  },
  {
    "data": {
      "items": [
        {
          "name": "Halo 3",
          "platform": "Xbox platform",
          "price": 81.99
        }
      ]
    },
    "src": "https://sandbox.oxylabs.io/products/28"
  },
  {
    "data": {
      "items": [
        {
          "name": "Halo: Combat Evolved",
          "platform": "Xbox platform",
          "price": 87.99
        }
      ]
    },
    "src": "https://sandbox.oxylabs.io/products/6"
  }
]

Alternatively, you can use output_format=”markdown” to receive Markdown results instead of parsed JSON.

Practical use cases

AI-Crawler is a versatile tool for a wide range of applications, including:

  1. Finding terms of service pages: Quickly locate legal and policy pages across a domain.

  2. Gathering pricing pages: Collect pricing details for competitor analysis or market research.

  3. Retrieving all “About” pages: Automatically find and extract company information from a list of websites.

  4. Listing AI-related news articles: Scrape a news site to gather and archive articles on a specific topic.

Example use cases

  • Find the terms of service page within a specific domain.

  • Gather all pricing pages from a company's domain.

  • Retrieve all pages describing what a company does.

  • Find all articles related to AI technologies on a news website.

Last updated

Was this helpful?