# AI-Crawler

## Overview

[**AI-Crawler**](https://aistudio.oxylabs.io/apps/crawl) is a data extraction app that uses advanced AI algorithms to crawl a given domain. It identifies relevant pages based on a natural language prompt and extracts structured **JSON** or **Markdown** output data.

This low-code tool is designed to simplify complex data acquisition tasks, allowing developers and data scientists to focus on analysis rather than building and maintaining custom web scrapers. The AI web crawler offers advanced filtering, schema-based parsing, and seamless integration with various automation pipelines.

You can preview the tool [**here**](https://aistudio.oxylabs.io/apps/crawl) and integrate it into your workflows by our Python/JavaScript SDKs, MCP server, or one of our 3rd-party integrations.

## Key features

* **Start a crawl from any given URL:** Begin your data extraction from any valid web address using the AI Crawler as a starting point.
* **Natural language prompt:** Define your data needs in plain English, and the crawl agent will interpret the prompt to find relevant content.
* **AI-assisted URL selection:** The AI web crawler intelligently explores the site, identifying and prioritizing pages most aligned with your prompt.
* **Multiple output formats:** Choose between structured JSON or Markdown output for seamless integration into automation or AI workflows.
* **Schema-based parsing:** For JSON output, you can define a parsing schema in natural language to ensure the extracted data is structured to fit your application.

## Usage

To get started with the AI Crawler, follow this four-step process:

1. **Provide a starting URL** of the website you want the web crawler to explore.
2. **Describe the content** you want to retrieve using a natural language prompt for the crawl agent.
3. **Select the output format.** Choose between structured JSON or Markdown.
4. **If using JSON output,** provide a schema to guide the AI web crawler in parsing and structuring the extracted data.

### Installation

To begin, be sure you have access to an API key (or [get a free trial](https://aistudio.oxylabs.io/register) with **1,000 credits**) and `Python 3.10+` installed. You can install the `oxylabs-ai-studio` package using pip:

```sh
pip install oxylabs-ai-studio
```

### Code examples (Python)

The following examples demonstrate how to use the `AiCrawler` to perform common crawling tasks.

```python
from oxylabs_ai_studio.apps.ai_crawler import AiCrawler
import json

# Initialize the AI Crawler with your API key
crawler = AiCrawler(api_key="your_api_key")

# Generate a schema automatically from natural language
schema = crawler.generate_schema(prompt="want to parse name, platform, price")
print(f"Generated schema: {schema}")

# Crawl a website and extract structured data
url = "https://sandbox.oxylabs.io/products"
result = crawler.crawl(
    url=url,
    user_prompt="Find all Halo games for Xbox",
    output_format="json",
    schema=schema,
    render_javascript=False,
    return_sources_limit=3,
    geo_location="US",
)

# Print the crawl output as JSON
print("Results:")
print(json.dumps(result.data, indent=2))
```

Learn more about AI-Crawler and Oxylabs AI Studio Python SDK in our [PyPI repository](https://pypi.org/project/oxylabs-ai-studio/). You can also check out our [AI Studio JavaScript SDK](https://github.com/oxylabs/oxylabs-ai-studio-js) guide for JS users.

### Request parameters

| Parameter                                                  | Description                                                   | Default Value |
| ---------------------------------------------------------- | ------------------------------------------------------------- | ------------- |
| <mark style="background-color:green;">`url`</mark>         | Starting URL to crawl                                         | –             |
| <mark style="background-color:green;">`user_prompt`</mark> | Natural language prompt to guide extraction                   | –             |
| `output_format`                                            | Output format (`json`, `markdown`)                            | `markdown`    |
| `schema`                                                   | OpenAPI schema for structured extraction (mandatory for JSON) | –             |
| `render_javascript`                                        | Enable render JavaScript                                      | `False`       |
| `return_sources_limit`                                     | Max number of sources to return                               | `25`          |
| `geo_location`                                             | Proxy location in ISO2 format                                 | –             |

&#x20;   – mandatory parameters

#### Output samples

`AI-Crawler` can return parsed, ready-to-use output that is easy to integrate into your applications.

Here's what its JSON output looks like:

```json
[
  {
    "data": {
      "items": [
        {
          "name": "Halo: Reach",
          "platform": "Xbox platform",
          "price": 84.99
        }
      ]
    },
    "src": "https://sandbox.oxylabs.io/products/141"
  },
  {
    "data": {
      "items": [
        {
          "name": "Halo 3",
          "platform": "Xbox platform",
          "price": 81.99
        }
      ]
    },
    "src": "https://sandbox.oxylabs.io/products/28"
  },
  {
    "data": {
      "items": [
        {
          "name": "Halo: Combat Evolved",
          "platform": "Xbox platform",
          "price": 87.99
        }
      ]
    },
    "src": "https://sandbox.oxylabs.io/products/6"
  }
]
```

Alternatively, you can use `output_format=”markdown”` to receive Markdown results instead of parsed JSON.

### Practical use cases

AI-Crawler is a versatile tool for a wide range of applications, including:

1. **Finding terms of service pages:** Quickly locate legal and policy pages across a domain.
2. **Gathering pricing pages:** Collect pricing details for competitor analysis or market research.
3. **Retrieving all “About” pages:** Automatically find and extract company information from a list of websites.
4. **Listing AI-related news articles:** Scrape a news site to gather and archive articles on a specific topic.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://developers.oxylabs.io/products/ai-studio/ai-crawler.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
