# Scraping Guide for AI

This guide will walk you through the workflow for collecting and filtering YouTube data for AI training purposes using [**Web Scraper API's specialized sources**](https://oxylabs.io/products/scraper-api/web/youtube): `youtube_search`, `youtube_video_trainability`, `youtube_metadata`, `youtube_download`, `youtube_transcript`.

## Step 1: Search for videos

Start by searching for videos related to your topic of interest.

### Basic search

For a quick search that returns up to 20 results:

```json
{
  "source": "youtube_search",
  "query": "your search term"
}
```

### Extended search

For more comprehensive results (up to 700 results):

```json
{
  "source": "youtube_search_max",
  "query": "your search term"
}
```

### Search with filters

Refine your search with filters:

```json
{
  "source": "youtube_search",
  "query": "your search term",
  "type": "video",
  "duration": "4-20",
  "upload_date": "this_month",
  "sort_by": "view_count",
  "hd": true
}
```

{% hint style="info" %}
Use the appropriate filters to narrow down results based on your specific needs. Options include content type (video, channel, playlist), duration, upload date, and quality settings.
{% endhint %}

## Step 2: Extract video IDs from search results

After receiving search results, extract the **video IDs** for further processing. In the response from `youtube_search` or `youtube_search_max`, video IDs are directly available in the `videoId` field of each result item, as shown in this example response snippet:

```json
{
    "results": [
        {
            "content": [
                {
                    "videoId": "LK9XuImr8Xg",  // This is the video ID you need
                    "thumbnail": {
                        "thumbnails": [
                            {
                                "url": "https://i.ytimg.com/vi/LK9XuImr8Xg/hq720_2.jpg?sqp=-oaymwE2COgCEMoBSFXyq4qpAygIARUAAIhCGABwAcABBvABAfgBtgiAAoAPigIMCAAQARhaIGUoLTAP&rs=AOn4CLDTvqEgoE2ZNfnn3EalF2ujcthVNw",
                                "width": 360,
                                "height": 202
                            }
                        ]
                    },
                    "title": {
                        // title details
                    }
                }
            ]
        }
    ]
}
```

Extract these video IDs into a list for use in subsequent API calls.

## Step 3: Check AI training eligibility

Before downloading or using videos for AI training, check their eligibility:

```json
{
  "source": "youtube_video_trainability",
  "video_id": "rFNDylrjn_w"
}
```

The response will indicate if the video can be used for AI training purposes:

* `["all"]` - Training permitted for all parties
* `["none"]` - No training permitted for any party
* `["party1", "party2", ...]` - Training permitted only for specific parties

## Step 4: Get video metadata&#x20;

Collect additional information about the videos to further evaluate their quality and relevance:

```json
{
  "source": "youtube_metadata",
  "query": "VIDEO_ID",
  "parse": true
}
```

The response will contain metadata like view counts, comments, ratings, and other metrics that can help you assess content quality.

{% hint style="success" %}
The `parse` parameter must be set to `true` for the metadata source.
{% endhint %}

## Step 5: Retrieve content from selected videos

After identifying high-quality, trainable videos based on their eligibility and metadata, you can proceed with content retrieval. This can be done in two parallel steps:

### 5.1 Download video/audio content

```json
{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/"
}
```

Additional options for download:

```json
{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/",
  "context": [
    {
      "key": "download_type",
      "value": "video"
    },
    {
      "key": "video_quality",
      "value": "1080"
    }
  ]
}
```

{% hint style="info" %}
This source is only available via the asynchronous [**Push-Pull integration**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/integration-methods/push-pull) and [**Cloud Storage**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/result-processing-and-storage/cloud-storage) feature.
{% endhint %}

**Note:**

* Videos can be up to 3 hours in length
* Default resolution is 720p (can be customized)
* You can specify audio-only, video-only, or both

### 5.2 Retrieve video transcripts

{% hint style="danger" %}
Transcripts are not the same as closed captions (CC). Not all videos have transcripts available in all languages. If a transcript doesn't exist in your specified language, the API will return a `404` status code.
{% endhint %}

#### **Checking if a video has transcripts:**

The most efficient way to check transcript availability is by examining the video metadata [**(Step 4)**](https://developers.oxylabs.io/overview), which includes these fields:

```json
{
    "is_transcript_available": true,
    "generated_subtitle_languages": [
        "en"
    ],
    "generated_transcript_languages": [
        "en"
    ]
}
```

{% hint style="info" %}
This approach is more cost-effective than making requests that result in `404` errors, which are billable.
{% endhint %}

If the metadata shows transcripts are available, you can retrieve them with:

```json
{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    }
  ]
}
```

For videos with manually created transcripts, specify:

```json
{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    },
    {
      "key": "transcript_origin",
      "value": "uploader_provided"
    }
  ]
}
```

#### **Checking if a video has transcripts (manually):**

On YouTube, click the "..." menu below the video, then look for **"Show transcript"** in the menu options. If this option is missing, the video doesn't have transcripts available. When present, you can click it to view available transcript languages.

## Bulk processing

For efficient processing of multiple videos, use batch endpoints:

```json
{
  "source": "youtube_video_trainability",
  "query": ["VIDEO_ID_1", "VIDEO_ID_2", "VIDEO_ID_3"]
}
```

## Best practices

1. Follow the discovery workflow from **search → trainability → metadata → content** to maximize efficiency
2. Narrow down search results before processing individual videos
3. Always verify trainability before using content for AI
4. Check [**response codes**](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/response-codes) and implement retries for failed requests
