YouTube Scraping Guide for AI

This guide will walk you through the workflow for collecting and filtering YouTube data for AI training purposes using Web Scraper API's specialized sources: youtube_search, youtube_video_trainability, youtube_metadata, youtube_download, and youtube_transcript.

Step 1: Search for videos

Start by searching for videos related to your topic of interest.

Basic search

For a quick search that returns up to 20 results:

{
  "source": "youtube_search",
  "query": "your search term"
}

Extended search

For more comprehensive results (up to 700 results):

{
  "source": "youtube_search_max",
  "query": "your search term"
}

Search with filters

Refine your search with filters:

{
  "source": "youtube_search",
  "query": "your search term",
  "type": "video",
  "duration": "4-20",
  "upload_date": "this_month",
  "sort_by": "view_count",
  "hd": true
}

Use the appropriate filters to narrow down results based on your specific needs. Options include content type (video, channel, playlist), duration, upload date, and quality settings.

Step 2: Extract video IDs from search results

After receiving search results, extract the video IDs for further processing. In the response from youtube_search or youtube_search_max, video IDs are directly available in the videoId field of each result item, as shown in this example response snippet:

{
    "results": [
        {
            "content": [
                {
                    "videoId": "LK9XuImr8Xg",  // This is the video ID you need
                    "thumbnail": {
                        "thumbnails": [
                            {
                                "url": "https://i.ytimg.com/vi/LK9XuImr8Xg/hq720_2.jpg?sqp=-oaymwE2COgCEMoBSFXyq4qpAygIARUAAIhCGABwAcABBvABAfgBtgiAAoAPigIMCAAQARhaIGUoLTAP&rs=AOn4CLDTvqEgoE2ZNfnn3EalF2ujcthVNw",
                                "width": 360,
                                "height": 202
                            }
                        ]
                    },
                    "title": {
                        // title details
                    }
                }
            ]
        }
    ]
}

Extract these video IDs into a list for use in subsequent API calls.

Step 3: Check AI training eligibility

Before downloading or using videos for AI training, check their eligibility:

{
  "source": "youtube_video_trainability",
  "video_id": "rFNDylrjn_w"
}

The response will indicate if the video can be used for AI training purposes:

["all"] - Training permitted for all parties
["none"] - No training permitted for any party
["party1", "party2", ...] - Training permitted only for specific parties

Step 4: Get video metadata

Collect additional information about the videos to further evaluate their quality and relevance:

{
  "source": "youtube_metadata",
  "query": "VIDEO_ID",
  "parse": true
}

The response will contain metadata like view counts, comments, ratings, and other metrics that can help you assess content quality.

The parse parameter must be set to true for the metadata source.

Step 5: Retrieve content from selected videos

After identifying high-quality, trainable videos based on their eligibility and metadata, you can proceed with content retrieval. This can be done in two parallel steps:

5.1 Download video/audio content

{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/"
}

Additional options for download:

{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/",
  "context": [
    {
      "key": "download_type",
      "value": "video"
    },
    {
      "key": "video_quality",
      "value": "1080"
    }
  ]
}

This source is only available via the asynchronous Push-Pull integration and Cloud Storage feature.

Note:

Videos can be up to 3 hours in length
Default resolution is 720p (can be customized)
You can specify audio-only, video-only, or both

5.2 Retrieve video transcripts

Transcripts are not the same as closed captions (CC). Not all videos have transcripts available in all languages. If a transcript doesn't exist in your specified language, the API will return a 404 status code.

Checking if a video has transcripts:

The most efficient way to check transcript availability is by examining the video metadata (Step 4), which includes these fields:

{
    "is_transcript_available": true,
    "generated_subtitle_languages": [
        "en"
    ],
    "generated_transcript_languages": [
        "en"
    ]
}

This approach is more cost-effective than making requests that result in 404 errors, which are billable.

If the metadata shows transcripts are available, you can retrieve them with:

{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    }
  ]
}

For videos with manually created transcripts, specify:

{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    },
    {
      "key": "transcript_origin",
      "value": "uploader_provided"
    }
  ]
}

Checking if a video has transcripts (manually):

On YouTube, click the "..." menu below the video, then look for "Show transcript" in the menu options. If this option is missing, the video doesn't have transcripts available. When present, you can click it to view available transcript languages.

Bulk processing

For efficient processing of multiple videos, use batch endpoints:

{
  "source": "youtube_video_trainability",
  "query": ["VIDEO_ID_1", "VIDEO_ID_2", "VIDEO_ID_3"]
}

Best practices

Follow the discovery workflow from search → trainability → metadata → content to maximize efficiency
Narrow down search results before processing individual videos
Always verify trainability before using content for AI
Check response codes and implement retries for failed requests

PreviousYouTube NextYouTube Search

Last updated 3 months ago

Was this helpful?