YouTube Scraping Guide for AI

This guide will walk you through the workflow for collecting and filtering YouTube data for AI training purposes using Web Scraper API's specialized sources: youtube_search, youtube_video_trainability, youtube_metadata, youtube_download, and youtube_transcript.

Step 1: Search for videos

Start by searching for videos related to your topic of interest.

For a quick search that returns up to 20 results:

{
  "source": "youtube_search",
  "query": "your search term"
}

For more comprehensive results (up to 700 results):

{
  "source": "youtube_search_max",
  "query": "your search term"
}

Search with filters

Refine your search with filters:

{
  "source": "youtube_search",
  "query": "your search term",
  "type": "video",
  "duration": "4-20",
  "upload_date": "this_month",
  "sort_by": "view_count",
  "hd": true
}

Use the appropriate filters to narrow down results based on your specific needs. Options include content type (video, channel, playlist), duration, upload date, and quality settings.

Step 2: Extract video IDs from search results

After receiving search results, extract the video IDs for further processing. In the response from youtube_search or youtube_search_max, video IDs are directly available in the videoId field of each result item, as shown in this example response snippet:

{
    "results": [
        {
            "content": [
                {
                    "videoId": "LK9XuImr8Xg",  // This is the video ID you need
                    "thumbnail": {
                        "thumbnails": [
                            {
                                "url": "https://i.ytimg.com/vi/LK9XuImr8Xg/hq720_2.jpg?sqp=-oaymwE2COgCEMoBSFXyq4qpAygIARUAAIhCGABwAcABBvABAfgBtgiAAoAPigIMCAAQARhaIGUoLTAP&rs=AOn4CLDTvqEgoE2ZNfnn3EalF2ujcthVNw",
                                "width": 360,
                                "height": 202
                            }
                        ]
                    },
                    "title": {
                        // title details
                    }
                }
            ]
        }
    ]
}

Extract these video IDs into a list for use in subsequent API calls.

Step 3: Check AI training eligibility

Before downloading or using videos for AI training, check their eligibility:

{
  "source": "youtube_video_trainability",
  "video_id": "rFNDylrjn_w"
}

The response will indicate if the video can be used for AI training purposes:

  • ["all"] - Training permitted for all parties

  • ["none"] - No training permitted for any party

  • ["party1", "party2", ...] - Training permitted only for specific parties

Step 4: Get video metadata

Collect additional information about the videos to further evaluate their quality and relevance:

{
  "source": "youtube_metadata",
  "query": "VIDEO_ID",
  "parse": true
}

The response will contain metadata like view counts, comments, ratings, and other metrics that can help you assess content quality.

Step 5: Retrieve content from selected videos

After identifying high-quality, trainable videos based on their eligibility and metadata, you can proceed with content retrieval. This can be done in two parallel steps:

5.1 Download video/audio content

{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/"
}

Additional options for download:

{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/",
  "context": [
    {
      "key": "download_type",
      "value": "video"
    },
    {
      "key": "video_quality",
      "value": "1080"
    }
  ]
}

This source is only available via the asynchronous Push-Pull integration and Cloud Storage feature.

Note:

  • Videos can be up to 3 hours in length

  • Default resolution is 720p (can be customized)

  • You can specify audio-only, video-only, or both

5.2 Retrieve video transcripts

Checking if a video has transcripts:

The most efficient way to check transcript availability is by examining the video metadata (Step 4), which includes these fields:

{
    "is_transcript_available": true,
    "generated_subtitle_languages": [
        "en"
    ],
    "generated_transcript_languages": [
        "en"
    ]
}

This approach is more cost-effective than making requests that result in 404 errors, which are billable.

If the metadata shows transcripts are available, you can retrieve them with:

{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    }
  ]
}

For videos with manually created transcripts, specify:

{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    },
    {
      "key": "transcript_origin",
      "value": "uploader_provided"
    }
  ]
}

Checking if a video has transcripts (manually):

On YouTube, click the "..." menu below the video, then look for "Show transcript" in the menu options. If this option is missing, the video doesn't have transcripts available. When present, you can click it to view available transcript languages.

Bulk processing

For efficient processing of multiple videos, use batch endpoints:

{
  "source": "youtube_video_trainability",
  "query": ["VIDEO_ID_1", "VIDEO_ID_2", "VIDEO_ID_3"]
}

Best practices

  1. Follow the discovery workflow from search → trainability → metadata → content to maximize efficiency

  2. Narrow down search results before processing individual videos

  3. Always verify trainability before using content for AI

  4. Check response codes and implement retries for failed requests

Last updated

Was this helpful?