YouTube Scraping Guide for AI
This guide will walk you through the workflow for collecting and filtering YouTube data for AI training purposes using Web Scraper API's specialized sources: youtube_search, youtube_video_trainability, youtube_metadata, youtube_download, and youtube_transcript.
Step 1: Search for videos
Start by searching for videos related to your topic of interest.
Basic search
For a quick search that returns up to 20 results:
{
  "source": "youtube_search",
  "query": "your search term"
}Extended search
For more comprehensive results (up to 700 results):
{
  "source": "youtube_search_max",
  "query": "your search term"
}Search with filters
Refine your search with filters:
{
  "source": "youtube_search",
  "query": "your search term",
  "type": "video",
  "duration": "4-20",
  "upload_date": "this_month",
  "sort_by": "view_count",
  "hd": true
}Step 2: Extract video IDs from search results
After receiving search results, extract the video IDs for further processing. In the response from youtube_search or youtube_search_max, video IDs are directly available in the videoId field of each result item, as shown in this example response snippet:
{
    "results": [
        {
            "content": [
                {
                    "videoId": "LK9XuImr8Xg",  // This is the video ID you need
                    "thumbnail": {
                        "thumbnails": [
                            {
                                "url": "https://i.ytimg.com/vi/LK9XuImr8Xg/hq720_2.jpg?sqp=-oaymwE2COgCEMoBSFXyq4qpAygIARUAAIhCGABwAcABBvABAfgBtgiAAoAPigIMCAAQARhaIGUoLTAP&rs=AOn4CLDTvqEgoE2ZNfnn3EalF2ujcthVNw",
                                "width": 360,
                                "height": 202
                            }
                        ]
                    },
                    "title": {
                        // title details
                    }
                }
            ]
        }
    ]
}Extract these video IDs into a list for use in subsequent API calls.
Step 3: Check AI training eligibility
Before downloading or using videos for AI training, check their eligibility:
{
  "source": "youtube_video_trainability",
  "video_id": "rFNDylrjn_w"
}The response will indicate if the video can be used for AI training purposes:
- ["all"]- Training permitted for all parties
- ["none"]- No training permitted for any party
- ["party1", "party2", ...]- Training permitted only for specific parties
Step 4: Get video metadata 
Collect additional information about the videos to further evaluate their quality and relevance:
{
  "source": "youtube_metadata",
  "query": "VIDEO_ID",
  "parse": true
}The response will contain metadata like view counts, comments, ratings, and other metrics that can help you assess content quality.
The parse parameter must be set to true for the metadata source.
Step 5: Retrieve content from selected videos
After identifying high-quality, trainable videos based on their eligibility and metadata, you can proceed with content retrieval. This can be done in two parallel steps:
5.1 Download video/audio content
{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/"
}Additional options for download:
{
  "source": "youtube_download",
  "query": "VIDEO_ID",
  "storage_type": "s3",
  "storage_url": "s3://your-bucket/your-folder/",
  "context": [
    {
      "key": "download_type",
      "value": "video"
    },
    {
      "key": "video_quality",
      "value": "1080"
    }
  ]
}Note:
- Videos can be up to 3 hours in length 
- Default resolution is 720p (can be customized) 
- You can specify audio-only, video-only, or both 
5.2 Retrieve video transcripts
Transcripts are not the same as closed captions (CC). Not all videos have transcripts available in all languages. If a transcript doesn't exist in your specified language, the API will return a 404 status code.
Checking if a video has transcripts:
The most efficient way to check transcript availability is by examining the video metadata (Step 4), which includes these fields:
{
    "is_transcript_available": true,
    "generated_subtitle_languages": [
        "en"
    ],
    "generated_transcript_languages": [
        "en"
    ]
}If the metadata shows transcripts are available, you can retrieve them with:
{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    }
  ]
}For videos with manually created transcripts, specify:
{
  "source": "youtube_transcript",
  "query": "VIDEO_ID",
  "context": [
    {
      "key": "language_code",
      "value": "en"
    },
    {
      "key": "transcript_origin",
      "value": "uploader_provided"
    }
  ]
}Checking if a video has transcripts (manually):
On YouTube, click the "..." menu below the video, then look for "Show transcript" in the menu options. If this option is missing, the video doesn't have transcripts available. When present, you can click it to view available transcript languages.
Bulk processing
For efficient processing of multiple videos, use batch endpoints:
{
  "source": "youtube_video_trainability",
  "query": ["VIDEO_ID_1", "VIDEO_ID_2", "VIDEO_ID_3"]
}Best practices
- Follow the discovery workflow from search → trainability → metadata → content to maximize efficiency 
- Narrow down search results before processing individual videos 
- Always verify trainability before using content for AI 
- Check response codes and implement retries for failed requests 
Last updated
Was this helpful?

