YouTube is one of the world's largest public datasets. For researchers, marketers, and data scientists, the platform is a goldmine for understanding user behavior, tracking trends, and seeing what competitors are doing. You can use this data for everything from sentiment analysis on product reviews to breaking down a competitor's content strategy.
But getting this data means choosing one of two options: using the official API or performing direct web scrapping. This guide will focus on the latter: how to manually scrape YouTube and build a reliable solution that can extract metadata, comments, transcripts, and search results.
How to Scrape YouTube Titles & Views in 10 Lines?
Let's start with an "instant win". You don't need a complex framework to get basic data. When you load any YouTube video, the page's HTML source code embeds large JSON objects in its <script> tags to render the page.
The most important object is initialPlayerResponse, which holds core video metadata. You can use a simple Python script with requests and re (regex) to grab and parse this data.
Here’s a minimal script to get the title, channel, and view count:
import requests
import re
import json
def quick_scrape(url):
try:
html = requests.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
},
).text
match = re.search(r"var ytInitialPlayerResponse\s*=\s*({.+?});", html)
data = json.loads(match.group(1))["videoDetails"]
print(f"Title: {data['title']}")
print(f"Channel: {data['author']}")
print(f"Views: {data['viewCount']}")
except Exception as e:
print(f"Scrape failed: {e}")
# --- Example Usage ---
video_url = "https://www.youtube.com/watch?v=F8NKVhkZZWI"
quick_scrape(video_url)
This method is fast, but it's also brittle. If YouTube changes its page structure, the regex will break. A more stable approach involves more precise parsing, which we'll cover later.
But first, let's address why we're doing this manually instead of using the official API.
YouTube Data API vs. Manual Scraping: Which is Better?
Developers have 2 primary methods for accessing YouTube data: the official API and web extraction. The API is the sanctioned route and many production use cases are still best served by the official API, partnerships, or approved data providers. Web extraction can be useful when you have permission and a clear compliance posture, or when you need data that is not available through documented API endpoints.
For data intensive projects, the best choice depends on the data you need, how real time it must be, and what compliance constraints apply.
The YouTube Data API v3 is Google's official, structured method, providing documented endpoints. On the other hand, manual scraping is a data retrieval technique that involves writing a program to access web pages directly. This second approach requires reverse-engineering but offers far greater flexibility.
For most data-intensive applications, the choice becomes clear when you compare them directly:
| Feature | YouTube Data API v3 | Manual scraping |
|---|---|---|
| Data richness | Provides structured but limited metadata. Often lacks full comment threads (with all nested replies), hidden video tags, and other granular details. | Can access virtually any data point visible on the web page, including real-time view counts, complete comment threads, and other non-API-exposed metadata. |
| Real-time access | Data can be subject to delays. Real-time metrics like like/dislike ratios or rapidly changing view counts are often unavailable or cached. | Fetches data directly as it is displayed to a live user, providing true real-time access to engagement metrics as they happen. |
| Rate limits | Strictly enforced via a daily quota system (e.g., 10,000 free units per day). Different requests consume different amounts of quota, making large-scale scraping costly or impossible. | There is no published quota like the API, but practical limits still apply, including rate limiting, throttling, CAPTCHAs, and enforcement actions. Build with conservative pacing, caching, retries with backoff, and clear compliance constraints. |
| Cost at scale | Free for low-volume use, but costs can escalate quickly as API quota usage increases beyond the free tier. | The primary costs are for infrastructure, mainly high-quality proxy services. At scale, this is often more cost-effective than paying for API quotas. |
| Stability | Highly stable and versioned. Endpoints are documented and unlikely to change without notice, leading to low maintenance. | Inherently brittle. The scraper can break whenever YouTube updates its website layout or internal API structure, requiring ongoing monitoring and maintenance. |
This comparison explains why this guide covers web extraction techniques. The official API is often the best option for production workflows, especially when it meets your requirements. When it does not, and you have permission plus a clear compliance posture, web extraction can help you collect specific page level data, comments, transcripts, and search results.
This leads to an important question: is this allowed?
Is It Legal to Scrape YouTube Data?
Before writing any code, we need to address the legal and ethical questions.
Here's the short answer: it's a legal “grey area”.
On one hand, YouTube's Terms of Service (ToS) explicitly prohibit automated data collection, stating that developers "must not... directly or indirectly, scrape".
On the other hand, U.S. courts have repeatedly ruled that scraping publicly accessible data (anything not behind a login) does not violate the Computer Fraud and Abuse Act (CFAA).
This means that while you are violating the ToS, you are likely not breaking the law.
So, what's the real risk?
For most ethical projects, the risk isn't a lawsuit – it's technical blocking. Platforms have largely given up on legal battles and are now in a technical "arms race" to detect and block scrapers. Their goal is to stop high-volume activity that harms their service.
For the pragmatic developer, the challenge isn't about courtroom legality. It's about how to avoid detection.
Note: this information is for educational purposes only and does not constitute legal advice. Web scraping laws are complex and constantly evolving. Always consult with a legal professional to ensure your project is compliant.
With that handled, let's set up the tools for the job.
Prerequisites: Environment and Strategy
Before you write your first scraper, let's get your environment ready and establish the core technical strategy for this guide.
Setting Up Your Python Environment
First, let's get your Python environment set up. This guide assumes you have Python 3.9 or newer. We highly recommend using a virtual environment to manage project dependencies and avoid conflicts.
# Create and navigate to your project directory
mkdir youtube-scraper
cd youtube-scraper
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows: venv\\\\Scripts\\\\activate
# On macOS/Linux: source venv/bin/activate
Next, we'll install the core library for our scripts.
# Install the required library
pip install requests
We will use the following libraries:
- requests – this is the most popular library for making HTTP requests in Python. It simplifies the process of sending requests to web pages and APIs and handling the responses.
- re and json – these libraries are part of Python's standard library and don't require separate installation. re is for regular expressions (to find the data), and json is for parsing it.
Choosing Your Method: HTTP vs. Headless
Now for a crucial decision: should you use simple HTTP requests or a "headless" browser?
- HTTP requests (e.g., requests) – this method is what we're using in this guide. You make direct requests to the server, get the HTML or JSON, and parse it. It is extremely fast, lightweight, and scalable. Its main weakness is that it can't run JavaScript.
- Headless browsers (e.g., Playwright, Selenium) – a headless browser is a real web browser (like Chrome) that runs in the background. It's much slower and more resource-intensive, but it can execute JavaScript, click buttons, and interact with the page.
Rule of thumb: always try to use direct HTTP/JSON parsing first. It's 100x more efficient. As we'll see, YouTube's modern architecture, which relies on internal JSON APIs, actually makes this easier for us.
With your environment ready and your HTTP-first strategy set, you can now move on to the single most important skill for a modern scraper: reverse-engineering.
Further reading: How Proxies Help You Scale AI Web Scraping and Data Collection and How to Scrape Google Search Results Using Python in 2025: Code and No-Code Solutions.
Reverse-Engineering YouTube's API Calls
Modern web scraping is less about parsing messy HTML and more about becoming a digital detective. The most valuable data on dynamic sites like YouTube isn't in the initial HTML; it's loaded dynamically via internal APIs.
Learning to find and replicate these API calls is the single most important skill for a modern scraper, and it's the technique we'll use for the rest of this guide.
Understanding YouTube's JSON Data Loading
Websites like YouTube are sophisticated JavaScript applications. When you visit a URL, the server sends a minimal HTML "shell" along with JavaScript. This JavaScript then executes in your browser, making further requests to internal APIs to fetch the actual content (video details, comments, etc.) in a clean, structured JSON format.
Our goal is to bypass the browser rendering process and query these internal APIs directly.
Finding API Calls with Devtools
Our primary tool for this detective work is the Network tab in any modern browser's developer tools (accessible via F12 or Ctrl+Shift+I).
Here's the basic workflow:
- Open a YouTube video in an incognito/private browser window.
- Open the Developer Tools and select the Network tab.
- To isolate data requests, click the filter button and select Fetch/XHR. This shows only the API requests made by the page.
You'll see a list of requests the page makes to YouTube's servers to get video, playback, and recommended data.

Analyzing Dynamic Content Requests (Pagination)
This technique is even more powerful for "infinite scroll" content like comments.
- On the YouTube page, perform an action that loads new data (e.g., scroll down the comments).
- Observe the new requests in the Network tab. You'll likely see a POST request to an endpoint like /youtubei/v1/next.
- Click this request.
- In the Response tab, you'll find the clean JSON data for the newly loaded comments.
- In the Payload tab, you'll see what your browser sent to get it – crucially, this often includes a continuation token.

This reveals the core mechanism: YouTube's internal API uses continuation tokens for pagination. You can get any paginated data by finding the first token and then repeatedly calling the appropriate endpoint (/next, /browse, /search) with the new token found in each response.
Now, let's apply this concept to build your scrapers.
Building the YouTube Scrapers
Now that you understand how YouTube loads data and how to find its internal APIs, let's build the Python scripts to scrape specific data types: video metadata, comments, transcripts, channel information, and search results.
Scraping Video Metadata
The 10-line script was a good start, but for a more reliable script, you need to parse both ytInitialPlayerResponse (for core metadata) and ytInitialData (for engagement data like likes).
The following script is a complete solution. It fetches a video page, extracts both JSON objects, and then carefully parses them, combining the data into a single, clean dictionary.
import requests
import json
import re
def get_video_details(video_url):
"""
Extract comprehensive video data from a YouTube video URL.
Args:
video_url (str): YouTube video URL or video ID
Returns:
dict|None: Complete video metadata or None if extraction fails
"""
# Extract video ID from URL or use as-is if already an ID
if "youtube.com" in video_url or "youtu.be" in video_url:
video_id = extract_video_id(video_url)
else:
video_id = video_url
url = f"https://www.youtube.com/watch?v={video_id}"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
return None
html = response.text
# Extract embedded JSON data from HTML
player_response = extract_player_response(html)
initial_data = extract_initial_data(html)
if not player_response:
return None
# Parse and structure the extracted data
video_data = parse_video_data(player_response, initial_data, video_id)
return video_data
def extract_video_id(url):
"""
Extract 11-character video ID from various YouTube URL formats.
Supports formats:
- https://www.youtube.com/watch?v=VIDEO_ID
- https://youtu.be/VIDEO_ID
- https://www.youtube.com/embed/VIDEO_ID
Args:
url (str): YouTube URL
Returns:
str: 11-character video ID or original string if no match
"""
patterns = [
r"(?:v=|\/)([0-9A-Za-z_-]{11}).*",
r"(?:embed\/)([0-9A-Za-z_-]{11})",
r"(?:watch\?v=)([0-9A-Za-z_-]{11})",
]
for pattern in patterns:
match = re.search(pattern, url)
if match:
return match.group(1)
return url
def extract_player_response(html):
"""
Extract ytInitialPlayerResponse JSON object from YouTube page HTML.
Contains core video metadata like title, description, duration.
Args:
html (str): YouTube page HTML content
Returns:
dict|None: Parsed player response data or None if not found
"""
pattern = r"var ytInitialPlayerResponse\s*=\s*({.+?});"
match = re.search(pattern, html, re.DOTALL)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return None
return None
def extract_initial_data(html):
"""
Extract ytInitialData JSON object from YouTube page HTML.
Contains engagement data like likes, comments, and UI elements.
Args:
html (str): YouTube page HTML content
Returns:
dict|None: Parsed initial data or None if not found
"""
pattern = r"var ytInitialData\s*=\s*({.+?});"
match = re.search(pattern, html, re.DOTALL)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return None
return None
def parse_video_data(player_response, initial_data, video_id):
"""
Parse and structure video data from extracted JSON objects.
Combines data from both ytInitialPlayerResponse and ytInitialData
to create comprehensive video metadata structure.
Args:
player_response (dict): YouTube player response data
initial_data (dict): YouTube initial page data
video_id (str): Video ID
Returns:
dict: Structured video metadata
"""
def safe_get(obj, path, default=None):
"""Safely navigate nested dictionary paths."""
try:
for key in path:
obj = obj[key]
return obj
except (KeyError, IndexError, TypeError):
return default
video_details = safe_get(player_response, ["videoDetails"], {})
microformat = safe_get(
player_response, ["microformat", "playerMicroformatRenderer"], {}
)
# Structure core video information
data = {
"video_id": video_id,
"title": video_details.get("title"),
"url": f"https://www.youtube.com/watch?v={video_id}",
"channel_name": video_details.get("author"),
"channel_id": video_details.get("channelId"),
"channel_url": f"https://www.youtube.com/channel/{video_details.get('channelId')}",
"description": video_details.get("shortDescription"),
"length_seconds": int(video_details.get("lengthSeconds", 0)),
"duration_text": format_duration(int(video_details.get("lengthSeconds", 0))),
"view_count": int(video_details.get("viewCount", 0)),
"view_count_text": f"{int(video_details.get('viewCount', 0)):,} views",
"publish_date": safe_get(microformat, ["publishDate"]),
"upload_date": safe_get(microformat, ["uploadDate"]),
"category": safe_get(microformat, ["category"]),
"keywords": video_details.get("keywords", []),
"thumbnail_default": safe_get(
video_details, ["thumbnail", "thumbnails", 0, "url"]
),
"thumbnail_high": safe_get(
video_details, ["thumbnail", "thumbnails", -1, "url"]
),
"is_live_content": video_details.get("isLiveContent", False),
"is_private": video_details.get("isPrivate", False),
"is_unlisted": safe_get(microformat, ["isUnlisted"], False),
"is_family_safe": safe_get(microformat, ["isFamilySafe"], True),
"available_countries": safe_get(microformat, ["availableCountries"], []),
}
# Extract engagement metrics (likes, comments)
if initial_data:
engagement = extract_engagement_metrics(initial_data)
data.update(engagement)
return data
def extract_engagement_metrics(initial_data):
"""
Extract like and comment counts from YouTube page data.
Navigates complex nested structure to find engagement metrics.
Like counts are buried 7 levels deep in button view models.
Comment counts are in engagement panel headers.
Args:
initial_data (dict): YouTube initial page data
Returns:
dict: Engagement metrics with 'like_count' and 'comment_count' keys
"""
engagement = {
"like_count": None,
"comment_count": None,
}
try:
# Extract like count from video primary info renderer
contents = initial_data.get("contents", {})
two_column = contents.get("twoColumnWatchNextResults", {})
results = two_column.get("results", {}).get("results", {})
contents_list = results.get("contents", [])
for content in contents_list:
if "videoPrimaryInfoRenderer" in content:
video_info = content["videoPrimaryInfoRenderer"]
# Navigate to like button through nested view models
video_actions = video_info.get("videoActions", {})
menu_renderer = video_actions.get("menuRenderer", {})
top_buttons = menu_renderer.get("topLevelButtons", [])
if top_buttons:
first_button = top_buttons[0]
# Deep navigation through segmented button structure
if "segmentedLikeDislikeButtonViewModel" in first_button:
seg_button = first_button["segmentedLikeDislikeButtonViewModel"]
like_vm = seg_button.get("likeButtonViewModel", {})
like_vm_inner = like_vm.get("likeButtonViewModel", {})
toggle_vm = like_vm_inner.get("toggleButtonViewModel", {})
toggle_vm_inner = toggle_vm.get("toggleButtonViewModel", {})
default_vm = toggle_vm_inner.get("defaultButtonViewModel", {})
button_vm = default_vm.get("buttonViewModel", {})
title = button_vm.get("title", "")
if title:
engagement["like_count"] = title
# Extract comment count from engagement panels
engagement_panels = initial_data.get("engagementPanels", [])
for panel in engagement_panels:
if "engagementPanelSectionListRenderer" in panel:
panel_renderer = panel["engagementPanelSectionListRenderer"]
panel_id = panel_renderer.get("panelIdentifier", "")
# Find comments panel by identifier
if "comment" in panel_id.lower():
header = panel_renderer.get("header", {})
title_header = header.get("engagementPanelTitleHeaderRenderer", {})
contextual_info = title_header.get("contextualInfo", {})
runs = contextual_info.get("runs", [])
if runs:
comment_count = runs[0].get("text", "")
if comment_count:
engagement["comment_count"] = comment_count
except Exception:
pass
return engagement
def format_duration(seconds):
"""
Convert duration in seconds to human-readable format.
Args:
seconds (int): Duration in seconds
Returns:
str: Formatted duration (HH:MM:SS or MM:SS)
"""
hours = seconds // 3600
minutes = (seconds % 3600) // 60
secs = seconds % 60
if hours > 0:
return f"{hours}:{minutes:02d}:{secs:02d}"
else:
return f"{minutes}:{secs:02d}"
def main():
"""Demonstrate video data extraction with example video."""
video_url = "https://www.youtube.com/watch?v=F8NKVhkZZWI"
print("🎬 YouTube Video Details Scraper")
video_data = get_video_details(video_url)
if video_data:
print(f"Extracted: {video_data['title']}")
print(f" Channel: {video_data['channel_name']}")
print(
f" Views: {video_data['view_count_text']} | Duration: {video_data['duration_text']}"
)
# Save complete data to JSON file
with open("video_details_complete.json", "w", encoding="utf-8") as f:
json.dump(video_data, f, indent=2, ensure_ascii=False)
print("Saved to video_details_complete.json")
else:
print("Failed to extract video details")
if __name__ == "__main__":
main()
Running this script provides a rich JSON output with nearly every key piece of metadata available on the video page.
{
"video_id": "F8NKVhkZZWI",
"title": "What are AI Agents?",
"url": "https://www.youtube.com/watch?v=F8NKVhkZZWI",
"channel_name": "IBM Technology",
"channel_id": "UCKWaEZ-_VweaEx1j62do_vQ",
"description": "Want to see Maya Murad explain more about AI Agents? ...",
"length_seconds": 748,
"duration_text": "12:28",
"view_count": 1601204,
"view_count_text": "1,601,204 views",
"publish_date": "2024-07-15T03:00:13-07:00",
"category": "Education",
"keywords": [
"IBM",
"Artificial Intelligence",
"AI",
"AI agents",
...
],
"thumbnail_high": "https://i.ytimg.com/vi/F8NKVhkZZWI/maxresdefault.jpg",
"like_count": "30K",
"comment_count": "998"
}
This is great for a single video, but what about content that loads dynamically, like comments? For that, you need to master continuation tokens.
The key to scaling: understanding continuation tokens
Parsing the initial page is great for one-off data, but what about dynamic content like comments, search results, or related videos? YouTube loads these items as you scroll.
It does this using continuation tokens.
Here's the process:
- The initial page (ytInitialData) contains the first batch of items (e.g., comments) and a continuation token.
- When you scroll down, the browser makes a POST request to an internal API endpoint (like https://www.youtube.com/youtubei/v1/ next).
- The body of this POST request contains the continuation token.
- The API response is a JSON object containing the next batch of items... and a new continuation token for the next page.
- This process repeats until the API returns no new token, meaning you've reached the end.
To scrape dynamically, you must simulate this process: find the initial token, make a POST request to the correct endpoint, parse the response, and loop. All our following scripts for comments, search, and channels use this exact technique.
Scraping Comments
Scraping comments is a perfect example of using continuation tokens. The ytInitialData on a video page only contains the first 1-2 comments. To get them all, you must paginate.
First, we find the initial token by inspecting the ytInitialData from the page's HTML. We can look for a continuationItemRenderer to find the token that will load the first batch of comments.

Once we have that first token, we make a POST request to the /youtubei/v1/next endpoint. The payload for this request is simple, just a context and our continuation token.

The response to this request will give us a new batch of comments and, crucially, the next continuation token, which we then use to make the next request.
The following script automates this entire loop:
import requests
import json
import re
import time
def parse_comments_from_framework_updates(response_data):
"""
Extract comment data from YouTube's frameworkUpdates structure.
YouTube stores comment data in frameworkUpdates.entityBatchUpdate.mutations
where each mutation contains a commentEntityPayload with full comment details.
Args:
response_data (dict): YouTube API response containing frameworkUpdates
Returns:
list: List of comment dictionaries with structured data
"""
comments = []
try:
# Navigate to framework updates structure
framework_updates = response_data.get("frameworkUpdates", {})
entity_batch = framework_updates.get("entityBatchUpdate", {})
mutations = entity_batch.get("mutations", [])
for mutation in mutations:
payload = mutation.get("payload", {})
# Process comment entity payloads
if "commentEntityPayload" in payload:
comment_payload = payload["commentEntityPayload"]
# Extract nested comment data
properties = comment_payload.get("properties", {})
author = comment_payload.get("author", {})
toolbar = comment_payload.get("toolbar", {})
# Parse comment content
content = properties.get("content", {})
comment_text = content.get("content", "")
# Parse author information
author_name = author.get("displayName", "")
# Parse metadata
published_time = properties.get("publishedTime", "")
# Parse engagement metrics
like_count = toolbar.get("likeCountNotliked", "0").strip()
if not like_count:
like_count = "0"
reply_count = toolbar.get("replyCount", "")
comment = {
"author": author_name,
"text": comment_text,
"published_time": published_time,
"like_count": like_count,
"reply_count": reply_count,
}
# Only include comments with actual text content
if comment_text:
comments.append(comment)
except Exception:
pass
return comments
def get_initial_continuation_token(video_id):
"""
Extract the initial continuation token from video page HTML.
This token is needed to make the first API request for comments.
Found in the engagement panels section of ytInitialData.
Args:
video_id (str): YouTube video ID
Returns:
str|None: Continuation token or None if not found
"""
url = f"https://www.youtube.com/watch?v={video_id}"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code != 200:
return None
html = response.text
pattern = r"var ytInitialData\s*=\s*({.+?});"
match = re.search(pattern, html, re.DOTALL)
if not match:
return None
try:
initial_data = json.loads(match.group(1))
engagement_panels = initial_data.get("engagementPanels", [])
# Search for comments panel in engagement panels
for panel in engagement_panels:
if "engagementPanelSectionListRenderer" in panel:
panel_renderer = panel["engagementPanelSectionListRenderer"]
panel_id = panel_renderer.get("panelIdentifier", "")
# Identify comments panel by ID
if "comment" in panel_id.lower():
content = panel_renderer.get("content", {})
section_renderer = content.get("sectionListRenderer", {})
contents = section_renderer.get("contents", [])
# Extract continuation token from item section
for item in contents:
if "itemSectionRenderer" in item:
item_section = item["itemSectionRenderer"]
section_contents = item_section.get("contents", [])
for content_item in section_contents:
if "continuationItemRenderer" in content_item:
continuation_endpoint = content_item[
"continuationItemRenderer"
].get("continuationEndpoint", {})
continuation_command = continuation_endpoint.get(
"continuationCommand", {}
)
token = continuation_command.get("token")
if token:
return token
except (KeyError, TypeError, json.JSONDecodeError):
pass
return None
def get_continuation_token_from_response(response_data):
"""
Extract continuation token for next page from API response.
Args:
response_data (dict): YouTube API response
Returns:
str|None: Next page continuation token or None if no more pages
"""
try:
on_response = response_data.get("onResponseReceivedEndpoints", [])
for endpoint in on_response:
# Check reload continuation items
if "reloadContinuationItemsCommand" in endpoint:
items = endpoint["reloadContinuationItemsCommand"].get(
"continuationItems", []
)
for item in items:
if "continuationItemRenderer" in item:
cont = item["continuationItemRenderer"].get(
"continuationEndpoint", {}
)
token = cont.get("continuationCommand", {}).get("token")
if token:
return token
# Check append continuation items
if "appendContinuationItemsAction" in endpoint:
items = endpoint["appendContinuationItemsAction"].get(
"continuationItems", []
)
for item in items:
if "continuationItemRenderer" in item:
cont = item["continuationItemRenderer"].get(
"continuationEndpoint", {}
)
token = cont.get("continuationCommand", {}).get("token")
if token:
return token
except (KeyError, TypeError, IndexError):
pass
return None
def fetch_comments(video_id, continuation_token=None, max_comments=100):
"""
Fetch comments from a YouTube video with pagination support.
Args:
video_id (str): YouTube video ID
continuation_token (str, optional): Starting continuation token
max_comments (int): Maximum number of comments to fetch (default: 100)
Returns:
list: List of comment dictionaries
"""
if not continuation_token:
continuation_token = get_initial_continuation_token(video_id)
if not continuation_token:
return []
all_comments = []
headers = {
"content-type": "application/json",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
}
data = {
"context": {
"client": {
"clientName": "WEB",
"clientVersion": "2.20251022.01.00",
}
},
"continuation": continuation_token,
}
page = 0
# Paginate through comments until limit reached
while len(all_comments) < max_comments:
page += 1
response = requests.post(
"https://www.youtube.com/youtubei/v1/next", headers=headers, json=data
)
if response.status_code != 200:
break
result = response.json()
# Extract comments from current page
comments = parse_comments_from_framework_updates(result)
if not comments:
break
all_comments.extend(comments)
# Get continuation token for next page
next_token = get_continuation_token_from_response(result)
if not next_token:
break
data["continuation"] = next_token
# Rate limiting to avoid being blocked
if page < 10:
time.sleep(1)
return all_comments[:max_comments]
def extract_video_id(url):
"""
Extract 11-character video ID from various YouTube URL formats.
Args:
url (str): YouTube URL
Returns:
str: Video ID or original string if no match
"""
patterns = [
r"(?:v=|\/)([0-9A-Za-z_-]{11}).*",
r"(?:embed\/)([0-9A-Za-z_-]{11})",
r"(?:watch\?v=)([0-9A-Za-z_-]{11})",
]
for pattern in patterns:
match = re.search(pattern, url)
if match:
return match.group(1)
return url
def main():
"""Demonstrate comment extraction with example video."""
video_url = "https://www.youtube.com/watch?v=F8NKVhkZZWI"
video_id = extract_video_id(video_url)
print("💬 YouTube Comments Scraper")
print(f"Video ID: {video_id}")
extracted_comments = fetch_comments(video_id, max_comments=100)
print(f"Fetched {len(extracted_comments)} comments")
if not extracted_comments:
print("No comments found")
return
output_data = {
"video_id": video_id,
"total_comments": len(extracted_comments),
"comments": extracted_comments,
}
with open("video_comments.json", "w", encoding="utf-8") as output_file:
json.dump(output_data, output_file, indent=2, ensure_ascii=False)
print("Saved to video_comments.json")
if __name__ == "__main__":
main()
Here is a sample of the clean JSON output from this script:
{
"video_id": "F8NKVhkZZWI",
"total_comments": 100,
"comments": [
{
"author": "@MrBa****",
"text": "I like the way she takes a minute to explain new terminologies before moving on with the overall idea/explanation. Excellent teacher!",
"published_time": "10 months ago",
"like_count": "282",
"reply_count": 2
},
{
"author": "@eagl*****",
"text": "Give this lady a vacation! this is her way of asking it out loud",
"published_time": "1 year ago",
"like_count": "739",
"reply_count": 13
},
...
]
}
This same pagination logic applies to scraping search results and channel video lists, but the next data type, transcripts, uses a different (and easier) endpoint.
Scraping Transcripts (Captions)
Transcripts are incredibly valuable for topic modeling and keyword analysis. Fortunately, they are loaded differently than comments or metadata. They are fetched from a dedicated internal API endpoint, /api/timedtext, which makes them relatively straightforward to scrape once you find the correct request.
You can find this endpoint using the browser DevTools:
- Open the Network tab (filtered to Fetch/XHR).
- On the YouTube video page, click the Subtitles closed captions.
- A new request to /api/timedtext__Italic will appear in the Network tab. This is the one we need to replicate.

The easiest way to replicate this request is to:
- Right-click on the timedtext request in the Network tab.
- Select "Copy" -> "Copy as cURL".

You can then use an online tool like curlconverter.com to convert that cURL command directly into Python requests code.
Important warning: The code below is generated from one specific browser session. Many parameters (ei, expire, signature, pot, etc.) are temporary and session-specific. A reliable, production-ready scraper would need to first load the main video page, extract these dynamic parameters using regex or other parsing methods, and then make the /api/timedtext request using those fresh parameters.
So, what's the solution?
Given the complexity and brittleness of manually extracting the dynamic parameters for the timedtext API, using a dedicated library is highly recommended for scraping transcripts reliably.
The youtube-transcript-api is a popular and well-maintained choice.
First, install it:
pip install youtube-transcript-api
Then, fetching a transcript becomes incredibly simple:
import json
from youtube_transcript_api import YouTubeTranscriptApi
def get_transcript(video_id, language="en"):
"""
Get transcript for a YouTube video.
Args:
video_id (str): YouTube video ID
language (str): Language code (default: 'en')
Returns:
list: Transcript data or None if failed
"""
try:
api = YouTubeTranscriptApi()
transcript = api.fetch(video_id, languages=[language])
# Convert to simple list of dictionaries
return [{"text": item.text, "start": item.start} for item in transcript]
except Exception as e:
print(f"Failed: {e}")
return None
def save_transcript(video_id, transcript_data):
"""Save transcript in JSON and text formats."""
if not transcript_data:
return
# Save JSON
with open(f"{video_id}_transcript.json", "w", encoding="utf-8") as f:
json.dump(transcript_data, f, indent=2, ensure_ascii=False)
# Save plain text
with open(f"{video_id}_transcript.txt", "w", encoding="utf-8") as f:
for item in transcript_data:
f.write(f"{item['text']}\n")
print(f"Saved transcript for {video_id}")
def main():
video_id = "F8NKVhkZZWI"
print(f"Getting transcript for: {video_id}")
transcript = get_transcript(video_id)
if transcript:
print(f"Found {len(transcript)} segments")
save_transcript(video_id, transcript)
else:
print("No transcript found")
if __name__ == "__main__":
main()
The output is a clean JSON object containing every line of the transcript with precise timestamps:
[
{
"text": "2024 will be the year of AI agents.",
"start": 0.456
},
{
"text": "So what are AI agents?",
"start": 4.152
},
{
"text": "And to start explaining that,",
"start": 5.806
},
// ... more transcript segments
]
This library approach handles finding the API, managing parameters, and dealing with potential errors (like transcripts being disabled), making it the most stable and recommended solution for obtaining transcript data.
Scraping Channel Metadata
Scraping data from a channel page, like the About tab or the list of videos, uses the same continuation token principle we saw with comments, but primarily utilizes the /youtubei/v1/browse endpoint.
Getting the rich metadata from the About tab (join date, total views, description, links) requires a specific approach. When you load a channel's main page, the initial data (ytInitialData) contains multiple continuation tokens for different sections. Through reverse-engineering (i.e., testing which token fetches which data), we've found that a specific token is needed to request the data specifically for the About tab content via the /browse endpoint.

The following script demonstrates how to find this specific token and use it to fetch and parse the channel's About metadata.
import requests
import json
import re
def get_channel_about(channel_url):
"""Extract channel data from YouTube channel URL."""
base_url = channel_url.replace("/about", "").rstrip("/")
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
}
response = requests.get(base_url, headers=headers)
if response.status_code != 200:
return None
html = response.text
token = extract_about_continuation_token(html)
if not token:
return None
channel_data = fetch_about_with_token(token)
return channel_data
def extract_about_continuation_token(html):
"""Extract continuation token from YouTube page data."""
pattern = r"var ytInitialData\s*=\s*({.+?});"
match = re.search(pattern, html, re.DOTALL)
if not match:
return None
try:
initial_data = json.loads(match.group(1))
header = initial_data.get("header", {})
page_header_renderer = header.get("pageHeaderRenderer", {})
content = page_header_renderer.get("content", {})
page_header_view_model = content.get("pageHeaderViewModel", {})
token = find_continuation_token_in_object(page_header_view_model)
return token
except (json.JSONDecodeError, KeyError, IndexError):
return None
def find_continuation_token_in_object(obj):
"""Recursively find continuation token in nested data."""
if isinstance(obj, dict):
if "continuationCommand" in obj:
token = obj["continuationCommand"].get("token", "")
if token:
return token
for value in obj.values():
result = find_continuation_token_in_object(value)
if result:
return result
elif isinstance(obj, list):
for item in obj:
result = find_continuation_token_in_object(item)
if result:
return result
return None
def fetch_about_with_token(token):
"""Fetch channel about data using continuation token."""
headers = {
"content-type": "application/json",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36",
"x-youtube-client-name": "1",
"x-youtube-client-version": "2.20251023.01.00",
}
json_data = {
"context": {
"client": {
"clientName": "WEB",
"clientVersion": "2.20251023.01.00",
}
},
"continuation": token,
}
response = requests.post(
"https://www.youtube.com/youtubei/v1/browse",
params={"prettyPrint": "false"},
headers=headers,
json=json_data,
)
if response.status_code != 200:
return None
result = response.json()
channel_data = parse_about_response(result)
return channel_data
def parse_about_response(response_data):
"""Parse channel data from YouTube API response."""
data = {
"channel_id": None,
"description": None,
"subscriber_count": None,
"video_count": None,
"view_count": None,
"joined_date": None,
"country": None,
"links": [],
"custom_url": None,
}
try:
on_response = response_data.get("onResponseReceivedEndpoints", [])
if not on_response:
return data
first_endpoint = on_response[0]
append_action = first_endpoint.get("appendContinuationItemsAction", {})
continuation_items = append_action.get("continuationItems", [])
if not continuation_items:
return data
first_item = continuation_items[0]
about_renderer = first_item.get("aboutChannelRenderer", {})
if not about_renderer:
return data
metadata = about_renderer.get("metadata", {})
about_vm = metadata.get("aboutChannelViewModel", {})
if not about_vm:
return data
data["channel_id"] = about_vm.get("channelId")
data["description"] = about_vm.get("description")
data["subscriber_count"] = about_vm.get("subscriberCountText")
data["view_count"] = about_vm.get("viewCountText")
data["country"] = about_vm.get("country")
data["custom_url"] = about_vm.get("canonicalChannelUrl")
data["video_count"] = about_vm.get("videoCountText")
joined_obj = about_vm.get("joinedDateText", {})
if isinstance(joined_obj, dict):
data["joined_date"] = joined_obj.get("content", "")
links = about_vm.get("links", [])
for link_obj in links:
if "channelExternalLinkViewModel" in link_obj:
link_vm = link_obj["channelExternalLinkViewModel"]
title_obj = link_vm.get("title", {})
title = title_obj.get("content", "")
link_content = link_vm.get("link", {})
url = link_content.get("content", "")
if not url:
command_runs = link_content.get("commandRuns", [])
if command_runs:
on_tap = command_runs[0].get("onTap", {})
innertube = on_tap.get("innertubeCommand", {})
url_endpoint = innertube.get("urlEndpoint", {})
url = url_endpoint.get("url", "")
if title and url:
data["links"].append({"title": title, "url": url})
except (KeyError, TypeError, IndexError):
pass
return data
def main():
"""Extract channel data and save to JSON file."""
channel_url = "https://www.youtube.com/@freecodecamp"
channel_data = get_channel_about(channel_url)
if channel_data:
with open("channel_data.json", "w", encoding="utf-8") as f:
json.dump(channel_data, f, indent=2, ensure_ascii=False)
return channel_data
if __name__ == "__main__":
main()
Here is a JSON output from this script:
{
"channel_id": "UC8butISFwT-Wl7EV0hUK0BQ",
"description": "Learn math, programming, and computer science for free. A 501(c)(3) tax-exempt charity. We also run a free learning interactive platform at freecodecamp.org",
"subscriber_count": "11.2M subscribers",
"video_count": "1,952 videos",
"view_count": "914,076,805 views",
"joined_date": "Joined Dec 16, 2014",
"country": "United States",
"links": [
{
"title": "SUPPORT OUR CHARITY",
"url": "donate.freecodecamp.org"
},
{
"title": "LEARN TO CODE WITH OUR FRIENDS",
"url": "scrimba.com/fcc"
}
],
"custom_url": "http://www.youtube.com/@freecodecamp"
}
Scraping other parts of a channel page, like the full list of videos or playlists, follows a similar pattern: find the appropriate initial continuation token in ytInitialData for that section (e.g., the video grid) and then paginate using the /browse endpoint to retrieve subsequent batches.
Scraping Search Results
Scraping YouTube Search Engine Results Pages (SERPs) is crucial for keyword research, rank tracking, and competitor analysis. This process uses the /youtubei/v1/search internal API endpoint and follows the pagination logic we've established.

The key difference for the /search endpoint is how the initial request and subsequent pagination requests are made:
- First page – you send a POST request to /youtubei/v1/search including the query term in the JSON payload.
- Subsequent pages – the first response contains a continuation token. For all following pages, you send a POST request to the same /search endpoint, but this time you include the continuation token in the payload instead of the query.
The following script below demonstrates this process. It fetches multiple pages of search results and parses data for both regular videos (videoRenderer) and YouTube Shorts (shortsLockupViewModel) found on the page.
import requests
import json
import time
def get_nested_value(obj, path, default=None):
"""Safely navigate nested dictionary paths."""
try:
for key in path:
obj = obj[key]
return obj
except (KeyError, IndexError, TypeError):
return default
def fetch_page(query=None, continuation=None):
"""Fetch search results page from YouTube API."""
request_payload = {
"context": {
"client": {
"clientName": "WEB",
"clientVersion": "2.20251021.01.00",
}
}
}
if query:
request_payload["query"] = query
elif continuation:
request_payload["continuation"] = continuation
return requests.post(
"https://www.youtube.com/youtubei/v1/search",
headers={
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
},
json=request_payload,
).json()
def extract_content_renderers(response_data):
"""Recursively extract video and short renderers from response data."""
video_renderers = []
short_renderers = []
if isinstance(response_data, dict):
if "videoRenderer" in response_data:
video_renderers.append(response_data["videoRenderer"])
if "shortsLockupViewModel" in response_data:
short_renderers.append(response_data["shortsLockupViewModel"])
for nested_data in response_data.values():
found_videos, found_shorts = extract_content_renderers(nested_data)
video_renderers.extend(found_videos)
short_renderers.extend(found_shorts)
elif isinstance(response_data, list):
for list_item in response_data:
found_videos, found_shorts = extract_content_renderers(list_item)
video_renderers.extend(found_videos)
short_renderers.extend(found_shorts)
return video_renderers, short_renderers
def parse_video(video_renderer):
"""Parse video renderer data into structured format."""
try:
video_id = video_renderer.get("videoId")
channel_id = get_nested_value(
video_renderer,
[
"ownerText",
"runs",
0,
"navigationEndpoint",
"browseEndpoint",
"browseId",
],
)
is_verified = any(
"Verified" in str(badge) for badge in video_renderer.get("ownerBadges", [])
)
return {
"video_id": video_id,
"title": get_nested_value(video_renderer, ["title", "runs", 0, "text"]),
"url": f"https://youtube.com/watch?v={video_id}",
"type": "video",
"channel_name": get_nested_value(
video_renderer, ["ownerText", "runs", 0, "text"]
),
"channel_id": channel_id,
"is_verified": is_verified,
"views": get_nested_value(video_renderer, ["viewCountText", "simpleText"]),
"published": get_nested_value(
video_renderer, ["publishedTimeText", "simpleText"]
),
"duration": get_nested_value(video_renderer, ["lengthText", "simpleText"]),
"thumbnail": get_nested_value(
video_renderer, ["thumbnail", "thumbnails", -1, "url"]
),
}
except (KeyError, TypeError, IndexError):
return None
def parse_short(short_renderer):
"""Parse short renderer data into structured format."""
try:
video_id = get_nested_value(
short_renderer,
["onTap", "innertubeCommand", "reelWatchEndpoint", "videoId"],
)
title = get_nested_value(
short_renderer, ["overlayMetadata", "primaryText", "content"]
)
views = get_nested_value(
short_renderer, ["overlayMetadata", "secondaryText", "content"]
)
thumbnail = get_nested_value(
short_renderer,
[
"onTap",
"innertubeCommand",
"reelWatchEndpoint",
"thumbnail",
"thumbnails",
0,
"url",
],
)
if not thumbnail:
thumbnail = get_nested_value(
short_renderer, ["thumbnail", "sources", 0, "url"]
)
return {
"video_id": video_id,
"title": title,
"url": f"https://youtube.com/shorts/{video_id}",
"views": views,
"thumbnail": thumbnail,
"type": "short",
}
except (KeyError, TypeError, IndexError):
return None
def filter_and_parse_renderers(renderers, parser_func):
"""Filter and parse renderers using provided parser function."""
return [
parsed_item
for parsed_item in (parser_func(renderer) for renderer in renderers)
if parsed_item and parsed_item.get("video_id")
]
def process_page_data(api_response):
"""Process API response to extract videos and shorts."""
video_renderers, short_renderers = extract_content_renderers(api_response)
videos = filter_and_parse_renderers(video_renderers, parse_video)
shorts = filter_and_parse_renderers(short_renderers, parse_short)
return videos, shorts
def get_continuation_token(response_data):
"""Extract continuation token for next page from response data."""
if "onResponseReceivedCommands" in response_data:
for command in response_data["onResponseReceivedCommands"]:
if "appendContinuationItemsAction" in command:
for continuation_item in command["appendContinuationItemsAction"].get(
"continuationItems", []
):
if "continuationItemRenderer" in continuation_item:
return continuation_item["continuationItemRenderer"][
"continuationEndpoint"
]["continuationCommand"]["token"]
if "contents" in response_data:
try:
page_contents = response_data["contents"]["twoColumnSearchResultsRenderer"][
"primaryContents"
]["sectionListRenderer"]["contents"]
for content_item in page_contents:
if "continuationItemRenderer" in content_item:
return content_item["continuationItemRenderer"][
"continuationEndpoint"
]["continuationCommand"]["token"]
except KeyError:
pass
return None
def scrape_page(query=None, continuation_token=None):
"""Scrape a single page of search results."""
api_response = fetch_page(query=query, continuation=continuation_token)
videos, shorts = process_page_data(api_response)
next_token = get_continuation_token(api_response)
return videos, shorts, next_token
def scrape(query, pages=5, delay=1):
"""Scrape multiple pages of YouTube search results."""
all_content = []
token = None
for page in range(1, pages + 1):
videos, shorts, token = scrape_page(
query=query if page == 1 else None, continuation_token=token
)
all_content.extend(videos + shorts)
if not token and page < pages:
break
if page < pages:
time.sleep(delay)
return all_content
def main():
"""Search YouTube and save results to JSON file."""
search_query = "what are ai agents"
pages_to_scrape = 3
scraped_content = scrape(search_query, pages=pages_to_scrape)
unique_content = {
content_item["video_id"]: content_item for content_item in scraped_content
}
all_results = list(unique_content.values())
video_results = [result for result in all_results if result["type"] == "video"]
short_results = [result for result in all_results if result["type"] == "short"]
output_data = {
"query": search_query,
"total_items": len(all_results),
"total_videos": len(video_results),
"total_shorts": len(short_results),
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"content": all_results,
}
with open("youtube_results.json", "w", encoding="utf-8") as output_file:
json.dump(output_data, output_file, indent=2, ensure_ascii=False)
return output_data
if __name__ == "__main__":
main()
This script provides a clean list of content items from the SERP, each marked with its type ('video' or 'short'). Here's a sample of the resulting JSON:
{
"query": "what are ai agents",
"total_items": 83,
"total_videos": 53,
"total_shorts": 30,
"content": [
{
"video_id": "F8NKVhkZZWI",
"title": "What are AI Agents?",
"url": "https://youtube.com/watch?v=F8NKVhkZZWI",
"type": "video",
"channel_name": "IBM Technology",
"channel_id": "UCKWaEZ-_VweaEx1j62do_vQ",
"is_verified": true,
"views": "1,602,255 views",
"published": "1 year ago",
"duration": "12:29",
"thumbnail": "https://i.ytimg.com/vi/F8NKVhkZZWI/hq720.jpg?sqp=..."
},
{
"video_id": "kKm_0eLmbzQ",
"title": "AI Agents For Small Business - Mark Zuckerberg",
"url": "https://youtube.com/shorts/kKm_0eLmbzQ",
"views": "1.7M views",
"thumbnail": "https://i.ytimg.com/vi/kKm_0eLmbzQ/frame0.jpg",
"type": "short"
}
// ... more video and short results ...
]
}
Scaling Challenges: Avoiding YouTube Blocks
The scripts we've built work well for a few requests. However, trying to scrape thousands or millions of pages will quickly run into YouTube's anti-bot defenses. Understanding these defenses is key to building a reliable scraper that operates at scale.
YouTube's Defenses
Websites like YouTube employ sophisticated, multi-layered systems to detect and block automated traffic. The primary techniques include:
- IP-based rate limiting. This is the most common and effective defense. YouTube's servers monitor the number of requests from a single IP address over time. If the frequency exceeds a certain threshold (which is much lower for bots than humans), that IP address will be temporarily or permanently blocked. You can learn more about what happens when your IP has been banned.
- Header & TLS fingerprinting. Anti-bot systems analyze the HTTP headers sent with each request. Missing a standard browser User-Agent or sending thousands of requests with the exact same headers is a clear sign of automation. Advanced systems can even analyze the unique signature of the TLS handshake (the initial secure connection setup) to identify non-browser clients.
- Behavioral analysis. Detection systems look for inhuman patterns, such as requesting pages faster than a human could navigate, always following the exact same path, or failing to request resources like images and CSS that a real browser would.
The Core Problem
While all these defenses are relevant, IP-based rate limiting is the fundamental challenge that must be overcome for any scraping project to succeed beyond a trivial number of requests. No matter how perfectly a scraper mimics browser headers or behavior, if all its requests originate from the same server IP address, it will inevitably be identified and blocked.
This sets the stage for the only viable solution for reliable, large-scale scraping: using a reliable proxy network.
The Solution: Using a High-Quality Proxy Network
The fundamental solution to IP blocking is routing scraper traffic through a proxy network. A proxy server acts as an intermediary: your request goes to the proxy, which forwards it to YouTube using its own IP address. To YouTube, the request appears to come from the proxy, not your scraper's server.
By using a large pool of different proxy IPs, a scraper can distribute its requests, making it look like thousands of different real users accessing the site normally. This avoids triggering IP-based rate limits. However, the quality of these proxy IPs is critical.
Why Residential & Mobile Proxies?
Not all proxies are created equal. Datacenter proxies are cheap but use IPs from commercial hosting providers; these IP ranges are easily identified and often pre-emptively blocked by sites like YouTube. For reliable scraping, you need proxies that blend in with real user traffic:
- Residential proxies. These use IP addresses assigned by Internet Service Providers (ISPs) to real home internet connections. Traffic from a residential proxy is indistinguishable from that of a genuine human user, leading to significantly higher success rates.
- Mobile proxies. For the most challenging targets, mobile IPs (from carriers like T-Mobile, Vodafone, etc.) offer the highest level of trust and are less likely to encounter blocks or CAPTCHAs.
Live Proxies specializes in providing high-quality residential and mobile proxy pools designed specifically for demanding data extraction tasks, offering IPs across key geographic locations.
Session Control: Rotating vs. Sticky IPs
Different scraping tasks require different approaches to IP management. Live Proxies offers flexible session control:
- Rotating sessions. By default, each new request you send can go through a different IP address from our vast pool. This is ideal for high-volume, stateless tasks like scraping metadata from thousands of different video pages, maximizing anonymity and distribution.
- Sticky sessions. For tasks requiring a consistent identity (like navigating pagination for comments or search results), you need to maintain the same IP address for multiple consecutive requests. Live Proxies allows you to create sticky sessions that provide a stable IP for up to 60 minutes. This prevents YouTube from breaking your session mid-task.
Key Advantage: Private IP Allocation
A common frustration with proxy providers is the “bad neighbor” problem – if another customer using the same shared IPs gets them blocked on YouTube, your requests fail too.
Live Proxies mitigates this with private IP allocation for our B2B and enterprise users. We ensure the pool of IPs you are allocated for a specific target (like YouTube) isn't simultaneously used by another customer on that same target. This significantly reduces the risk of IP contamination and is key to achieving consistent, high success rates.
Integrating Live Proxies in Python
Using Live Proxies with Python's requests library is straightforward. You'll receive credentials including a host (like b2b.liveproxies.io), port, username, and password.
Here’s how to integrate a sticky session proxy into your Python scripts:
import requests
import json
# --- Live Proxies Configuration (Replace with YOUR credentials) ---
# Example for a sticky session: username includes a session ID (e.g., 's12345')
PROXY_USER = "YOUR_USERNAME-sid-s12345"
PROXY_PASS = "YOUR_PASSWORD"
PROXY_HOST = "b2b.liveproxies.io" # Or your specific assigned gateway
PROXY_PORT = "7383" # Check your assigned port
# Format for the requests library
proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
proxies = {"http": proxy_url, "https": proxy_url} # Handle both HTTP and HTTPS traffic
# --- End Configuration ---
def get_video_with_proxy(url):
"""Example function modified to use proxies"""
try:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"
}
# Add the 'proxies' argument and a timeout
response = requests.get(
url,
headers=headers,
proxies=proxies,
timeout=15, # Increase timeout slightly for proxies
verify=False, # Often needed for HTTPS via proxies, handle with care
)
response.raise_for_status() # Check for HTTP errors (like 407 Proxy Auth Required)
# Use a test URL like httpbin.org/ip to confirm the proxy IP
print(f"Success! Request made from IP: {response.json().get('origin')}")
# Replace with actual data processing for YouTube URLs
except requests.exceptions.ProxyError as e:
print(
f"Proxy Error: Could not connect to proxy. Check credentials/host/port. Details: {e}"
)
except requests.exceptions.Timeout:
print(f"Timeout Error: The request timed out connecting via proxy.")
except requests.exceptions.RequestException as e:
print(f"Request Error: {e}")
# --- Example Usage with a test URL ---
test_url = "https://httpbin.org/ip"
print(f"Testing proxy connection to {test_url}...")
get_video_with_proxy(test_url)
# To use a ROTATING session (new IP per request), simply remove the '-sid-...' part:
# PROXY_USER = "YOUR_USERNAME"
With a reliable proxy network integrated, your scraper is now equipped to handle YouTube's defenses and operate effectively at scale. For maximum performance on specific, high-demand platforms like YouTube, consider exploring proxies specifically optimized for those targets.
Scaling Your Scraper and Managing Data
The scripts we've built, combined with a quality proxy network, form the core of a YouTube scraper. But to handle thousands or millions of URLs reliably and make sense of the collected data, you need to think about system architecture and data processing.
Asynchronous Scraping
Traditional Python code (like our requests examples) is synchronous: it makes a request and waits idly for the server's response before doing anything else. This network waiting time is the biggest bottleneck.
Asynchronous programming solves this. Using Python's built-in asyncio library along with an async HTTP client like httpx or aiohttp, your program can initiate hundreds of requests concurrently. While waiting for one response, it starts or handles others, dramatically increasing throughput on a single machine.
Use case: The go-to for any project scraping more than a few hundred pages where speed is a priority. It maximizes the efficiency of your scraper and proxy usage.
Distributed Scraping
Even an async scraper on one machine has limits. For enterprise-level scraping (millions of pages daily), you need to distribute the work across multiple machines (workers).
This is typically done using a distributed task queue framework like Celery, paired with a message broker like Redis.
- Producer. One part of your system identifies URLs to scrape and adds them as "jobs" to the Redis queue.
- Workers (consumers). Multiple separate Python processes (potentially on different servers) constantly watch the queue. Each worker grabs a job (a URL), executes your scraping logic (using proxies), and stores the result.
This architecture provides:
- Scalability – need more speed? Just launch more worker processes.
- Reliability – if one worker crashes, the job can be automatically retried by another.
- Decoupling – the process of finding URLs is separate from the scraping itself.
Use case: Necessary for very large-scale, ongoing scraping operations requiring high throughput and fault tolerance.
Data Storage Options
Raw scraped data (often JSON) needs structure before analysis. Choosing the right storage depends on scale and use case:
-
CSV files – simple, portable, and great for smaller datasets or quick analysis. Becomes unwieldy and slow for millions of records.
-
Databases – essential for larger, ongoing projects:
- PostgreSQL – excellent for structured, relational data (e.g., storing channels, videos, and comments in separate tables). Enforces data integrity.
- MongoDB – great for flexibility. Can store the raw JSON directly, making initial saving easy. Schema is flexible but requires more discipline during analysis.
-
Parquet files – the best choice for very large datasets intended for analysis. It's a compressed, columnar format optimized for fast reading with tools like Pandas or data warehouses (BigQuery, Snowflake). Often used as an intermediary format.
Recommendation: Start simple (CSV/database). Scale to Parquet/data warehouses as needed.
Data Cleaning with Pandas
The Pandas library is the Python standard for data manipulation. Loading your scraped data into a Pandas DataFrame allows you to easily clean:
- Type conversion – change view counts like '1.6M views' into the integer 1600000.
- Date parsing – convert '1 year ago' or ISO date strings into proper datetime objects.
- Handling missing data – fill or drop records with missing fields.
- Text processing – clean comment text, extract keywords from descriptions using regex, etc. (Relevant for data parsing).
Typical workflow: scrape -> store raw data -> load into Pandas -> clean/transform -> store clean data -> analyze.
Building Resilient Scrapers
Building the scraper is just the first step. To run it reliably over time, especially at scale, you need to plan for inevitable failures and monitor performance.
Your scripts will eventually encounter errors – network issues, YouTube changing its layout, unexpected data formats, or proxy failures. A resilient system anticipates and handles these gracefully:
- Robust error handling. Use try...except blocks extensively. Catch specific exceptions (e.g., requests.RequestException__Italic, JSONDecodeError, proxy errors) and decide whether to retry the request, skip the URL, or log the error and move on. Don't let one bad URL crash your entire process.
- Monitoring and alerting. You can't fix problems you don't know exist. Log key metrics continuously:
- Success rate. Track the percentage of HTTP 200 responses versus errors (4xx, 5xx, timeouts).
- Latency. Monitor the average time taken per request. A sudden increase might indicate network issues or impending blocks.
- Items scraped. Track the number of videos, comments, etc., processed per hour or day. A sudden drop signals a problem.
- Proxy performance. Monitor success rates per proxy or provider. Set up alerts (using tools like Sentry, Datadog, Grafana, or simple email/Slack notifications via your script) for significant drops in success rate, spikes in specific errors (like 429s or 404s), or a complete halt in data collection. This allows you to quickly detect when YouTube makes a change that breaks your scraper or when your proxy pool has issues.
- Exponential backoff. If you receive rate-limiting errors (like HTTP 429 Too Many Requests) or temporary server errors 503 Service Unavailable), don't immediately retry hammering the server. Implement an exponential backoff strategy: wait 2 seconds, then 4, then 8, then 16 before retrying that specific request. This is polite to YouTube's servers and is often necessary to automatically recover from temporary blocks. Libraries like backoff can simplify implementing this logic in Python.
Applying Scraped Data: Business Use Cases
Collecting YouTube data is only useful if it leads to real insights. By combining the scraping techniques discussed with data analysis, you can find real business value for YouTube-specific goals, though proxies enable a wide variety of business use cases across many platforms. Here are three common applications specifically for the YouTube data we've gathered:
1. Competitor Channel Intelligence
Understanding your competitors' content strategy, performance, and audience engagement is crucial.
- Input. A list of competitor channel URLs.
- Scrape. Use the script that collects channel metadata to get their video lists and the script for video metadata for performance metrics on each video. Use the transcript scraper for topic analysis.
- Outputs & KPIs:
- Upload cadence. How frequently do they post content?
- Content mix. Ratio of long-form videos vs. Shorts?
- Performance benchmarks. Typical views/engagement at 7/30 days?
- Topic analysis. Key themes/keywords in titles, descriptions, transcripts?
- Top performing content. What resonates most with their audience?
2. Brand Mention & Comment Monitoring
Track brand perception, identify customer service issues, and engage with your audience by monitoring comments.
- Input. Your brand's video URLs, plus URLs of videos mentioning your brand.
- Scrape. Use the comment scraper regularly (e.g., daily/weekly) on target videos.
- Outputs & KPIs:
- Sentiment analysis. Track positive/negative/neutral trends over time.
- Emerging issues. Identify frequent negative keywords (e.g., "doesn't work", "bug found").
- Feature requests. Find comments suggesting improvements.
- Engagement opportunities. Flag comments needing a response from your team.
3. Keyword Research & SERP Tracking
Understand the video landscape for your target keywords, track your channel's visibility, and identify content gaps.
- Input. A list of relevant keywords for your industry.
- Scrape. Use the search results scraper regularly (e.g., daily/weekly) for each keyword, storing video ranks.
- Outputs & KPIs:
- Average rank. Track your channel's average position over time.
- Keyword visibility. See which keywords your videos rank for.
- Competitor SERP presence. Analyze which competitors dominate results.
- Content gaps. Identify keywords with weak search results.
- SERP volatility. Track how much results change for specific keywords.
Further reading: How to Scrape Walmart: Data, Prices, Products, and Reviews (2025) and How to Scrape Yahoo Finance Using Python and Other Tools.
Conclusion
This guide explained the core technique of scraping YouTube: reverse-engineering its internal API calls. By parsing embedded JSON data and mastering the continuation token system, you can access the rich, real-time data essential for in-depth analysis. However, implementing these methods highlights a significant challenge: scaling requires overcoming robust anti-blocking measures.
Therefore, the integration of high-quality residential or mobile proxies, such as those from Live Proxies, isn’t just an option but a crucial infrastructure. When combined with intelligent session management and smart coding practices, proxies ensure the reliability needed for large-scale data collection.
Your Next Steps: Experiment with the provided scripts. Once you understand the concepts, integrate a high-quality proxy solution to maximize the value of YouTube data scraping for your projects.
FAQs
Can I scrape YouTube without the data API?
Yes, absolutely. This guide focuses entirely on methods that bypass the official API by directly interacting with the website's internal structure (embedded JSON) and private API endpoints (like /next, /browse, /search), using techniques like continuation tokens. This allows access to richer, real-time data but requires handling anti-scraping measures, primarily through proxies.
Can I scrape private or unlisted videos?
No. Scraping should only target publicly available content. Private videos require login authentication, and attempting to bypass this violates terms of service and privacy. Unlisted videos are only accessible via direct URL and won't appear in scraped search results or channel feeds. Ethical scraping respects privacy and access controls.
Is downloading videos the same as scraping metadata?
No, they're different. This guide focuses on scraping metadata (textual information about videos, comments, channels, etc.). Downloading the actual video files (e.g., .mp4) is typically done with different tools (like yt-dlp) and is outside the scope of this guide. For data analysis, you almost always need the metadata, not the video file itself.
How often should I re-scrape YouTube data?
It depends entirely on your goal:
- Rank tracking (SERPs). Daily or even hourly for competitive keywords.
- Comment monitoring. Daily for newly uploaded videos, perhaps weekly or monthly for older, less active ones.
- Competitor uploads. Daily or weekly.
- Video stats (views/likes). These change rapidly, especially for new videos. You might re-scrape popular videos hourly for the first day, then daily, then weekly. Monitor the rate of change to determine the optimal frequency.
Why do view/like counts sometimes seem inconsistent?
YouTube counts aren't always updated globally in real-time due to caching and system delays. You might also scrape a formatted string (like "1.6M") and later get a precise number ("1,602,123"). This is normal. Proper data cleaning (like using Pandas, as mentioned earlier) is needed to normalize these values (e.g., convert "1.6M" to 1600000) for accurate analysis.
How can I scrape all comments from a video with millions?
Paginating through millions of comments can be extremely slow and resource-intensive. Consider sampling strategies:
- Scrape the first N pages of Top Comments (default sort).
- Scrape the first N pages of Newest First comments.
- Scrape comments posted within specific time windows (e.g., first 24 hours, recent week). For many sentiment or topic analyses, a representative sample is often sufficient.
What if a video doesn't have a transcript?
This is common. Creators might not upload manual captions, and auto-generated ones might be unavailable or disabled. Your scraping script (especially if using a library like youtube-transcript-api) should handle this gracefully and record that the transcript is missing, rather than crashing or leaving a null value.
How do I get results for a specific country or language?
YouTube heavily personalizes results based on location and language. To get results as seen in a specific region (e.g., Germany):
- Use a proxy server located in that specific country/city. High-quality residential or mobile proxies often allow precise geo-targeting options.
- Send appropriate Accept-Language HTTP headers (e.g., de-DE,de;q=0.9) in your requests.




