Facebook holds billions of posts, comments, and marketplace listings, data that can reveal market trends, sentiment shifts, and pricing opportunities. But extracting it isn't straightforward. Facebook blocks standard HTTP requests, serves degraded HTML to bots, and shows login walls when you try to scroll through feeds.
This guide walks through the approach to scraping Facebook at scale, like posts, comments, groups, and Marketplace listings using Python, TLS browser emulation, and API reverse engineering. You'll get working code and learn which Facebook scraping tools bypass detection systems and handle anti-bot measures.
How to scrape Facebook in 10 lines?
Let’s start with a direct task: scraping a specific post URL. This requires an HTTP client that handles TLS handshake emulation, not full browser automation.
The code
It fetches the HTML and parses metadata using regex:
import re
from curl_cffi import requests
# Fetch HTML and extract data using regex
html = requests.get(
"https://www.facebook.com/zuck/posts/two-decades-many-awesome-projects-and-even-more-plaid-shirts-grateful-for-20-yea/10117222347025301/",
impersonate="chrome",
).text
author = re.search(r'"actors":\[{"__typename":"User","name":"([^"]+)"', html).group(1)
text = re.search(r'<meta property="og:description" content="([^"]*)"', html).group(1)
reactions = re.search(r'"reaction_count":{"count":(\d+)', html).group(1)
comments = re.search(r'"total_count":(\d+)', html).group(1)
shares = re.search(r'"share_count":{"count":(\d+)', html).group(1)
print(
f"Author: {author}\nPost Text: {text}\nReactions: {reactions}\nComments: {comments}\nShares: {shares}"
)
Output
Author: Mark Zuckerberg
Post Text: Two decades, many awesome projects, and even more plaid shirts. Grateful for 20 years of building the future together!
Reactions: 190863
Comments: 51086
Shares: 3704
Why this works
Standard Python requests fail because Facebook analyzes the TLS fingerprint during the SSL handshake. The platform checks the JA3 signature and serves a degraded HTML shell to any handshake that doesn't match a known browser version, and this happens before cookies are even checked, so you can't fake your way past it with session tokens alone.
curl_cffi solves this by handling browser emulation at the C-level, replicating Chrome's TLS handshake to bypass the bot detection layer. The impersonate="chrome" parameter configures the library to match Chrome's exact TLS fingerprint.
The limitation
This approach works for direct URL access (also known as permalinks). It does not support discovery.
This 10-line approach works for testing and small-scale extraction. Scraping at volume requires IP rotation and proxy management to avoid rate limits. It’s covered in the "How to avoid blocks" section below.
When scrolling through a profile page or group feed, Facebook displays a login wall after a few posts. On Marketplace category or search pages, scrolling extends to approximately 20-30 listings before the login prompt appears. API reverse engineering bypasses these scroll limitations by replicating the backend requests directly.
However, direct URLs to individual posts and group posts work without authentication and provide access to full content, including post text and comments.

For posts and groups, the safe approach is to use SERP discovery (Google Search) to find direct URLs, then use this extraction method to parse them. This decouples the architecture and isolates the scraping infrastructure.
Is Facebook scraping legal and safe?
Scraping publicly accessible Facebook content while logged out is generally lower risk, as Meta’s Terms primarily govern logged-in usage. However, Meta still enforces technical and behavioral restrictions, and IPs exhibiting automated patterns may be blocked regardless of login state.
Stay logged out. If content requires a login, don't scrape it. Scraping behind authentication violates their ToS.
Isolate your scraping infrastructure. Don't scrape from IPs connected to your personal Facebook account. Meta tracks request patterns and links them to profiles. Run your scraper through residential proxies on separate servers.
Minimize personal data storage. Storing names or IDs without consent creates GDPR/CCPA liability. For sentiment analysis, retain the comment text but exclude usernames and profile links.
What Facebook data has real value?
Scraping full HTML bloats storage and slows processing. Focus on structured fields:
-
Reaction breakdowns. Total reaction counts are vanity metrics. The distribution matters. Comparing Like vs. Angry vs. Sad provides sentiment signals without running NLP on comment text.
-
Marketplace pricing + location. Raw prices are noisy, but pairing prices with geolocation reveals arbitrage opportunities. You can map regional price variances (iPhones cost less in rural areas than in cities) and spot underpriced items.
-
Engagement ratios. Look at comments vs. shares, not totals alone. High comments with low shares indicate controversy or arguments. High shares with low comments indicate endorsement. This reveals the virality pattern.
Further reading: 8 Best Proxies for AI Tools and Scalable Data Collection in 2026 and How Live Proxies Help Prevent IP Bans in Large-Scale Web Scraping.
How to scrape Facebook posts and comments reliably?
You can't scrape profile timelines or group feeds directly, as Facebook blocks unauthenticated scrolling with login walls after a few posts. But direct URLs to specific posts provide access to full content, including comments.
Facebook splits post data into 2 layers. Post content arrives in the initial HTML, but comments load asynchronously via GraphQL calls.

The approach: decouple discovery from extraction
The 10-line method works for known URLs. Finding those URLs requires a different strategy.
The workflow splits into 3 stages:
-
Discovery. Use Google search operators to find Facebook post URLs without hitting login walls. This bypasses Facebook's feed pagination limits.
-
Extraction. Use curl_cffi with regex to parse post metadata (timestamps, text, engagement metrics) from the discovered URLs.
-
Comment pagination. Query Facebook's GraphQL API directly to paginate through comments without rendering the full UI.

Setup and environment
This builds on the approach above, adding automated discovery to find URLs at scale.
Run this code in an isolated virtual environment. You need Python 3.10+ and Google Chrome.
Create the environment:
python -m venv fb_scraper
source fb_scraper/bin/activate # Windows: fb_scraper\Scripts\activate
Install dependencies:
pip install seleniumbase curl-cffi
seleniumbase install chromedriver
SeleniumBase handles search automation, and curl_cffi handles TLS fingerprinting.
Implementation
Save the following code as facebook_scraper.py. This script uses FacebookURLCollector for Google-based discovery and scrape_comments_graphql for backend data retrieval.
import re
import time
import random
import json
import base64
import urllib.parse
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from curl_cffi import requests
from seleniumbase import Driver
# Optional: uncomment to use Live Proxies
# PROXY_URL = "http://username:[email protected]:7383"
# GraphQL doc_id for comment pagination
GRAPHQL_COMMENT_DOC_ID = "9442468175864664"
class FacebookURLCollector:
"""
Discovers Facebook post URLs using Google search.
This bypasses Facebook's login wall on profile feeds.
"""
def __init__(self, use_proxy: bool = False):
self.use_proxy = use_proxy
self.search_queries_used = 0
def _get_driver(self):
"""Initialize browser for Google search."""
driver_args = {
"uc": True, # Undetected Chrome
"headless": False, # Set True for production
"incognito": True,
"disable_js": False,
}
if self.use_proxy and 'PROXY_URL' in globals():
# Parse proxy for SeleniumBase format
proxy_parts = PROXY_URL.replace("http://", "").split("@")
if len(proxy_parts) == 2:
creds, host_port = proxy_parts
driver_args["proxy"] = host_port
driver_args["proxy_user"] = creds.split(":")[0]
driver_args["proxy_pass"] = creds.split(":")[1]
return Driver(**driver_args)
def search_page_posts(
self,
page_name: str,
max_results: int = 10,
days_back: int = 30
) -> List[str]:
"""
Search Google for Facebook post URLs from a specific page.
Args:
page_name: Facebook page name or username
max_results: Maximum number of post URLs to collect
days_back: Search within last N days
Returns:
List of Facebook post URLs
"""
urls = []
# Time partitioning: search recent periods first
time_ranges = [
("past week", 7),
("past month", 30),
("past year", 365)
]
driver = self._get_driver()
try:
for period_name, period_days in time_ranges:
if len(urls) >= max_results:
break
# Build Google search query
# This finds post URLs, not the main page
query = f'site:facebook.com "{page_name}" (posts OR videos OR photos) -inurl:photos_albums -inurl:groups'
search_url = f"https://www.google.com/search?q={urllib.parse.quote(query)}&num=20&tbs=qdr:{'w' if period_days == 7 else 'm' if period_days == 30 else 'y'}"
print(f"Searching Google for {period_name} posts...")
driver.get(search_url)
time.sleep(random.uniform(3, 5))
# Extract Facebook URLs from search results
# Google wraps URLs, need to decode
links = driver.find_elements("css selector", "a[href*='facebook.com']")
for link in links:
href = link.get_attribute("href")
if not href:
continue
# Decode Google redirect URL if needed
if "/url?q=" in href:
match = re.search(r'/url\?q=([^&]+)', href)
if match:
href = urllib.parse.unquote(match.group(1))
# Filter for actual post URLs
if self._is_valid_post_url(href, page_name):
if href not in urls:
urls.append(href)
print(f"Found: {href}")
if len(urls) >= max_results:
break
# Anti-detection delay between Google searches
self.search_queries_used += 1
time.sleep(random.uniform(10, 15))
finally:
driver.quit()
return urls[:max_results]
def _is_valid_post_url(self, url: str, page_name: str) -> bool:
"""
Validate if URL is a Facebook post URL for the target page.
Valid patterns:
- facebook.com/{page}/posts/{id}
- facebook.com/{page}/photos/{id}
- facebook.com/{page}/videos/{id}
- facebook.com/story.php?story_fbid={id}&id={page_id}
"""
if not url or "facebook.com" not in url:
return False
# Must contain page name or be story.php
if page_name.lower() not in url.lower() and "story.php" not in url:
return False
# Exclude non-post URLs
exclude_patterns = [
"/about", "/photos_albums", "/events", "/reviews",
"/community", "/likes", "/followers", "/groups/",
"/marketplace/", "/watch/", "/reel/"
]
if any(pattern in url for pattern in exclude_patterns):
return False
# Include valid post patterns
include_patterns = [
r'/posts/\d+',
r'/photos/[^/]+/\d+',
r'/videos/\d+',
r'/story\.php\?.*story_fbid=\d+',
r'/permalink\.php\?.*story_fbid=\d+'
]
return any(re.search(pattern, url) for pattern in include_patterns)
class FacebookPostScraper:
"""
Scrapes Facebook post data from direct URLs.
Uses curl_cffi for browser TLS emulation.
"""
def __init__(self, use_proxy: bool = False):
self.use_proxy = use_proxy
self.session = requests.Session()
# Configure session
if use_proxy and 'PROXY_URL' in globals():
self.session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
def scrape_post(self, url: str, comment_limit: int = 0) -> Dict:
"""
Scrape a single Facebook post.
Args:
url: Direct Facebook post URL
comment_limit: Number of comments to fetch (0 = skip comments)
Returns:
Dict with post data
"""
try:
response = self.session.get(
url,
impersonate="chrome",
timeout=30
)
html = response.text
# Check if blocked
if "login" in response.url.lower() or "checkpoint" in html.lower():
print(f"Blocked or redirected for {url}")
return {"url": url, "error": "blocked"}
# Parse post data from HTML
post_data = self._parse_post_html(html, url)
# Fetch comments if requested
if comment_limit > 0 and post_data.get("feedback_id"):
comments = self._fetch_comments(
feedback_id=post_data["feedback_id"],
limit=comment_limit
)
post_data["comments"] = comments
post_data["comment_count_fetched"] = len(comments)
return post_data
except Exception as e:
print(f"Error scraping {url}: {e}")
return {"url": url, "error": str(e)}
def _parse_post_html(self, html: str, url: str) -> Dict:
"""Extract post metadata from HTML using regex."""
data = {"url": url}
# Post ID from URL or HTML
post_id_match = re.search(r'/posts/(\d+)', url)
if post_id_match:
data["post_id"] = post_id_match.group(1)
else:
# Try story_fbid
story_match = re.search(r'story_fbid=(\d+)', url)
if story_match:
data["post_id"] = story_match.group(1)
# Author name
author_patterns = [
r'"actors":\[{"__typename":"User","name":"([^"]+)"',
r'"owner":{"__typename":"User","name":"([^"]+)"',
r'<meta property="og:title" content="([^"]+)"'
]
for pattern in author_patterns:
match = re.search(pattern, html)
if match:
data["author"] = match.group(1)
break
# Post text/description
text_patterns = [
r'<meta property="og:description" content="([^"]*)"',
r'"message":{"text":"([^"]*)"',
r'"comet_sections".*"message":{"text":"([^"]*)"'
]
for pattern in text_patterns:
match = re.search(pattern, html)
if match:
# Decode escaped unicode
text = match.group(1).replace('\\n', '\n')
text = bytes(text, 'utf-8').decode('unicode_escape')
data["text"] = text
break
# Timestamp
timestamp_match = re.search(r'"publish_time":(\d+)', html)
if timestamp_match:
data["timestamp"] = int(timestamp_match.group(1))
data["datetime"] = datetime.fromtimestamp(
int(timestamp_match.group(1))
).isoformat()
# Engagement metrics
metrics = {}
# Total reactions
reaction_match = re.search(r'"reaction_count":{"count":(\d+)', html)
if reaction_match:
metrics["reactions"] = int(reaction_match.group(1))
# Comments
comment_matches = re.findall(r'"total_count":(\d+)', html)
if comment_matches:
# First total_count is usually comments
metrics["comments"] = int(comment_matches[0])
# Shares
share_match = re.search(r'"share_count":{"count":(\d+)', html)
if share_match:
metrics["shares"] = int(share_match.group(1))
data["engagement"] = metrics
# Reaction breakdown
reaction_types = ["LIKE", "LOVE", "CARE", "HAHA", "WOW", "SAD", "ANGRY"]
reaction_breakdown = {}
for reaction_type in reaction_types:
pattern = rf'"key":"{reaction_type}".*?"reaction_count":(\d+)'
match = re.search(pattern, html)
if match:
reaction_breakdown[reaction_type.lower()] = int(match.group(1))
if reaction_breakdown:
data["reaction_breakdown"] = reaction_breakdown
# Feedback ID for GraphQL comment fetching
feedback_match = re.search(r'"feedback":{"id":"([^"]+)"', html)
if feedback_match:
data["feedback_id"] = feedback_match.group(1)
return data
def _fetch_comments(self, feedback_id: str, limit: int = 10) -> List[Dict]:
"""
Fetch comments via Facebook GraphQL API.
Args:
feedback_id: Post feedback ID from HTML
limit: Number of comments to fetch
Returns:
List of comment dicts
"""
comments = []
cursor = None
while len(comments) < limit:
try:
# Build GraphQL variables
variables = {
"display_comments_feedback_context": None,
"display_comments_context_enable_comment": None,
"display_comments_context_is_ad_preview": None,
"display_comments_context_is_aggregated_share": None,
"display_comments_context_is_story_set": None,
"feedLocation": "PERMALINK",
"feedbackSource": 2,
"focusCommentID": None,
"gridMediaWidth": 230,
"scale": 1,
"useDefaultActor": False,
"id": feedback_id,
"privacySelectorRenderLocation": "COMET_STREAM",
"renderLocation": "permalink",
"serializedPreloadedQueryName": "CometFocusedStoryViewUFIQuery",
"storyID": None,
"__relay_internal__pv__IsWorkUserrelayprovider": False,
"commentsAfterCount": limit - len(comments),
"commentsAfterCursor": cursor,
"commentsBeforeCount": None,
"commentsBeforeCursor": None,
"commentsIntentToken": "RANKED_UNFILTERED_CHRONOLOGICAL",
"feedContext": None,
"feedbackContext": {
"feedLocation": "PERMALINK",
"feedbackSource": 2,
"groupID": None,
"storyID": None
},
"filterNonCommentItem": True,
"privacySelectorRenderLocation": "COMET_STREAM",
"renderLocation": "permalink"
}
data = {
"av": "0",
"__user": "0",
"__a": "1",
"__req": "1",
"__hs": "",
"dpr": "1",
"__ccg": "UNKNOWN",
"__rev": "",
"__s": "",
"__hsi": "",
"__dyn": "",
"__csr": "",
"fb_dtsg": "NA",
"jazoest": "NA",
"lsd": "NA",
"__spin_r": "",
"__spin_b": "trunk",
"__spin_t": str(int(time.time())),
"fb_api_caller_class": "RelayModern",
"fb_api_req_friendly_name": "CometFocusedStoryViewUFIQuery",
"variables": json.dumps(variables),
"server_timestamps": "true",
"doc_id": GRAPHQL_COMMENT_DOC_ID
}
response = self.session.post(
"https://www.facebook.com/api/graphql/",
data=data,
headers={
"Content-Type": "application/x-www-form-urlencoded",
"Accept": "*/*",
"Origin": "https://www.facebook.com",
"Referer": "https://www.facebook.com/"
},
impersonate="chrome",
timeout=30
)
# Parse response
text = response.text
# GraphQL response may have non-JSON prefix
json_start = text.find("{")
if json_start == -1:
break
result = json.loads(text[json_start:])
# Navigate to comments data
edges = (
result.get("data", {})
.get("node", {})
.get("display_comments", {})
.get("edges", [])
)
if not edges:
break
for edge in edges:
node = edge.get("node", {})
comment = {
"id": node.get("id"),
"author": node.get("author", {}).get("name"),
"text": node.get("body", {}).get("text"),
"timestamp": node.get("created_time"),
"reactions": node.get("feedback", {}).get("reactors", {}).get("count", 0)
}
comments.append(comment)
if len(comments) >= limit:
break
# Check pagination
page_info = (
result.get("data", {})
.get("node", {})
.get("display_comments", {})
.get("page_info", {})
)
cursor = page_info.get("end_cursor")
if not page_info.get("has_next_page") or not cursor:
break
# Delay between GraphQL requests
time.sleep(random.uniform(1, 2))
except Exception as e:
print(f"Error fetching comments: {e}")
break
return comments[:limit]
def scrape_facebook_page(
page_url: str,
post_limit: int = 10,
comments_per_post: int = 0,
use_proxy: bool = False
) -> List[Dict]:
"""
Main workflow: discover post URLs then scrape them.
Args:
page_url: Facebook page URL
post_limit: Number of posts to scrape
comments_per_post: Comments per post to fetch
use_proxy: Whether to use Live Proxies
Returns:
List of scraped post data
"""
# Extract page name from URL
page_name = page_url.rstrip("/").split("/")[-1]
print(f"Starting scrape for page: {page_name}")
print(f"Target: {post_limit} posts, {comments_per_post} comments/post")
# Stage 1: Discover URLs via Google
collector = FacebookURLCollector(use_proxy=use_proxy)
post_urls = collector.search_page_posts(
page_name=page_name,
max_results=post_limit
)
print(f"\nFound {len(post_urls)} post URLs")
if not post_urls:
print("No post URLs discovered")
return []
# Stage 2: Scrape each post
scraper = FacebookPostScraper(use_proxy=use_proxy)
results = []
for i, url in enumerate(post_urls, 1):
print(f"\n[{i}/{len(post_urls)}] Scraping: {url}")
result = scraper.scrape_post(
url=url,
comment_limit=comments_per_post
)
if "error" not in result:
results.append(result)
print(f" ✓ Success: {result.get('author', 'Unknown')} - {result.get('engagement', {}).get('reactions', 0)} reactions")
else:
print(f" ✗ Failed: {result['error']}")
# Anti-detection delay between posts
if i < len(post_urls):
delay = random.uniform(2, 4)
print(f" Waiting {delay:.1f}s...")
time.sleep(delay)
return results
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Scrape Facebook posts and comments using Google discovery"
)
parser.add_argument(
"--url", "-u",
required=True,
help="Facebook page URL (e.g., https://www.facebook.com/narendramodi)"
)
parser.add_argument(
"--posts", "-p",
type=int,
default=10,
help="Number of posts to scrape"
)
parser.add_argument(
"--comments", "-c",
type=int,
default=0,
help="Comments per post to fetch"
)
parser.add_argument(
"--proxy",
action="store_true",
help="Use Live Proxies for rotation"
)
parser.add_argument(
"--output", "-o",
help="Output JSON filename"
)
args = parser.parse_args()
# Run scraper
results = scrape_facebook_page(
page_url=args.url,
post_limit=args.posts,
comments_per_post=args.comments,
use_proxy=args.proxy
)
# Save results
output_file = args.output or f"facebook_{args.url.rstrip('/').split('/')[-1]}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print(f"\n{'='*60}")
print(f"Scraping complete!")
print(f"Total posts scraped: {len(results)}")
print(f"Output saved to: {output_file}")
Running the scraper
Execute the script with your target page:
Basic scrape (posts only):
python facebook_scraper.py -u https://www.facebook.com/narendramodi -p 10
Deep scrape (posts + 10 comments each):
python facebook_scraper.py -u https://www.facebook.com/narendramodi -p 10 -c 10
Parameters:
-
-- url (-u) – Target Facebook page URL
-
-- posts (-p) – Number of posts to discover via Google
-
-- comments (-c) – Number of comments to extract per post (default: 0)
-
-- output (-o) – Custom output filename (optional)
The script outputs a JSON file with post metadata and comments.
What you get back
Here's the output structure, which is nested JSON with post text, engagement metrics, reaction breakdowns, and comments:
{
"url": "https://www.facebook.com/narendramodi/posts/...",
"post_id": "1671969867459607",
"author": "Narendra Modi",
"text": "Had a wonderful interaction with students...",
"timestamp": 1735154400,
"datetime": "2024-12-25T10:00:00",
"engagement": {
"reactions": 45231,
"comments": 2184,
"shares": 892
},
"reaction_breakdown": {
"like": 38000,
"love": 5231
},
"feedback_id": "ZmVlZGJhY2s6...",
"comments": [
{
"id": "Y29tbWVudDoxMjM0NTY=",
"author": "User Name",
"text": "Great initiative!",
"timestamp": 1735158000,
"reactions": 23
}
],
"comment_count_fetched": 10
}
Why regex over DOM parsers
Regex processes large React documents faster than BeautifulSoup. Pre-compiled patterns with re.compile() reduce overhead during high-volume extraction.
For discovery, the script uses time partitioning, which means querying Google for distinct date ranges (past 7 days, past 30 days). This retrieves historical data without duplicate results.
The GRAPHQL_COMMENT_DOC_ID identifies the query type. If comment extraction fails, open your browser's Network tab, filter for graphql, and inspect the payload variables to find the updated doc_id.

How to scrape Facebook groups (public) without breaking rules?
Groups use a different URL structure. You can't access feeds directly, so find group post URLs through search engines instead.

Group posts look like this: https://www.facebook.com/groups/{group_id}/posts/{post_id}/
The approach: search index discovery
The method mirrors the page scraper. A headless browser searches Google for site:facebook.com/groups/{group_name}, finds direct post URLs, then the same regex patterns extract the post data.
Building the scraper
Save this as facebook_group_scraper.py.
It uses the same time partitioning from above, like searching different time windows (past week, past month) to retrieve more results without triggering rate limits.
import re
import time
import random
import json
import html as html_lib
import urllib.parse
from typing import List, Dict
from curl_cffi import requests
from seleniumbase import Driver
# Optional proxy
# PROXY_URL = "http://username:[email protected]:7383"
class FacebookGroupScraper:
"""
Scrapes public Facebook group posts discovered via Google.
"""
def __init__(self, use_proxy: bool = False):
self.use_proxy = use_proxy
def discover_group_posts(
self,
group_name: str,
max_results: int = 10
) -> List[str]:
"""
Search Google for public group post URLs.
Args:
group_name: Facebook group name or keyword
max_results: Number of post URLs to find
Returns:
List of group post URLs
"""
urls = []
driver_args = {
"uc": True,
"headless": False, # Set True for production
"incognito": True,
}
if self.use_proxy and 'PROXY_URL' in globals():
proxy_parts = PROXY_URL.replace("http://", "").split("@")
if len(proxy_parts) == 2:
creds, host_port = proxy_parts
driver_args["proxy"] = host_port
driver_args["proxy_user"] = creds.split(":")[0]
driver_args["proxy_pass"] = creds.split(":")[1]
driver = Driver(**driver_args)
try:
# Multiple search queries for better coverage
queries = [
f'site:facebook.com/groups "{group_name}" "/posts/"',
f'site:facebook.com/groups/{group_name} "/posts/"',
f'"{group_name}" site:facebook.com/groups "posts"'
]
for query in queries:
if len(urls) >= max_results:
break
search_url = f"https://www.google.com/search?q={urllib.parse.quote(query)}&num=20"
print(f"Searching: {query}")
driver.get(search_url)
time.sleep(random.uniform(3, 5))
links = driver.find_elements("css selector", "a[href*='facebook.com/groups/']")
for link in links:
href = link.get_attribute("href")
if not href:
continue
# Decode Google redirect
if "/url?q=" in href:
match = re.search(r'/url\?q=([^&]+)', href)
if match:
href = urllib.parse.unquote(match.group(1))
# Validate group post URL
if self._is_valid_group_post_url(href):
if href not in urls:
urls.append(href)
print(f"Found: {href}")
if len(urls) >= max_results:
break
time.sleep(random.uniform(8, 12))
finally:
driver.quit()
return urls[:max_results]
def _is_valid_group_post_url(self, url: str) -> bool:
"""Check if URL is a Facebook group post."""
if not url or "facebook.com/groups/" not in url:
return False
# Must be a direct post URL
patterns = [
r'/groups/\d+/posts/\d+',
r'/groups/[^/]+/posts/\d+'
]
return any(re.search(pattern, url) for pattern in patterns)
def scrape_group_post(self, url: str) -> Dict:
"""
Scrape a single group post from direct URL.
Uses same TLS browser emulation as page scraper.
"""
session = requests.Session()
if self.use_proxy and 'PROXY_URL' in globals():
session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
try:
response = session.get(
url,
impersonate="chrome",
timeout=30
)
html = response.text
if "login" in response.url.lower():
return {"url": url, "error": "login_wall"}
data = {"url": url}
# Group post ID
post_match = re.search(r'/posts/(\d+)', url)
if post_match:
data["post_id"] = post_match.group(1)
# Author
author_patterns = [
r'"actors":\[{"__typename":"User","name":"([^"]+)"',
r'"author":{"__typename":"User","name":"([^"]+)"',
r'"owner":{"name":"([^"]+)"'
]
for pattern in author_patterns:
match = re.search(pattern, html)
if match:
data["author"] = html_lib.unescape(match.group(1))
break
# Post text
text_patterns = [
r'<meta property="og:description" content="([^"]*)"',
r'"message":{"text":"([^"]*)"',
r'"story":{.*?"message":{"text":"([^"]*)"'
]
for pattern in text_patterns:
match = re.search(pattern, html)
if match:
text = html_lib.unescape(match.group(1))
data["text"] = text.replace('\\n', '\n')
break
# Group name
group_patterns = [
r'<title>([^<]+) \| Facebook</title>',
r'"community":{"name":"([^"]+)"'
]
for pattern in group_patterns:
match = re.search(pattern, html)
if match:
data["group_name"] = html_lib.unescape(match.group(1))
break
# Timestamp
time_match = re.search(r'"publish_time":(\d+)', html)
if time_match:
data["timestamp"] = int(time_match.group(1))
# Engagement
reaction_match = re.search(r'"reaction_count":{"count":(\d+)', html)
comment_match = re.search(r'"total_count":(\d+)', html)
data["engagement"] = {
"reactions": int(reaction_match.group(1)) if reaction_match else 0,
"comments": int(comment_match.group(1)) if comment_match else 0
}
return data
except Exception as e:
return {"url": url, "error": str(e)}
def scrape_group(
self,
group_url: str,
post_limit: int = 10
) -> List[Dict]:
"""
Main workflow for scraping group posts.
Args:
group_url: Facebook group URL or identifier
post_limit: Number of posts to scrape
Returns:
List of scraped posts
"""
# Extract group name from URL
group_name = group_url.rstrip("/").split("/")[-1]
print(f"Discovering posts for group: {group_name}")
post_urls = self.discover_group_posts(group_name, post_limit)
print(f"\nFound {len(post_urls)} group posts")
results = []
for i, url in enumerate(post_urls, 1):
print(f"[{i}/{len(post_urls)}] Scraping: {url}")
result = self.scrape_group_post(url)
if "error" not in result:
results.append(result)
print(f" ✓ {result.get('author')} in {result.get('group_name')}")
time.sleep(random.uniform(2, 4))
return results
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Scrape public Facebook group posts via Google discovery"
)
parser.add_argument(
"--url",
required=True,
help="Facebook group URL (e.g., https://www.facebook.com/groups/dogspotting)"
)
parser.add_argument(
"--posts",
type=int,
default=10,
help="Number of posts to scrape"
)
parser.add_argument(
"--proxy",
action="store_true",
help="Use Live Proxies"
)
parser.add_argument(
"--output",
help="Output filename"
)
args = parser.parse_args()
scraper = FacebookGroupScraper(use_proxy=args.proxy)
results = scraper.scrape_group(args.url, args.posts)
output_file = args.output or f"group_posts_{args.posts}.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print(f"\nDone! Scraped {len(results)} posts")
print(f"Saved to: {output_file}")
Running it
Run the script with your target group:
python facebook_group_scraper.py --url "https://www.facebook.com/groups/dogspotting" --posts 10
Example response
{
"url": "https://www.facebook.com/groups/dogspotting/posts/10164914842549467/",
"post_id": "10164914842549467",
"author": "Sarah Johnson",
"text": "Spotted this good boy at Central Park today! Anyone know what breed mix he might be?",
"group_name": "Dogspotting Society",
"timestamp": 1735123456,
"engagement": {
"reactions": 234,
"comments": 67
}
}
How this bypasses Facebook's restrictions
Google's index is the key here. This approach only accesses public URLs that Google already indexed, never the Facebook feed itself.
What doesn't work
Google indexes take time to update, so this method retrieves historical data more reliably than real-time monitoring (posts from the last 10 minutes won't appear). Only public groups work–search engines can't see private group content.
Google rate-limits your searches. The script handles basic detection, but high-volume scraping needs rotating proxies or a SERP API. The regex patterns may need occasional updates if Facebook changes their variable names (like reaction_count to feedback_count). This doesn't happen often, but it does happen.
How to scrape Facebook Marketplace?
Marketplace displays approximately 20-30 listings before showing a login wall when scrolling while logged out.

The approach: API reverse engineering
Marketplace is a Single Page Application (SPA). When you scroll, it fetches data through https://www.facebook.com/api/graphql/.
The implementation uses API reverse engineering, observing the GraphQL requests the browser makes and replicating them directly in code. By copying the request format from your browser's Network tab, you can fetch listings programmatically without hitting the scroll-based login wall.
Implementation
Here's what it does, searches a specific Marketplace category within a location radius.
How it works:
-
Takes a category URL (like facebook.com/marketplace/category/electronics)
-
Extracts the category slug (electronics)
-
Sends GraphQL queries with that category + your lat/long coordinates
-
Returns listings within a 65km radius
Important: Facebook rotates API tokens and query IDs (doc_id) frequently. Before running the script, copy fresh values from your browser:
Open Facebook Marketplace in a logged-out browser
-
Open DevTools → Network tab
-
Scroll through some listings and filter the Network tab for graphql
-
Click on a GraphQL request and copy values from the Payload/Headers tabs
-
Update COOKIES, HEADERS, BASE_DATA, and DOC_ID in the code
These values are just browser fingerprints and session tokens for public data access, no login required.
from curl_cffi import requests
import json
import time
import random
import argparse
from urllib.parse import urlparse
# Optional: Uncomment to use Live Proxies
# PROXY_URL = "http://username:[email protected]:7383"
# IMPORTANT: Update these values from your browser's DevTools
# Open Facebook Marketplace, inspect a GraphQL request, and copy fresh values
COOKIES = {
# These are examples. Replace with values from a real Marketplace session.
'wd': '2560x1271',
'datr': 'YOUR_DATR_TOKEN',
'sb': 'YOUR_SB_TOKEN',
'ps_l': '1',
'ps_n': '1',
'fr': 'YOUR_FR_TOKEN',
'xs': '0%3Aabc123',
'c_user': '0', # Logged-out sessions may omit this
}
HEADERS = {
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'application/x-www-form-urlencoded',
'origin': 'https://www.facebook.com',
'referer': 'https://www.facebook.com/marketplace/',
'sec-ch-prefers-color-scheme': 'light',
'sec-ch-ua': '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-model': '""',
'sec-ch-ua-platform': '"Windows"',
'sec-ch-ua-platform-version': '"10.0.0"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'x-asbd-id': '129477',
'x-fb-friendly-name': 'CometMarketplaceCategoryContentPaginationQuery',
'x-fb-lsd': 'YOUR_LSD_TOKEN',
}
# Base request data from DevTools
BASE_DATA = {
'av': '0',
'__user': '0',
'__a': '1',
'__req': '1',
'__hs': 'YOUR_HS_VALUE',
'dpr': '1',
'__ccg': 'GOOD',
'__rev': 'YOUR_REV',
'__s': 'YOUR_S_VALUE',
'__hsi': 'YOUR_HSI',
'__dyn': 'YOUR_DYN',
'__csr': 'YOUR_CSR',
'fb_dtsg': 'NA',
'jazoest': '22000',
'lsd': 'YOUR_LSD_TOKEN',
'__spin_r': 'YOUR_SPIN_R',
'__spin_b': 'trunk',
'__spin_t': str(int(time.time())),
'fb_api_caller_class': 'RelayModern',
'fb_api_req_friendly_name': 'CometMarketplaceCategoryContentPaginationQuery',
'server_timestamps': 'true',
}
# GraphQL document ID. Update this from DevTools.
DOC_ID = '1234567890123456'
def marketplace_search(
category_url: str = None,
query: str = None,
lat: float = 37.7749,
long: float = -122.4194,
radius_km: int = 65,
limit: int = 50,
min_price: int = None,
max_price: int = None,
condition: str = None,
use_proxy: bool = False
):
"""
Search Facebook Marketplace by category or keyword.
Args:
category_url: Marketplace category URL (e.g., electronics)
query: Search keyword (optional, overrides category browsing)
lat: Latitude for location-based results
long: Longitude for location-based results
radius_km: Search radius in kilometers
limit: Maximum listings to retrieve
min_price: Minimum price filter
max_price: Maximum price filter
condition: Item condition (e.g., "new", "used_good")
use_proxy: Whether to use Live Proxies
Returns:
List of listing data
"""
# Extract category slug from URL if provided
category_slug = None
if category_url:
parsed = urlparse(category_url)
path_parts = parsed.path.strip('/').split('/')
if 'category' in path_parts:
category_idx = path_parts.index('category')
if category_idx + 1 < len(path_parts):
category_slug = path_parts[category_idx + 1]
session = requests.Session()
if use_proxy and 'PROXY_URL' in globals():
session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
listings = []
cursor = None
while len(listings) < limit:
try:
# Build GraphQL variables
variables = {
"buyLocation": {
"latitude": lat,
"longitude": long
},
"categoryIDArray": [],
"count": min(24, limit - len(listings)),
"cursor": cursor,
"deliveryMethod": "NO_SHIPPING",
"filters": {
"sort_by": "BEST_MATCH"
},
"latitude": lat,
"longitude": long,
"query": query or "",
"radius": radius_km * 1000, # Convert km to meters
"scale": 1
}
# Add category if browsing by category
if category_slug:
variables["category"] = category_slug
variables["categoryIDArray"] = [category_slug]
# Add filters
if min_price is not None:
variables["filters"]["minPrice"] = min_price
if max_price is not None:
variables["filters"]["maxPrice"] = max_price
if condition:
variables["filters"]["condition"] = condition
# Build request data
data = BASE_DATA.copy()
data['variables'] = json.dumps(variables)
data['doc_id'] = DOC_ID
data['__spin_t'] = str(int(time.time()))
# Make GraphQL request
response = session.post(
'https://www.facebook.com/api/graphql/',
cookies=COOKIES,
headers=HEADERS,
data=data,
impersonate="chrome",
timeout=30
)
# Parse response
text = response.text
json_start = text.find('{')
if json_start == -1:
print("No JSON found in response")
break
result = json.loads(text[json_start:])
# Navigate to marketplace listings
# GraphQL response structure can vary; inspect with DevTools
data_node = result.get('data', {})
# Common response patterns
edges = []
if 'viewer' in data_node:
# Pattern 1: viewer -> marketplace_feed
edges = (
data_node.get('viewer', {})
.get('marketplace_feed', {})
.get('feed_units', {})
.get('edges', [])
)
elif 'marketplace_search' in data_node:
# Pattern 2: direct marketplace_search
edges = (
data_node.get('marketplace_search', {})
.get('feed_units', {})
.get('edges', [])
)
if not edges:
print("No listings found or response structure changed")
break
# Extract listing data
for edge in edges:
node = edge.get('node', {})
# Extract common fields
listing = {
'id': node.get('id'),
'title': node.get('marketplace_listing_title') or node.get('title'),
'price': self._extract_price(node),
'location': self._extract_location(node),
'image_url': self._extract_image(node),
'listing_url': self._extract_url(node),
'condition': node.get('condition'),
'seller_name': self._extract_seller(node)
}
# Only add complete listings
if listing['title'] and listing['id']:
listings.append(listing)
print(f"[{len(listings)}/{limit}] {listing['title']} - {listing['price']}")
if len(listings) >= limit:
break
# Pagination
page_info = {}
if 'viewer' in data_node:
page_info = (
data_node.get('viewer', {})
.get('marketplace_feed', {})
.get('feed_units', {})
.get('page_info', {})
)
elif 'marketplace_search' in data_node:
page_info = (
data_node.get('marketplace_search', {})
.get('feed_units', {})
.get('page_info', {})
)
cursor = page_info.get('end_cursor')
if not page_info.get('has_next_page') or not cursor:
print("No more pages")
break
# Delay between requests to avoid rate limits
time.sleep(random.uniform(2, 4))
except Exception as e:
print(f"Error: {e}")
break
return listings[:limit]
def _extract_price(node):
"""Extract price from listing node."""
# Try multiple price field patterns
price_patterns = [
node.get('listing_price', {}).get('formatted_amount'),
node.get('price_text'),
node.get('formatted_price')
]
for price in price_patterns:
if price:
return price
return "N/A"
def _extract_location(node):
"""Extract location string."""
location_patterns = [
node.get('location_text', {}).get('text'),
node.get('location', {}).get('reverse_geocode', {}).get('city_page', {}).get('display_name'),
node.get('marketplace_listing_location', {}).get('location', {}).get('city')
]
for location in location_patterns:
if location:
return location
return "Unknown"
def _extract_image(node):
"""Extract primary image URL."""
image_patterns = [
node.get('primary_listing_photo', {}).get('image', {}).get('uri'),
node.get('listing_photos', [{}])[0].get('image', {}).get('uri'),
node.get('image', {}).get('uri')
]
for image in image_patterns:
if image:
return image
return None
def _extract_url(node):
"""Extract listing URL."""
if node.get('id'):
return f"https://www.facebook.com/marketplace/item/{node['id']}/"
if node.get('story_key'):
return f"https://www.facebook.com/marketplace/item/{node['story_key']}/"
return None
def _extract_seller(node):
"""Extract seller name."""
seller_patterns = [
node.get('marketplace_listing_seller', {}).get('name'),
node.get('seller', {}).get('name')
]
for seller in seller_patterns:
if seller:
return seller
return "Unknown"
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Search Facebook Marketplace by category or keyword"
)
parser.add_argument(
"--url",
help="Marketplace category URL (e.g., https://www.facebook.com/marketplace/category/electronics)"
)
parser.add_argument(
"--query",
help="Search keyword (e.g., 'iPhone')"
)
parser.add_argument(
"--lat",
type=float,
default=37.7749,
help="Latitude (default: San Francisco)"
)
parser.add_argument(
"--long",
type=float,
default=-122.4194,
help="Longitude (default: San Francisco)"
)
parser.add_argument(
"--radius",
type=int,
default=65,
help="Search radius in kilometers"
)
parser.add_argument(
"--min-price",
type=int,
help="Minimum price"
)
parser.add_argument(
"--max-price",
type=int,
help="Maximum price"
)
parser.add_argument(
"--condition",
choices=["new", "used_good", "used_like_new", "used_fair"],
help="Item condition"
)
parser.add_argument(
"--limit",
type=int,
default=50,
help="Maximum listings to scrape"
)
parser.add_argument(
"--proxy",
action="store_true",
help="Use Live Proxies"
)
parser.add_argument(
"--output",
default="marketplace_results.json",
help="Output JSON filename"
)
args = parser.parse_args()
# Require either URL or query
if not args.url and not args.query:
parser.error("Either --url or --query is required")
# Bind helper methods (simple approach)
global self
self = type('', (), {
'_extract_price': _extract_price,
'_extract_location': _extract_location,
'_extract_image': _extract_image,
'_extract_url': _extract_url,
'_extract_seller': _extract_seller
})()
results = marketplace_search(
category_url=args.url,
query=args.query,
lat=args.lat,
long=args.long,
radius_km=args.radius,
limit=args.limit,
min_price=args.min_price,
max_price=args.max_price,
condition=args.condition,
use_proxy=args.proxy
)
with open(args.output, "w", encoding="utf-8") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
print(f"\nDone! Found {len(results)} listings")
print(f"Saved to: {args.output}")
How to use it
You have 2 search modes:
Category mode – Browse a specific category:
python facebook_marketplace_scraper.py --url "https://www.facebook.com/marketplace/category/electronics" --limit 50
Keyword mode - Search for specific items:
python facebook_marketplace_scraper.py --query "iPhone" --limit 50
Advanced filtering:
# Search for furniture in New York within 10km, priced $100-$500, in used condition
python facebook_marketplace_scraper.py \
--query "furniture" \
--lat 40.7128 --long -74.0060 \
--radius 10 \
--min-price 100 --max-price 500 \
--condition used_good \
--limit 20
Available arguments:
- – –url – Category URL, for example .../category/electronics
- – –query – Search keyword, for example iPhone or Canon
- – –lat / ––long – Search center coordinates. Default: San Francisco
- – –radius – Search radius in kilometers. Default: 65
- – –min-price / ––max-price – Price range in dollars
- – –condition – Item condition: new, used_good, used_like_new, or used_fair
- – –limit – Maximum items to scrape. Default: 50
- – –output – Output filename. Default: marketplace_results.json
You must provide either - -url or - -query (or both for filtered category browsing).
Sample output
Facebook returns location as a city string (like "New York, NY") instead of exact coordinates. The output looks like this:
[
{
"id": "743141558781810",
"title": "Sony A7III Mirrorless Camera Body",
"price": "$1,200",
"location": "San Francisco, CA",
"image_url": "https://scontent.xx.fbcdn.net/...",
"listing_url": "https://www.facebook.com/marketplace/item/743141558781810/",
"condition": "used_good",
"seller_name": "John Smith"
}
]
Limitations
Keyword search is less reliable than category browsing. Facebook's search algorithm may return semantically similar items instead of exact matches.
Session data expires quickly. Cookies (c_user, xs) invalidate if your IP changes, and the query IDs (doc_id) rotate when Facebook updates the Marketplace frontend. Re-extract fresh values from DevTools if you start getting empty responses.
The GraphQL endpoint is sensitive to rate limits. Use the built-in delays and consider rotating proxies for larger crawls.
How to avoid blocks in Facebook scraping?
The scripts above work for small tests. For larger volumes, IP rotation and request timing become essential.
Use residential proxies
Cloud server IPs (AWS, GCP, Azure) don't work because they're datacenter addresses. Facebook flags them immediately.
When choosing a residential proxy provider, look for features built specifically for scraping:
Live Proxies offers residential and mobile IPs with these features:
-
Private allocation. Provides a dedicated IP pool exclusively for your account. Unlike shared pools, where other users can burn IPs with aggressive scraping, your allocated IPs aren't being used on the same targets by another customer. You control the IP reputation. See pricing options.
-
Sticky sessions. Lock an IP for up to 24 hours by adding a session ID to your proxy username (e.g., sid:1). Use this when a task needs a consistent identity, such as a Google search session or a sequence of Marketplace pagination requests.
-
Rotating sessions. Assigns a fresh IP for every request. Use this for high-volume discovery tasks like searching Marketplace listings across multiple cities or fetching hundreds of independent post URLs.
Live Proxies also offers rotating mobile proxies, where IPs come from mobile carriers like T-Mobile and Verizon. These work for stricter targets, but residential proxies are usually sufficient for Facebook's public surfaces.
Add proxies to the scripts
Get your proxy credentials from the Live Proxies dashboard.
For B2C users, the format is:
IP:PORT:USERNAME-ACCESS_CODE-SID:PASSWORD
For B2B users, the format is:
b2b.liveproxies.io:7383:username-access_code-sid:password
Uncomment the PROXY_URL line at the top of any script:
# PROXY CONFIGURATION
PROXY_URL = "http://username:[email protected]:7383"
The scripts automatically route all traffic through your residential proxy pool.
Request timing strategy
Space out requests even with proxies. Facebook tracks request frequency per IP.
For discovery (Google search or Marketplace browsing):
Use 5–10 second delays between requests
Use sticky sessions (one IP per browsing session)
Randomize delays: time.sleep(random.uniform(5, 10))
For extraction (fetching individual post URLs):
Use 1–2 second delays between requests
Can rotate IPs more aggressively since requests are independent
Already built into the scripts
Why curl_cffi matters with residential proxies
Residential IPs only work if your connection appears legitimate. Standard Python requests still fail because Facebook checks the TLS handshake before looking at the IP reputation.
The scripts use curl_cffi with impersonate="chrome" throughout. This replicates a real Chrome TLS fingerprint and combines with the residential IP to match both connection-layer and network-layer expectations.
Detecting blocks
Facebook blocks show up as:
-
Soft blocks – API returns empty data or {"data": null}
-
CAPTCHA – Browser displays "unusual traffic" warning
-
HTTP errors – 429 (rate limit) or 403 (forbidden)
If you hit a soft block, wait 10–15 minutes and rotate to a different proxy session. If you hit repeated CAPTCHAs, reduce your rate further.
For more on handling IP bans, see your IP has been banned.
How to model, clean, and store Facebook scraping data?
The scripts save raw JSON with display-formatted strings. Before analysis, clean the data:
-
Normalize values. Strip currency symbols from prices ("$1,200" → 1200.0), convert Unix timestamps to ISO 8601 format (1767231091 → 2026-01-01T07:01:31Z), and remove null fields to reduce file size.
-
Use JSON Lines. Save data as .jsonl (one JSON object per line) instead of standard JSON arrays. If your scraper crashes, you only lose the last partial line, not the entire file.
-
Handle duplicates. Use post_id or id as your unique key. When saving to a database, use upserts to update existing posts if they were re-scraped later with new comment counts.
-
For storage, document databases (MongoDB or PostgreSQL with JSONB) handle nested data structures natively, while relational databases work better after normalizing comments into separate tables.
How to scale Facebook scraping safely?
Running one script works for small projects. Scaling to thousands of posts requires distributing the load.
Run multiple scripts with different proxy sessions
Don't run parallel threads from one script; Facebook detects the burst of requests from a single process, even behind a proxy. Instead, run multiple independent scripts from separate terminals or servers.
With Live Proxies, assign different session IDs:
-
Script 1: username-access_code-sid:1:password
-
Script 2: username-access_code-sid:2:password
-
Script 3: username-access_code-sid:3:password
Each script receives its own IP. If one is blocked, the others continue running.
Split discovery from extraction
Finding URLs (Google search) is slow. Extracting data (fetching posts) is fast. Don't do both in one workflow at scale.
Run 2 separate processes:
- Discovery script finds URLs, writes them to a text file
- Extraction scripts read from that file and scrape the posts
If the extraction hits rate limits, you can pause it without losing your URL list.
Space out requests across scripts
Don't have all scripts start at the same time. Stagger them by 30–60 seconds. Add random delays so their request patterns don't align.
Watch for error rates
If you start seeing lots of HTTP 403 or 429 errors, slow down. Add longer delays or reduce the number of concurrent scripts.
For enterprise-scale requirements, see Live Proxies B2B & Enterprise solutions.
What are the best Facebook scraping use cases?
Here are some real-world applications:
Marketplace arbitrage
Facebook Marketplace is a goldmine for mispriced assets, particularly in categories where seller urgency outpaces buyer discovery. The Marketplace scraper can identify listings priced below regional medians, for example, a $1,200 camera listed for $800 in a low-demand area. Resellers use this to map price disparities across cities and find profitable flips.
Brand sentiment tracking
Public comments on brand pages provide an unfiltered, real-time barometer of customer satisfaction. So, use the Post Scraper to ingest comments from a competitor’s page. Instead of just counting reactions, build a Weighted Sentiment Score: track the ratio of "Angry/Sad" vs. "Love/Care". A sudden spike in negative engagement often predicts product defects or PR crises days before they appear in official reports.
Market research in groups
Public groups are where users discuss problems and recommend products. So, use the Group Discovery Script (the Google-based scraper above) to identify posts in groups related to your niche, for example, dog owners discussing food brands or photographers comparing camera gear. Extracting the post text reveals the language people use to describe their needs, which is useful for product positioning or keyword research.
For more scraping guides, check out our tutorials on social media scraping, Instagram scraping, and X/Twitter scraping.
Further reading: How Proxies Help You Scale AI Web Scraping and Data Collection and How to Scrape Walmart: Data, Prices, Products, and Reviews (2025).
Conclusion
Facebook scraping requires combining TLS browser emulation, API reverse engineering, and residential proxies. Direct URL scraping works for single posts. Google discovery solves the feed login wall for pages and groups. Marketplace requires GraphQL replication because scrolling hits a login barrier after 20-30 listings.
Start with the 10-line example to verify your setup works, then scale to the full implementations as needed.
Next steps: Try the code on small datasets first, monitor your success rates, and adjust delays if you hit rate limits. For production scraping at volume, residential proxies with private allocation are essential.
FAQs
Can I scrape private or closed Facebook groups?
No. Private groups require login, which violates Facebook's ToS and results in account bans or legal risk. The Group Discovery Script only works on public groups.
Can I scrape emails from Facebook groups?
No. Facebook doesn't expose user emails in public HTML. Any method claiming to extract emails is either fake or requires account access, which violates the rules described above.
How often should I re-scrape pages?
It depends on page activity. For high-traffic news pages, every 4–6 hours works with residential proxies. For groups or Marketplace categories, once or twice daily is usually enough. Watch for rate limits and back off if you see soft blocks.
Why do some pages suddenly return empty data?
This is a soft block. Facebook serves an empty HTML shell because it flagged your IP or TLS fingerprint. Rotate your proxy session and verify curl_cffi is still using impersonate="chrome".
Do proxies actually help with Facebook scraping?
Yes, but only residential proxies. Facebook recognizes datacenter IPs (AWS, DigitalOcean, etc.) immediately. Residential proxies from providers like Live Proxies assign real ISP IPs, which match Facebook's trust expectations for consumer traffic.
What's the safest way to store scraped data?
Use JSON Lines (.jsonl) format to prevent file corruption if your scraper crashes. Legally, store only the fields you need and avoid retaining personal identifiers unless you have a clear compliance basis.




