Live Proxies

How to Scrape Facebook: Comments, Posts, Groups, Marketplace, and Scraping Tools in 2026 (Guide)

Scrape Facebook posts, comments, groups, and Marketplace more reliably in 2026 with practical methods, code examples, and tools for safer large-scale data extraction.

How to Scrape Facebook: Comments, Posts, Groups, Marketplace, and Scraping Tools
Live Proxies

Live Proxies Editorial Team

Content Manager

How To

17 March 2026

Facebook holds billions of posts, comments, and marketplace listings, data that can reveal market trends, sentiment shifts, and pricing opportunities. But extracting it isn't straightforward. Facebook blocks standard HTTP requests, serves degraded HTML to bots, and shows login walls when you try to scroll through feeds.

This guide walks through the approach to scraping Facebook at scale, like posts, comments, groups, and Marketplace listings using Python, TLS browser emulation, and API reverse engineering. You'll get working code and learn which Facebook scraping tools bypass detection systems and handle anti-bot measures.

How to scrape Facebook in 10 lines?

Let’s start with a direct task: scraping a specific post URL. This requires an HTTP client that handles TLS handshake emulation, not full browser automation.

The code

It fetches the HTML and parses metadata using regex:

import re
from curl_cffi import requests

# Fetch HTML and extract data using regex
html = requests.get(
    "https://www.facebook.com/zuck/posts/two-decades-many-awesome-projects-and-even-more-plaid-shirts-grateful-for-20-yea/10117222347025301/",
    impersonate="chrome",
).text
author = re.search(r'"actors":\[{"__typename":"User","name":"([^"]+)"', html).group(1)
text = re.search(r'<meta property="og:description" content="([^"]*)"', html).group(1)
reactions = re.search(r'"reaction_count":{"count":(\d+)', html).group(1)
comments = re.search(r'"total_count":(\d+)', html).group(1)
shares = re.search(r'"share_count":{"count":(\d+)', html).group(1)

print(
    f"Author:    {author}\nPost Text: {text}\nReactions: {reactions}\nComments:  {comments}\nShares:    {shares}"
)

Output

Author:    Mark Zuckerberg
Post Text: Two decades, many awesome projects, and even more plaid shirts. Grateful for 20 years of building the future together!
Reactions: 190863
Comments:  51086
Shares:    3704

Why this works

Standard Python requests fail because Facebook analyzes the TLS fingerprint during the SSL handshake. The platform checks the JA3 signature and serves a degraded HTML shell to any handshake that doesn't match a known browser version, and this happens before cookies are even checked, so you can't fake your way past it with session tokens alone.

curl_cffi solves this by handling browser emulation at the C-level, replicating Chrome's TLS handshake to bypass the bot detection layer. The impersonate="chrome" parameter configures the library to match Chrome's exact TLS fingerprint.

The limitation

This approach works for direct URL access (also known as permalinks). It does not support discovery.

This 10-line approach works for testing and small-scale extraction. Scraping at volume requires IP rotation and proxy management to avoid rate limits. It’s covered in the "How to avoid blocks" section below.

When scrolling through a profile page or group feed, Facebook displays a login wall after a few posts. On Marketplace category or search pages, scrolling extends to approximately 20-30 listings before the login prompt appears. API reverse engineering bypasses these scroll limitations by replicating the backend requests directly.

However, direct URLs to individual posts and group posts work without authentication and provide access to full content, including post text and comments.
The limitation

For posts and groups, the safe approach is to use SERP discovery (Google Search) to find direct URLs, then use this extraction method to parse them. This decouples the architecture and isolates the scraping infrastructure.

Is Facebook scraping legal and safe?

Scraping publicly accessible Facebook content while logged out is generally lower risk, as Meta’s Terms primarily govern logged-in usage. However, Meta still enforces technical and behavioral restrictions, and IPs exhibiting automated patterns may be blocked regardless of login state.

Stay logged out. If content requires a login, don't scrape it. Scraping behind authentication violates their ToS.

Isolate your scraping infrastructure. Don't scrape from IPs connected to your personal Facebook account. Meta tracks request patterns and links them to profiles. Run your scraper through residential proxies on separate servers.

Minimize personal data storage. Storing names or IDs without consent creates GDPR/CCPA liability. For sentiment analysis, retain the comment text but exclude usernames and profile links.

What Facebook data has real value?

Scraping full HTML bloats storage and slows processing. Focus on structured fields:

  • Reaction breakdowns. Total reaction counts are vanity metrics. The distribution matters. Comparing Like vs. Angry vs. Sad provides sentiment signals without running NLP on comment text.

  • Marketplace pricing + location. Raw prices are noisy, but pairing prices with geolocation reveals arbitrage opportunities. You can map regional price variances (iPhones cost less in rural areas than in cities) and spot underpriced items.

  • Engagement ratios. Look at comments vs. shares, not totals alone. High comments with low shares indicate controversy or arguments. High shares with low comments indicate endorsement. This reveals the virality pattern.

Further reading: 8 Best Proxies for AI Tools and Scalable Data Collection in 2026 and How Live Proxies Help Prevent IP Bans in Large-Scale Web Scraping.

How to scrape Facebook posts and comments reliably?

You can't scrape profile timelines or group feeds directly, as Facebook blocks unauthenticated scrolling with login walls after a few posts. But direct URLs to specific posts provide access to full content, including comments.

Facebook splits post data into 2 layers. Post content arrives in the initial HTML, but comments load asynchronously via GraphQL calls.

How to scrape Facebook posts and comments reliably

The approach: decouple discovery from extraction

The 10-line method works for known URLs. Finding those URLs requires a different strategy.

The workflow splits into 3 stages:

  • Discovery. Use Google search operators to find Facebook post URLs without hitting login walls. This bypasses Facebook's feed pagination limits.

  • Extraction. Use curl_cffi with regex to parse post metadata (timestamps, text, engagement metrics) from the discovered URLs.

  • Comment pagination. Query Facebook's GraphQL API directly to paginate through comments without rendering the full UI.

The approach: decouple discovery from extraction

Setup and environment

This builds on the approach above, adding automated discovery to find URLs at scale.

Run this code in an isolated virtual environment. You need Python 3.10+ and Google Chrome.

Create the environment:

python -m venv fb_scraper
source fb_scraper/bin/activate  # Windows: fb_scraper\Scripts\activate

Install dependencies:

pip install seleniumbase curl-cffi
seleniumbase install chromedriver

SeleniumBase handles search automation, and curl_cffi handles TLS fingerprinting.

Implementation

Save the following code as facebook_scraper.py. This script uses FacebookURLCollector for Google-based discovery and scrape_comments_graphql for backend data retrieval.

import re
import time
import random
import json
import base64
import urllib.parse
from datetime import datetime, timedelta
from typing import List, Dict, Optional
from curl_cffi import requests
from seleniumbase import Driver

# Optional: uncomment to use Live Proxies
# PROXY_URL = "http://username:[email protected]:7383"

# GraphQL doc_id for comment pagination
GRAPHQL_COMMENT_DOC_ID = "9442468175864664"

class FacebookURLCollector:
    """
    Discovers Facebook post URLs using Google search.
    This bypasses Facebook's login wall on profile feeds.
    """
    
    def __init__(self, use_proxy: bool = False):
        self.use_proxy = use_proxy
        self.search_queries_used = 0
        
    def _get_driver(self):
        """Initialize browser for Google search."""
        driver_args = {
            "uc": True,  # Undetected Chrome
            "headless": False,  # Set True for production
            "incognito": True,
            "disable_js": False,
        }
        
        if self.use_proxy and 'PROXY_URL' in globals():
            # Parse proxy for SeleniumBase format
            proxy_parts = PROXY_URL.replace("http://", "").split("@")
            if len(proxy_parts) == 2:
                creds, host_port = proxy_parts
                driver_args["proxy"] = host_port
                driver_args["proxy_user"] = creds.split(":")[0]
                driver_args["proxy_pass"] = creds.split(":")[1]
        
        return Driver(**driver_args)
    
    def search_page_posts(
        self, 
        page_name: str, 
        max_results: int = 10,
        days_back: int = 30
    ) -> List[str]:
        """
        Search Google for Facebook post URLs from a specific page.
        
        Args:
            page_name: Facebook page name or username
            max_results: Maximum number of post URLs to collect
            days_back: Search within last N days
            
        Returns:
            List of Facebook post URLs
        """
        urls = []
        
        # Time partitioning: search recent periods first
        time_ranges = [
            ("past week", 7),
            ("past month", 30),
            ("past year", 365)
        ]
        
        driver = self._get_driver()
        
        try:
            for period_name, period_days in time_ranges:
                if len(urls) >= max_results:
                    break
                    
                # Build Google search query
                # This finds post URLs, not the main page
                query = f'site:facebook.com "{page_name}" (posts OR videos OR photos) -inurl:photos_albums -inurl:groups'
                search_url = f"https://www.google.com/search?q={urllib.parse.quote(query)}&num=20&tbs=qdr:{'w' if period_days == 7 else 'm' if period_days == 30 else 'y'}"
                
                print(f"Searching Google for {period_name} posts...")
                driver.get(search_url)
                time.sleep(random.uniform(3, 5))
                
                # Extract Facebook URLs from search results
                # Google wraps URLs, need to decode
                links = driver.find_elements("css selector", "a[href*='facebook.com']")
                
                for link in links:
                    href = link.get_attribute("href")
                    if not href:
                        continue
                    
                    # Decode Google redirect URL if needed
                    if "/url?q=" in href:
                        match = re.search(r'/url\?q=([^&]+)', href)
                        if match:
                            href = urllib.parse.unquote(match.group(1))
                    
                    # Filter for actual post URLs
                    if self._is_valid_post_url(href, page_name):
                        if href not in urls:
                            urls.append(href)
                            print(f"Found: {href}")
                            
                            if len(urls) >= max_results:
                                break
                
                # Anti-detection delay between Google searches
                self.search_queries_used += 1
                time.sleep(random.uniform(10, 15))
                
        finally:
            driver.quit()
        
        return urls[:max_results]
    
    def _is_valid_post_url(self, url: str, page_name: str) -> bool:
        """
        Validate if URL is a Facebook post URL for the target page.
        
        Valid patterns:
        - facebook.com/{page}/posts/{id}
        - facebook.com/{page}/photos/{id}
        - facebook.com/{page}/videos/{id}
        - facebook.com/story.php?story_fbid={id}&id={page_id}
        """
        if not url or "facebook.com" not in url:
            return False
        
        # Must contain page name or be story.php
        if page_name.lower() not in url.lower() and "story.php" not in url:
            return False
        
        # Exclude non-post URLs
        exclude_patterns = [
            "/about", "/photos_albums", "/events", "/reviews",
            "/community", "/likes", "/followers", "/groups/",
            "/marketplace/", "/watch/", "/reel/"
        ]
        
        if any(pattern in url for pattern in exclude_patterns):
            return False
        
        # Include valid post patterns
        include_patterns = [
            r'/posts/\d+',
            r'/photos/[^/]+/\d+',
            r'/videos/\d+',
            r'/story\.php\?.*story_fbid=\d+',
            r'/permalink\.php\?.*story_fbid=\d+'
        ]
        
        return any(re.search(pattern, url) for pattern in include_patterns)


class FacebookPostScraper:
    """
    Scrapes Facebook post data from direct URLs.
    Uses curl_cffi for browser TLS emulation.
    """
    
    def __init__(self, use_proxy: bool = False):
        self.use_proxy = use_proxy
        self.session = requests.Session()
        
        # Configure session
        if use_proxy and 'PROXY_URL' in globals():
            self.session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
    
    def scrape_post(self, url: str, comment_limit: int = 0) -> Dict:
        """
        Scrape a single Facebook post.
        
        Args:
            url: Direct Facebook post URL
            comment_limit: Number of comments to fetch (0 = skip comments)
            
        Returns:
            Dict with post data
        """
        try:
            response = self.session.get(
                url,
                impersonate="chrome",
                timeout=30
            )
            html = response.text
            
            # Check if blocked
            if "login" in response.url.lower() or "checkpoint" in html.lower():
                print(f"Blocked or redirected for {url}")
                return {"url": url, "error": "blocked"}
            
            # Parse post data from HTML
            post_data = self._parse_post_html(html, url)
            
            # Fetch comments if requested
            if comment_limit > 0 and post_data.get("feedback_id"):
                comments = self._fetch_comments(
                    feedback_id=post_data["feedback_id"],
                    limit=comment_limit
                )
                post_data["comments"] = comments
                post_data["comment_count_fetched"] = len(comments)
            
            return post_data
            
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return {"url": url, "error": str(e)}
    
    def _parse_post_html(self, html: str, url: str) -> Dict:
        """Extract post metadata from HTML using regex."""
        data = {"url": url}
        
        # Post ID from URL or HTML
        post_id_match = re.search(r'/posts/(\d+)', url)
        if post_id_match:
            data["post_id"] = post_id_match.group(1)
        else:
            # Try story_fbid
            story_match = re.search(r'story_fbid=(\d+)', url)
            if story_match:
                data["post_id"] = story_match.group(1)
        
        # Author name
        author_patterns = [
            r'"actors":\[{"__typename":"User","name":"([^"]+)"',
            r'"owner":{"__typename":"User","name":"([^"]+)"',
            r'<meta property="og:title" content="([^"]+)"'
        ]
        for pattern in author_patterns:
            match = re.search(pattern, html)
            if match:
                data["author"] = match.group(1)
                break
        
        # Post text/description
        text_patterns = [
            r'<meta property="og:description" content="([^"]*)"',
            r'"message":{"text":"([^"]*)"',
            r'"comet_sections".*"message":{"text":"([^"]*)"'
        ]
        for pattern in text_patterns:
            match = re.search(pattern, html)
            if match:
                # Decode escaped unicode
                text = match.group(1).replace('\\n', '\n')
                text = bytes(text, 'utf-8').decode('unicode_escape')
                data["text"] = text
                break
        
        # Timestamp
        timestamp_match = re.search(r'"publish_time":(\d+)', html)
        if timestamp_match:
            data["timestamp"] = int(timestamp_match.group(1))
            data["datetime"] = datetime.fromtimestamp(
                int(timestamp_match.group(1))
            ).isoformat()
        
        # Engagement metrics
        metrics = {}
        
        # Total reactions
        reaction_match = re.search(r'"reaction_count":{"count":(\d+)', html)
        if reaction_match:
            metrics["reactions"] = int(reaction_match.group(1))
        
        # Comments
        comment_matches = re.findall(r'"total_count":(\d+)', html)
        if comment_matches:
            # First total_count is usually comments
            metrics["comments"] = int(comment_matches[0])
        
        # Shares
        share_match = re.search(r'"share_count":{"count":(\d+)', html)
        if share_match:
            metrics["shares"] = int(share_match.group(1))
        
        data["engagement"] = metrics
        
        # Reaction breakdown
        reaction_types = ["LIKE", "LOVE", "CARE", "HAHA", "WOW", "SAD", "ANGRY"]
        reaction_breakdown = {}
        
        for reaction_type in reaction_types:
            pattern = rf'"key":"{reaction_type}".*?"reaction_count":(\d+)'
            match = re.search(pattern, html)
            if match:
                reaction_breakdown[reaction_type.lower()] = int(match.group(1))
        
        if reaction_breakdown:
            data["reaction_breakdown"] = reaction_breakdown
        
        # Feedback ID for GraphQL comment fetching
        feedback_match = re.search(r'"feedback":{"id":"([^"]+)"', html)
        if feedback_match:
            data["feedback_id"] = feedback_match.group(1)
        
        return data
    
    def _fetch_comments(self, feedback_id: str, limit: int = 10) -> List[Dict]:
        """
        Fetch comments via Facebook GraphQL API.
        
        Args:
            feedback_id: Post feedback ID from HTML
            limit: Number of comments to fetch
            
        Returns:
            List of comment dicts
        """
        comments = []
        cursor = None
        
        while len(comments) < limit:
            try:
                # Build GraphQL variables
                variables = {
                    "display_comments_feedback_context": None,
                    "display_comments_context_enable_comment": None,
                    "display_comments_context_is_ad_preview": None,
                    "display_comments_context_is_aggregated_share": None,
                    "display_comments_context_is_story_set": None,
                    "feedLocation": "PERMALINK",
                    "feedbackSource": 2,
                    "focusCommentID": None,
                    "gridMediaWidth": 230,
                    "scale": 1,
                    "useDefaultActor": False,
                    "id": feedback_id,
                    "privacySelectorRenderLocation": "COMET_STREAM",
                    "renderLocation": "permalink",
                    "serializedPreloadedQueryName": "CometFocusedStoryViewUFIQuery",
                    "storyID": None,
                    "__relay_internal__pv__IsWorkUserrelayprovider": False,
                    "commentsAfterCount": limit - len(comments),
                    "commentsAfterCursor": cursor,
                    "commentsBeforeCount": None,
                    "commentsBeforeCursor": None,
                    "commentsIntentToken": "RANKED_UNFILTERED_CHRONOLOGICAL",
                    "feedContext": None,
                    "feedbackContext": {
                        "feedLocation": "PERMALINK",
                        "feedbackSource": 2,
                        "groupID": None,
                        "storyID": None
                    },
                    "filterNonCommentItem": True,
                    "privacySelectorRenderLocation": "COMET_STREAM",
                    "renderLocation": "permalink"
                }
                
                data = {
                    "av": "0",
                    "__user": "0",
                    "__a": "1",
                    "__req": "1",
                    "__hs": "",
                    "dpr": "1",
                    "__ccg": "UNKNOWN",
                    "__rev": "",
                    "__s": "",
                    "__hsi": "",
                    "__dyn": "",
                    "__csr": "",
                    "fb_dtsg": "NA",
                    "jazoest": "NA",
                    "lsd": "NA",
                    "__spin_r": "",
                    "__spin_b": "trunk",
                    "__spin_t": str(int(time.time())),
                    "fb_api_caller_class": "RelayModern",
                    "fb_api_req_friendly_name": "CometFocusedStoryViewUFIQuery",
                    "variables": json.dumps(variables),
                    "server_timestamps": "true",
                    "doc_id": GRAPHQL_COMMENT_DOC_ID
                }
                
                response = self.session.post(
                    "https://www.facebook.com/api/graphql/",
                    data=data,
                    headers={
                        "Content-Type": "application/x-www-form-urlencoded",
                        "Accept": "*/*",
                        "Origin": "https://www.facebook.com",
                        "Referer": "https://www.facebook.com/"
                    },
                    impersonate="chrome",
                    timeout=30
                )
                
                # Parse response
                text = response.text
                
                # GraphQL response may have non-JSON prefix
                json_start = text.find("{")
                if json_start == -1:
                    break
                    
                result = json.loads(text[json_start:])
                
                # Navigate to comments data
                edges = (
                    result.get("data", {})
                    .get("node", {})
                    .get("display_comments", {})
                    .get("edges", [])
                )
                
                if not edges:
                    break
                
                for edge in edges:
                    node = edge.get("node", {})
                    comment = {
                        "id": node.get("id"),
                        "author": node.get("author", {}).get("name"),
                        "text": node.get("body", {}).get("text"),
                        "timestamp": node.get("created_time"),
                        "reactions": node.get("feedback", {}).get("reactors", {}).get("count", 0)
                    }
                    comments.append(comment)
                    
                    if len(comments) >= limit:
                        break
                
                # Check pagination
                page_info = (
                    result.get("data", {})
                    .get("node", {})
                    .get("display_comments", {})
                    .get("page_info", {})
                )
                
                cursor = page_info.get("end_cursor")
                if not page_info.get("has_next_page") or not cursor:
                    break
                
                # Delay between GraphQL requests
                time.sleep(random.uniform(1, 2))
                
            except Exception as e:
                print(f"Error fetching comments: {e}")
                break
        
        return comments[:limit]


def scrape_facebook_page(
    page_url: str,
    post_limit: int = 10,
    comments_per_post: int = 0,
    use_proxy: bool = False
) -> List[Dict]:
    """
    Main workflow: discover post URLs then scrape them.
    
    Args:
        page_url: Facebook page URL
        post_limit: Number of posts to scrape
        comments_per_post: Comments per post to fetch
        use_proxy: Whether to use Live Proxies
        
    Returns:
        List of scraped post data
    """
    # Extract page name from URL
    page_name = page_url.rstrip("/").split("/")[-1]
    
    print(f"Starting scrape for page: {page_name}")
    print(f"Target: {post_limit} posts, {comments_per_post} comments/post")
    
    # Stage 1: Discover URLs via Google
    collector = FacebookURLCollector(use_proxy=use_proxy)
    post_urls = collector.search_page_posts(
        page_name=page_name,
        max_results=post_limit
    )
    
    print(f"\nFound {len(post_urls)} post URLs")
    
    if not post_urls:
        print("No post URLs discovered")
        return []
    
    # Stage 2: Scrape each post
    scraper = FacebookPostScraper(use_proxy=use_proxy)
    results = []
    
    for i, url in enumerate(post_urls, 1):
        print(f"\n[{i}/{len(post_urls)}] Scraping: {url}")
        
        result = scraper.scrape_post(
            url=url,
            comment_limit=comments_per_post
        )
        
        if "error" not in result:
            results.append(result)
            print(f"  ✓ Success: {result.get('author', 'Unknown')} - {result.get('engagement', {}).get('reactions', 0)} reactions")
        else:
            print(f"  ✗ Failed: {result['error']}")
        
        # Anti-detection delay between posts
        if i < len(post_urls):
            delay = random.uniform(2, 4)
            print(f"  Waiting {delay:.1f}s...")
            time.sleep(delay)
    
    return results


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(
        description="Scrape Facebook posts and comments using Google discovery"
    )
    parser.add_argument(
        "--url", "-u",
        required=True,
        help="Facebook page URL (e.g., https://www.facebook.com/narendramodi)"
    )
    parser.add_argument(
        "--posts", "-p",
        type=int,
        default=10,
        help="Number of posts to scrape"
    )
    parser.add_argument(
        "--comments", "-c",
        type=int,
        default=0,
        help="Comments per post to fetch"
    )
    parser.add_argument(
        "--proxy",
        action="store_true",
        help="Use Live Proxies for rotation"
    )
    parser.add_argument(
        "--output", "-o",
        help="Output JSON filename"
    )
    
    args = parser.parse_args()
    
    # Run scraper
    results = scrape_facebook_page(
        page_url=args.url,
        post_limit=args.posts,
        comments_per_post=args.comments,
        use_proxy=args.proxy
    )
    
    # Save results
    output_file = args.output or f"facebook_{args.url.rstrip('/').split('/')[-1]}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    print(f"\n{'='*60}")
    print(f"Scraping complete!")
    print(f"Total posts scraped: {len(results)}")
    print(f"Output saved to: {output_file}")

Running the scraper

Execute the script with your target page:

Basic scrape (posts only):

python facebook_scraper.py -u https://www.facebook.com/narendramodi -p 10

Deep scrape (posts + 10 comments each):

python facebook_scraper.py -u https://www.facebook.com/narendramodi -p 10 -c 10

Parameters:

  • -- url (-u) – Target Facebook page URL

  • -- posts (-p) – Number of posts to discover via Google

  • -- comments (-c) – Number of comments to extract per post (default: 0)

  • -- output (-o) – Custom output filename (optional)

The script outputs a JSON file with post metadata and comments.

What you get back

Here's the output structure, which is nested JSON with post text, engagement metrics, reaction breakdowns, and comments:

{
  "url": "https://www.facebook.com/narendramodi/posts/...",
  "post_id": "1671969867459607",
  "author": "Narendra Modi",
  "text": "Had a wonderful interaction with students...",
  "timestamp": 1735154400,
  "datetime": "2024-12-25T10:00:00",
  "engagement": {
    "reactions": 45231,
    "comments": 2184,
    "shares": 892
  },
  "reaction_breakdown": {
    "like": 38000,
    "love": 5231
  },
  "feedback_id": "ZmVlZGJhY2s6...",
  "comments": [
    {
      "id": "Y29tbWVudDoxMjM0NTY=",
      "author": "User Name",
      "text": "Great initiative!",
      "timestamp": 1735158000,
      "reactions": 23
    }
  ],
  "comment_count_fetched": 10
}

Why regex over DOM parsers

Regex processes large React documents faster than BeautifulSoup. Pre-compiled patterns with re.compile() reduce overhead during high-volume extraction.

For discovery, the script uses time partitioning, which means querying Google for distinct date ranges (past 7 days, past 30 days). This retrieves historical data without duplicate results.

The GRAPHQL_COMMENT_DOC_ID identifies the query type. If comment extraction fails, open your browser's Network tab, filter for graphql, and inspect the payload variables to find the updated doc_id.

Why regex over DOM parsers

How to scrape Facebook groups (public) without breaking rules?

Groups use a different URL structure. You can't access feeds directly, so find group post URLs through search engines instead.

How to scrape Facebook groups (public) without breaking rules

Group posts look like this: htt​ps://ww​w.facebook.co​m/groups/{group_id}/​posts/{post_id}/

The approach: search index discovery

The method mirrors the page scraper. A headless browser searches Google for site:facebook.com/groups/{group_name}, finds direct post URLs, then the same regex patterns extract the post data.

Building the scraper

Save this as facebook_group_scraper.py.

It uses the same time partitioning from above, like searching different time windows (past week, past month) to retrieve more results without triggering rate limits.

import re
import time
import random
import json
import html as html_lib
import urllib.parse
from typing import List, Dict
from curl_cffi import requests
from seleniumbase import Driver

# Optional proxy
# PROXY_URL = "http://username:[email protected]:7383"

class FacebookGroupScraper:
    """
    Scrapes public Facebook group posts discovered via Google.
    """
    
    def __init__(self, use_proxy: bool = False):
        self.use_proxy = use_proxy
        
    def discover_group_posts(
        self,
        group_name: str,
        max_results: int = 10
    ) -> List[str]:
        """
        Search Google for public group post URLs.
        
        Args:
            group_name: Facebook group name or keyword
            max_results: Number of post URLs to find
            
        Returns:
            List of group post URLs
        """
        urls = []
        
        driver_args = {
            "uc": True,
            "headless": False,  # Set True for production
            "incognito": True,
        }
        
        if self.use_proxy and 'PROXY_URL' in globals():
            proxy_parts = PROXY_URL.replace("http://", "").split("@")
            if len(proxy_parts) == 2:
                creds, host_port = proxy_parts
                driver_args["proxy"] = host_port
                driver_args["proxy_user"] = creds.split(":")[0]
                driver_args["proxy_pass"] = creds.split(":")[1]
        
        driver = Driver(**driver_args)
        
        try:
            # Multiple search queries for better coverage
            queries = [
                f'site:facebook.com/groups "{group_name}" "/posts/"',
                f'site:facebook.com/groups/{group_name} "/posts/"',
                f'"{group_name}" site:facebook.com/groups "posts"'
            ]
            
            for query in queries:
                if len(urls) >= max_results:
                    break
                    
                search_url = f"https://www.google.com/search?q={urllib.parse.quote(query)}&num=20"
                
                print(f"Searching: {query}")
                driver.get(search_url)
                time.sleep(random.uniform(3, 5))
                
                links = driver.find_elements("css selector", "a[href*='facebook.com/groups/']")
                
                for link in links:
                    href = link.get_attribute("href")
                    if not href:
                        continue
                    
                    # Decode Google redirect
                    if "/url?q=" in href:
                        match = re.search(r'/url\?q=([^&]+)', href)
                        if match:
                            href = urllib.parse.unquote(match.group(1))
                    
                    # Validate group post URL
                    if self._is_valid_group_post_url(href):
                        if href not in urls:
                            urls.append(href)
                            print(f"Found: {href}")
                            
                            if len(urls) >= max_results:
                                break
                
                time.sleep(random.uniform(8, 12))
                
        finally:
            driver.quit()
        
        return urls[:max_results]
    
    def _is_valid_group_post_url(self, url: str) -> bool:
        """Check if URL is a Facebook group post."""
        if not url or "facebook.com/groups/" not in url:
            return False
        
        # Must be a direct post URL
        patterns = [
            r'/groups/\d+/posts/\d+',
            r'/groups/[^/]+/posts/\d+'
        ]
        
        return any(re.search(pattern, url) for pattern in patterns)
    
    def scrape_group_post(self, url: str) -> Dict:
        """
        Scrape a single group post from direct URL.
        Uses same TLS browser emulation as page scraper.
        """
        session = requests.Session()
        
        if self.use_proxy and 'PROXY_URL' in globals():
            session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
        
        try:
            response = session.get(
                url,
                impersonate="chrome",
                timeout=30
            )
            html = response.text
            
            if "login" in response.url.lower():
                return {"url": url, "error": "login_wall"}
            
            data = {"url": url}
            
            # Group post ID
            post_match = re.search(r'/posts/(\d+)', url)
            if post_match:
                data["post_id"] = post_match.group(1)
            
            # Author
            author_patterns = [
                r'"actors":\[{"__typename":"User","name":"([^"]+)"',
                r'"author":{"__typename":"User","name":"([^"]+)"',
                r'"owner":{"name":"([^"]+)"'
            ]
            for pattern in author_patterns:
                match = re.search(pattern, html)
                if match:
                    data["author"] = html_lib.unescape(match.group(1))
                    break
            
            # Post text
            text_patterns = [
                r'<meta property="og:description" content="([^"]*)"',
                r'"message":{"text":"([^"]*)"',
                r'"story":{.*?"message":{"text":"([^"]*)"'
            ]
            for pattern in text_patterns:
                match = re.search(pattern, html)
                if match:
                    text = html_lib.unescape(match.group(1))
                    data["text"] = text.replace('\\n', '\n')
                    break
            
            # Group name
            group_patterns = [
                r'<title>([^<]+) \| Facebook</title>',
                r'"community":{"name":"([^"]+)"'
            ]
            for pattern in group_patterns:
                match = re.search(pattern, html)
                if match:
                    data["group_name"] = html_lib.unescape(match.group(1))
                    break
            
            # Timestamp
            time_match = re.search(r'"publish_time":(\d+)', html)
            if time_match:
                data["timestamp"] = int(time_match.group(1))
            
            # Engagement
            reaction_match = re.search(r'"reaction_count":{"count":(\d+)', html)
            comment_match = re.search(r'"total_count":(\d+)', html)
            
            data["engagement"] = {
                "reactions": int(reaction_match.group(1)) if reaction_match else 0,
                "comments": int(comment_match.group(1)) if comment_match else 0
            }
            
            return data
            
        except Exception as e:
            return {"url": url, "error": str(e)}
    
    def scrape_group(
        self,
        group_url: str,
        post_limit: int = 10
    ) -> List[Dict]:
        """
        Main workflow for scraping group posts.
        
        Args:
            group_url: Facebook group URL or identifier
            post_limit: Number of posts to scrape
            
        Returns:
            List of scraped posts
        """
        # Extract group name from URL
        group_name = group_url.rstrip("/").split("/")[-1]
        
        print(f"Discovering posts for group: {group_name}")
        post_urls = self.discover_group_posts(group_name, post_limit)
        
        print(f"\nFound {len(post_urls)} group posts")
        
        results = []
        for i, url in enumerate(post_urls, 1):
            print(f"[{i}/{len(post_urls)}] Scraping: {url}")
            
            result = self.scrape_group_post(url)
            if "error" not in result:
                results.append(result)
                print(f"  ✓ {result.get('author')} in {result.get('group_name')}")
            
            time.sleep(random.uniform(2, 4))
        
        return results


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(
        description="Scrape public Facebook group posts via Google discovery"
    )
    parser.add_argument(
        "--url",
        required=True,
        help="Facebook group URL (e.g., https://www.facebook.com/groups/dogspotting)"
    )
    parser.add_argument(
        "--posts",
        type=int,
        default=10,
        help="Number of posts to scrape"
    )
    parser.add_argument(
        "--proxy",
        action="store_true",
        help="Use Live Proxies"
    )
    parser.add_argument(
        "--output",
        help="Output filename"
    )
    
    args = parser.parse_args()
    
    scraper = FacebookGroupScraper(use_proxy=args.proxy)
    results = scraper.scrape_group(args.url, args.posts)
    
    output_file = args.output or f"group_posts_{args.posts}.json"
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    print(f"\nDone! Scraped {len(results)} posts")
    print(f"Saved to: {output_file}")

Running it

Run the script with your target group:

python facebook_group_scraper.py --url "https://www.facebook.com/groups/dogspotting" --posts 10

Example response

{
    "url": "https://www.facebook.com/groups/dogspotting/posts/10164914842549467/",
    "post_id": "10164914842549467",
    "author": "Sarah Johnson",
    "text": "Spotted this good boy at Central Park today! Anyone know what breed mix he might be?",
    "group_name": "Dogspotting Society",
    "timestamp": 1735123456,
    "engagement": {
        "reactions": 234,
        "comments": 67
    }
}

How this bypasses Facebook's restrictions

Google's index is the key here. This approach only accesses public URLs that Google already indexed, never the Facebook feed itself.

What doesn't work

Google indexes take time to update, so this method retrieves historical data more reliably than real-time monitoring (posts from the last 10 minutes won't appear). Only public groups work–search engines can't see private group content.

Google rate-limits your searches. The script handles basic detection, but high-volume scraping needs rotating proxies or a SERP API. The regex patterns may need occasional updates if Facebook changes their variable names (like reaction_count to feedback_count). This doesn't happen often, but it does happen.

How to scrape Facebook Marketplace?

Marketplace displays approximately 20-30 listings before showing a login wall when scrolling while logged out.

How to scrape Facebook Marketplace?

The approach: API reverse engineering

Marketplace is a Single Page Application (SPA). When you scroll, it fetches data through http​s://w​w​w.facebook.com/api/​graphql/.

The implementation uses API reverse engineering, observing the GraphQL requests the browser makes and replicating them directly in code. By copying the request format from your browser's Network tab, you can fetch listings programmatically without hitting the scroll-based login wall.

Implementation

Here's what it does, searches a specific Marketplace category within a location radius.

How it works:

  • Takes a category URL (like f​aceb​ook.co​m/m​arketplace/category/electronics)

  • Extracts the category slug (electronics)

  • Sends GraphQL queries with that category + your lat/long coordinates

  • Returns listings within a 65km radius

Important: Facebook rotates API tokens and query IDs (doc_id) frequently. Before running the script, copy fresh values from your browser:

Open Facebook Marketplace in a logged-out browser

  • Open DevTools → Network tab

  • Scroll through some listings and filter the Network tab for graphql

  • Click on a GraphQL request and copy values from the Payload/Headers tabs

  • Update COOKIES, HEADERS, BASE_DATA, and DOC_ID in the code

These values are just browser fingerprints and session tokens for public data access, no login required.

from curl_cffi import requests
import json
import time
import random
import argparse
from urllib.parse import urlparse

# Optional: Uncomment to use Live Proxies
# PROXY_URL = "http://username:[email protected]:7383"

# IMPORTANT: Update these values from your browser's DevTools
# Open Facebook Marketplace, inspect a GraphQL request, and copy fresh values
COOKIES = {
    # These are examples. Replace with values from a real Marketplace session.
    'wd': '2560x1271',
    'datr': 'YOUR_DATR_TOKEN',
    'sb': 'YOUR_SB_TOKEN',
    'ps_l': '1',
    'ps_n': '1',
    'fr': 'YOUR_FR_TOKEN',
    'xs': '0%3Aabc123',
    'c_user': '0',  # Logged-out sessions may omit this
}

HEADERS = {
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.9',
    'content-type': 'application/x-www-form-urlencoded',
    'origin': 'https://www.facebook.com',
    'referer': 'https://www.facebook.com/marketplace/',
    'sec-ch-prefers-color-scheme': 'light',
    'sec-ch-ua': '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-model': '""',
    'sec-ch-ua-platform': '"Windows"',
    'sec-ch-ua-platform-version': '"10.0.0"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'x-asbd-id': '129477',
    'x-fb-friendly-name': 'CometMarketplaceCategoryContentPaginationQuery',
    'x-fb-lsd': 'YOUR_LSD_TOKEN',
}

# Base request data from DevTools
BASE_DATA = {
    'av': '0',
    '__user': '0',
    '__a': '1',
    '__req': '1',
    '__hs': 'YOUR_HS_VALUE',
    'dpr': '1',
    '__ccg': 'GOOD',
    '__rev': 'YOUR_REV',
    '__s': 'YOUR_S_VALUE',
    '__hsi': 'YOUR_HSI',
    '__dyn': 'YOUR_DYN',
    '__csr': 'YOUR_CSR',
    'fb_dtsg': 'NA',
    'jazoest': '22000',
    'lsd': 'YOUR_LSD_TOKEN',
    '__spin_r': 'YOUR_SPIN_R',
    '__spin_b': 'trunk',
    '__spin_t': str(int(time.time())),
    'fb_api_caller_class': 'RelayModern',
    'fb_api_req_friendly_name': 'CometMarketplaceCategoryContentPaginationQuery',
    'server_timestamps': 'true',
}

# GraphQL document ID. Update this from DevTools.
DOC_ID = '1234567890123456'


def marketplace_search(
    category_url: str = None,
    query: str = None,
    lat: float = 37.7749,
    long: float = -122.4194,
    radius_km: int = 65,
    limit: int = 50,
    min_price: int = None,
    max_price: int = None,
    condition: str = None,
    use_proxy: bool = False
):
    """
    Search Facebook Marketplace by category or keyword.
    
    Args:
        category_url: Marketplace category URL (e.g., electronics)
        query: Search keyword (optional, overrides category browsing)
        lat: Latitude for location-based results
        long: Longitude for location-based results
        radius_km: Search radius in kilometers
        limit: Maximum listings to retrieve
        min_price: Minimum price filter
        max_price: Maximum price filter
        condition: Item condition (e.g., "new", "used_good")
        use_proxy: Whether to use Live Proxies
        
    Returns:
        List of listing data
    """
    
    # Extract category slug from URL if provided
    category_slug = None
    if category_url:
        parsed = urlparse(category_url)
        path_parts = parsed.path.strip('/').split('/')
        if 'category' in path_parts:
            category_idx = path_parts.index('category')
            if category_idx + 1 < len(path_parts):
                category_slug = path_parts[category_idx + 1]
    
    session = requests.Session()
    
    if use_proxy and 'PROXY_URL' in globals():
        session.proxies = {"http": PROXY_URL, "https": PROXY_URL}
    
    listings = []
    cursor = None
    
    while len(listings) < limit:
        try:
            # Build GraphQL variables
            variables = {
                "buyLocation": {
                    "latitude": lat,
                    "longitude": long
                },
                "categoryIDArray": [],
                "count": min(24, limit - len(listings)),
                "cursor": cursor,
                "deliveryMethod": "NO_SHIPPING",
                "filters": {
                    "sort_by": "BEST_MATCH"
                },
                "latitude": lat,
                "longitude": long,
                "query": query or "",
                "radius": radius_km * 1000,  # Convert km to meters
                "scale": 1
            }
            
            # Add category if browsing by category
            if category_slug:
                variables["category"] = category_slug
                variables["categoryIDArray"] = [category_slug]
            
            # Add filters
            if min_price is not None:
                variables["filters"]["minPrice"] = min_price
            if max_price is not None:
                variables["filters"]["maxPrice"] = max_price
            if condition:
                variables["filters"]["condition"] = condition
            
            # Build request data
            data = BASE_DATA.copy()
            data['variables'] = json.dumps(variables)
            data['doc_id'] = DOC_ID
            data['__spin_t'] = str(int(time.time()))
            
            # Make GraphQL request
            response = session.post(
                'https://www.facebook.com/api/graphql/',
                cookies=COOKIES,
                headers=HEADERS,
                data=data,
                impersonate="chrome",
                timeout=30
            )
            
            # Parse response
            text = response.text
            json_start = text.find('{')
            if json_start == -1:
                print("No JSON found in response")
                break
            
            result = json.loads(text[json_start:])
            
            # Navigate to marketplace listings
            # GraphQL response structure can vary; inspect with DevTools
            data_node = result.get('data', {})
            
            # Common response patterns
            edges = []
            if 'viewer' in data_node:
                # Pattern 1: viewer -> marketplace_feed
                edges = (
                    data_node.get('viewer', {})
                    .get('marketplace_feed', {})
                    .get('feed_units', {})
                    .get('edges', [])
                )
            elif 'marketplace_search' in data_node:
                # Pattern 2: direct marketplace_search
                edges = (
                    data_node.get('marketplace_search', {})
                    .get('feed_units', {})
                    .get('edges', [])
                )
            
            if not edges:
                print("No listings found or response structure changed")
                break
            
            # Extract listing data
            for edge in edges:
                node = edge.get('node', {})
                
                # Extract common fields
                listing = {
                    'id': node.get('id'),
                    'title': node.get('marketplace_listing_title') or node.get('title'),
                    'price': self._extract_price(node),
                    'location': self._extract_location(node),
                    'image_url': self._extract_image(node),
                    'listing_url': self._extract_url(node),
                    'condition': node.get('condition'),
                    'seller_name': self._extract_seller(node)
                }
                
                # Only add complete listings
                if listing['title'] and listing['id']:
                    listings.append(listing)
                    print(f"[{len(listings)}/{limit}] {listing['title']} - {listing['price']}")
                
                if len(listings) >= limit:
                    break
            
            # Pagination
            page_info = {}
            if 'viewer' in data_node:
                page_info = (
                    data_node.get('viewer', {})
                    .get('marketplace_feed', {})
                    .get('feed_units', {})
                    .get('page_info', {})
                )
            elif 'marketplace_search' in data_node:
                page_info = (
                    data_node.get('marketplace_search', {})
                    .get('feed_units', {})
                    .get('page_info', {})
                )
            
            cursor = page_info.get('end_cursor')
            if not page_info.get('has_next_page') or not cursor:
                print("No more pages")
                break
            
            # Delay between requests to avoid rate limits
            time.sleep(random.uniform(2, 4))
            
        except Exception as e:
            print(f"Error: {e}")
            break
    
    return listings[:limit]


def _extract_price(node):
    """Extract price from listing node."""
    # Try multiple price field patterns
    price_patterns = [
        node.get('listing_price', {}).get('formatted_amount'),
        node.get('price_text'),
        node.get('formatted_price')
    ]
    
    for price in price_patterns:
        if price:
            return price
    
    return "N/A"


def _extract_location(node):
    """Extract location string."""
    location_patterns = [
        node.get('location_text', {}).get('text'),
        node.get('location', {}).get('reverse_geocode', {}).get('city_page', {}).get('display_name'),
        node.get('marketplace_listing_location', {}).get('location', {}).get('city')
    ]
    
    for location in location_patterns:
        if location:
            return location
    
    return "Unknown"


def _extract_image(node):
    """Extract primary image URL."""
    image_patterns = [
        node.get('primary_listing_photo', {}).get('image', {}).get('uri'),
        node.get('listing_photos', [{}])[0].get('image', {}).get('uri'),
        node.get('image', {}).get('uri')
    ]
    
    for image in image_patterns:
        if image:
            return image
    
    return None


def _extract_url(node):
    """Extract listing URL."""
    if node.get('id'):
        return f"https://www.facebook.com/marketplace/item/{node['id']}/"
    
    if node.get('story_key'):
        return f"https://www.facebook.com/marketplace/item/{node['story_key']}/"
    
    return None


def _extract_seller(node):
    """Extract seller name."""
    seller_patterns = [
        node.get('marketplace_listing_seller', {}).get('name'),
        node.get('seller', {}).get('name')
    ]
    
    for seller in seller_patterns:
        if seller:
            return seller
    
    return "Unknown"


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Search Facebook Marketplace by category or keyword"
    )
    parser.add_argument(
        "--url",
        help="Marketplace category URL (e.g., https://www.facebook.com/marketplace/category/electronics)"
    )
    parser.add_argument(
        "--query",
        help="Search keyword (e.g., 'iPhone')"
    )
    parser.add_argument(
        "--lat",
        type=float,
        default=37.7749,
        help="Latitude (default: San Francisco)"
    )
    parser.add_argument(
        "--long",
        type=float,
        default=-122.4194,
        help="Longitude (default: San Francisco)"
    )
    parser.add_argument(
        "--radius",
        type=int,
        default=65,
        help="Search radius in kilometers"
    )
    parser.add_argument(
        "--min-price",
        type=int,
        help="Minimum price"
    )
    parser.add_argument(
        "--max-price",
        type=int,
        help="Maximum price"
    )
    parser.add_argument(
        "--condition",
        choices=["new", "used_good", "used_like_new", "used_fair"],
        help="Item condition"
    )
    parser.add_argument(
        "--limit",
        type=int,
        default=50,
        help="Maximum listings to scrape"
    )
    parser.add_argument(
        "--proxy",
        action="store_true",
        help="Use Live Proxies"
    )
    parser.add_argument(
        "--output",
        default="marketplace_results.json",
        help="Output JSON filename"
    )
    
    args = parser.parse_args()
    
    # Require either URL or query
    if not args.url and not args.query:
        parser.error("Either --url or --query is required")
    
    # Bind helper methods (simple approach)
    global self
    self = type('', (), {
        '_extract_price': _extract_price,
        '_extract_location': _extract_location,
        '_extract_image': _extract_image,
        '_extract_url': _extract_url,
        '_extract_seller': _extract_seller
    })()
    
    results = marketplace_search(
        category_url=args.url,
        query=args.query,
        lat=args.lat,
        long=args.long,
        radius_km=args.radius,
        limit=args.limit,
        min_price=args.min_price,
        max_price=args.max_price,
        condition=args.condition,
        use_proxy=args.proxy
    )
    
    with open(args.output, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    print(f"\nDone! Found {len(results)} listings")
    print(f"Saved to: {args.output}")

How to use it

You have 2 search modes:

Category mode – Browse a specific category:

python facebook_marketplace_scraper.py --url "https://www.facebook.com/marketplace/category/electronics" --limit 50

Keyword mode - Search for specific items:

python facebook_marketplace_scraper.py --query "iPhone" --limit 50

Advanced filtering:

# Search for furniture in New York within 10km, priced $100-$500, in used condition
python facebook_marketplace_scraper.py \
  --query "furniture" \
  --lat 40.7128 --long -74.0060 \
  --radius 10 \
  --min-price 100 --max-price 500 \
  --condition used_good \
  --limit 20

Available arguments:

  • – –url – Category URL, for example .../category/electronics
  • – –query – Search keyword, for example iPhone or Canon
  • – –lat / ––long – Search center coordinates. Default: San Francisco
  • – –radius – Search radius in kilometers. Default: 65
  • – –min-price / ––max-price – Price range in dollars
  • – –condition – Item condition: new, used_good, used_like_new, or used_fair
  • – –limit – Maximum items to scrape. Default: 50
  • – –output – Output filename. Default: marketplace_results.json

You must provide either - -url or - -query (or both for filtered category browsing).

Sample output

Facebook returns location as a city string (like "New York, NY") instead of exact coordinates. The output looks like this:

[
  {
    "id": "743141558781810",
    "title": "Sony A7III Mirrorless Camera Body",
    "price": "$1,200",
    "location": "San Francisco, CA",
    "image_url": "https://scontent.xx.fbcdn.net/...",
    "listing_url": "https://www.facebook.com/marketplace/item/743141558781810/",
    "condition": "used_good",
    "seller_name": "John Smith"
  }
]

Limitations

Keyword search is less reliable than category browsing. Facebook's search algorithm may return semantically similar items instead of exact matches.

Session data expires quickly. Cookies (c_user, xs) invalidate if your IP changes, and the query IDs (doc_id) rotate when Facebook updates the Marketplace frontend. Re-extract fresh values from DevTools if you start getting empty responses.

The GraphQL endpoint is sensitive to rate limits. Use the built-in delays and consider rotating proxies for larger crawls.

How to avoid blocks in Facebook scraping?

The scripts above work for small tests. For larger volumes, IP rotation and request timing become essential.

Use residential proxies

Cloud server IPs (AWS, GCP, Azure) don't work because they're datacenter addresses. Facebook flags them immediately.

When choosing a residential proxy provider, look for features built specifically for scraping:

Live Proxies offers residential and mobile IPs with these features:

  • Private allocation. Provides a dedicated IP pool exclusively for your account. Unlike shared pools, where other users can burn IPs with aggressive scraping, your allocated IPs aren't being used on the same targets by another customer. You control the IP reputation. See pricing options.

  • Sticky sessions. Lock an IP for up to 24 hours by adding a session ID to your proxy username (e.g., sid:1). Use this when a task needs a consistent identity, such as a Google search session or a sequence of Marketplace pagination requests.

  • Rotating sessions. Assigns a fresh IP for every request. Use this for high-volume discovery tasks like searching Marketplace listings across multiple cities or fetching hundreds of independent post URLs.

Live Proxies also offers rotating mobile proxies, where IPs come from mobile carriers like T-Mobile and Verizon. These work for stricter targets, but residential proxies are usually sufficient for Facebook's public surfaces.

Add proxies to the scripts

Get your proxy credentials from the Live Proxies dashboard.

For B2C users, the format is:

IP:PORT:USERNAME-ACCESS_CODE-SID:PASSWORD

For B2B users, the format is:

b2b.liveproxies.io:7383:username-access_code-sid:password

Uncomment the PROXY_URL line at the top of any script:

# PROXY CONFIGURATION
PROXY_URL = "http://username:[email protected]:7383"

The scripts automatically route all traffic through your residential proxy pool.

Request timing strategy

Space out requests even with proxies. Facebook tracks request frequency per IP.

For discovery (Google search or Marketplace browsing):

Use 5–10 second delays between requests

Use sticky sessions (one IP per browsing session)

Randomize delays: time.sleep(random.uniform(5, 10))

For extraction (fetching individual post URLs):

Use 1–2 second delays between requests

Can rotate IPs more aggressively since requests are independent

Already built into the scripts

Why curl_cffi matters with residential proxies

Residential IPs only work if your connection appears legitimate. Standard Python requests still fail because Facebook checks the TLS handshake before looking at the IP reputation.

The scripts use curl_cffi with impersonate="chrome" throughout. This replicates a real Chrome TLS fingerprint and combines with the residential IP to match both connection-layer and network-layer expectations.

Detecting blocks

Facebook blocks show up as:

  • Soft blocks – API returns empty data or {"data": null}

  • CAPTCHA – Browser displays "unusual traffic" warning

  • HTTP errors – 429 (rate limit) or 403 (forbidden)

If you hit a soft block, wait 10–15 minutes and rotate to a different proxy session. If you hit repeated CAPTCHAs, reduce your rate further.

For more on handling IP bans, see your IP has been banned.

How to model, clean, and store Facebook scraping data?

The scripts save raw JSON with display-formatted strings. Before analysis, clean the data:

  • Normalize values. Strip currency symbols from prices ("$1,200" → 1200.0), convert Unix timestamps to ISO 8601 format (1767231091 → 2026-01-01T07:01:31Z), and remove null fields to reduce file size.

  • Use JSON Lines. Save data as .jsonl (one JSON object per line) instead of standard JSON arrays. If your scraper crashes, you only lose the last partial line, not the entire file.

  • Handle duplicates. Use post_id or id as your unique key. When saving to a database, use upserts to update existing posts if they were re-scraped later with new comment counts.

  • For storage, document databases (MongoDB or PostgreSQL with JSONB) handle nested data structures natively, while relational databases work better after normalizing comments into separate tables.

How to scale Facebook scraping safely?

Running one script works for small projects. Scaling to thousands of posts requires distributing the load.

Run multiple scripts with different proxy sessions

Don't run parallel threads from one script; Facebook detects the burst of requests from a single process, even behind a proxy. Instead, run multiple independent scripts from separate terminals or servers.

With Live Proxies, assign different session IDs:

  • Script 1: username-access_code-sid:1:password

  • Script 2: username-access_code-sid:2:password

  • Script 3: username-access_code-sid:3:password

Each script receives its own IP. If one is blocked, the others continue running.

Split discovery from extraction

Finding URLs (Google search) is slow. Extracting data (fetching posts) is fast. Don't do both in one workflow at scale.

Run 2 separate processes:

  • Discovery script finds URLs, writes them to a text file
  • Extraction scripts read from that file and scrape the posts

If the extraction hits rate limits, you can pause it without losing your URL list.

Space out requests across scripts

Don't have all scripts start at the same time. Stagger them by 30–60 seconds. Add random delays so their request patterns don't align.

Watch for error rates

If you start seeing lots of HTTP 403 or 429 errors, slow down. Add longer delays or reduce the number of concurrent scripts.

For enterprise-scale requirements, see Live Proxies B2B & Enterprise solutions.

What are the best Facebook scraping use cases?

Here are some real-world applications:

Marketplace arbitrage

Facebook Marketplace is a goldmine for mispriced assets, particularly in categories where seller urgency outpaces buyer discovery. The Marketplace scraper can identify listings priced below regional medians, for example, a $1,200 camera listed for $800 in a low-demand area. Resellers use this to map price disparities across cities and find profitable flips.

Brand sentiment tracking

Public comments on brand pages provide an unfiltered, real-time barometer of customer satisfaction. So, use the Post Scraper to ingest comments from a competitor’s page. Instead of just counting reactions, build a Weighted Sentiment Score: track the ratio of "Angry/Sad" vs. "Love/Care". A sudden spike in negative engagement often predicts product defects or PR crises days before they appear in official reports.

Market research in groups

Public groups are where users discuss problems and recommend products. So, use the Group Discovery Script (the Google-based scraper above) to identify posts in groups related to your niche, for example, dog owners discussing food brands or photographers comparing camera gear. Extracting the post text reveals the language people use to describe their needs, which is useful for product positioning or keyword research.

For more scraping guides, check out our tutorials on social media scraping, Instagram scraping, and X/Twitter scraping.

Further reading: How Proxies Help You Scale AI Web Scraping and Data Collection and How to Scrape Walmart: Data, Prices, Products, and Reviews (2025).

Conclusion

Facebook scraping requires combining TLS browser emulation, API reverse engineering, and residential proxies. Direct URL scraping works for single posts. Google discovery solves the feed login wall for pages and groups. Marketplace requires GraphQL replication because scrolling hits a login barrier after 20-30 listings.

Start with the 10-line example to verify your setup works, then scale to the full implementations as needed.

Next steps: Try the code on small datasets first, monitor your success rates, and adjust delays if you hit rate limits. For production scraping at volume, residential proxies with private allocation are essential.

FAQs

Can I scrape private or closed Facebook groups?

No. Private groups require login, which violates Facebook's ToS and results in account bans or legal risk. The Group Discovery Script only works on public groups.

Can I scrape emails from Facebook groups?

No. Facebook doesn't expose user emails in public HTML. Any method claiming to extract emails is either fake or requires account access, which violates the rules described above.

How often should I re-scrape pages?

It depends on page activity. For high-traffic news pages, every 4–6 hours works with residential proxies. For groups or Marketplace categories, once or twice daily is usually enough. Watch for rate limits and back off if you see soft blocks.

Why do some pages suddenly return empty data?

This is a soft block. Facebook serves an empty HTML shell because it flagged your IP or TLS fingerprint. Rotate your proxy session and verify curl_cffi is still using impersonate="chrome".

Do proxies actually help with Facebook scraping?

Yes, but only residential proxies. Facebook recognizes datacenter IPs (AWS, DigitalOcean, etc.) immediately. Residential proxies from providers like Live Proxies assign real ISP IPs, which match Facebook's trust expectations for consumer traffic.

What's the safest way to store scraped data?

Use JSON Lines (.jsonl) format to prevent file corruption if your scraper crashes. Legally, store only the fields you need and avoid retaining personal identifiers unless you have a clear compliance basis.