Live Proxies

How to Scrape Glassdoor Jobs and Reviews with Python in 2026

Learn how to scrape Glassdoor jobs and reviews with Python in 2026 using curl_cffi, BFF endpoints, pagination, cookies, and proxy-ready workflows.

How to Scrape Glassdoor
Live Proxies

Live Proxies Editorial Team

Content Manager

How To

8 April 2026

Glassdoor is hard to scrape. It sits behind Cloudflare, which protects 22% of all websites and uses TLS fingerprinting to identify non-browser clients. Glassdoor also requires session cookies on its API endpoints and shows a login wall on reviews. Standard Python HTTP libraries are more likely to be challenged due to detectable TLS and HTTP characteristics, though depending on request frequency and IP reputation, they may still work for small-scale testing.

This guide has 2 Python scrapers that handle these problems. curl_cffi impersonates Chrome's TLS fingerprint. A homepage visit through the same session picks up the required cookies. The review API can return structured data using only session cookies, though access may vary by region, IP reputation, and platform changes. Both scrapers call Glassdoor's undocumented BFF endpoints and return structured JSON. No browser automation needed.

TL;DR

  • This guide includes 2 ready-to-use Python scrapers: one for jobs, one for reviews.

  • Both use curl_cffi to bypass Cloudflare's TLS fingerprinting. No Selenium or browser automation needed.

  • Glassdoor's undocumented /bff/ endpoints return structured JSON, including reviews without login.

  • For production use, you need rotating residential proxies.

What Data Can You Scrape From Glassdoor?

You can scrape job listings, company reviews, salaries, interview questions, and company overviews. This guide covers jobs and reviews.

  • Job listings – title, company, location, salary range, posting age (relative, e.g., "3d ago"), easy-apply status, full job descriptions

  • Company reviews – overall rating, sub-ratings (work-life balance, culture & values, compensation & benefits, career opportunities, diversity & inclusion, senior management), pros, cons, advice to management, recommendation status, CEO approval, business outlook

  • Salary data – reported salaries by role and location

  • Interview questions – questions reported by candidates, difficulty ratings, outcomes

  • Company overviews – headquarters, size, revenue, industry, CEO approval rating

These are all /bff/ endpoints. Salary, interview, and overview endpoints are not covered here. They work differently.

Is It Legal to Scrape Glassdoor?

Scraping public data is not illegal under US federal law, but Glassdoor's Terms of Service prohibit automated access. Some content is behind a login wall, which makes it more complicated legally. If you use scraped data commercially, consult a lawyer.

Practical guidelines:

  • Do not create accounts or submit credentials to access gated content.

  • Do not collect personal identifying information beyond what is publicly displayed.

  • If you collect data for AI training, check the EU AI Act and California's AB 2013 for data-sourcing requirements.

Why Do People Scrape Glassdoor in 2026?

Most teams scrape Glassdoor for one of 5 reasons: finding sales leads from hiring signals, tracking competitor headcount and strategy, benchmarking employee sentiment, collecting salary data, or building research datasets. The data is useful but not perfect. Glassdoor reviews are self-selected, and companies sometimes encourage positive reviews. Treat it as a signal, not ground truth.

  • Lead generation and job aggregation. Sales teams use job listing data to identify who is actively hiring. Smaller job boards scrape listings to fill their own platforms.

  • Competitive intelligence. Startups scrape competitor job listings to track hiring patterns. If a competitor posts 15 machine learning roles in a month, you know where they are investing.

  • HR and talent intelligence. Companies scrape competitor reviews to benchmark their own ratings. Low Glassdoor scores make hiring harder, so HR teams track competitor sentiment.

  • Market research and salary benchmarking. Compensation analysts need salary data across roles and geographies. Glassdoor is one of the biggest self-reported salary datasets.

  • Academic research. Labor economists and organizational researchers use Glassdoor review data to study workplace culture, pay equity, and management practices.

How to Scrape Glassdoor with Python Step by Step?

Install curl_cffi for TLS impersonation, open a session to get cookies, then POST to Glassdoor's undocumented API endpoints.

  1. Install curl_cffi (the only dependency)

  2. Create a session with impersonate="chrome"

  3. Visit the Glassdoor homepage to collect session cookies

  4. POST to the BFF API endpoint with your search parameters

  5. Parse the JSON response and save to CSV or JSON

What Do You Need to Get Started?

You need Python 3.10+ and curl_cffi. If you are new to Python web scraping, start with that guide first.

Why not requests or httpx? Because Cloudflare's TLS fingerprinting sits in front of Glassdoor. Your request headers might say Chrome, but if the underlying TLS handshake identifies as a Python HTTP client, Cloudflare rejects it.

Glassdoor Cloudflare protection page

Glassdoor's Cloudflare protection page. The User-Agent is a standard Chrome string, but the TLS fingerprint exposed the automation.

curl_cffi solves this. It uses curl-impersonate internally, a patched curl build that links against the same TLS libraries as real browsers.

pip install curl_cffi

That is the only dependency.

How do you set up the project?

Create a curl_cffi session that impersonates Chrome's TLS fingerprint:

from curl_cffi import requests as curl_requests

session = curl_requests.Session(impersonate="chrome")
session.headers.update({
    "accept": "*/*",
    "accept-language": "en-GB,en;q=0.7",
    "origin": "https://www.glassdoor.com",
    "referer": "https://www.glassdoor.com/",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
})

The impersonate="chrome" flag handles TLS and HTTP/2 fingerprinting. The headers set origin, referer, and sec-fetch-* values that Glassdoor's API expects. The full scraper also sets sec-ch-ua, sec-ch-ua-mobile, sec-ch-ua-platform, and a full user-agent string. The gist has the complete header set.

Further reading: How to Scrape Facebook: Comments, Posts, Groups, Marketplace, and Scraping Tools in 2026 (Guide) and How to Scrape YouTube: A Complete Guide to Videos, Comments, and Transcripts (2026).

How do you make the first API call?

Before calling any API, you need cookies. Glassdoor's BFF endpoints reject requests without a valid session. Skip this step, and you get 403 or empty responses. Visit the homepage first:

base_url = "https://www.glassdoor.com"
session.get(base_url, timeout=15)
# Now session.cookies contains the tokens we need

Once the session has cookies, you can call the job search API. Here is the API call and response in Chrome DevTools:

Chrome Dev Tools

Chrome DevTools showing the jobSearchResultsQuery endpoint and its JSON response.

The Python equivalent:

search_url = f"{base_url}/job-search-next/bff/jobSearchResultsQuery"

payload = {
    "excludeJobListingIds": [],
    "keyword": "data engineer",
    "locationId": 1147401,      # New York
    "locationType": "CITY",
    "numJobsToShow": 30,
    "pageNumber": 0,
    "pageCursor": "",
    "pageType": "SERP",
    "seoUrl": True,
    "seoFriendlyUrlInput": "new-york-data-engineer-jobs",
    "parameterUrlInput": "IL.0,8_IC1147401_KO9,22",
    "queryString": "",
    "includeIndeedJobAttributes": True,  # includes Indeed cross-posts in results
    "filterParams": [
        {"filterKey": "sortBy", "values": "date_desc"}
    ],
}

resp = session.post(
    search_url,
    json=payload,
    headers={"content-type": "application/json"},
    timeout=20,
)
resp.raise_for_status()
data = resp.json()

The parameterUrlInput string encodes the location and keyword character positions in Glassdoor's internal URL format. The full scraper generates this automatically. If you are building from scratch, copy it from a real Glassdoor search URL in your browser. There is also an originalPageUrl field that combines the SEO slug and parameter URL. Check the gist for how it is built.

The response comes back as nested JSON. The job listings live at data["data"]["jobListings"]["jobListings"] – the key appears twice. Each listing has this structure:

# Each item in the list looks like:
# {"jobview": {"header": {...}, "job": {...}, "overview": {...}}}

jobs = []
for item in data["data"]["jobListings"]["jobListings"]:
    header = item["jobview"]["header"]
    jobs.append({
        "title": header["jobTitleText"],           # "Senior Data Engineer"
        "company": header["employerNameFromSearch"], # "Google"
        "location": header["locationName"],          # "New York, NY"
    })

Note the field names: jobTitleText, not jobTitle. employerNameFromSearch, not employerName.

How do you save the results?

The minimal CSV export:

import csv

if jobs:
    with open("glassdoor_jobs.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=jobs[0].keys())
        writer.writeheader()
        writer.writerows(jobs)

The full scraper uses Python's dataclasses for the data model and adds JSON export that strips fields starting with _.

Full scraper run in the terminal:

2 pages scraped

2 pages scraped, 60 jobs collected, saved to CSV.

The output CSV:

Scraped job data

Scraped job data with title, company, location, salary range, posted date, and company rating.

The full job scraper with filters, city resolution, pagination, and export is in one file: glassdoor_scraper.py on GitHub Gist. Download and run:

python glassdoor_scraper.py --keyword "data engineer" --city London --site co.uk --pages 3

For production use, add --proxy and --proxy-sticky flags (explained in the proxy section).

The reviews scraper works the same way. Different endpoint, different payload.

How Does Glassdoor Serve Data?

Glassdoor is a Next.js application. The initial HTML includes some server-rendered content and a _ _NEXT_DATA_ _ JSON blob, but the actual job listings and reviews come from separate API calls that the front-end makes after page load.

The difference matters at scale. A headless browser uses 200 MB+ of memory per instance for a full page load. An API call returns a few KB of JSON.

When do you need a browser? Only for Glassdoor page types where no API endpoint exists. Here are the main alternatives:

  • Patchright – drop-in Playwright replacement with automation fingerprints removed. Same API, one-line install change.

  • Camoufox – custom Firefox build with fingerprint spoofing at the C++ level (WebGL, fonts, WebRTC). Note: it had a year-long maintenance gap and is still stabilizing.

  • Pydoll – connects directly to Chrome DevTools Protocol, no WebDriver. Human-like mouse movements and shadow DOM support. More setup than Patchright, but no navigator.webdriver leak.

How does Glassdoor pagination work?

Glassdoor's job search API uses both pageNumber and pageCursor. You need to send both. After each page, the response includes a paginationCursors array. Pass the last cursor to the next request:

# Extract cursor for the next page
cursors = data["data"]["jobListings"]["paginationCursors"]
if isinstance(cursors, list) and len(cursors) > 1:
    next_cursor = cursors[-1].get("cursor", "")

The reviews API is simpler. It uses page numbers directly. You increment page from 1 upward, with a configurable pageSize. Note: the reviews API is 1-indexed (page: 1 is the first page), unlike the job search API which is 0-indexed (pageNumber: 0).

payload = {
    # ... other fields ...
    "page": 2,
    "pageSize": 10,
}

In both cases, add a random delay between pages to avoid rate limiting.

How to Scrape Glassdoor Reviews with Python?

POST to the employer-reviews BFF endpoint with an employerId (from the typeahead API) and a dynamicProfileId (from the employer overview page). The response has ratings, pros, cons, and employment details.

Glassdoor requires sign in to browse reviews

Glassdoor requires sign-in to browse reviews in a browser, but the API endpoint does not.

What fields does the reviews API return?

Single Glassdoor review

A single Glassdoor review with the fields the scraper extracts: rating, job title, location, pros, cons, and recommendation status.

The reviews BFF API lives at /bff/employer-profile-mono/employer-reviews and returns these fields per review:

Field Description
reviewId Unique identifier
reviewDateTime When the review was posted
ratingOverall 1 to 5 star rating
ratingWorkLifeBalance 1 to 5 sub-rating
ratingCultureAndValues 1 to 5 sub-rating
ratingDiversityAndInclusion 1 to 5 sub-rating
ratingCareerOpportunities 1 to 5 sub-rating
ratingCompensationAndBenefits 1 to 5 sub-rating
ratingSeniorLeadership 1 to 5 sub-rating
summary Review headline
pros What the reviewer liked
cons What the reviewer disliked
advice Advice to management
jobTitle Reviewer's job title (object with id and text)
location Reviewer's location (object with name)
employmentStatus REGULAR or PART_TIME
isCurrentJob Boolean
ratingRecommendToFriend POSITIVE, NEGATIVE, or NO_OPINION
ratingCeo APPROVE, DISAPPROVE, or NO_OPINION
ratingBusinessOutlook POSITIVE, NEGATIVE, or NO_OPINION
lengthOfEmployment Integer

The scraper converts the 3-value fields (ratingRecommendToFriend, ratingCeo, ratingBusinessOutlook) to booleans: POSITIVE/APPROVE becomes True, everything else becomes False. If you need the NO_OPINION vs NEGATIVE distinction, modify _parse_review() to keep the raw string.

Employer reviews endpoint in DevTools

The employer-reviews endpoint in DevTools showing the raw JSON structure.

How do you get the employerId?

Use Glassdoor's typeahead API:

url = f"{base_url}/searchsuggest/typeahead"
url += "?numSuggestions=8&source=GD_V2&version=NEW&rf=full"
url += f"&fallback=token&input=Amazon"

resp = session.get(url, timeout=15)
resp.raise_for_status()
for item in resp.json():
    if item.get("category", "").lower() in ("company", "employer", "employers"):
        employer_id = item.get("employerId") or item.get("id")  # 6036 for Amazon
        break

How do you get the dynamicProfileId?

Visit the employer's overview page and extract it from the embedded JSON:

import re

# Basic slug from company name -- covers most cases. The scraper falls back if this URL 404s.
slug = company_name.replace(" ", "-").replace("&", "and")
overview_url = f"{base_url}/Overview/Working-at-{slug}-EI_IE{employer_id}.htm"
resp = session.get(overview_url, timeout=20)

match = re.search(r'"dynamicProfileId"\s*:\s*(\d+)', resp.text)
if not match:
    raise ValueError("dynamicProfileId not found. The overview page HTML may have changed")
profile_id = int(match.group(1))  # 733609 for Amazon

The full scraper falls back to profileId and the reviews page HTML if the overview page regex misses.

Then you can fetch reviews:

reviews_url = f"{base_url}/bff/employer-profile-mono/employer-reviews"

payload = {
    "applyDefaultCriteria": True,
    "employerId": 6036,
    "dynamicProfileId": 733609,
    "employmentStatuses": ["REGULAR", "PART_TIME"],
    "jobTitle": None,
    "goc": None,
    "location": {},
    "onlyCurrentEmployees": False,
    "overallRating": None,          # Always None -- filter ratings client-side after fetching
    "pageSize": 10,
    "page": 1,                      # Increment for pagination
    "sort": "DATE",
    "textSearch": "",
    "worldwideFilter": False,
    "enableKeywordSearch": False,    # Set to True when textSearch is non-empty
    "defaultLanguage": "eng",
    "language": "eng",
}

resp = session.post(reviews_url, json=payload)
reviews = resp.json()["data"]["employerReviews"]["reviews"]

To paginate, increment page and repeat. Stop when the response returns an empty reviews array.

How do you clean and deduplicate review data?

Watch for these in production:

The jobTitle field is not a string. It is an object like {"id": 61502, "text": "Program Manager"}. Extract jobTitle["text"].

The location field is also an object. Extract location["name"] (not "text" like jobTitle).

Glassdoor's pagination can return overlapping results. Track reviewId values you have already seen:

seen_ids = set()
cleaned_reviews = []
for review in raw_reviews:
    rid = str(review["reviewId"])
    if rid not in seen_ids:
        seen_ids.add(rid)
        cleaned_reviews.append(_parse_review(review))

Strip HTML from text fields. The pros, cons, and advice fields can contain HTML entities and tags. Use regex and html.unescape() to clean them:

from html import unescape
import re

def strip_html(text):
    text = re.sub(r"<[^>]+>", " ", text)
    text = unescape(text)
    return re.sub(r"\s+", " ", text).strip()

The full reviews scraper with employer lookup, profile ID resolution, all filters, and export logic: glassdoor_reviews_scraper.py on GitHub Gist. Download and run:

python glassdoor_reviews_scraper.py --company Amazon --pages 3 --format json

Add --proxy and --proxy-sticky for production use. Same setup as the job scraper.

How to Scrape Jobs From Glassdoor?

POST to the jobSearchResultsQuery BFF endpoint with keyword, location, and filter parameters. The API supports sorting, salary range, remote/on-site, posting age, and regional sites.

How do you build the search query?

Glassdoor job search page with filter options

Glassdoor job search page with filter options.

These filters are supported by the API:

CLI Flag API Parameter Values
-​-sort sortBy relevant, date
-​-work-type remoteWorkType remote, onsite
-​-easy-apply applicationType flag (no value needed)
-​-rating minRating any, 1, 2, 3, 4
-​-min-salary minSalary integer
-​-max-salary maxSalary integer
-​-posted fromAge any, 1d, 3d, 1w, 2w, 1m

The Values column shows what you type on the CLI. The scraper maps these to internal API values automatically (e.g., date becomes date_desc, remote becomes 1).

These go into the filterParams array in the payload:

filters = [{"filterKey": "sortBy", "values": "date_desc"}]

if work_type == "remote":
    filters.append({"filterKey": "remoteWorkType", "values": "1"})

if min_salary:
    filters.append({"filterKey": "minSalary", "values": str(min_salary)})

Multi-region support. Glassdoor has regional sites on different domains: glassdoor.c​om, glassdoor.co.i​n, glassdoor.s​g, glassdoor.co.u​k, and others. Each domain has its own country ID for API payloads. Using a proxy geolocated to the target region helps avoid Cloudflare challenges. The scraper maps domains to country IDs:

SITES = {
    "co.in": 115,   # India
    "sg": 217,      # Singapore
    "com": 1,       # US
    "co.uk": 2,     # UK
    "ca": 3,        # Canada
    "com.au": 4,    # Australia
    # Also: "de": 96, "fr": 84, "com.hk": 103, "co.nz": 160
}

City name resolution. Glassdoor uses internal location IDs. You can resolve city names to IDs with their autocomplete endpoint:

url = f"{base_url}/findPopularLocationAjax.htm?maxLocationsToReturn=10&term=London"
resp = session.get(url, timeout=15)
top = resp.json()[0]
location_id = int(top["locationId"])       # 2671300 for London
loc_code = top.get("locationType", "C")    # "C" city, "S" state, "N" country, "M" metro

LOC_TYPE_MAP = {"C": "CITY", "S": "STATE", "N": "COUNTRY", "M": "METRO"}
location_type = LOC_TYPE_MAP.get(loc_code, "CITY")

Pass location_type (mapped to "CITY", "STATE", "COUNTRY", or "METRO") in the search payload's locationType field. If you search for "California" and send locationType: "CITY", you get wrong or empty results.

How do you get full job descriptions?

The search API returns summary data. For full job descriptions, there is a separate details endpoint:

details_url = f"{base_url}/job-listing/api/job-details"

params = {
    "jobListingId": job_id,
    "pageTypeEnum": "SERP",
    "countryId": str(SITES[site]),
}
if job_link:
    params["queryString"] = job_link   # from search results header.jobLink

resp = session.get(details_url, params=params, timeout=20)
description = resp.json().get("description", "")

The queryString parameter expects the jobLink path from the search results. Include it when available. Some responses return errors without it.

What Are the Best Glassdoor Scraping Tools Without Python?

If you do not want to write code, there are hosted scrapers and browser extensions. They work, but with limits.

Hosted scrapers and no-code tools

Apify has 15+ community-built Glassdoor actors (none official). They cover jobs, reviews, and salaries. Several have already been deprecated, and the active ones have open bug reports about data caps and scraping failures. You depend on community developers to fix them when Glassdoor changes.

Octoparse is a desktop scraping app with optional cloud execution on paid plans (from $119/month). It has pre-built Glassdoor templates for jobs and reviews. You configure extraction rules in a point-and-click UI. The free plan is local-only with no proxy rotation, which means Glassdoor will block you quickly.

Good for one-time pulls or teams that do not write Python. For ongoing scraping, these tools cost more per record. A Python scraper with rotating residential proxies is cheaper at volume because you pay for bandwidth, not result count.

Browser extensions

Chrome extensions like Instant Data Scraper and Data Miner auto-detect tabular data on visible pages. They work inside your browser session, so if you are logged in they can see reviews. But they only scrape the rendered DOM, not the structured JSON from BFF API endpoints. They do not scale to multiple companies or pages, and they have no proxy rotation.

Python frameworks

If you want a batteries-included alternative to assembling curl_cffi and parsing manually, Scrapling is a full scraping framework with built-in TLS impersonation, Cloudflare bypass, and adaptive element tracking. It has 3 fetcher types: Fetcher (HTTP-only via curl_cffi), StealthyFetcher (stealth browser with fingerprint spoofing), and DynamicFetcher (full browser automation).

How Do You Avoid Blocks During Glassdoor Scraping?

You get blocked because of request pacing, proxy issues, or session problems. In a 2026 survey by Apify and The Web Scraping Club, 59% of respondents said Cloudflare was the most common anti-bot system on their target sites.

How fast can you send requests?

Glassdoor does not always return 429 when throttling. It can respond with 200 and empty data instead. Check your results, not just status codes.

The scrapers default to 1.5–3.5 seconds of random delay between requests.

import time, random

time.sleep(random.uniform(1.5, 3.5))

What proxies do you need?

Without rotation, a single IP gets flagged quickly. You need rotating residential proxies, not datacenter.

Cloudflare blocks datacenter IP ranges. Residential proxies often achieve higher acceptance rates because they originate from ISP-issued IP ranges, though performance depends on IP quality and request behavior. A pool of 200 rotating proxies gives you around 300 unique IPs over 30 days.

Mobile proxies from carriers like T-Mobile are another option if residential IPs start getting challenged on specific regional sites.

With Live Proxies, adding a session ID suffix (-1, -2) pins requests to one IP for up to 24 hours. Remove the suffix to rotate per request:

# Example credentials -- replace with your own from the Live Proxies dashboard.
# Sticky session for cookie initialization (same IP for the whole setup)
# Note: "https" value starts with http:// -- this is the proxy protocol, not the target.
proxies_sticky = {
    "http": "http://LV00000000-ACCESSCODE-1:[email protected]:7383",
    "https": "http://LV00000000-ACCESSCODE-1:[email protected]:7383",
}

# Rotating session for data collection (new IP per request, no session ID)
proxies_rotating = {
    "http": "http://LV00000000-ACCESSCODE:[email protected]:7383",
    "https": "http://LV00000000-ACCESSCODE:[email protected]:7383",
}

# Use sticky proxy so cookies stay on one IP
session = curl_requests.Session(impersonate="chrome", proxies=proxies_sticky)
# Set browser headers before the homepage visit (see project setup section for the full set)
session.headers.update({"origin": base_url, "referer": f"{base_url}/"})
session.get(base_url, timeout=15)

# Then switch to rotating for the actual scraping
session.proxies = proxies_rotating

Note: The examples use b2b.liveproxies.i​o:7383 (B2B endpoint). Your hostname may differ depending on your plan. Check your dashboard for the correct hostname.

Some providers share IPs across customers hitting the same sites. Those IPs get blocked fast.

If your proxy is not working, the two most common causes are:

  1. Sticky vs rotating. If you are still on the sticky proxy after the initial session setup, you are sending all traffic through one IP. Switch to the rotating proxy dict after the initial session.get(base_url).

  2. Credentials and port. Verify your auth string and port against your provider's dashboard.

How do sessions and cookies work?

Key behaviors to know:

  • Cookie lifetime. Sessions last several hours but can be invalidated if you send too many requests too fast.

  • Cookie-IP binding. If a session cookie was set on one IP and later requests come from a different one, this may increase challenge probability, particularly when combined with inconsistent request patterns. The scrapers handle this with separate --proxy-sticky and --proxy flags.

  • Detecting expiry vs. block. A 403 after many successful requests usually means session expiry. Create a new session and visit the homepage again to fix it. A 403 on the first request means your IP is blocked. Switch to a different proxy.

What Are Common Glassdoor Scraping Errors and How Do You Fix Them?

Most errors come from stale sessions, wrong API parameters, or response structure changes. Here are the ones you will hit first.

Response structure changes

If the API response structure changes, use defensive extraction with fallback paths:

def extract_jobs(data):
    """Try multiple paths -- Glassdoor sometimes nests differently between requests."""
    for path in [
        lambda d: d["data"]["jobListings"]["jobListings"],
        lambda d: d["data"]["jobListings"],
        lambda d: d["jobListings"]["jobListings"],
        # The full scraper checks additional paths -- see glassdoor_scraper.py
    ]:
        try:
            result = path(data)
            if isinstance(result, list) and result:
                return result
        except (KeyError, AttributeError, TypeError):
            continue
    return []

Timeouts

The scrapers catch and log exceptions but do not retry. Add exponential backoff if you need retries:

import time
from curl_cffi.requests.exceptions import RequestException

for attempt in range(3):
    try:
        resp = session.post(url, json=payload, timeout=20)
        if resp.status_code == 403:
            log.error("403 -- session burned, start a new session instead of retrying")
            break  # do not retry 403s, the session is dead
        resp.raise_for_status()
        break
    except RequestException as e:
        if attempt < 2:
            time.sleep(2 ** attempt)  # 1s, then 2s
        else:
            log.error("Failed after 3 attempts: %s", e)
            raise

Rate limiting

If your scraper suddenly returns zero results on 200 responses, you are being throttled. Stop, wait 10–15 minutes, and restart with a fresh session and a different proxy IP.

Common API parameter errors

  • 400 Bad Request on job details. You are missing the queryString parameter. Pass header.jobLink from search results.

  • Reviews API returning 400. The API does not support server-side rating filtering. Always send None and filter client-side after fetching.

  • Empty jobTitle or employerName fields. Wrong field names. The correct names are jobTitleText and employerNameFromSearch.

  • dynamicProfileId not found. The overview page structure may have changed. The scraper has fallback patterns; see the reviews section above.

  • 403 or empty responses after extended scraping. Your session cookies have expired. Create a new session and visit the homepage again on a sticky proxy IP.

How Do You Store and Validate Scraped Glassdoor Data?

Both scrapers support -​-format csv, -​-format json, and -​-format both. Use CSV for spreadsheets, JSON for code.

What quality checks should you run?

After scraping, check the data before you use it.

  • Every job should have job_id, title, and company. Every review should have review_id and date. If these are missing, something went wrong with the parsing.

  • Overall ratings are stored as strings from "1.0" to "5.0". Some reviews have an empty string instead. That means the reviewer did not provide a rating. The scraper converts 0 and null API values to empty strings. Do not treat them as zero in your analysis.

  • Pagination can return overlapping results, so check for duplicate IDs across pages. The scrapers track seen IDs in a set, but verify this in your output if you modify the code.

  • Text fields like pros, cons, and job descriptions sometimes have leftover HTML tags. Both scrapers run strip_html on these fields, but scan your output if you are adding new fields.

How to Scale Glassdoor Scraping Safely for Larger Datasets?

Scale horizontally with independent sessions, not vertically with more concurrent requests. Each scraper instance gets its own proxy and session. It handles a different company or location and runs at a conservative request rate. This way you avoid shared-state problems from concurrent requests in one process.

What is the best concurrency strategy?

Use different session IDs (-1, -2, -3) for -​-proxy-sticky so each instance gets cookies on a different dedicated IP, then rotates through the shared pool for data collection:

# Terminal 1
python glassdoor_scraper.py --city London --keyword "data engineer" --site co.uk --pages 5 \
  --proxy http://LV00000000-ACCESSCODE:[email protected]:7383 \
  --proxy-sticky http://LV00000000-ACCESSCODE-1:[email protected]:7383

# Terminal 2
python glassdoor_scraper.py --city "New York" --keyword "data engineer" --site com --pages 5 \
  --proxy http://LV00000000-ACCESSCODE:[email protected]:7383 \
  --proxy-sticky http://LV00000000-ACCESSCODE-2:[email protected]:7383

# Terminal 3
python glassdoor_reviews_scraper.py --company Amazon --pages 10 --site com \
  --proxy http://LV00000000-ACCESSCODE:[email protected]:7383 \
  --proxy-sticky http://LV00000000-ACCESSCODE-3:[email protected]:7383

Each sticky session holds for up to 24 hours. After the initial setup, all instances share the rotating pool.

What should you monitor?

If your success rate per page drops below ~90%, something changed. A sudden increase in response time usually means you are being throttled. More than ~5% empty pages suggests your session or proxy has been flagged. If unique records per page drops, you have a pagination problem.

For a daily health-check script, use a stable IP. A static residential proxy stays the same for weeks. That eliminates IP rotation as a variable. If the check fails, you know it is an API change.

What Should You Do Next?

Download both scrapers: glassdoor_scraper.py for jobs and glassdoor_reviews_scraper.py for reviews. They work without proxies for initial testing.

For production, add rotating residential proxies. The scrapers already support --proxy and --proxy-sticky flags, so the switch is one line. Live Proxies supports the sticky and rotating setup shown above.

Further reading: How to Scrape Walmart: Data, Prices, Products, and Reviews and How to Scrape Yahoo Finance Using Python and Other Tools.

Conclusion

That covers both scrapers. The approach is the same for jobs and reviews – curl_cffi handles the TLS part, you grab cookies from the homepage, and then hit the BFF endpoints directly. No browser needed.

If you are just testing, proxies are optional. But for real scraping across many companies, you will need rotating residential proxies. Use a sticky one for the cookie setup, then switch to rotating. And keep in mind, these are undocumented endpoints. They can break anytime without warning.

FAQs

Does Glassdoor have an API?

Glassdoor shut down its official public API in 2021. There is no supported API for external developers. The scrapers in this guide use Glassdoor's internal BFF (Backend for Frontend) endpoints. These are undocumented APIs that the front-end calls to load data. They can change without notice.

What Glassdoor data is safest to collect?

Job titles, company names, salary ranges, ratings, review text. Avoid anything that could identify individual reviewers. The French CNIL fined Kaspr EUR 240,000 in 2024 for scraping LinkedIn contact details. If you collect data for AI training, the EU AI Act (enforcement from August 2026) and California's AB 2013 require documenting training data sources.

Why do scrapers get different results for the same company page?

Glassdoor personalizes results based on IP geolocation, session cookies, A/B test assignment, and CDN caching. Different sessions can get different response structures or slightly stale data. If you are comparing regional sites (.com vs .co.in), the data is actually different. Each regional site has its own dataset.

Should you store raw API responses as well as parsed data?

Yes, if storage costs permit. When you later discover a field you did not extract, you can re-parse the raw JSON without re-scraping. The scrapers do not save raw responses. Add json.dump(resp.json(), f) before parsing if you want them. Raw responses may include fields you should not keep under GDPR.

Do you need proxies to scrape Glassdoor?

For a few test requests from your home IP, no. For sustained scraping, yes. Cloudflare's rate limits and IP reputation checks flag a single IP quickly. Use a sticky proxy for the initial cookie setup, then switch to rotating proxies for data collection.

How do you handle missing or empty fields in scraped Glassdoor data?

Sub-ratings, advice to management, and salary ranges are commonly blank. The scrapers output empty strings instead of 0 for missing ratings. Filter these out before averaging.

Can you scrape Glassdoor without logging in?

Yes. The BFF API endpoints serve data without login credentials. You only need session cookies from a homepage visit. curl_cffi handles this automatically.

Can you scrape Glassdoor salary data with Python?

The job scraper extracts salary ranges (min and max pay) from search results. Neither scraper covers Glassdoor's dedicated salary pages. Those use a different API endpoint with a different request structure. If you need detailed salary breakdowns by role and location, you need to reverse-engineer the salary BFF endpoint separately.

What is the difference between scraping Glassdoor with Selenium and the API approach in this guide?

Selenium and Playwright scrape the rendered page. The API approach is faster and uses less memory because you skip the full page load. The downside is that undocumented endpoints can break without warning.

How do you handle Glassdoor API changes or breaking updates?

The scraper's multi-path extraction handles minor response structure changes. For major changes (new endpoint URL, different payload format), open browser DevTools, find the new BFF request, and update the endpoint and payload in the scraper.