Quick verdict: Glassdoor uses DataDome at the edge + a soft login wall on full reviews/salaries. Public preview data (top of page) is scrape-able with rotating residential proxies + Playwright. Full review text and detailed salary data require an account. Use Playwright (not requests) because Glassdoor is heavily JS-rendered. Plan for ~30-40% block rate without LTE mobile proxies.
| Data type | Login needed? | Difficulty |
|---|---|---|
| Job listings (search results) | No | Medium — DataDome on /Job/ |
| Company overview (size, industry, HQ) | No | Easy |
| Average salary by role (preview) | No | Medium |
| Detailed salary breakdown (full distribution) | Yes | Hard |
| Top 3 reviews (preview) | No | Medium |
| Full reviews (paginated) | Yes | Hard — account ban risk |
| Interview questions | Partial | Hard |
Public data (no login) follows the HiQ v. LinkedIn precedent — legal in the US, contractually prohibited by Glassdoor ToS. Scraping behind a login is far riskier: account creation under false pretenses, ToS violations, and possible CFAA exposure if the login is gated. For commercial use, Glassdoor offers a paid API for partners.
Never scrape personal information of reviewers (even if visible). GDPR / CCPA risk.
Glassdoor uses DataDome (the same protection as Reddit and Hermes). DataDome scores requests at four layers; see bypass DataDome for the full breakdown. For Glassdoor specifically:
requests fails the TLS fingerprint check.pip install playwright
python -m playwright install chromiumfrom playwright.sync_api import sync_playwright
import time, random, json
PROXY_USER = "your_user"
PROXY_PASS = "your_pass"
PROXY_HOST = "gw.spyderproxy.com"
PROXY_PORT = 8000
def scrape_glassdoor_jobs(query, location, max_pages=3):
session_id = random.randint(0, 100000)
jobs = []
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": f"http://{PROXY_HOST}:{PROXY_PORT}",
"username": f"{PROXY_USER}-session-{session_id}",
"password": PROXY_PASS,
},
)
ctx = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
viewport={"width": 1920, "height": 1080},
locale="en-US",
)
page = ctx.new_page()
url = f"https://www.glassdoor.com/Job/{location}-{query}-jobs.htm"
page.goto(url, wait_until="domcontentloaded", timeout=45000)
time.sleep(random.uniform(3, 6))
for _ in range(max_pages):
page_jobs = extract_jobs_from_page(page)
jobs.extend(page_jobs)
# Scroll to load more (Glassdoor uses infinite scroll on jobs)
page.evaluate("window.scrollBy(0, 1500)")
time.sleep(random.uniform(2, 4))
browser.close()
return jobs
def extract_jobs_from_page(page):
"""Glassdoor JSON-LD has the job data."""
cards = page.query_selector_all("li.JobsList_jobListItem")
out = []
for c in cards:
title_el = c.query_selector("a.JobCard_jobTitle")
company_el = c.query_selector("span.EmployerProfile_compactEmployerName")
location_el = c.query_selector("div.JobCard_location")
salary_el = c.query_selector("div.JobCard_salaryEstimate")
if not title_el:
continue
out.append({
"title": title_el.inner_text().strip(),
"company": company_el.inner_text().strip() if company_el else None,
"location": location_el.inner_text().strip() if location_el else None,
"salary_estimate": salary_el.inner_text().strip() if salary_el else None,
"url": title_el.get_attribute("href"),
})
return out
if __name__ == "__main__":
jobs = scrape_glassdoor_jobs("python-developer", "San-Francisco-CA", max_pages=3)
print(json.dumps(jobs[:5], indent=2))Without login, Glassdoor shows the top 3 reviews per company. Bigger samples need an account.
def scrape_company_preview(company_url):
"""Public preview: top reviews + ratings."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": f"http://{PROXY_HOST}:{PROXY_PORT}",
"username": f"{PROXY_USER}-session-{random.randint(0,99999)}",
"password": PROXY_PASS,
},
)
page = browser.new_page()
page.goto(company_url, wait_until="domcontentloaded")
time.sleep(random.uniform(3, 6))
rating = page.query_selector("[data-test="rating"]")
review_count = page.query_selector("[data-test="reviewCount"]")
reviews = page.query_selector_all("[data-test="employer-review"]")
result = {
"rating": rating.inner_text() if rating else None,
"review_count": review_count.inner_text() if review_count else None,
"reviews": [{
"headline": r.query_selector("h2").inner_text() if r.query_selector("h2") else None,
"rating": r.query_selector("[data-test="review-rating"]").inner_text() if r.query_selector("[data-test="review-rating"]") else None,
"pros": r.query_selector("[data-test="pros"]").inner_text() if r.query_selector("[data-test="pros"]") else None,
"cons": r.query_selector("[data-test="cons"]").inner_text() if r.query_selector("[data-test="cons"]") else None,
} for r in reviews[:3]],
}
browser.close()
return resultAfter 3-5 page views as an anonymous user, Glassdoor often shows a "Get more reviews" interstitial that requires login. Hitting it kills your scrape. Mitigations:
ctx.clear_cookies() or new context)Logging in to scrape more is a ToS violation and risks account bans + creates personal-data exposure (your account is tied to the scraping). Not recommended.
Glassdoor + DataDome is a tough target. Recommended:
Related: scrape Indeed, bypass DataDome, scrape LinkedIn safely.