Quick verdict: Indeed is protected by Cloudflare + custom anti-bot rules. Plain requests gets a 403 within seconds. The reliable pattern in 2026: rotating residential proxies + realistic headers + 2-5s delays + parsing the JSON inside the page (Indeed embeds job data as JSON in a script tag, not in HTML). Plan for ~20-30% block rate even with that setup — build retries.
Public Indeed listings have been declared scrape-able under HiQ v. LinkedIn (US, 2022) — public-facing, non-login data is not "protected computer access" under the CFAA. Indeed's ToS prohibits automated access; that is a contract issue, not a criminal one. The risk is account ban (if logged in) and IP ban (always). Never scrape behind a login. Respect robots.txt sections that explicitly disallow crawling. For commercial use, Indeed offers an official API (paid) — preferred for production.
Search URL pattern:
https://www.indeed.com/jobs?q=KEYWORDS&l=LOCATION&start=OFFSETq — job title or keywords (URL-encoded)l — location (city, state, ZIP)start — pagination offset (10 per page)fromage=N — posted within N days (1, 3, 7, 14)jt=fulltime — job type (fulltime, parttime, contract, internship, temporary)radius=N — search radius in milesExample: https://www.indeed.com/jobs?q=python+developer&l=Remote&fromage=3&jt=fulltime
For non-US markets, swap the domain: www.indeed.co.uk, de.indeed.com, au.indeed.com, etc.
Use Premium Residential ($2.75/GB, sticky sessions up to 8 hours) for Indeed. Sessions matter — Indeed pages load 200-500KB each (HTML + embedded JSON), so 1 GB covers ~3,000 page loads.
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_GW = "gw.spyderproxy.com:8000"
def proxy_for_session(session_id):
"""Sticky-session proxy: same IP for this session_id for up to 8h."""
return {
"http": f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}@{PROXY_GW}",
"https": f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}@{PROXY_GW}",
}import requests, time, random, json, re
from urllib.parse import urlencode
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def search_indeed(query, location="Remote", max_pages=5):
session_id = random.randint(0, 100000)
proxies = proxy_for_session(session_id)
s = requests.Session()
s.headers.update(HEADERS)
jobs = []
for page in range(max_pages):
params = {"q": query, "l": location, "start": page * 10}
url = f"https://www.indeed.com/jobs?{urlencode(params)}"
try:
r = s.get(url, proxies=proxies, timeout=20)
if r.status_code != 200:
print(f" page {page}: {r.status_code} (rotating session)")
session_id = random.randint(0, 100000)
proxies = proxy_for_session(session_id)
continue
jobs.extend(parse_indeed_page(r.text))
except requests.RequestException as e:
print(f" page {page}: {e}")
time.sleep(random.uniform(2.0, 5.0))
return jobs
def parse_indeed_page(html):
"""Indeed embeds job data as JSON in window.mosaic.providerData."""
m = re.search(r"window.mosaic.providerData["mosaic-provider-jobcards"]s*=s*({.*?});", html)
if not m:
# fall back to HTML parsing if JSON path changes
return parse_indeed_html_fallback(html)
data = json.loads(m.group(1))
results = data.get("metaData", {}).get("mosaicProviderJobCardsModel", {}).get("results", [])
return [{
"title": r.get("title"),
"company": r.get("company"),
"location": r.get("formattedLocation"),
"salary": r.get("salarySnippet", {}).get("text"),
"snippet": r.get("snippet"),
"job_key": r.get("jobkey"),
"url": f"https://www.indeed.com/viewjob?jk={r.get("jobkey")}",
} for r in results]
def parse_indeed_html_fallback(html):
soup = BeautifulSoup(html, "lxml")
cards = soup.select("div.job_seen_beacon")
return [{
"title": (c.select_one("h2.jobTitle span") or {}).get("title", ""),
"company": c.select_one("[data-testid="company-name"]").get_text(strip=True) if c.select_one("[data-testid="company-name"]") else "",
"location": c.select_one("[data-testid="text-location"]").get_text(strip=True) if c.select_one("[data-testid="text-location"]") else "",
} for c in cards]
if __name__ == "__main__":
jobs = search_indeed("python developer", "San Francisco, CA", max_pages=3)
print(f"Got {len(jobs)} jobs")
for j in jobs[:5]:
print(j)Indeed renders the job list HTML server-side from a JavaScript object (window.mosaic.providerData) embedded in a <script> tag. Two reasons to parse the JSON instead of the rendered HTML:
job_seen_beacon changes every few months; the JSON keys are years-stable.For full job descriptions, hit the detail URL:
def fetch_job_detail(job_key, session, proxies):
url = f"https://www.indeed.com/viewjob?jk={job_key}"
r = session.get(url, proxies=proxies, timeout=20)
if r.status_code != 200:
return None
soup = BeautifulSoup(r.text, "lxml")
return {
"description": soup.select_one("#jobDescriptionText").get_text(separator="
", strip=True),
"posted_date": soup.select_one("[data-testid="job-posted-date"]").get_text(strip=True) if soup.select_one("[data-testid="job-posted-date"]") else None,
}Plan for ~20-30% transient failures even with the right setup. Build retries with fresh sessions on 403/429/503.
https://rss.indeed.com/rss?q=...&l=... — less complete but no anti-botRelated: scrape LinkedIn safely, scrape Glassdoor, bypass Cloudflare.