spyderproxy

How to Scrape Indeed Job Postings (2026 Python Guide)

A

Alex R.

|
Published date

Sun May 10 2026

Quick verdict: Indeed is protected by Cloudflare + custom anti-bot rules. Plain requests gets a 403 within seconds. The reliable pattern in 2026: rotating residential proxies + realistic headers + 2-5s delays + parsing the JSON inside the page (Indeed embeds job data as JSON in a script tag, not in HTML). Plan for ~20-30% block rate even with that setup — build retries.

Public Indeed listings have been declared scrape-able under HiQ v. LinkedIn (US, 2022) — public-facing, non-login data is not "protected computer access" under the CFAA. Indeed's ToS prohibits automated access; that is a contract issue, not a criminal one. The risk is account ban (if logged in) and IP ban (always). Never scrape behind a login. Respect robots.txt sections that explicitly disallow crawling. For commercial use, Indeed offers an official API (paid) — preferred for production.

Indeed URL Structure

Search URL pattern:

https://www.indeed.com/jobs?q=KEYWORDS&l=LOCATION&start=OFFSET
  • q — job title or keywords (URL-encoded)
  • l — location (city, state, ZIP)
  • start — pagination offset (10 per page)
  • fromage=N — posted within N days (1, 3, 7, 14)
  • jt=fulltime — job type (fulltime, parttime, contract, internship, temporary)
  • radius=N — search radius in miles

Example: https://www.indeed.com/jobs?q=python+developer&l=Remote&fromage=3&jt=fulltime

For non-US markets, swap the domain: www.indeed.co.uk, de.indeed.com, au.indeed.com, etc.

Anti-Bot Layers

  1. Cloudflare at the edge — rate limits, Bot Fight Mode, occasional Turnstile
  2. Header validation — missing User-Agent, Accept-Language, or Accept-Encoding triggers a block
  3. IP reputation — datacenter IPs blocked instantly. Static residential gets ~100-500 requests before rate limits
  4. Behavioral — requests faster than 1/second from one IP look bot-like; randomized 2-5s delays look human
  5. Cookie tracking — Indeed sets session cookies; reusing them across IPs is a flag. Fresh session per fresh IP.

Proxy Setup

Use Premium Residential ($2.75/GB, sticky sessions up to 8 hours) for Indeed. Sessions matter — Indeed pages load 200-500KB each (HTML + embedded JSON), so 1 GB covers ~3,000 page loads.

PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_GW = "gw.spyderproxy.com:8000"

def proxy_for_session(session_id):
    """Sticky-session proxy: same IP for this session_id for up to 8h."""
    return {
        "http":  f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}@{PROXY_GW}",
        "https": f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}@{PROXY_GW}",
    }

A Working Indeed Scraper

import requests, time, random, json, re
from urllib.parse import urlencode
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def search_indeed(query, location="Remote", max_pages=5):
    session_id = random.randint(0, 100000)
    proxies = proxy_for_session(session_id)
    s = requests.Session()
    s.headers.update(HEADERS)

    jobs = []
    for page in range(max_pages):
        params = {"q": query, "l": location, "start": page * 10}
        url = f"https://www.indeed.com/jobs?{urlencode(params)}"
        try:
            r = s.get(url, proxies=proxies, timeout=20)
            if r.status_code != 200:
                print(f"  page {page}: {r.status_code} (rotating session)")
                session_id = random.randint(0, 100000)
                proxies = proxy_for_session(session_id)
                continue
            jobs.extend(parse_indeed_page(r.text))
        except requests.RequestException as e:
            print(f"  page {page}: {e}")

        time.sleep(random.uniform(2.0, 5.0))
    return jobs


def parse_indeed_page(html):
    """Indeed embeds job data as JSON in window.mosaic.providerData."""
    m = re.search(r"window.mosaic.providerData["mosaic-provider-jobcards"]s*=s*({.*?});", html)
    if not m:
        # fall back to HTML parsing if JSON path changes
        return parse_indeed_html_fallback(html)

    data = json.loads(m.group(1))
    results = data.get("metaData", {}).get("mosaicProviderJobCardsModel", {}).get("results", [])
    return [{
        "title": r.get("title"),
        "company": r.get("company"),
        "location": r.get("formattedLocation"),
        "salary": r.get("salarySnippet", {}).get("text"),
        "snippet": r.get("snippet"),
        "job_key": r.get("jobkey"),
        "url": f"https://www.indeed.com/viewjob?jk={r.get("jobkey")}",
    } for r in results]


def parse_indeed_html_fallback(html):
    soup = BeautifulSoup(html, "lxml")
    cards = soup.select("div.job_seen_beacon")
    return [{
        "title": (c.select_one("h2.jobTitle span") or {}).get("title", ""),
        "company": c.select_one("[data-testid="company-name"]").get_text(strip=True) if c.select_one("[data-testid="company-name"]") else "",
        "location": c.select_one("[data-testid="text-location"]").get_text(strip=True) if c.select_one("[data-testid="text-location"]") else "",
    } for c in cards]


if __name__ == "__main__":
    jobs = search_indeed("python developer", "San Francisco, CA", max_pages=3)
    print(f"Got {len(jobs)} jobs")
    for j in jobs[:5]:
        print(j)

Why JSON Parsing Beats HTML Selectors

Indeed renders the job list HTML server-side from a JavaScript object (window.mosaic.providerData) embedded in a <script> tag. Two reasons to parse the JSON instead of the rendered HTML:

  1. Stability: the JSON schema changes far less often than the CSS selectors. job_seen_beacon changes every few months; the JSON keys are years-stable.
  2. Completeness: the JSON has fields the HTML hides (full description, posted-date, employer reviews, etc.).

Scraping Individual Job Pages

For full job descriptions, hit the detail URL:

def fetch_job_detail(job_key, session, proxies):
    url = f"https://www.indeed.com/viewjob?jk={job_key}"
    r = session.get(url, proxies=proxies, timeout=20)
    if r.status_code != 200:
        return None

    soup = BeautifulSoup(r.text, "lxml")
    return {
        "description": soup.select_one("#jobDescriptionText").get_text(separator="
", strip=True),
        "posted_date": soup.select_one("[data-testid="job-posted-date"]").get_text(strip=True) if soup.select_one("[data-testid="job-posted-date"]") else None,
    }

Rate Limits in Practice

  • 1 IP, 1 req/sec: ~200-500 successful requests before block
  • 1 IP, 1 req/3-5s: ~1,000-2,000 successful requests before block
  • Rotating residential, 1 req/2-5s: effectively unlimited (rotation handles per-IP limit)

Plan for ~20-30% transient failures even with the right setup. Build retries with fresh sessions on 403/429/503.

Official Alternatives

  • Indeed Publisher API: the official path. Free for non-commercial; paid tiers for volume. docs.indeed.com
  • RSS feeds: Indeed publishes per-search RSS at https://rss.indeed.com/rss?q=...&l=... — less complete but no anti-bot
  • Aggregators: if you need all major job sites, services like Adzuna or The Muse API consolidate Indeed + others

Related: scrape LinkedIn safely, scrape Glassdoor, bypass Cloudflare.