Is scraping Indeed legal?

Public Indeed listings (no login required) are scrape-able under HiQ v. LinkedIn (US 2022) — public data is not 'protected computer access' under CFAA. But Indeed's ToS prohibits automated access — that's a contract violation, not a crime. Risks: IP ban, account ban if logged in. For commercial use, Indeed Publisher API is the safe path.

Does Indeed block all scrapers?

It blocks aggressively at the IP layer, but rotating residential proxies + realistic headers + 2-5s delays bypass most blocks. Expect 20-30% transient failures even with the right setup. Datacenter IPs get blocked instantly. The cleanest approach uses Premium Residential proxies with sticky sessions.

Should I use the Indeed API instead of scraping?

For commercial production use: yes. The Indeed Publisher API is reliable, cheaper than dealing with anti-bot, and avoids ToS issues. For research, one-off analysis, or volumes the API does not cover, scraping with proxies is the practical option.

How do I avoid Indeed rate limits?

Three things: rotate IPs (residential proxies with session rotation), 2-5 second delays between requests, and full browser-like headers (User-Agent, Accept-Language, Accept-Encoding). Indeed's per-IP limit is roughly 1,000-2,000 requests with delays — rotation makes this effectively unlimited.

Why does Indeed return a Cloudflare challenge?

Two causes: 1) your IP has bad reputation (datacenter, recently abused). Switch to residential. 2) you've exceeded the per-IP soft limit. Wait or rotate IP. Cloudflare Turnstile (the modern CAPTCHA) is rare on Indeed but exists for problematic IPs.

Where is the job data on an Indeed page?

In a tag as window.mosaic.providerData. Parse that JSON instead of HTML — it has more fields (company size, ratings, full salary range) and the schema is years-stable. The CSS selectors change every few months as Indeed redesigns.

How many proxies do I need to scrape Indeed at scale?

Rough math: 1 IP handles 1,000-2,000 requests with delays. For 100,000 jobs you need ~50-100 IP rotations. Premium Residential with session rotation handles this with ~$50-100 in bandwidth (roughly 50 GB at $2.75/GB).

Can I scrape Indeed without proxies?

From a single residential home connection: yes, for ~500-1,000 requests. After that you'll be rate-limited for hours. For any meaningful volume, you need rotating proxies. Free public proxies will be blocked instantly — Indeed has them all on its blocklist.

How to Scrape Indeed Job Postings (2026 Python Guide)

Alex R.

Sun May 10 2026

Quick verdict: Indeed is protected by Cloudflare + custom anti-bot rules. Plain requests gets a 403 within seconds. The reliable pattern in 2026: rotating residential proxies + realistic headers + 2-5s delays + parsing the JSON inside the page (Indeed embeds job data as JSON in a script tag, not in HTML). Plan for ~20-30% block rate even with that setup — build retries.

Legal Status

Public Indeed listings have been declared scrape-able under HiQ v. LinkedIn (US, 2022) — public-facing, non-login data is not "protected computer access" under the CFAA. Indeed's ToS prohibits automated access; that is a contract issue, not a criminal one. The risk is account ban (if logged in) and IP ban (always). Never scrape behind a login. Respect robots.txt sections that explicitly disallow crawling. For commercial use, Indeed offers an official API (paid) — preferred for production.

Indeed URL Structure

Search URL pattern:

https://www.indeed.com/jobs?q=KEYWORDS&l=LOCATION&start=OFFSET

q — job title or keywords (URL-encoded)
l — location (city, state, ZIP)
start — pagination offset (10 per page)
fromage=N — posted within N days (1, 3, 7, 14)
jt=fulltime — job type (fulltime, parttime, contract, internship, temporary)
radius=N — search radius in miles

Example: https://www.indeed.com/jobs?q=python+developer&l=Remote&fromage=3&jt=fulltime

For non-US markets, swap the domain: www.indeed.co.uk, de.indeed.com, au.indeed.com, etc.

Anti-Bot Layers

Cloudflare at the edge — rate limits, Bot Fight Mode, occasional Turnstile
Header validation — missing User-Agent, Accept-Language, or Accept-Encoding triggers a block
IP reputation — datacenter IPs blocked instantly. Static residential gets ~100-500 requests before rate limits
Behavioral — requests faster than 1/second from one IP look bot-like; randomized 2-5s delays look human
Cookie tracking — Indeed sets session cookies; reusing them across IPs is a flag. Fresh session per fresh IP.

Proxy Setup

Use Premium Residential ($2.75/GB, sticky sessions up to 8 hours) for Indeed. Sessions matter — Indeed pages load 200-500KB each (HTML + embedded JSON), so 1 GB covers ~3,000 page loads.

PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_GW = "gw.spyderproxy.com:8000"

def proxy_for_session(session_id):
    """Sticky-session proxy: same IP for this session_id for up to 8h."""
    return {
        "http":  f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}@{PROXY_GW}",
        "https": f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}@{PROXY_GW}",
    }

A Working Indeed Scraper

import requests, time, random, json, re
from urllib.parse import urlencode
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def search_indeed(query, location="Remote", max_pages=5):
    session_id = random.randint(0, 100000)
    proxies = proxy_for_session(session_id)
    s = requests.Session()
    s.headers.update(HEADERS)

    jobs = []
    for page in range(max_pages):
        params = {"q": query, "l": location, "start": page * 10}
        url = f"https://www.indeed.com/jobs?{urlencode(params)}"
        try:
            r = s.get(url, proxies=proxies, timeout=20)
            if r.status_code != 200:
                print(f"  page {page}: {r.status_code} (rotating session)")
                session_id = random.randint(0, 100000)
                proxies = proxy_for_session(session_id)
                continue
            jobs.extend(parse_indeed_page(r.text))
        except requests.RequestException as e:
            print(f"  page {page}: {e}")

        time.sleep(random.uniform(2.0, 5.0))
    return jobs


def parse_indeed_page(html):
    """Indeed embeds job data as JSON in window.mosaic.providerData."""
    m = re.search(r"window.mosaic.providerData["mosaic-provider-jobcards"]s*=s*({.*?});", html)
    if not m:
        # fall back to HTML parsing if JSON path changes
        return parse_indeed_html_fallback(html)

    data = json.loads(m.group(1))
    results = data.get("metaData", {}).get("mosaicProviderJobCardsModel", {}).get("results", [])
    return [{
        "title": r.get("title"),
        "company": r.get("company"),
        "location": r.get("formattedLocation"),
        "salary": r.get("salarySnippet", {}).get("text"),
        "snippet": r.get("snippet"),
        "job_key": r.get("jobkey"),
        "url": f"https://www.indeed.com/viewjob?jk={r.get("jobkey")}",
    } for r in results]


def parse_indeed_html_fallback(html):
    soup = BeautifulSoup(html, "lxml")
    cards = soup.select("div.job_seen_beacon")
    return [{
        "title": (c.select_one("h2.jobTitle span") or {}).get("title", ""),
        "company": c.select_one("[data-testid="company-name"]").get_text(strip=True) if c.select_one("[data-testid="company-name"]") else "",
        "location": c.select_one("[data-testid="text-location"]").get_text(strip=True) if c.select_one("[data-testid="text-location"]") else "",
    } for c in cards]


if __name__ == "__main__":
    jobs = search_indeed("python developer", "San Francisco, CA", max_pages=3)
    print(f"Got {len(jobs)} jobs")
    for j in jobs[:5]:
        print(j)

Why JSON Parsing Beats HTML Selectors

Indeed renders the job list HTML server-side from a JavaScript object (window.mosaic.providerData) embedded in a <script> tag. Two reasons to parse the JSON instead of the rendered HTML:

Stability: the JSON schema changes far less often than the CSS selectors. job_seen_beacon changes every few months; the JSON keys are years-stable.
Completeness: the JSON has fields the HTML hides (full description, posted-date, employer reviews, etc.).

Scraping Individual Job Pages

For full job descriptions, hit the detail URL:

def fetch_job_detail(job_key, session, proxies):
    url = f"https://www.indeed.com/viewjob?jk={job_key}"
    r = session.get(url, proxies=proxies, timeout=20)
    if r.status_code != 200:
        return None

    soup = BeautifulSoup(r.text, "lxml")
    return {
        "description": soup.select_one("#jobDescriptionText").get_text(separator="
", strip=True),
        "posted_date": soup.select_one("[data-testid="job-posted-date"]").get_text(strip=True) if soup.select_one("[data-testid="job-posted-date"]") else None,
    }

Rate Limits in Practice

1 IP, 1 req/sec: ~200-500 successful requests before block
1 IP, 1 req/3-5s: ~1,000-2,000 successful requests before block
Rotating residential, 1 req/2-5s: effectively unlimited (rotation handles per-IP limit)

Plan for ~20-30% transient failures even with the right setup. Build retries with fresh sessions on 403/429/503.

Official Alternatives

Indeed Publisher API: the official path. Free for non-commercial; paid tiers for volume. docs.indeed.com
RSS feeds: Indeed publishes per-search RSS at https://rss.indeed.com/rss?q=...&l=... — less complete but no anti-bot
Aggregators: if you need all major job sites, services like Adzuna or The Muse API consolidate Indeed + others