spyderproxy

How to Scrape LinkedIn Data Safely (2026): Python Tutorial

S

SpyderProxy Team

|
Published date

2026-04-21

LinkedIn is the single richest source of B2B data on the open web — 1 billion+ professional profiles, 67 million+ companies, and millions of active job postings. It's also one of the most aggressively defended sites against scrapers. Their own paid API (Sales Navigator, Recruiter) is expensive and has hard limits; for lead generation, competitive intelligence, recruiting, and market research teams, scraping LinkedIn data with Python is often the only realistic option.

This tutorial walks you through doing it properly: the legality (summary: hiQ v. LinkedIn 2022 says public scraping is legal), the tooling (Playwright + residential proxies), code examples for profiles/companies/jobs, and the specific anti-bot patterns LinkedIn uses and how to stay under them.

Before we start — if you scrape behind the login wall, LinkedIn will detect and ban accounts. Everything in this guide focuses on the logged-out public surface, which is all most scraping use cases actually need.

Is It Legal to Scrape LinkedIn?

This is the most googled question in the field, so let's settle it.

  • hiQ Labs v. LinkedIn (Ninth Circuit, 2022) — the case that defines this. hiQ scraped publicly visible LinkedIn profiles to sell a workforce analytics product. LinkedIn sued under the Computer Fraud and Abuse Act (CFAA). The Ninth Circuit ruled that scraping publicly available data is not a CFAA violation. That case went to the Supreme Court, which sent it back, and the Ninth Circuit reaffirmed. Settled out of court 2022.
  • LinkedIn v. Mantheos (2024) and similar — LinkedIn has continued to sue scrapers under ToS breach, DMCA, and misappropriation claims. Most are settled, and most involve scrapers accessing LinkedIn while logged in or bypassing technical measures.
  • GDPR/CCPA — LinkedIn profiles contain personal data. Scraping public data for legitimate interests (B2B outreach, academic research) is defensible; building a consumer marketing database from scraped profiles is not.

Practical summary:

  • Safe: Scraping publicly visible profiles, companies, and job postings for lead generation, recruiting research, market analysis.
  • Risky: Scraping behind LinkedIn login, bypassing captchas, republishing scraped profiles commercially.
  • Don't: Selling scraped LinkedIn data as-is, ignoring GDPR data subject rights, scraping private/connection-only data.

Not legal advice, and jurisdiction matters. If the operation is commercial and material, have a lawyer review it.

What Data You Can Actually Scrape (Without Login)

The public LinkedIn surface — no login required — includes:

  • Public profile snippets — name, headline, location, current company, short summary (visible on linkedin.com/in/username for profiles set to public).
  • Company pages — name, industry, size, headquarters, employee count range, about text, recent posts (visible on linkedin.com/company/slug).
  • Job postings — title, company, location, salary (if listed), description, posted date (visible on linkedin.com/jobs).
  • Search results — the public job search, company search, and "people" Google-indexed results.

What needs login (avoid scraping this): full work history, skills, endorsements, recommendations, mutual connections, who's viewed your profile, messaging.

Tools You'll Need

Install:

pip install playwright requests beautifulsoup4 pandas lxml playwright install chromium

And a residential proxy plan. LinkedIn is the strictest site we cover on this blog when it comes to proxy quality. Datacenter IPs are blocked at the CDN layer within 1–2 requests. Free proxies are banned across the LinkedIn network within minutes. You need clean residential or ISP IPs.

We recommend rotating residential (SpyderProxy Premium $2.75/GB) for profile and company scraping, and static ISP ($3.90/day) for logged-in scenarios where you need a stable IP for the session.

Step 1: Scrape a Public Company Page

Company pages are the simplest starting point — they have structured data, clean markup, and no per-profile rate limits.

from playwright.sync_api import sync_playwright import json, re PROXY = { "server": "http://proxy.spyderproxy.com:10000", "username": "YOUR_USERNAME", "password": "YOUR_PASSWORD", } def scrape_company(slug: str): url = f"https://www.linkedin.com/company/{slug}/" with sync_playwright() as p: browser = p.chromium.launch(headless=True, proxy=PROXY) context = browser.new_context( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36", viewport={"width": 1280, "height": 900}, ) page = context.new_page() page.goto(url, timeout=30000, wait_until="domcontentloaded") # Extract JSON-LD schema (LinkedIn publishes Organization schema on company pages) ld_json_elements = page.locator('script[type="application/ld+json"]').all() org_data = None for el in ld_json_elements: try: data = json.loads(el.inner_html()) if data.get("@type") == "Organization": org_data = data break except Exception: continue # Extract from meta tags as fallback meta = {} meta["name"] = page.locator('meta[property="og:title"]').get_attribute("content") or "" meta["description"] = page.locator('meta[property="og:description"]').get_attribute("content") or "" # On-page text snippets (industry, size, HQ are in the sidebar) about_text = "" about_el = page.locator('.core-section-container__content').first if about_el.count(): about_text = about_el.inner_text() browser.close() return { "slug": slug, "url": url, "ld_json": org_data, "og_title": meta["name"], "og_description": meta["description"], "about_snippet": about_text[:500], } if __name__ == "__main__": data = scrape_company("microsoft") print(json.dumps(data, indent=2, default=str)[:1500])

Key insight: LinkedIn publishes JSON-LD schema markup on public pages. That's a gift — structured, typed data with none of the fragility of HTML selectors. Always check <script type="application/ld+json"> first.

Step 2: Scrape Job Postings

Job search is the most valuable LinkedIn scrape for most use cases. It powers lead lists, competitive intelligence (who's hiring for what roles), and market wage data.

from urllib.parse import urlencode def scrape_jobs(keywords: str, location: str, max_jobs: int = 50): params = { "keywords": keywords, "location": location, "trk": "public_jobs_jobs-search-bar_search-submit", } url = f"https://www.linkedin.com/jobs/search?{urlencode(params)}" with sync_playwright() as p: browser = p.chromium.launch(headless=True, proxy=PROXY) context = browser.new_context( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36", ) page = context.new_page() page.goto(url, timeout=30000, wait_until="domcontentloaded") page.wait_for_selector('.jobs-search__results-list li', timeout=15000) jobs = [] # Scroll to load more jobs (LinkedIn lazy-loads ~25 per scroll) for _ in range(max_jobs // 25 + 1): cards = page.locator('.jobs-search__results-list li').all() for card in cards: try: title = card.locator('.base-search-card__title').inner_text().strip() company = card.locator('.base-search-card__subtitle').inner_text().strip() location = card.locator('.job-search-card__location').inner_text().strip() link = card.locator('a.base-card__full-link').get_attribute("href") or "" link = link.split("?")[0] # strip tracking params posted = card.locator('time').get_attribute("datetime") if card.locator('time').count() else "" job = {"title": title, "company": company, "location": location, "url": link, "posted": posted} if job not in jobs: jobs.append(job) if len(jobs) >= max_jobs: break except Exception: continue if len(jobs) >= max_jobs: break page.mouse.wheel(0, 3000) page.wait_for_timeout(1500) browser.close() return jobs[:max_jobs] if __name__ == "__main__": jobs = scrape_jobs("python developer", "San Francisco", max_jobs=30) for j in jobs[:5]: print(j)

Step 3: Scrape Public Profile Snippets

The public "snippet" view of a LinkedIn profile (no login) gives you: name, headline, current company, location, and a short summary. That's usually enough for lead enrichment and research.

def scrape_profile(profile_slug: str): # profile_slug = the username part of linkedin.com/in/{slug} url = f"https://www.linkedin.com/in/{profile_slug}/" with sync_playwright() as p: browser = p.chromium.launch(headless=True, proxy=PROXY) context = browser.new_context( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36", ) page = context.new_page() try: page.goto(url, timeout=30000, wait_until="domcontentloaded") except Exception: browser.close() return None # If we were redirected to login wall, bail if "authwall" in page.url or "login" in page.url: browser.close() return {"slug": profile_slug, "error": "authwall"} # Pull JSON-LD Person schema if present import json as _json ld_data = None for el in page.locator('script[type="application/ld+json"]').all(): try: d = _json.loads(el.inner_html()) if d.get("@type") == "Person" or d.get("@graph"): ld_data = d break except Exception: continue og_title = page.locator('meta[property="og:title"]').get_attribute("content") or "" og_desc = page.locator('meta[property="og:description"]').get_attribute("content") or "" browser.close() return { "slug": profile_slug, "url": url, "og_title": og_title, "og_description": og_desc, "ld_json": ld_data, }

LinkedIn will redirect about 10–30% of profile requests to an "authwall" page pushing you to sign in. That's normal on the public surface — just skip those and keep going. Rotating residential IPs lowers the authwall rate significantly.

Step 4: Rotate Proxies Per Request

This is the single most important setting for LinkedIn scraping. LinkedIn tracks IP:request ratios aggressively. A single IP making 50+ LinkedIn requests will start hitting captchas and authwalls within minutes.

import random, time def get_fresh_proxy(): """Generate a new SpyderProxy session — each new session ID = new residential IP""" session = random.randint(1, 10_000_000) return { "server": "http://proxy.spyderproxy.com:10000", "username": f"YOUR_USERNAME-session-{session}", "password": "YOUR_PASSWORD", } companies = ["microsoft", "google", "amazon", "meta", "apple", "netflix"] results = [] for slug in companies: proxy = get_fresh_proxy() # fresh IP for every company data = scrape_company_with_proxy(slug, proxy) if data: results.append(data) time.sleep(random.uniform(3, 7)) # jitter, don't be a regular pulse

Pattern: fresh session ID per request on SpyderProxy's rotating residential pool. This gives you a new IP per request with zero setup overhead. If you need the same IP across a multi-step flow (e.g., job search → click through to detail page), use a sticky session with a fixed session ID for up to 24 hours.

Step 5: Parse and Save to CSV

import pandas as pd jobs = scrape_jobs("backend engineer", "remote", max_jobs=200) df = pd.DataFrame(jobs) # Clean up df["company"] = df["company"].str.strip() df["title"] = df["title"].str.strip() df = df.drop_duplicates(subset=["url"]) df["scraped_at"] = pd.Timestamp.now() df.to_csv("linkedin_backend_jobs.csv", index=False, encoding="utf-8") print(f"Saved {len(df)} jobs")

LinkedIn's Anti-Bot Challenges (And How to Handle Them)

In rough order of how often you'll see them:

  1. Authwall redirect — most common. LinkedIn decides your session is suspicious and sends you to a login/signup page. Resolution: rotate to a fresh residential IP. Don't retry from the same IP.
  2. Captcha — usually Arkose Labs FunCaptcha. Shown when a session is really suspicious. Resolution: rotate IP and clear browser state (new context in Playwright).
  3. 999 status response — LinkedIn's signature "you've been throttled" code. Only shown to scrapers. Rotate IP and back off for 15+ minutes from that IP range.
  4. IP range bans — entire /24 blocks banned when one IP in them scrapes too aggressively. This is why you don't use datacenter IPs — one bad actor taints 254 other IPs. Residential IPs are scattered, so range bans don't compound.
  5. Account bans (if you're scraping logged-in, which we advise against). Permanent, tied to your person-identified account, impossible to appeal.

Proxy Types for LinkedIn: Which to Use

Proxy typeBest forSpyderProxy priceLinkedIn ban rate
Rotating residentialProfile snippets, job listings, companiesPremium $2.75/GB, Budget $1.75/GBLow (~5% authwall)
Static residential (ISP)Stable sessions, logged-in-like scraping$3.90/dayVery low with careful pacing
Mobile/LTEHeavy production, hardest cases$2/IPEffectively zero
DatacenterDon't$1.50/proxy/moNear 100%

The Premium Residential pool is our recommendation for 90% of LinkedIn scraping. When that's not enough, step up to LTE Mobile.

Common Mistakes That Get LinkedIn Scrapers Banned

  • Using requests instead of a headless browser. LinkedIn detects this in 1–2 requests.
  • Scraping logged in on your personal account. Permanent account ban, and the ban extends to accounts on the same device/browser fingerprint.
  • Not rotating IPs. A single residential IP can handle ~30–50 LinkedIn requests before throttling. Rotate aggressively.
  • Scraping at a perfectly regular cadence. Real users have irregular patterns. Always add 2–8 seconds of jitter between requests.
  • Using the default Chromium User-Agent. LinkedIn's bot detection fingerprints the UA string, WebGL, canvas, and audio context. Use stealth plugins.
  • Hitting a sequentially numbered range of profile IDs. Real user traffic has entropy. Pattern-detection algorithms notice sequential access immediately.
  • Ignoring robots.txt out of spite. You can technically scrape public data regardless of robots, but respecting it limits your legal exposure significantly.

Alternatives: Commercial LinkedIn Scraping APIs

If you don't want to build the scraper, there are managed services: ProxyCurl, Phantombuster, Apify's LinkedIn scrapers, and LinkedIn's own Sales Navigator API (if you qualify). Pricing ranges from $0.01/profile on ProxyCurl up to $1,000+/month for managed Sales Navigator access. These are fine for low volume and prototype projects.

For production at scale (10K+ profiles/month), self-hosted Playwright + SpyderProxy residential is typically 3–5× cheaper than the commercial APIs and gives you control over freshness, fields, and rate.

Frequently Asked Questions

Is scraping LinkedIn legal?

Scraping publicly accessible LinkedIn data has been held not to violate the US Computer Fraud and Abuse Act under hiQ Labs v. LinkedIn (Ninth Circuit, 2022). It does violate LinkedIn's Terms of Service, so LinkedIn can sue in civil court for ToS breach, DMCA, and similar claims — they have, and most cases settle. For commercial B2B use cases, most operators treat public scraping as legally defensible; consult a lawyer for your specific jurisdiction.

Will my LinkedIn account get banned if I scrape?

If you scrape while logged in, yes — quickly, and permanently. If you scrape the public, logged-out surface from residential proxies, LinkedIn can block those specific IPs but they can't ban an account they can't identify. Never use your real LinkedIn account for scraping.

What's the difference between scraping public vs logged-in LinkedIn data?

Public data (what anyone can see without logging in) includes company pages, job listings, and short profile snippets. Logged-in data includes full work histories, skills, endorsements, and connection graphs. Public scraping is legally defensible; logged-in scraping is both ToS-violating and technically detectable via account patterns.

Why do LinkedIn scrapers keep getting blocked?

Almost always one of three reasons: (1) using datacenter or free proxies that LinkedIn identifies instantly, (2) not rotating IPs frequently enough — a single IP handles ~30–50 requests before throttling, (3) making requests at perfectly regular intervals instead of randomized jitter.

What proxies work best for LinkedIn?

Rotating residential proxies are the baseline. SpyderProxy Premium Residential at $2.75/GB keeps the authwall rate around 5% for logged-out scraping. For heaviest workloads, LTE mobile proxies at $2/IP effectively eliminate blocks but at higher cost.

How many LinkedIn profiles can I scrape per day?

With rotating residential proxies and proper pacing (3–7 seconds per request, fresh IP every request), a single worker realistically does 5,000–15,000 public profile snippets per day. Parallel workers with separate IP sessions scale this linearly.

Can I scrape LinkedIn Sales Navigator?

Sales Navigator is behind login, shows more data, and has more aggressive anti-bot. Scraping it risks permanent account bans and is much riskier legally. Use the official Sales Navigator API if you need Sales Navigator data; it's the right tool for that job.

Should I use Selenium or Playwright for LinkedIn?

Playwright. Better stealth, better async, cleaner proxy config. Pair with playwright-stealth to hide the typical headless-browser fingerprints.

Does LinkedIn publish structured data I can scrape?

Yes — company pages and public profiles include JSON-LD Organization and Person schema markup. Always check <script type="application/ld+json"> first — it's cleaner, typed, and more stable than HTML selectors.

Bottom Line

LinkedIn scraping in 2026 is technically and legally workable when you stick to public data, use residential proxies, and rotate aggressively. The stack that works: Playwright for rendering, rotating residential IPs (SpyderProxy Premium at $2.75/GB), randomized request pacing, and JSON-LD extraction as your primary data path.

Avoid: logged-in scraping, datacenter IPs, regular-pulse pacing, and ignoring the authwall — rotate through it, don't fight it.

For production-scale LinkedIn pipelines, SpyderProxy rotating residential handles the vast majority of workloads. Scale up to LTE mobile when you need the hardest anti-bot insulation.

Related Resources