LinkedIn is the single richest source of B2B data on the open web — 1 billion+ professional profiles, 67 million+ companies, and millions of active job postings. It's also one of the most aggressively defended sites against scrapers. Their own paid API (Sales Navigator, Recruiter) is expensive and has hard limits; for lead generation, competitive intelligence, recruiting, and market research teams, scraping LinkedIn data with Python is often the only realistic option.
This tutorial walks you through doing it properly: the legality (summary: hiQ v. LinkedIn 2022 says public scraping is legal), the tooling (Playwright + residential proxies), code examples for profiles/companies/jobs, and the specific anti-bot patterns LinkedIn uses and how to stay under them.
Before we start — if you scrape behind the login wall, LinkedIn will detect and ban accounts. Everything in this guide focuses on the logged-out public surface, which is all most scraping use cases actually need.
This is the most googled question in the field, so let's settle it.
Practical summary:
Not legal advice, and jurisdiction matters. If the operation is commercial and material, have a lawyer review it.
The public LinkedIn surface — no login required — includes:
linkedin.com/in/username for profiles set to public).linkedin.com/company/slug).linkedin.com/jobs).What needs login (avoid scraping this): full work history, skills, endorsements, recommendations, mutual connections, who's viewed your profile, messaging.
Install:
pip install playwright requests beautifulsoup4 pandas lxml
playwright install chromium
And a residential proxy plan. LinkedIn is the strictest site we cover on this blog when it comes to proxy quality. Datacenter IPs are blocked at the CDN layer within 1–2 requests. Free proxies are banned across the LinkedIn network within minutes. You need clean residential or ISP IPs.
We recommend rotating residential (SpyderProxy Premium $2.75/GB) for profile and company scraping, and static ISP ($3.90/day) for logged-in scenarios where you need a stable IP for the session.
Company pages are the simplest starting point — they have structured data, clean markup, and no per-profile rate limits.
from playwright.sync_api import sync_playwright
import json, re
PROXY = {
"server": "http://proxy.spyderproxy.com:10000",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
}
def scrape_company(slug: str):
url = f"https://www.linkedin.com/company/{slug}/"
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, proxy=PROXY)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 900},
)
page = context.new_page()
page.goto(url, timeout=30000, wait_until="domcontentloaded")
# Extract JSON-LD schema (LinkedIn publishes Organization schema on company pages)
ld_json_elements = page.locator('script[type="application/ld+json"]').all()
org_data = None
for el in ld_json_elements:
try:
data = json.loads(el.inner_html())
if data.get("@type") == "Organization":
org_data = data
break
except Exception:
continue
# Extract from meta tags as fallback
meta = {}
meta["name"] = page.locator('meta[property="og:title"]').get_attribute("content") or ""
meta["description"] = page.locator('meta[property="og:description"]').get_attribute("content") or ""
# On-page text snippets (industry, size, HQ are in the sidebar)
about_text = ""
about_el = page.locator('.core-section-container__content').first
if about_el.count():
about_text = about_el.inner_text()
browser.close()
return {
"slug": slug,
"url": url,
"ld_json": org_data,
"og_title": meta["name"],
"og_description": meta["description"],
"about_snippet": about_text[:500],
}
if __name__ == "__main__":
data = scrape_company("microsoft")
print(json.dumps(data, indent=2, default=str)[:1500])
Key insight: LinkedIn publishes JSON-LD schema markup on public pages. That's a gift — structured, typed data with none of the fragility of HTML selectors. Always check <script type="application/ld+json"> first.
Job search is the most valuable LinkedIn scrape for most use cases. It powers lead lists, competitive intelligence (who's hiring for what roles), and market wage data.
from urllib.parse import urlencode
def scrape_jobs(keywords: str, location: str, max_jobs: int = 50):
params = {
"keywords": keywords,
"location": location,
"trk": "public_jobs_jobs-search-bar_search-submit",
}
url = f"https://www.linkedin.com/jobs/search?{urlencode(params)}"
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, proxy=PROXY)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
)
page = context.new_page()
page.goto(url, timeout=30000, wait_until="domcontentloaded")
page.wait_for_selector('.jobs-search__results-list li', timeout=15000)
jobs = []
# Scroll to load more jobs (LinkedIn lazy-loads ~25 per scroll)
for _ in range(max_jobs // 25 + 1):
cards = page.locator('.jobs-search__results-list li').all()
for card in cards:
try:
title = card.locator('.base-search-card__title').inner_text().strip()
company = card.locator('.base-search-card__subtitle').inner_text().strip()
location = card.locator('.job-search-card__location').inner_text().strip()
link = card.locator('a.base-card__full-link').get_attribute("href") or ""
link = link.split("?")[0] # strip tracking params
posted = card.locator('time').get_attribute("datetime") if card.locator('time').count() else ""
job = {"title": title, "company": company, "location": location, "url": link, "posted": posted}
if job not in jobs:
jobs.append(job)
if len(jobs) >= max_jobs:
break
except Exception:
continue
if len(jobs) >= max_jobs:
break
page.mouse.wheel(0, 3000)
page.wait_for_timeout(1500)
browser.close()
return jobs[:max_jobs]
if __name__ == "__main__":
jobs = scrape_jobs("python developer", "San Francisco", max_jobs=30)
for j in jobs[:5]:
print(j)
The public "snippet" view of a LinkedIn profile (no login) gives you: name, headline, current company, location, and a short summary. That's usually enough for lead enrichment and research.
def scrape_profile(profile_slug: str):
# profile_slug = the username part of linkedin.com/in/{slug}
url = f"https://www.linkedin.com/in/{profile_slug}/"
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, proxy=PROXY)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
)
page = context.new_page()
try:
page.goto(url, timeout=30000, wait_until="domcontentloaded")
except Exception:
browser.close()
return None
# If we were redirected to login wall, bail
if "authwall" in page.url or "login" in page.url:
browser.close()
return {"slug": profile_slug, "error": "authwall"}
# Pull JSON-LD Person schema if present
import json as _json
ld_data = None
for el in page.locator('script[type="application/ld+json"]').all():
try:
d = _json.loads(el.inner_html())
if d.get("@type") == "Person" or d.get("@graph"):
ld_data = d
break
except Exception:
continue
og_title = page.locator('meta[property="og:title"]').get_attribute("content") or ""
og_desc = page.locator('meta[property="og:description"]').get_attribute("content") or ""
browser.close()
return {
"slug": profile_slug,
"url": url,
"og_title": og_title,
"og_description": og_desc,
"ld_json": ld_data,
}
LinkedIn will redirect about 10–30% of profile requests to an "authwall" page pushing you to sign in. That's normal on the public surface — just skip those and keep going. Rotating residential IPs lowers the authwall rate significantly.
This is the single most important setting for LinkedIn scraping. LinkedIn tracks IP:request ratios aggressively. A single IP making 50+ LinkedIn requests will start hitting captchas and authwalls within minutes.
import random, time
def get_fresh_proxy():
"""Generate a new SpyderProxy session — each new session ID = new residential IP"""
session = random.randint(1, 10_000_000)
return {
"server": "http://proxy.spyderproxy.com:10000",
"username": f"YOUR_USERNAME-session-{session}",
"password": "YOUR_PASSWORD",
}
companies = ["microsoft", "google", "amazon", "meta", "apple", "netflix"]
results = []
for slug in companies:
proxy = get_fresh_proxy() # fresh IP for every company
data = scrape_company_with_proxy(slug, proxy)
if data:
results.append(data)
time.sleep(random.uniform(3, 7)) # jitter, don't be a regular pulse
Pattern: fresh session ID per request on SpyderProxy's rotating residential pool. This gives you a new IP per request with zero setup overhead. If you need the same IP across a multi-step flow (e.g., job search → click through to detail page), use a sticky session with a fixed session ID for up to 24 hours.
import pandas as pd
jobs = scrape_jobs("backend engineer", "remote", max_jobs=200)
df = pd.DataFrame(jobs)
# Clean up
df["company"] = df["company"].str.strip()
df["title"] = df["title"].str.strip()
df = df.drop_duplicates(subset=["url"])
df["scraped_at"] = pd.Timestamp.now()
df.to_csv("linkedin_backend_jobs.csv", index=False, encoding="utf-8")
print(f"Saved {len(df)} jobs")
In rough order of how often you'll see them:
| Proxy type | Best for | SpyderProxy price | LinkedIn ban rate |
|---|---|---|---|
| Rotating residential | Profile snippets, job listings, companies | Premium $2.75/GB, Budget $1.75/GB | Low (~5% authwall) |
| Static residential (ISP) | Stable sessions, logged-in-like scraping | $3.90/day | Very low with careful pacing |
| Mobile/LTE | Heavy production, hardest cases | $2/IP | Effectively zero |
| Datacenter | Don't | $1.50/proxy/mo | Near 100% |
The Premium Residential pool is our recommendation for 90% of LinkedIn scraping. When that's not enough, step up to LTE Mobile.
requests instead of a headless browser. LinkedIn detects this in 1–2 requests.If you don't want to build the scraper, there are managed services: ProxyCurl, Phantombuster, Apify's LinkedIn scrapers, and LinkedIn's own Sales Navigator API (if you qualify). Pricing ranges from $0.01/profile on ProxyCurl up to $1,000+/month for managed Sales Navigator access. These are fine for low volume and prototype projects.
For production at scale (10K+ profiles/month), self-hosted Playwright + SpyderProxy residential is typically 3–5× cheaper than the commercial APIs and gives you control over freshness, fields, and rate.
Scraping publicly accessible LinkedIn data has been held not to violate the US Computer Fraud and Abuse Act under hiQ Labs v. LinkedIn (Ninth Circuit, 2022). It does violate LinkedIn's Terms of Service, so LinkedIn can sue in civil court for ToS breach, DMCA, and similar claims — they have, and most cases settle. For commercial B2B use cases, most operators treat public scraping as legally defensible; consult a lawyer for your specific jurisdiction.
If you scrape while logged in, yes — quickly, and permanently. If you scrape the public, logged-out surface from residential proxies, LinkedIn can block those specific IPs but they can't ban an account they can't identify. Never use your real LinkedIn account for scraping.
Public data (what anyone can see without logging in) includes company pages, job listings, and short profile snippets. Logged-in data includes full work histories, skills, endorsements, and connection graphs. Public scraping is legally defensible; logged-in scraping is both ToS-violating and technically detectable via account patterns.
Almost always one of three reasons: (1) using datacenter or free proxies that LinkedIn identifies instantly, (2) not rotating IPs frequently enough — a single IP handles ~30–50 requests before throttling, (3) making requests at perfectly regular intervals instead of randomized jitter.
Rotating residential proxies are the baseline. SpyderProxy Premium Residential at $2.75/GB keeps the authwall rate around 5% for logged-out scraping. For heaviest workloads, LTE mobile proxies at $2/IP effectively eliminate blocks but at higher cost.
With rotating residential proxies and proper pacing (3–7 seconds per request, fresh IP every request), a single worker realistically does 5,000–15,000 public profile snippets per day. Parallel workers with separate IP sessions scale this linearly.
Sales Navigator is behind login, shows more data, and has more aggressive anti-bot. Scraping it risks permanent account bans and is much riskier legally. Use the official Sales Navigator API if you need Sales Navigator data; it's the right tool for that job.
Playwright. Better stealth, better async, cleaner proxy config. Pair with playwright-stealth to hide the typical headless-browser fingerprints.
Yes — company pages and public profiles include JSON-LD Organization and Person schema markup. Always check <script type="application/ld+json"> first — it's cleaner, typed, and more stable than HTML selectors.
LinkedIn scraping in 2026 is technically and legally workable when you stick to public data, use residential proxies, and rotate aggressively. The stack that works: Playwright for rendering, rotating residential IPs (SpyderProxy Premium at $2.75/GB), randomized request pacing, and JSON-LD extraction as your primary data path.
Avoid: logged-in scraping, datacenter IPs, regular-pulse pacing, and ignoring the authwall — rotate through it, don't fight it.
For production-scale LinkedIn pipelines, SpyderProxy rotating residential handles the vast majority of workloads. Scale up to LTE mobile when you need the hardest anti-bot insulation.