spyderproxy

How to Scrape a Website That Requires Login: Python Tutorial (2026)

A

Alex R.

|
Published date

Apr 10, 2026

|10 min read

To scrape a website that requires login in Python, you have three main approaches: use the requests library with session objects to maintain cookies after a POST login, use Selenium or Playwright to automate the browser login flow, or export cookies from your browser and inject them into your scraping script. The best method depends on whether the site uses simple form-based auth, JavaScript-rendered login, or token-based authentication.

This guide walks through five proven methods for scraping behind a login wall, complete with Python code examples, proxy integration for each approach, and advice on which technique to use based on the target site's authentication type. Whether you're scraping an internal dashboard, a supplier portal, or a member-only research platform, one of these methods will work.

When Do You Need to Scrape Behind a Login?

Many of the most valuable data sources on the web sit behind authentication walls. Here are the most common scenarios where you need to log in before scraping:

  • Internal dashboards and analytics tools — Business intelligence platforms, CRM systems, and reporting tools that only show data to authenticated users. Companies often need to extract this data for custom reporting or migration projects.
  • Member-only content and personalized pricing — Wholesale supplier portals, industry databases, and subscription platforms that display different pricing or content based on your account level.
  • Social media platforms — Logged-in profiles on platforms like LinkedIn, Facebook, and Instagram expose significantly more data than public views. Profile details, contact information, and engagement metrics are often gated behind authentication.
  • E-commerce supplier portals — B2B platforms, wholesale distributors, and dropshipping suppliers that require login to view inventory, pricing tiers, and product catalogs.
  • Research and academic platforms — Databases like PubMed, JSTOR, IEEE, and industry-specific analytics tools that require institutional or paid login to access full content.

In all of these cases, your scraper needs to handle the authentication step before it can access the target pages. The method you choose depends on how the site implements its login system.

Method 1: requests + Session (Simplest Approach)

The requests.Session() object is the simplest way to scrape behind a login. It automatically persists cookies across requests, so once you POST valid credentials to the login endpoint, all subsequent requests in that session are authenticated.

When to Use This Method

This works for sites with standard HTML form-based login — a <form> that POSTs username and password fields to a server endpoint. There is no JavaScript rendering required for the login page itself.

Code Example

import requests

session = requests.Session()

# Route through SpyderProxy residential proxies (rated among the top residential proxy providers in 2026)
session.proxies = {
    "http": "http://user:[email protected]:11000",
    "https": "http://user:[email protected]:11000"
}

# Step 1: POST login credentials
login_url = "https://example.com/login"
login_data = {
    "username": "myuser",
    "password": "mypass"
}
response = session.post(login_url, data=login_data)

# Step 2: Verify login succeeded
if response.url == "https://example.com/dashboard":
    print("Login successful")

# Step 3: Scrape authenticated pages
dashboard = session.get("https://example.com/dashboard")
profile = session.get("https://example.com/account/settings")
print(dashboard.text)

How It Works

The Session object stores cookies set by the server after a successful login. When the server responds to your POST with a Set-Cookie header containing a session ID, the Session object captures it and includes it in every subsequent request automatically. This is the same mechanism your browser uses.

When This Method Fails

  • JavaScript-rendered login pages — If the login form is rendered by JavaScript (React, Vue, Angular), there is no HTML form to POST to directly.
  • CSRF tokens — Many sites include a hidden CSRF token in the login form. You need to GET the login page first, extract the token, then include it in your POST data.
  • CAPTCHA on login — If the login page has a CAPTCHA, requests alone cannot solve it.
  • Two-factor authentication (2FA) — Session-based POST cannot handle SMS or authenticator codes.

Handling CSRF Tokens

from bs4 import BeautifulSoup

# First, GET the login page to extract the CSRF token
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]

# Include the token in your POST
login_data = {
    "username": "myuser",
    "password": "mypass",
    "csrf_token": csrf_token
}
session.post("https://example.com/login", data=login_data)

Method 2: Selenium Browser Automation

When the login page relies on JavaScript — dynamically rendered forms, single-page applications, or interactive CAPTCHA elements — you need a real browser. Selenium automates Chrome or Firefox and can interact with any login page the same way a human does.

Code Example with Proxy

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Configure Chrome with proxy
options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://geo.spyderproxy.com:11000")
options.add_argument("--disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=options)

# Navigate to login page
driver.get("https://example.com/login")

# Fill in credentials
driver.find_element(By.ID, "username").send_keys("myuser")
driver.find_element(By.ID, "password").send_keys("mypass")
driver.find_element(By.ID, "login-btn").click()

# Wait for login to complete (page redirect)
WebDriverWait(driver, 10).until(
    EC.url_contains("/dashboard")
)

# Now scrape authenticated pages
driver.get("https://example.com/dashboard")
page_source = driver.page_source

# Extract data from page_source with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.find_all("div", class_="data-row")

driver.quit()

Handling Wait Times

Login pages often have animations, redirects, or AJAX calls that take time to complete. Never use time.sleep() — use WebDriverWait with explicit conditions:

# Wait for a specific element to appear after login
WebDriverWait(driver, 15).until(
    EC.presence_of_element_located((By.CLASS_NAME, "dashboard-content"))
)

# Wait for URL to change
WebDriverWait(driver, 10).until(
    EC.url_contains("/dashboard")
)

Reducing CAPTCHA Frequency

Selenium with datacenter IPs triggers CAPTCHAs frequently because sites know those IPs belong to cloud providers. Residential proxies dramatically reduce CAPTCHA rates because the requests come from real ISP IP addresses that look like normal users. SpyderProxy's residential pool of 130M+ IPs makes each Selenium session appear to originate from a genuine household connection.

Method 3: Playwright (Recommended for 2026)

Playwright is the modern replacement for Selenium and is the recommended tool for browser automation in 2026. It is faster, has built-in auto-wait functionality, and offers native per-context proxy support — meaning each browser context can use a different proxy without restarting the browser.

Why Playwright Over Selenium

  • Auto-wait — Playwright automatically waits for elements to be ready before interacting, eliminating most timing issues
  • Per-context proxies — Each browser context can have its own proxy, perfect for multi-account scraping
  • Faster execution — Playwright communicates with browsers over a faster protocol than Selenium
  • Better stealth — Harder for sites to detect as automation compared to Selenium

Code Example

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)

    # Create context with proxy
    context = browser.new_context(proxy={
        "server": "http://geo.spyderproxy.com:11000",
        "username": "user",
        "password": "pass"
    })

    page = context.new_page()

    # Navigate and login
    page.goto("https://example.com/login")
    page.fill("#username", "myuser")
    page.fill("#password", "mypass")
    page.click("#login-btn")

    # Auto-waits for navigation to complete
    page.wait_for_url("**/dashboard")

    # Scrape the authenticated page
    content = page.content()

    # Extract specific data
    titles = page.query_selector_all(".item-title")
    for title in titles:
        print(title.inner_text())

    # Navigate to more authenticated pages
    page.goto("https://example.com/account/orders")
    orders = page.content()

    browser.close()

For a detailed comparison of all three browser automation tools, see our guide on Puppeteer vs Playwright vs Selenium.

Method 4: Cookie Injection

When the login flow is too complex to automate — OAuth redirects, multi-step 2FA, visual CAPTCHAs, or biometric verification — you can log in manually once in your browser and then export the session cookies for use in your scraping script.

Step 1: Export Cookies from Your Browser

Open Chrome DevTools (F12), go to Application > Cookies, and find the cookies for the target domain. Look for session-related cookies (commonly named session_id, auth_token, JSESSIONID, or similar). Alternatively, use the EditThisCookie browser extension to export all cookies for a domain in JSON format.

Step 2: Inject into requests

import requests

session = requests.Session()

# Set proxy
session.proxies = {
    "http": "http://user:[email protected]:11000",
    "https": "http://user:[email protected]:11000"
}

# Inject cookies from your browser session
session.cookies.set("session_id", "abc123def456", domain="example.com")
session.cookies.set("auth_token", "eyJhbGciOi...", domain="example.com")

# Now scrape as an authenticated user
response = session.get("https://example.com/dashboard")
print(response.status_code)  # 200 if cookies are valid
print(response.text)

Step 3: Inject into Playwright

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    context = browser.new_context(proxy={
        "server": "http://geo.spyderproxy.com:11000",
        "username": "user",
        "password": "pass"
    })

    # Add cookies before navigating
    context.add_cookies([
        {"name": "session_id", "value": "abc123", "domain": "example.com", "path": "/"},
        {"name": "auth_token", "value": "xyz789", "domain": "example.com", "path": "/"},
    ])

    page = context.new_page()
    page.goto("https://example.com/dashboard")
    print(page.content())
    browser.close()

When to Use Cookie Injection

  • The login requires OAuth (Google, Facebook, GitHub sign-in)
  • The site has 2FA that you cannot automate
  • CAPTCHA persists even with residential proxies
  • The login flow is multi-step or involves redirects across multiple domains

The main downside is that cookies expire. Depending on the site, you may need to manually re-login and re-export cookies every few hours, days, or weeks.

Method 5: API Authentication with Tokens

Many modern websites — especially single-page applications built with React, Vue, or Angular — use JWT (JSON Web Tokens) or API keys for authentication instead of session cookies. The frontend logs in through an API endpoint and receives a token that is included in the Authorization header of subsequent requests.

How to Find the API Token

  1. Open Chrome DevTools and go to the Network tab
  2. Log in to the site manually
  3. Look for the login API request (usually a POST to /api/auth/login or similar)
  4. Check the Response for a token field (often called access_token, token, or jwt)
  5. Check subsequent requests — look for an Authorization: Bearer ... header

Code Example

import requests

# Option A: Login via API to get token
login_response = requests.post(
    "https://api.example.com/auth/login",
    json={"email": "[email protected]", "password": "mypass"},
    proxies={"https": "http://user:[email protected]:11000"}
)
token = login_response.json()["access_token"]

# Option B: Use token intercepted from DevTools
token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

# Use the token in headers
headers = {"Authorization": f"Bearer {token}"}

# Scrape API endpoints directly
response = requests.get(
    "https://api.example.com/data/products",
    headers=headers,
    proxies={"https": "http://user:[email protected]:11000"}
)
data = response.json()
print(f"Found {len(data['items'])} products")

Why This Is the Best Method (When Available)

API-based scraping is faster than browser automation because you skip rendering HTML entirely. You get structured JSON responses that are easy to parse, and the requests are lightweight. If a site uses API-based auth, always prefer this method over Selenium or Playwright.

Why Proxies Matter for Authenticated Scraping

Authenticated scraping puts you at higher risk of detection and bans than anonymous scraping. Here is why proxies are essential:

  • Login pattern monitoring — Sites track how many login attempts come from a single IP. Multiple logins from the same address trigger security alerts, account locks, or IP bans.
  • Session fingerprinting — If you scrape thousands of pages from a single IP after logging in, the site sees abnormal behavior from that user-IP combination and may flag the account.
  • Residential IPs look like real users — Datacenter IPs are immediately suspicious. Residential proxies use IP addresses assigned to real ISP customers, making your scraping traffic indistinguishable from normal users.
  • Sticky sessions maintain authentication — When scraping behind a login, you need the same IP address for the entire authenticated session. Rotating IPs mid-session would invalidate your cookies. Sticky sessions solve this.

SpyderProxy offers residential proxies with 130M+ IPs across 190+ countries. The budget residential plan supports sticky sessions up to 24 hours, which is ideal for authenticated scraping workflows where you need IP consistency for the duration of a login session. The proxy endpoint is geo.spyderproxy.com:11000 for all examples in this guide.

Common Pitfalls and How to Avoid Them

Even with the right method, there are several traps that catch beginners. Here is how to handle each one:

CSRF Token Errors

If your POST login returns a 403 or redirects back to the login page, the site likely uses CSRF protection. Always GET the login page first, extract the hidden CSRF token from the HTML, and include it in your POST data. See the code example in Method 1 above.

Rate Limiting

Add random delays between requests (1–5 seconds for most sites). Authenticated sessions are monitored more closely than anonymous traffic. Use time.sleep(random.uniform(1, 3)) between page requests.

Session Expiration

Session cookies expire after a set time (often 15 minutes to 24 hours). Your scraper should detect when a session has expired — typically a redirect to the login page or a 401 status code — and automatically re-login. Build a re-authentication function into your scraper.

JavaScript Challenges

If you get empty responses or a page full of <noscript> tags, the site requires JavaScript rendering. Switch from requests to Playwright or Selenium. There is no workaround for JS-rendered content with HTTP-only libraries.

Two-Factor Authentication

2FA cannot be automated in most cases. Use the cookie injection method (Method 4): log in manually with 2FA, export the session cookies, and inject them into your script. Some sites offer app-based TOTP tokens that can be generated programmatically using the pyotp library if you have the secret key.

Frequently Asked Questions

Is it legal to scrape behind a login?

The legality depends on the website's Terms of Service and your jurisdiction. Scraping publicly accessible data has been upheld in US courts (hiQ v. LinkedIn), but logging into a site may involve agreeing to ToS that restrict automated access. Always review the site's terms, scrape only data you have legitimate access to, and consult legal counsel for commercial projects.

Which method is the fastest for scraping?

The requests + Session method (Method 1) is the fastest because it makes lightweight HTTP requests without rendering a browser. API token authentication (Method 5) is equally fast. Playwright is faster than Selenium for browser-based approaches. Use requests when possible, and switch to browser automation only when the site requires JavaScript.

Do I need proxies for authenticated scraping?

Yes, for any project beyond a handful of pages. Sites monitor login IPs closely and will lock accounts or ban IPs that show scraping patterns. Residential proxies with sticky sessions are the standard setup for authenticated scraping at scale. Even for small projects, proxies protect your main IP from being flagged.

How do I handle CAPTCHA on login pages?

Residential proxies significantly reduce CAPTCHA frequency because the requests come from real ISP addresses. For persistent CAPTCHAs, use the cookie injection method: solve the CAPTCHA manually during login, then export and reuse the session cookies in your scraper. This avoids needing to solve CAPTCHAs programmatically.

Can I scrape websites that use two-factor authentication?

Yes, using the cookie injection method. Log in manually (including the 2FA step), export the session cookies from your browser, and inject them into your Python script. The cookies remain valid until they expire. If you have the TOTP secret key, you can also generate 2FA codes programmatically with the pyotp library and combine it with Playwright automation.

Conclusion

Scraping behind a login is straightforward once you choose the right method for the target site's authentication type. Start with requests.Session() for simple HTML forms, move to Playwright for JavaScript-heavy logins, and fall back to cookie injection for complex auth flows like OAuth and 2FA. In every case, pair your scraper with residential proxies to avoid IP bans and keep your sessions stable.

For more scraping guides, check out Python Requests with Rotating Proxies, How to Scrape Amazon Without Getting Blocked, and our web scraping use cases page for proxy configuration examples.

Need Proxies for Authenticated Scraping?

SpyderProxy residential proxies offer 130M+ IPs with sticky sessions up to 24 hours — perfect for maintaining authenticated scraping sessions without getting banned.