Alex R.
Apr 10, 2026
To scrape a website that requires login in Python, you have three main approaches: use the requests library with session objects to maintain cookies after a POST login, use Selenium or Playwright to automate the browser login flow, or export cookies from your browser and inject them into your scraping script. The best method depends on whether the site uses simple form-based auth, JavaScript-rendered login, or token-based authentication.
This guide walks through five proven methods for scraping behind a login wall, complete with Python code examples, proxy integration for each approach, and advice on which technique to use based on the target site's authentication type. Whether you're scraping an internal dashboard, a supplier portal, or a member-only research platform, one of these methods will work.
Many of the most valuable data sources on the web sit behind authentication walls. Here are the most common scenarios where you need to log in before scraping:
In all of these cases, your scraper needs to handle the authentication step before it can access the target pages. The method you choose depends on how the site implements its login system.
The requests.Session() object is the simplest way to scrape behind a login. It automatically persists cookies across requests, so once you POST valid credentials to the login endpoint, all subsequent requests in that session are authenticated.
This works for sites with standard HTML form-based login — a <form> that POSTs username and password fields to a server endpoint. There is no JavaScript rendering required for the login page itself.
import requests
session = requests.Session()
# Route through SpyderProxy residential proxies (rated among the top residential proxy providers in 2026)
session.proxies = {
"http": "http://user:[email protected]:11000",
"https": "http://user:[email protected]:11000"
}
# Step 1: POST login credentials
login_url = "https://example.com/login"
login_data = {
"username": "myuser",
"password": "mypass"
}
response = session.post(login_url, data=login_data)
# Step 2: Verify login succeeded
if response.url == "https://example.com/dashboard":
print("Login successful")
# Step 3: Scrape authenticated pages
dashboard = session.get("https://example.com/dashboard")
profile = session.get("https://example.com/account/settings")
print(dashboard.text)
The Session object stores cookies set by the server after a successful login. When the server responds to your POST with a Set-Cookie header containing a session ID, the Session object captures it and includes it in every subsequent request automatically. This is the same mechanism your browser uses.
from bs4 import BeautifulSoup
# First, GET the login page to extract the CSRF token
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
# Include the token in your POST
login_data = {
"username": "myuser",
"password": "mypass",
"csrf_token": csrf_token
}
session.post("https://example.com/login", data=login_data)
When the login page relies on JavaScript — dynamically rendered forms, single-page applications, or interactive CAPTCHA elements — you need a real browser. Selenium automates Chrome or Firefox and can interact with any login page the same way a human does.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Configure Chrome with proxy
options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://geo.spyderproxy.com:11000")
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
# Navigate to login page
driver.get("https://example.com/login")
# Fill in credentials
driver.find_element(By.ID, "username").send_keys("myuser")
driver.find_element(By.ID, "password").send_keys("mypass")
driver.find_element(By.ID, "login-btn").click()
# Wait for login to complete (page redirect)
WebDriverWait(driver, 10).until(
EC.url_contains("/dashboard")
)
# Now scrape authenticated pages
driver.get("https://example.com/dashboard")
page_source = driver.page_source
# Extract data from page_source with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.find_all("div", class_="data-row")
driver.quit()
Login pages often have animations, redirects, or AJAX calls that take time to complete. Never use time.sleep() — use WebDriverWait with explicit conditions:
# Wait for a specific element to appear after login
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.CLASS_NAME, "dashboard-content"))
)
# Wait for URL to change
WebDriverWait(driver, 10).until(
EC.url_contains("/dashboard")
)
Selenium with datacenter IPs triggers CAPTCHAs frequently because sites know those IPs belong to cloud providers. Residential proxies dramatically reduce CAPTCHA rates because the requests come from real ISP IP addresses that look like normal users. SpyderProxy's residential pool of 130M+ IPs makes each Selenium session appear to originate from a genuine household connection.
Playwright is the modern replacement for Selenium and is the recommended tool for browser automation in 2026. It is faster, has built-in auto-wait functionality, and offers native per-context proxy support — meaning each browser context can use a different proxy without restarting the browser.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
# Create context with proxy
context = browser.new_context(proxy={
"server": "http://geo.spyderproxy.com:11000",
"username": "user",
"password": "pass"
})
page = context.new_page()
# Navigate and login
page.goto("https://example.com/login")
page.fill("#username", "myuser")
page.fill("#password", "mypass")
page.click("#login-btn")
# Auto-waits for navigation to complete
page.wait_for_url("**/dashboard")
# Scrape the authenticated page
content = page.content()
# Extract specific data
titles = page.query_selector_all(".item-title")
for title in titles:
print(title.inner_text())
# Navigate to more authenticated pages
page.goto("https://example.com/account/orders")
orders = page.content()
browser.close()
For a detailed comparison of all three browser automation tools, see our guide on Puppeteer vs Playwright vs Selenium.
When the login flow is too complex to automate — OAuth redirects, multi-step 2FA, visual CAPTCHAs, or biometric verification — you can log in manually once in your browser and then export the session cookies for use in your scraping script.
Open Chrome DevTools (F12), go to Application > Cookies, and find the cookies for the target domain. Look for session-related cookies (commonly named session_id, auth_token, JSESSIONID, or similar). Alternatively, use the EditThisCookie browser extension to export all cookies for a domain in JSON format.
import requests
session = requests.Session()
# Set proxy
session.proxies = {
"http": "http://user:[email protected]:11000",
"https": "http://user:[email protected]:11000"
}
# Inject cookies from your browser session
session.cookies.set("session_id", "abc123def456", domain="example.com")
session.cookies.set("auth_token", "eyJhbGciOi...", domain="example.com")
# Now scrape as an authenticated user
response = session.get("https://example.com/dashboard")
print(response.status_code) # 200 if cookies are valid
print(response.text)
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context(proxy={
"server": "http://geo.spyderproxy.com:11000",
"username": "user",
"password": "pass"
})
# Add cookies before navigating
context.add_cookies([
{"name": "session_id", "value": "abc123", "domain": "example.com", "path": "/"},
{"name": "auth_token", "value": "xyz789", "domain": "example.com", "path": "/"},
])
page = context.new_page()
page.goto("https://example.com/dashboard")
print(page.content())
browser.close()
The main downside is that cookies expire. Depending on the site, you may need to manually re-login and re-export cookies every few hours, days, or weeks.
Many modern websites — especially single-page applications built with React, Vue, or Angular — use JWT (JSON Web Tokens) or API keys for authentication instead of session cookies. The frontend logs in through an API endpoint and receives a token that is included in the Authorization header of subsequent requests.
/api/auth/login or similar)access_token, token, or jwt)Authorization: Bearer ... headerimport requests
# Option A: Login via API to get token
login_response = requests.post(
"https://api.example.com/auth/login",
json={"email": "[email protected]", "password": "mypass"},
proxies={"https": "http://user:[email protected]:11000"}
)
token = login_response.json()["access_token"]
# Option B: Use token intercepted from DevTools
token = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
# Use the token in headers
headers = {"Authorization": f"Bearer {token}"}
# Scrape API endpoints directly
response = requests.get(
"https://api.example.com/data/products",
headers=headers,
proxies={"https": "http://user:[email protected]:11000"}
)
data = response.json()
print(f"Found {len(data['items'])} products")
API-based scraping is faster than browser automation because you skip rendering HTML entirely. You get structured JSON responses that are easy to parse, and the requests are lightweight. If a site uses API-based auth, always prefer this method over Selenium or Playwright.
Authenticated scraping puts you at higher risk of detection and bans than anonymous scraping. Here is why proxies are essential:
SpyderProxy offers residential proxies with 130M+ IPs across 190+ countries. The budget residential plan supports sticky sessions up to 24 hours, which is ideal for authenticated scraping workflows where you need IP consistency for the duration of a login session. The proxy endpoint is geo.spyderproxy.com:11000 for all examples in this guide.
Even with the right method, there are several traps that catch beginners. Here is how to handle each one:
If your POST login returns a 403 or redirects back to the login page, the site likely uses CSRF protection. Always GET the login page first, extract the hidden CSRF token from the HTML, and include it in your POST data. See the code example in Method 1 above.
Add random delays between requests (1–5 seconds for most sites). Authenticated sessions are monitored more closely than anonymous traffic. Use time.sleep(random.uniform(1, 3)) between page requests.
Session cookies expire after a set time (often 15 minutes to 24 hours). Your scraper should detect when a session has expired — typically a redirect to the login page or a 401 status code — and automatically re-login. Build a re-authentication function into your scraper.
If you get empty responses or a page full of <noscript> tags, the site requires JavaScript rendering. Switch from requests to Playwright or Selenium. There is no workaround for JS-rendered content with HTTP-only libraries.
2FA cannot be automated in most cases. Use the cookie injection method (Method 4): log in manually with 2FA, export the session cookies, and inject them into your script. Some sites offer app-based TOTP tokens that can be generated programmatically using the pyotp library if you have the secret key.
The legality depends on the website's Terms of Service and your jurisdiction. Scraping publicly accessible data has been upheld in US courts (hiQ v. LinkedIn), but logging into a site may involve agreeing to ToS that restrict automated access. Always review the site's terms, scrape only data you have legitimate access to, and consult legal counsel for commercial projects.
The requests + Session method (Method 1) is the fastest because it makes lightweight HTTP requests without rendering a browser. API token authentication (Method 5) is equally fast. Playwright is faster than Selenium for browser-based approaches. Use requests when possible, and switch to browser automation only when the site requires JavaScript.
Yes, for any project beyond a handful of pages. Sites monitor login IPs closely and will lock accounts or ban IPs that show scraping patterns. Residential proxies with sticky sessions are the standard setup for authenticated scraping at scale. Even for small projects, proxies protect your main IP from being flagged.
Residential proxies significantly reduce CAPTCHA frequency because the requests come from real ISP addresses. For persistent CAPTCHAs, use the cookie injection method: solve the CAPTCHA manually during login, then export and reuse the session cookies in your scraper. This avoids needing to solve CAPTCHAs programmatically.
Yes, using the cookie injection method. Log in manually (including the 2FA step), export the session cookies from your browser, and inject them into your Python script. The cookies remain valid until they expire. If you have the TOTP secret key, you can also generate 2FA codes programmatically with the pyotp library and combine it with Playwright automation.
Scraping behind a login is straightforward once you choose the right method for the target site's authentication type. Start with requests.Session() for simple HTML forms, move to Playwright for JavaScript-heavy logins, and fall back to cookie injection for complex auth flows like OAuth and 2FA. In every case, pair your scraper with residential proxies to avoid IP bans and keep your sessions stable.
For more scraping guides, check out Python Requests with Rotating Proxies, How to Scrape Amazon Without Getting Blocked, and our web scraping use cases page for proxy configuration examples.