spyderproxy

How to Scrape Twitter (X) Data with Python (2026): Full Tutorial

S

SpyderProxy Team

|
Published date

2026-04-21

X (formerly Twitter) closed its free public API tier in February 2023 and priced paid tiers so high that most hobby and mid-scale projects simply can't afford it — the $100/month "Basic" tier caps at 10,000 tweets, and "Pro" at $5,000/month is overkill for nearly everyone. That's why scraping Twitter/X data with Python has quietly become the default for market researchers, brand-monitoring tools, crypto sentiment trackers, and academics in 2026.

This tutorial walks you through the full pipeline: how to fetch tweets, profiles, search results, and trends from X without hitting an API bill, how to handle the aggressive rate limits and bot detection, and how residential proxies turn a fragile scraper into a production-grade data pipeline.

We'll use Playwright (more reliable than requests on X in 2026), JSON extraction from embedded state, and rotating residential IPs. Full code examples throughout.

Is Scraping Twitter/X Legal?

The short answer: scraping publicly accessible X data is generally legal in the US, UK, and EU, but specific uses matter. Key points to know:

  • hiQ v. LinkedIn (2022, Ninth Circuit) established that scraping publicly available data doesn't violate the Computer Fraud and Abuse Act. That ruling has been extended by lower courts to cover similar platforms, including X.
  • X's Terms of Service prohibit automated access without the API. Violating ToS is not a crime, but X can ban your accounts and block your IPs, and they have sued scrapers (Bright Data in 2023, multiple dismissed and refiled).
  • GDPR and CCPA apply when tweets include personal data. You can scrape public posts, but storing and processing them at scale may trigger obligations — especially if users are in the EU.
  • Copyright applies to tweet content. You can quote, analyze, and aggregate, but republishing verbatim feeds can create legal exposure.

Practically: if you're scraping publicly visible tweets for sentiment analysis, academic research, brand monitoring, or competitive intelligence, you're on solid ground. If you're scraping private or protected accounts, building a database of personal data, or republishing content commercially, talk to a lawyer first.

What You Can Actually Scrape from X in 2026

X exposes quite a lot without requiring login, even after their 2023 "rate limit" changes:

  • Tweets and replies from any public account — text, timestamp, engagement counts (likes, retweets, views, bookmarks), media URLs.
  • Profiles — bio, follower/following counts, join date, verified status, location if public, pinned tweet.
  • Search results — any query you can run in the X search bar, including operators like from:username, since:2026-01-01, min_faves:100.
  • Trending topics — by geographic region.
  • Lists — public lists and their members.
  • Communities — public communities and posts within them.

What requires login (and therefore real account risk): older tweets beyond the "guest view" limit, likes on a profile, bookmarks, DMs (obviously). The rest is scrapable without authentication if you have decent proxies.

Tools You'll Need

Install these first:

pip install playwright requests beautifulsoup4 pandas playwright install chromium

Why Playwright instead of requests? X is now a JavaScript-heavy SPA (React + Redux under the hood). A plain requests.get() returns a mostly-empty HTML shell. Playwright renders the JS, lets us wait for tweets to load, and gives us access to the embedded JSON state — which is the cleanest way to extract structured data.

You'll also want a residential proxy plan. X's anti-bot is aggressive against datacenter IPs — you'll hit login walls within a few dozen requests from a datacenter range. Residential IPs (SpyderProxy Budget at $1.75/GB or Premium at $2.75/GB) look like real mobile users on home internet and get 10× the request budget before challenges kick in.

Step 1: Scrape a Public Profile

Let's start with the simplest case: pulling basic profile data for a single user.

from playwright.sync_api import sync_playwright import json PROXY = { "server": "http://proxy.spyderproxy.com:10000", "username": "YOUR_USERNAME", "password": "YOUR_PASSWORD", } def scrape_profile(username: str): with sync_playwright() as p: browser = p.chromium.launch( headless=True, proxy=PROXY, ) context = browser.new_context( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36", viewport={"width": 1280, "height": 900}, locale="en-US", ) page = context.new_page() page.goto(f"https://x.com/{username}", timeout=30000) page.wait_for_selector('[data-testid="UserName"]', timeout=15000) # Extract basic profile info name = page.inner_text('[data-testid="UserName"]') bio = page.locator('[data-testid="UserDescription"]').inner_text() if page.locator('[data-testid="UserDescription"]').count() else "" stats_elements = page.locator('[role="presentation"] a').all() followers = following = "" for el in stats_elements: text = el.inner_text() href = el.get_attribute("href") or "" if href.endswith("/verified_followers") or href.endswith("/followers"): followers = text.split("\n")[0] elif href.endswith("/following"): following = text.split("\n")[0] browser.close() return { "username": username, "name": name, "bio": bio, "followers": followers, "following": following, } if __name__ == "__main__": data = scrape_profile("elonmusk") print(json.dumps(data, indent=2))

Key points:

  • User-Agent matters — X blocks default Playwright UA strings. Use a current Chrome fingerprint.
  • Proxy is mandatory for non-trivial volume — even for a single profile you might get away IP-unproxied, but two or three and you'll hit a login wall.
  • Use data-testid attributes, not CSS class names. X generates class names via CSS-in-JS and they change every deploy.

Step 2: Scrape Tweets from a User Timeline

Scrolling to load more tweets is the trick. X uses infinite scroll that only loads ~20 tweets per page; you have to scroll to fetch more.

from playwright.sync_api import sync_playwright import time PROXY = { "server": "http://proxy.spyderproxy.com:10000", "username": "YOUR_USERNAME", "password": "YOUR_PASSWORD", } def scrape_user_tweets(username: str, max_tweets: int = 50): tweets = [] seen = set() with sync_playwright() as p: browser = p.chromium.launch(headless=True, proxy=PROXY) context = browser.new_context( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36", locale="en-US", ) page = context.new_page() page.goto(f"https://x.com/{username}", timeout=30000) page.wait_for_selector('article[data-testid="tweet"]', timeout=20000) prev_count = -1 stall_count = 0 while len(tweets) < max_tweets and stall_count < 3: articles = page.locator('article[data-testid="tweet"]').all() for a in articles: try: tweet_id = a.locator('a[href*="/status/"]').first.get_attribute("href") if not tweet_id or tweet_id in seen: continue seen.add(tweet_id) text_el = a.locator('[data-testid="tweetText"]') text = text_el.inner_text() if text_el.count() else "" time_el = a.locator('time') ts = time_el.get_attribute("datetime") if time_el.count() else "" tweets.append({ "id": tweet_id.split("/status/")[-1], "url": f"https://x.com{tweet_id}", "text": text, "timestamp": ts, }) except Exception: continue if len(tweets) == prev_count: stall_count += 1 else: stall_count = 0 prev_count = len(tweets) page.mouse.wheel(0, 3000) time.sleep(1.5) browser.close() return tweets[:max_tweets] if __name__ == "__main__": tweets = scrape_user_tweets("elonmusk", max_tweets=30) for t in tweets[:5]: print(t)

Note the stall counter — if we scroll three times without loading new tweets, we've hit the end (or X is rate-limiting us). Bailing early prevents infinite loops.

Step 3: Scrape Search Results and Hashtags

The search endpoint accepts X's full operator syntax: from:user, since:YYYY-MM-DD, until:YYYY-MM-DD, min_faves:N, lang:en, and so on.

def scrape_search(query: str, max_results: int = 100, tab: str = "live"): """tab: 'live' (recent tweets), 'top' (top tweets), 'people', 'photos', 'videos'""" from urllib.parse import quote_plus url = f"https://x.com/search?q={quote_plus(query)}&src=typed_query&f={tab}" with sync_playwright() as p: browser = p.chromium.launch(headless=True, proxy=PROXY) context = browser.new_context( user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36", ) page = context.new_page() page.goto(url, timeout=30000) page.wait_for_selector('article[data-testid="tweet"]', timeout=20000) # Same scroll-and-collect pattern as Step 2 tweets = [] seen = set() for _ in range(20): articles = page.locator('article[data-testid="tweet"]').all() for a in articles: try: href = a.locator('a[href*="/status/"]').first.get_attribute("href") if href and href not in seen: seen.add(href) text = a.locator('[data-testid="tweetText"]').inner_text() if a.locator('[data-testid="tweetText"]').count() else "" ts = a.locator('time').get_attribute("datetime") if a.locator('time').count() else "" author = href.split("/")[1] if href.startswith("/") else "" tweets.append({"url": f"https://x.com{href}", "author": author, "text": text, "timestamp": ts}) if len(tweets) >= max_results: browser.close() return tweets except Exception: continue page.mouse.wheel(0, 3000) import time; time.sleep(1.5) browser.close() return tweets # Examples scrape_search('"web scraping" min_faves:50 lang:en', max_results=50) scrape_search("#SEO since:2026-04-01 until:2026-04-21", max_results=100) scrape_search("from:openai", max_results=30, tab="top")

Step 4: Handle Rate Limits and Anti-Bot Challenges

X's defenses are, in order of severity:

  1. Guest rate limit — anonymous users can view a limited number of tweets per session. You'll see "Rate limit exceeded" at ~100–200 requests from a single IP.
  2. Login wall — X redirects to /i/flow/login when it doesn't trust the session. Usually triggered by datacenter IPs or by aggressive request rates.
  3. Arkose Labs FunCaptcha — the rotating 3D puzzle. Shown on suspicious login flows and sometimes on signup. Not typically on guest browse.
  4. IP bans — temporary (30 min – 2 hour) or permanent soft blocks on specific IPs.

The two-line fix for 95% of these: rotating residential proxies + a rotation delay.

import time, random # In your scraper loop — rotate every N requests by getting a new proxy session for i, username in enumerate(usernames): if i % 20 == 0: # Trigger session rotation on SpyderProxy (new IP per session ID) PROXY["username"] = f"YOUR_USERNAME-session-{random.randint(1,999999)}" data = scrape_profile(username) time.sleep(random.uniform(2, 5)) # jitter

SpyderProxy's rotating residential pool returns a fresh IP on each new session ID — pass a unique session token in the username and you get per-request rotation, or reuse the same token for up to 24h of sticky sessions when you need scroll continuity. See the residential proxy docs for the exact syntax.

Step 5: Extract Structured JSON from Embedded State

X ships a big chunk of structured JSON inside <script id="__NEXT_DATA__"> or in an inline window.__INITIAL_STATE__ block on some routes. If you can grab this, you skip DOM scraping entirely and get clean objects.

import re, json def extract_initial_state(html: str): m = re.search(r']*id="__NEXT_DATA__"[^>]*>(.+?)', html, re.DOTALL) if m: return json.loads(m.group(1)) m = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.+?});', html, re.DOTALL) if m: return json.loads(m.group(1)) return None # Inside Playwright html = page.content() state = extract_initial_state(html) if state: # traverse the object — shape changes over time, explore with a JSON viewer print(json.dumps(state, indent=2)[:2000])

This is the highest-fidelity path — you get exact follower counts, precise engagement numbers, and metadata that the rendered UI truncates. The downside is the JSON shape changes across X deploys; when yours breaks, log a snapshot and re-map the path.

Step 6: Save to CSV (and Dedupe)

import pandas as pd tweets = scrape_search("#SEO since:2026-04-01", max_results=500) df = pd.DataFrame(tweets) df = df.drop_duplicates(subset=["url"]) # Optional: enrich with extra computed columns df["scraped_at"] = pd.Timestamp.now() df["domain"] = df["url"].str.extract(r"x\.com/([^/]+)/") df.to_csv("x_seo_tweets.csv", index=False, encoding="utf-8") print(f"Saved {len(df)} unique tweets")

Residential vs Mobile Proxies for X

For X specifically:

  • Rotating residential (SpyderProxy Premium $2.75/GB or Budget $1.75/GB) — best general-purpose choice. Fresh IP per session, clean reputation on real ISPs, works for 95% of scraping workloads.
  • Mobile/LTE proxies ($2/IP) — go here when residential gets blocked. X treats mobile IPs as highest-trust because carrier-grade NAT means thousands of real users share each IP, so X can't block them. Overkill for most hobby projects but essential for high-volume commercial scrapers.
  • Static residential/ISP ($3.90/day) — when you need the same IP for hours (e.g., logged-in scraping, persistent session tracking).
  • Datacenter — don't. X identifies commercial hosting ranges in seconds.

Common Mistakes That Get Your Scraper Banned

  • Using requests instead of a headless browser. X's HTML is a JS shell. A requests.get() response is almost empty.
  • Reusing the same IP for 1,000 requests. Rotate every 20–50 requests maximum.
  • No randomized delay. A perfectly regular 1.0-second cadence is the biggest bot tell there is.
  • Default Playwright User-Agent. Always override with a current Chrome string.
  • Not setting viewport. Bot detection scripts fingerprint unusual viewport sizes. Stick to common resolutions (1280×900, 1920×1080).
  • Scraping logged-in. Tempting for fuller data, but X bans accounts aggressively for automation. Only do this with burner accounts you can afford to lose.
  • Ignoring the Retry-After response header. When X does rate-limit you, it tells you exactly how long to wait.

Should You Use a Commercial Twitter/X Scraping API Instead?

If you don't want to maintain the browser automation yourself, there are third-party X scraping APIs on RapidAPI and elsewhere. Pricing is typically $30–$300/month for 10K–1M requests. They work fine for low-volume and prototype-stage projects. For production at scale, a self-hosted scraper + residential proxies is usually 5–10× cheaper per tweet and gives you control over rate, freshness, and fields.

Frequently Asked Questions

Is scraping Twitter/X against the law?

Scraping publicly accessible data on X is not a crime in the US, UK, or EU under current precedent (hiQ v. LinkedIn, Ninth Circuit 2022). It does violate X's Terms of Service, which means X can ban your account and block your IPs, and they can file civil suits — they have, though most have been dismissed. Private/protected accounts, bypassing technical barriers, and commercial republication raise more serious legal questions.

Can I scrape X without the paid API?

Yes. Public tweets, profiles, search results, and trends are all accessible from the unauthenticated web interface. You need a headless browser (Playwright or Selenium) because X is a single-page application, plus residential proxies to get past the per-IP rate limits.

Why do my scrapers get blocked so fast?

Three main reasons: (1) you're scraping from a datacenter IP — X identifies these instantly; (2) you have no randomization in your request timing; (3) your User-Agent is the default Playwright string or a generic bot UA. Fix all three and most blocks disappear.

How many tweets can I scrape per hour?

With rotating residential proxies and a 2–5 second delay between requests, a single concurrency-1 scraper realistically does 500–1,500 tweets per hour. Parallelism scales this linearly if each worker uses a different residential IP. Expect to hit rate limits at 10,000+ tweets per hour without careful rotation.

Should I use Selenium or Playwright for X?

Playwright. It has better async support, cleaner APIs, and better proxy integration than Selenium. Also, Playwright's auto-wait logic handles X's dynamic rendering without all the WebDriverWait boilerplate.

Can I scrape tweet replies and thread context?

Yes — navigate to the tweet URL (/status/TWEET_ID) and scrape the article elements that appear beneath the main tweet. These are replies. X loads them lazily with scroll, same pattern as the user timeline example above.

What's the cheapest way to scrape X at scale?

Self-hosted Playwright workers + SpyderProxy Budget Residential at $1.75/GB. A typical tweet scrape is 50–150 KB of traffic, so $1.75 buys you roughly 7,000–20,000 tweets worth of bandwidth. That's an order of magnitude cheaper than any commercial X scraping API.

Does X detect headless Chrome?

Yes, but you can hide it. Set --disable-blink-features=AutomationControlled, override navigator.webdriver to undefined, and use a real Chrome profile path. Playwright-stealth and playwright-extra-stealth plugins do most of this automatically.

How often does X change its HTML?

Major structural changes every 2–4 months, minor CSS class changes weekly. Always target data-testid attributes (e.g., tweet, tweetText, UserName) because these are stable across deploys in a way CSS class names aren't.

Bottom Line

Scraping X in 2026 is a solved problem as long as you use the right tools: Playwright for JS rendering, rotating residential proxies to defeat per-IP rate limits, and data-testid selectors for stability. Don't fight the platform — let residential IPs do the work of looking like real users, add randomized delays, and rotate sessions every 20–50 requests.

For production workloads, SpyderProxy Premium Residential at $2.75/GB with rotating sessions handles the vast majority of X scraping without a single 403. When you hit the hardest cases — logged-in scraping, heavy commercial volume, or Arkose-protected flows — step up to LTE Mobile proxies at $2/IP and the block rate drops to effectively zero.

Related Resources