X (formerly Twitter) closed its free public API tier in February 2023 and priced paid tiers so high that most hobby and mid-scale projects simply can't afford it — the $100/month "Basic" tier caps at 10,000 tweets, and "Pro" at $5,000/month is overkill for nearly everyone. That's why scraping Twitter/X data with Python has quietly become the default for market researchers, brand-monitoring tools, crypto sentiment trackers, and academics in 2026.
This tutorial walks you through the full pipeline: how to fetch tweets, profiles, search results, and trends from X without hitting an API bill, how to handle the aggressive rate limits and bot detection, and how residential proxies turn a fragile scraper into a production-grade data pipeline.
We'll use Playwright (more reliable than requests on X in 2026), JSON extraction from embedded state, and rotating residential IPs. Full code examples throughout.
The short answer: scraping publicly accessible X data is generally legal in the US, UK, and EU, but specific uses matter. Key points to know:
Practically: if you're scraping publicly visible tweets for sentiment analysis, academic research, brand monitoring, or competitive intelligence, you're on solid ground. If you're scraping private or protected accounts, building a database of personal data, or republishing content commercially, talk to a lawyer first.
X exposes quite a lot without requiring login, even after their 2023 "rate limit" changes:
from:username, since:2026-01-01, min_faves:100.What requires login (and therefore real account risk): older tweets beyond the "guest view" limit, likes on a profile, bookmarks, DMs (obviously). The rest is scrapable without authentication if you have decent proxies.
Install these first:
pip install playwright requests beautifulsoup4 pandas
playwright install chromium
Why Playwright instead of requests? X is now a JavaScript-heavy SPA (React + Redux under the hood). A plain requests.get() returns a mostly-empty HTML shell. Playwright renders the JS, lets us wait for tweets to load, and gives us access to the embedded JSON state — which is the cleanest way to extract structured data.
You'll also want a residential proxy plan. X's anti-bot is aggressive against datacenter IPs — you'll hit login walls within a few dozen requests from a datacenter range. Residential IPs (SpyderProxy Budget at $1.75/GB or Premium at $2.75/GB) look like real mobile users on home internet and get 10× the request budget before challenges kick in.
Let's start with the simplest case: pulling basic profile data for a single user.
from playwright.sync_api import sync_playwright
import json
PROXY = {
"server": "http://proxy.spyderproxy.com:10000",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
}
def scrape_profile(username: str):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy=PROXY,
)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 900},
locale="en-US",
)
page = context.new_page()
page.goto(f"https://x.com/{username}", timeout=30000)
page.wait_for_selector('[data-testid="UserName"]', timeout=15000)
# Extract basic profile info
name = page.inner_text('[data-testid="UserName"]')
bio = page.locator('[data-testid="UserDescription"]').inner_text() if page.locator('[data-testid="UserDescription"]').count() else ""
stats_elements = page.locator('[role="presentation"] a').all()
followers = following = ""
for el in stats_elements:
text = el.inner_text()
href = el.get_attribute("href") or ""
if href.endswith("/verified_followers") or href.endswith("/followers"):
followers = text.split("\n")[0]
elif href.endswith("/following"):
following = text.split("\n")[0]
browser.close()
return {
"username": username,
"name": name,
"bio": bio,
"followers": followers,
"following": following,
}
if __name__ == "__main__":
data = scrape_profile("elonmusk")
print(json.dumps(data, indent=2))
Key points:
data-testid attributes, not CSS class names. X generates class names via CSS-in-JS and they change every deploy.Scrolling to load more tweets is the trick. X uses infinite scroll that only loads ~20 tweets per page; you have to scroll to fetch more.
from playwright.sync_api import sync_playwright
import time
PROXY = {
"server": "http://proxy.spyderproxy.com:10000",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
}
def scrape_user_tweets(username: str, max_tweets: int = 50):
tweets = []
seen = set()
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, proxy=PROXY)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
locale="en-US",
)
page = context.new_page()
page.goto(f"https://x.com/{username}", timeout=30000)
page.wait_for_selector('article[data-testid="tweet"]', timeout=20000)
prev_count = -1
stall_count = 0
while len(tweets) < max_tweets and stall_count < 3:
articles = page.locator('article[data-testid="tweet"]').all()
for a in articles:
try:
tweet_id = a.locator('a[href*="/status/"]').first.get_attribute("href")
if not tweet_id or tweet_id in seen:
continue
seen.add(tweet_id)
text_el = a.locator('[data-testid="tweetText"]')
text = text_el.inner_text() if text_el.count() else ""
time_el = a.locator('time')
ts = time_el.get_attribute("datetime") if time_el.count() else ""
tweets.append({
"id": tweet_id.split("/status/")[-1],
"url": f"https://x.com{tweet_id}",
"text": text,
"timestamp": ts,
})
except Exception:
continue
if len(tweets) == prev_count:
stall_count += 1
else:
stall_count = 0
prev_count = len(tweets)
page.mouse.wheel(0, 3000)
time.sleep(1.5)
browser.close()
return tweets[:max_tweets]
if __name__ == "__main__":
tweets = scrape_user_tweets("elonmusk", max_tweets=30)
for t in tweets[:5]:
print(t)
Note the stall counter — if we scroll three times without loading new tweets, we've hit the end (or X is rate-limiting us). Bailing early prevents infinite loops.
The search endpoint accepts X's full operator syntax: from:user, since:YYYY-MM-DD, until:YYYY-MM-DD, min_faves:N, lang:en, and so on.
def scrape_search(query: str, max_results: int = 100, tab: str = "live"):
"""tab: 'live' (recent tweets), 'top' (top tweets), 'people', 'photos', 'videos'"""
from urllib.parse import quote_plus
url = f"https://x.com/search?q={quote_plus(query)}&src=typed_query&f={tab}"
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, proxy=PROXY)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/147.0.0.0 Safari/537.36",
)
page = context.new_page()
page.goto(url, timeout=30000)
page.wait_for_selector('article[data-testid="tweet"]', timeout=20000)
# Same scroll-and-collect pattern as Step 2
tweets = []
seen = set()
for _ in range(20):
articles = page.locator('article[data-testid="tweet"]').all()
for a in articles:
try:
href = a.locator('a[href*="/status/"]').first.get_attribute("href")
if href and href not in seen:
seen.add(href)
text = a.locator('[data-testid="tweetText"]').inner_text() if a.locator('[data-testid="tweetText"]').count() else ""
ts = a.locator('time').get_attribute("datetime") if a.locator('time').count() else ""
author = href.split("/")[1] if href.startswith("/") else ""
tweets.append({"url": f"https://x.com{href}", "author": author, "text": text, "timestamp": ts})
if len(tweets) >= max_results:
browser.close()
return tweets
except Exception:
continue
page.mouse.wheel(0, 3000)
import time; time.sleep(1.5)
browser.close()
return tweets
# Examples
scrape_search('"web scraping" min_faves:50 lang:en', max_results=50)
scrape_search("#SEO since:2026-04-01 until:2026-04-21", max_results=100)
scrape_search("from:openai", max_results=30, tab="top")
X's defenses are, in order of severity:
/i/flow/login when it doesn't trust the session. Usually triggered by datacenter IPs or by aggressive request rates.The two-line fix for 95% of these: rotating residential proxies + a rotation delay.
import time, random
# In your scraper loop — rotate every N requests by getting a new proxy session
for i, username in enumerate(usernames):
if i % 20 == 0:
# Trigger session rotation on SpyderProxy (new IP per session ID)
PROXY["username"] = f"YOUR_USERNAME-session-{random.randint(1,999999)}"
data = scrape_profile(username)
time.sleep(random.uniform(2, 5)) # jitter
SpyderProxy's rotating residential pool returns a fresh IP on each new session ID — pass a unique session token in the username and you get per-request rotation, or reuse the same token for up to 24h of sticky sessions when you need scroll continuity. See the residential proxy docs for the exact syntax.
X ships a big chunk of structured JSON inside <script id="__NEXT_DATA__"> or in an inline window.__INITIAL_STATE__ block on some routes. If you can grab this, you skip DOM scraping entirely and get clean objects.
import re, json
def extract_initial_state(html: str):
m = re.search(r'', html, re.DOTALL)
if m:
return json.loads(m.group(1))
m = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.+?});', html, re.DOTALL)
if m:
return json.loads(m.group(1))
return None
# Inside Playwright
html = page.content()
state = extract_initial_state(html)
if state:
# traverse the object — shape changes over time, explore with a JSON viewer
print(json.dumps(state, indent=2)[:2000])
This is the highest-fidelity path — you get exact follower counts, precise engagement numbers, and metadata that the rendered UI truncates. The downside is the JSON shape changes across X deploys; when yours breaks, log a snapshot and re-map the path.
import pandas as pd
tweets = scrape_search("#SEO since:2026-04-01", max_results=500)
df = pd.DataFrame(tweets)
df = df.drop_duplicates(subset=["url"])
# Optional: enrich with extra computed columns
df["scraped_at"] = pd.Timestamp.now()
df["domain"] = df["url"].str.extract(r"x\.com/([^/]+)/")
df.to_csv("x_seo_tweets.csv", index=False, encoding="utf-8")
print(f"Saved {len(df)} unique tweets")
For X specifically:
requests instead of a headless browser. X's HTML is a JS shell. A requests.get() response is almost empty.Retry-After response header. When X does rate-limit you, it tells you exactly how long to wait.If you don't want to maintain the browser automation yourself, there are third-party X scraping APIs on RapidAPI and elsewhere. Pricing is typically $30–$300/month for 10K–1M requests. They work fine for low-volume and prototype-stage projects. For production at scale, a self-hosted scraper + residential proxies is usually 5–10× cheaper per tweet and gives you control over rate, freshness, and fields.
Scraping publicly accessible data on X is not a crime in the US, UK, or EU under current precedent (hiQ v. LinkedIn, Ninth Circuit 2022). It does violate X's Terms of Service, which means X can ban your account and block your IPs, and they can file civil suits — they have, though most have been dismissed. Private/protected accounts, bypassing technical barriers, and commercial republication raise more serious legal questions.
Yes. Public tweets, profiles, search results, and trends are all accessible from the unauthenticated web interface. You need a headless browser (Playwright or Selenium) because X is a single-page application, plus residential proxies to get past the per-IP rate limits.
Three main reasons: (1) you're scraping from a datacenter IP — X identifies these instantly; (2) you have no randomization in your request timing; (3) your User-Agent is the default Playwright string or a generic bot UA. Fix all three and most blocks disappear.
With rotating residential proxies and a 2–5 second delay between requests, a single concurrency-1 scraper realistically does 500–1,500 tweets per hour. Parallelism scales this linearly if each worker uses a different residential IP. Expect to hit rate limits at 10,000+ tweets per hour without careful rotation.
Playwright. It has better async support, cleaner APIs, and better proxy integration than Selenium. Also, Playwright's auto-wait logic handles X's dynamic rendering without all the WebDriverWait boilerplate.
Yes — navigate to the tweet URL (/status/TWEET_ID) and scrape the article elements that appear beneath the main tweet. These are replies. X loads them lazily with scroll, same pattern as the user timeline example above.
Self-hosted Playwright workers + SpyderProxy Budget Residential at $1.75/GB. A typical tweet scrape is 50–150 KB of traffic, so $1.75 buys you roughly 7,000–20,000 tweets worth of bandwidth. That's an order of magnitude cheaper than any commercial X scraping API.
Yes, but you can hide it. Set --disable-blink-features=AutomationControlled, override navigator.webdriver to undefined, and use a real Chrome profile path. Playwright-stealth and playwright-extra-stealth plugins do most of this automatically.
Major structural changes every 2–4 months, minor CSS class changes weekly. Always target data-testid attributes (e.g., tweet, tweetText, UserName) because these are stable across deploys in a way CSS class names aren't.
Scraping X in 2026 is a solved problem as long as you use the right tools: Playwright for JS rendering, rotating residential proxies to defeat per-IP rate limits, and data-testid selectors for stability. Don't fight the platform — let residential IPs do the work of looking like real users, add randomized delays, and rotate sessions every 20–50 requests.
For production workloads, SpyderProxy Premium Residential at $2.75/GB with rotating sessions handles the vast majority of X scraping without a single 403. When you hit the hardest cases — logged-in scraping, heavy commercial volume, or Arkose-protected flows — step up to LTE Mobile proxies at $2/IP and the block rate drops to effectively zero.