To scrape Reddit, the easiest approach is to append .json to any Reddit URL and parse the JSON response with Python's requests library. For larger scrapes, use the official PRAW library with a Reddit API key (60 requests per minute), or scrape the public site through rotating residential proxies to bypass rate limits when collecting at scale. Always rotate user agents, add delays between requests, and respect Reddit's terms of service.
This guide covers three working methods in 2026: the simple JSON endpoint approach, the official PRAW API client, and proxy-based HTML scraping for high-volume jobs. Includes complete Python code for scraping subreddits, posts, comments, and user pages, plus best practices to avoid bans.
Reddit has roughly 100,000 active subreddits and over 50 million daily active users. The data is uniquely valuable because it represents authentic, opinionated user-generated content across nearly every topic imaginable. Common scraping use cases include:
Reddit offers an official API with documented endpoints, but as of 2024 it has rate limits (60 requests per minute for authenticated users, 10 per minute unauthenticated) and pricing for high-volume commercial use. The API is the cleanest, most stable way to get structured data — but if you need more requests than the rate limit allows, you must scrape the public site directly.
The good news: Reddit's website serves the same data as the API in JSON form. Just append .json to almost any URL:
https://www.reddit.com/r/python/.json → top posts in r/pythonhttps://www.reddit.com/r/python/top.json?t=week → top posts this weekhttps://www.reddit.com/r/python/comments/abc123/post_slug.json → full thread with commentshttps://www.reddit.com/user/spez/.json → user activityhttps://www.reddit.com/search.json?q=python&sort=new → search resultsThis unauthenticated endpoint has stricter rate limits than the API, but with proper proxy rotation you can scale it to millions of requests per day.
Install the required packages:
pip install requests beautifulsoup4 praw
Optional but recommended for proxy rotation:
pip install requests[socks]
This is the no-API-key approach. Simple, works for small scrapes, but limited by Reddit's IP-based rate limit (~30 requests per minute per IP without authentication).
import requests
import time
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
def get_subreddit_posts(subreddit, sort="hot", limit=25):
"""Fetch posts from a subreddit. sort: hot, new, top, rising."""
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status()
data = response.json()
posts = []
for child in data["data"]["children"]:
post = child["data"]
posts.append({
"id": post["id"],
"title": post["title"],
"author": post["author"],
"subreddit": post["subreddit"],
"score": post["score"],
"num_comments": post["num_comments"],
"created_utc": post["created_utc"],
"url": post["url"],
"permalink": f"https://www.reddit.com{post['permalink']}",
"selftext": post.get("selftext", ""),
})
return posts
# Example
posts = get_subreddit_posts("python", sort="top", limit=50)
for p in posts[:5]:
print(f"[{p['score']:>5}] {p['title']}")
print(f" by u/{p['author']} | {p['num_comments']} comments")
print(f" {p['permalink']}
")
Reddit uses cursor-based pagination via the after parameter. To fetch the next page, pass the last post's name (e.g., t3_abc123):
def paginate_subreddit(subreddit, max_posts=500, sort="new"):
posts = []
after = None
while len(posts) < max_posts:
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit=100"
if after:
url += f"&after={after}"
r = requests.get(url, headers=HEADERS, timeout=10)
r.raise_for_status()
data = r.json()["data"]
children = data["children"]
if not children:
break
posts.extend(c["data"] for c in children)
after = data.get("after")
if not after:
break
time.sleep(2) # Respect rate limit
return posts[:max_posts]
If you can register a Reddit app and get an API key, PRAW is the cleanest way to scrape. It handles auth, pagination, and rate limiting for you.
http://localhost:8080.import praw
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_SECRET",
user_agent="my-scraper/0.1 by /u/YOUR_USERNAME",
)
subreddit = reddit.subreddit("python")
for post in subreddit.top(time_filter="month", limit=100):
print(f"[{post.score}] {post.title}")
print(f" by {post.author} | {post.num_comments} comments")
print(f" https://reddit.com{post.permalink}
")
def get_all_comments(submission_id):
submission = reddit.submission(id=submission_id)
submission.comments.replace_more(limit=None) # Expand "load more"
comments = []
for comment in submission.comments.list():
comments.append({
"id": comment.id,
"author": str(comment.author),
"body": comment.body,
"score": comment.score,
"created_utc": comment.created_utc,
"parent_id": comment.parent_id,
})
return comments
comments = get_all_comments("abc123")
print(f"Got {len(comments)} comments")
When you need to scrape thousands of pages per minute — well beyond what the API or unauthenticated JSON endpoint allows — you must rotate residential or mobile proxies. Each request goes through a different IP, so Reddit's per-IP rate limiter sees a fresh client each time.
import requests
from bs4 import BeautifulSoup
import random
import time
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/121.0",
]
def scrape_subreddit_with_proxy(subreddit, sort="hot", limit=100):
proxy = {
"http": "http://USER:[email protected]:8080",
"https": "http://USER:[email protected]:8080",
}
headers = {"User-Agent": random.choice(USER_AGENTS)}
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
r = requests.get(url, headers=headers, proxies=proxy, timeout=15)
r.raise_for_status()
return r.json()["data"]["children"]
# Distribute requests over 50 subreddits
SUBS = ["python", "javascript", "programming", "webdev", "datascience"] # ...
all_posts = []
for sub in SUBS:
try:
posts = scrape_subreddit_with_proxy(sub, "top", 100)
all_posts.extend(p["data"] for p in posts)
time.sleep(random.uniform(1, 3)) # Random jitter
except Exception as e:
print(f"Failed {sub}: {e}")
continue
print(f"Collected {len(all_posts)} posts across {len(SUBS)} subreddits")
Most rotating residential proxy providers (including SpyderProxy) auto-rotate the IP on each request when you hit the gateway endpoint — you do not need to manage a proxy pool yourself.
requests UA (instant ban).time.sleep(random.uniform(1, 3)) between requests. Reddit's bot detection considers patterns, not just IPs.Accept, Accept-Language, Accept-Encoding, and Referer should all match a real browser request.time.sleep(60 * attempt)) and try a different proxy./robots.txt excludes.created_utc.import csv
with open("reddit_posts.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["id", "title", "author", "score", "num_comments", "created_utc", "permalink"])
writer.writeheader()
for p in all_posts:
writer.writerow({
"id": p["id"],
"title": p["title"],
"author": p["author"],
"score": p["score"],
"num_comments": p["num_comments"],
"created_utc": p["created_utc"],
"permalink": p["permalink"],
})
import json
with open("reddit_posts.json", "w", encoding="utf-8") as f:
json.dump(all_posts, f, indent=2, ensure_ascii=False)
Reddit's Data API Terms govern programmatic access. Public data scraping is generally tolerated when you respect rate limits and avoid commercial redistribution without permission. Specifically:
For purely personal research, sentiment analysis, or trend detection (data not redistributed), responsible scraping with proper rate limits and proxies is widely accepted practice.
Reddit is one of the richest open-data platforms on the web, and scraping it is more accessible than most people think. For small projects, append .json to any URL and parse with requests. For structured access at moderate scale, use PRAW with an API key. For large-scale collection, rotate residential proxies and follow good bot hygiene — randomized user agents, delays, and exponential backoff — to stay below Reddit's detection thresholds. Combine all three approaches as your project grows, and always cache aggressively to minimize wasted requests.