How to Scrape Reddit with Python and Proxies (2026 Guide)

To scrape Reddit, the easiest approach is to append .json to any Reddit URL and parse the JSON response with Python's requests library. For larger scrapes, use the official PRAW library with a Reddit API key (60 requests per minute), or scrape the public site through rotating residential proxies to bypass rate limits when collecting at scale. Always rotate user agents, add delays between requests, and respect Reddit's terms of service.

This guide covers three working methods in 2026: the simple JSON endpoint approach, the official PRAW API client, and proxy-based HTML scraping for high-volume jobs. Includes complete Python code for scraping subreddits, posts, comments, and user pages, plus best practices to avoid bans.

Why Scrape Reddit?

Reddit has roughly 100,000 active subreddits and over 50 million daily active users. The data is uniquely valuable because it represents authentic, opinionated user-generated content across nearly every topic imaginable. Common scraping use cases include:

Market research — What are r/investing or r/wallstreetbets discussing this week?
Brand monitoring — Track mentions of your product across all subreddits.
Sentiment analysis — Build datasets for training NLP models on real conversational text.
Trend detection — Identify rising topics in r/all or specific niches before they go mainstream.
Competitor research — See what users say about competing products in dedicated subreddits.
Academic research — Study online communities, misinformation spread, or social dynamics.

Reddit API vs Scraping the Site

Reddit offers an official API with documented endpoints, but as of 2024 it has rate limits (60 requests per minute for authenticated users, 10 per minute unauthenticated) and pricing for high-volume commercial use. The API is the cleanest, most stable way to get structured data — but if you need more requests than the rate limit allows, you must scrape the public site directly.

The good news: Reddit's website serves the same data as the API in JSON form. Just append .json to almost any URL:

https://www.reddit.com/r/python/.json → top posts in r/python
https://www.reddit.com/r/python/top.json?t=week → top posts this week
https://www.reddit.com/r/python/comments/abc123/post_slug.json → full thread with comments
https://www.reddit.com/user/spez/.json → user activity
https://www.reddit.com/search.json?q=python&sort=new → search results

This unauthenticated endpoint has stricter rate limits than the API, but with proper proxy rotation you can scale it to millions of requests per day.

Python Setup

Install the required packages:

pip install requests beautifulsoup4 praw

Optional but recommended for proxy rotation:

pip install requests[socks]

Method 1: Scrape Reddit JSON Endpoints (Easiest)

This is the no-API-key approach. Simple, works for small scrapes, but limited by Reddit's IP-based rate limit (~30 requests per minute per IP without authentication).

import requests
import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}

def get_subreddit_posts(subreddit, sort="hot", limit=25):
    """Fetch posts from a subreddit. sort: hot, new, top, rising."""
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()
    data = response.json()

    posts = []
    for child in data["data"]["children"]:
        post = child["data"]
        posts.append({
            "id": post["id"],
            "title": post["title"],
            "author": post["author"],
            "subreddit": post["subreddit"],
            "score": post["score"],
            "num_comments": post["num_comments"],
            "created_utc": post["created_utc"],
            "url": post["url"],
            "permalink": f"https://www.reddit.com{post['permalink']}",
            "selftext": post.get("selftext", ""),
        })
    return posts

# Example
posts = get_subreddit_posts("python", sort="top", limit=50)
for p in posts[:5]:
    print(f"[{p['score']:>5}] {p['title']}")
    print(f"        by u/{p['author']} | {p['num_comments']} comments")
    print(f"        {p['permalink']}
")

Pagination

Reddit uses cursor-based pagination via the after parameter. To fetch the next page, pass the last post's name (e.g., t3_abc123):

def paginate_subreddit(subreddit, max_posts=500, sort="new"):
    posts = []
    after = None
    while len(posts) < max_posts:
        url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit=100"
        if after:
            url += f"&after={after}"
        r = requests.get(url, headers=HEADERS, timeout=10)
        r.raise_for_status()
        data = r.json()["data"]
        children = data["children"]
        if not children:
            break
        posts.extend(c["data"] for c in children)
        after = data.get("after")
        if not after:
            break
        time.sleep(2)  # Respect rate limit
    return posts[:max_posts]

Method 2: PRAW (Official Reddit API Wrapper)

If you can register a Reddit app and get an API key, PRAW is the cleanest way to scrape. It handles auth, pagination, and rate limiting for you.

Register a Reddit App

Visit reddit.com/prefs/apps and click create an app.
Choose script as the app type.
Set redirect URI to http://localhost:8080.
Copy the client ID (under the app name) and secret.

Scrape with PRAW

import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_SECRET",
    user_agent="my-scraper/0.1 by /u/YOUR_USERNAME",
)

subreddit = reddit.subreddit("python")

for post in subreddit.top(time_filter="month", limit=100):
    print(f"[{post.score}] {post.title}")
    print(f"  by {post.author} | {post.num_comments} comments")
    print(f"  https://reddit.com{post.permalink}
")

Scrape Comments with PRAW

def get_all_comments(submission_id):
    submission = reddit.submission(id=submission_id)
    submission.comments.replace_more(limit=None)  # Expand "load more"
    comments = []
    for comment in submission.comments.list():
        comments.append({
            "id": comment.id,
            "author": str(comment.author),
            "body": comment.body,
            "score": comment.score,
            "created_utc": comment.created_utc,
            "parent_id": comment.parent_id,
        })
    return comments

comments = get_all_comments("abc123")
print(f"Got {len(comments)} comments")

Method 3: HTML Scraping at Scale with Proxies

When you need to scrape thousands of pages per minute — well beyond what the API or unauthenticated JSON endpoint allows — you must rotate residential or mobile proxies. Each request goes through a different IP, so Reddit's per-IP rate limiter sees a fresh client each time.

import requests
from bs4 import BeautifulSoup
import random
import time

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/121.0",
]

def scrape_subreddit_with_proxy(subreddit, sort="hot", limit=100):
    proxy = {
        "http": "http://USER:[email protected]:8080",
        "https": "http://USER:[email protected]:8080",
    }
    headers = {"User-Agent": random.choice(USER_AGENTS)}
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
    r = requests.get(url, headers=headers, proxies=proxy, timeout=15)
    r.raise_for_status()
    return r.json()["data"]["children"]

# Distribute requests over 50 subreddits
SUBS = ["python", "javascript", "programming", "webdev", "datascience"]  # ...
all_posts = []
for sub in SUBS:
    try:
        posts = scrape_subreddit_with_proxy(sub, "top", 100)
        all_posts.extend(p["data"] for p in posts)
        time.sleep(random.uniform(1, 3))  # Random jitter
    except Exception as e:
        print(f"Failed {sub}: {e}")
        continue

print(f"Collected {len(all_posts)} posts across {len(SUBS)} subreddits")

Most rotating residential proxy providers (including SpyderProxy) auto-rotate the IP on each request when you hit the gateway endpoint — you do not need to manage a proxy pool yourself.

Best Practices to Avoid Bans

Rotate User-Agents — Use a list of real, recent browser UA strings. Never use the default Python requests UA (instant ban).
Add delays — Even with proxies, add time.sleep(random.uniform(1, 3)) between requests. Reddit's bot detection considers patterns, not just IPs.
Use residential or mobile IPs — Datacenter proxies are flagged faster on Reddit. Residential and mobile IPs blend in with real users.
Set proper headers — Accept, Accept-Language, Accept-Encoding, and Referer should all match a real browser request.
Handle rate limit responses — If you get HTTP 429, back off exponentially (time.sleep(60 * attempt)) and try a different proxy.
Respect robots.txt — Reddit allows scraping with rate limits; avoid hammering endpoints that /robots.txt excludes.
Cache aggressively — If you scrape the same subreddits hourly, cache results and only fetch new posts since the last created_utc.

Storing Scraped Data

CSV Export

import csv

with open("reddit_posts.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["id", "title", "author", "score", "num_comments", "created_utc", "permalink"])
    writer.writeheader()
    for p in all_posts:
        writer.writerow({
            "id": p["id"],
            "title": p["title"],
            "author": p["author"],
            "score": p["score"],
            "num_comments": p["num_comments"],
            "created_utc": p["created_utc"],
            "permalink": p["permalink"],
        })

JSON Export

import json

with open("reddit_posts.json", "w", encoding="utf-8") as f:
    json.dump(all_posts, f, indent=2, ensure_ascii=False)

Legal and Ethical Considerations

Reddit's Data API Terms govern programmatic access. Public data scraping is generally tolerated when you respect rate limits and avoid commercial redistribution without permission. Specifically:

Do not scrape personal information (PII) from user profiles.
Do not redistribute or resell raw Reddit data without a commercial agreement.
Do not use scraped data to train ML models intended for commercial release without negotiating an API license.
Honor any user's deletion: if a post or comment is removed, do not republish it.
Add attribution when displaying Reddit content publicly.

For purely personal research, sentiment analysis, or trend detection (data not redistributed), responsible scraping with proper rate limits and proxies is widely accepted practice.

Conclusion

Reddit is one of the richest open-data platforms on the web, and scraping it is more accessible than most people think. For small projects, append .json to any URL and parse with requests. For structured access at moderate scale, use PRAW with an API key. For large-scale collection, rotate residential proxies and follow good bot hygiene — randomized user agents, delays, and exponential backoff — to stay below Reddit's detection thresholds. Combine all three approaches as your project grows, and always cache aggressively to minimize wasted requests.