Scrape Reddit Without the API (2026)

Reddit's free API days ended in 2023. The official API still exists, it is still technically free for non-commercial use, and it is still the right starting point for a personal project. But for anything commercial, at scale, or plugged into an AI agent that needs to read 50 subreddits on a schedule, you will run into the same three problems developers have been running into since the pricing change:

The free tier rate limits are aggressive (60 requests per minute authenticated).
Commercial use triggers a $0.24 per 1,000 calls pricing tier that stacks up fast.
OAuth setup, app review, and user agent policing all add friction for what should be a simple HTTP fetch.

This guide covers the three techniques developers actually use to get Reddit data without jumping through those hoops, ordered from easiest to most resilient under scale, with runnable Python for each. At the end I explain when it makes sense to stop writing your own scraper and hand the problem to a data API.

#Method 1: The `.json` endpoint trick

The single most underused feature of Reddit is that every page responds to a .json suffix with structured JSON. No API key. No OAuth. No app registration. Append .json to any subreddit, post, or user URL and you get the same data Reddit's frontend uses.

import requests

def fetch_subreddit(subreddit: str, sort: str = "hot", limit: int = 25):
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
    headers = {
        "User-Agent": "MyResearchBot/1.0 (by /u/myuser)",
    }
    params = {"limit": limit, "raw_json": 1}
    res = requests.get(url, headers=headers, params=params, timeout=10)
    res.raise_for_status()
    posts = res.json()["data"]["children"]
    return [
        {
            "id": p["data"]["id"],
            "title": p["data"]["title"],
            "author": p["data"]["author"],
            "score": p["data"]["score"],
            "num_comments": p["data"]["num_comments"],
            "url": p["data"]["url"],
            "selftext": p["data"]["selftext"],
            "created_utc": p["data"]["created_utc"],
        }
        for p in posts
    ]

posts = fetch_subreddit("MachineLearning", sort="top", limit=100)
for p in posts[:5]:
    print(p["score"], p["title"])

You get the full post object back. selftext for text posts, url for link posts, score, comment count, author, created time, everything a logged-in user would see on the subreddit page.

#The catch

Reddit aggressively rate-limits unauthenticated traffic. Expect roughly 30 requests per minute per IP before you start seeing 429 responses. The rate limit is not documented, it is not in the response headers, and it is enforced silently: you get a 200 with a degraded response body instead of a clear error. If your scraper suddenly starts returning fewer posts per call than you asked for, that is what is happening.

Three fixes, in order of effort:

Slow down. time.sleep(2) between calls is enough for most projects.
Rotate user agents. Reddit actually reads the User-Agent string and penalises suspicious ones. Make it identifiable and honest.
Target old.reddit.com. The old subdomain serves the same JSON with lighter protection and no JavaScript rendering of HTML fallbacks.

For small-scale research or a side project reading one or two subreddits, Method 1 is all you need. Zero infrastructure, zero cost, working code in 10 lines.

#Method 2: Pagination and comment threads

The .json endpoint paginates with an after token. Each response includes an after field in data.after that you pass as a query param on the next call.

import requests
import time

def fetch_all_posts(subreddit: str, sort: str = "top", pages: int = 5):
    headers = {"User-Agent": "MyResearchBot/1.0 (by /u/myuser)"}
    out = []
    after = None
    for _ in range(pages):
        params = {"limit": 100, "raw_json": 1}
        if after:
            params["after"] = after
        url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
        res = requests.get(url, headers=headers, params=params, timeout=10)
        res.raise_for_status()
        body = res.json()["data"]
        out.extend(body["children"])
        after = body.get("after")
        if not after:
            break
        time.sleep(2)
    return out

Comments live under a different URL pattern: https://www.reddit.com/r/{subreddit}/comments/{post_id}.json. The response is a two-element array: [0] is the post, [1] is the top-level comments tree. Replies nest inside each comment's replies field.

def fetch_comments(subreddit: str, post_id: str):
    url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"
    headers = {"User-Agent": "MyResearchBot/1.0 (by /u/myuser)"}
    res = requests.get(url, headers=headers, timeout=10)
    res.raise_for_status()
    data = res.json()
    comments_tree = data[1]["data"]["children"]
    return _flatten(comments_tree)

def _flatten(nodes, depth=0):
    out = []
    for n in nodes:
        if n["kind"] != "t1":
            continue
        d = n["data"]
        out.append({
            "id": d["id"],
            "author": d["author"],
            "body": d["body"],
            "score": d["score"],
            "depth": depth,
        })
        replies = d.get("replies")
        if isinstance(replies, dict):
            out.extend(_flatten(replies["data"]["children"], depth + 1))
    return out

Comment trees on large threads can run to thousands of nodes. Reddit uses a more stub (kind: "more") for deeply nested replies it does not load upfront. You either ignore those (fine for most analytics) or follow the children list in each more stub by making additional requests. Doing it properly is fiddly.

#Method 3: Proxy rotation for scale

Past ~10,000 requests per day or when your IP gets soft-banned, you need rotating residential proxies. The cheap path is a commercial proxy service: ScraperAPI, Scrape.do, Bright Data, or similar. You send your request to their endpoint, they rotate the exit IP, and you pay per request.

def fetch_with_proxy(target_url: str, proxy_key: str):
    res = requests.get(
        "http://api.scraperapi.com",
        params={"api_key": proxy_key, "url": target_url, "country_code": "us"},
        timeout=30,
    )
    res.raise_for_status()
    return res.json()

This unblocks you immediately but you are now paying per request to a proxy vendor on top of whatever you pay for servers and engineering time to maintain the scraper. If your use case is anything more than "read public Reddit data for my product," you are burning engineering capacity on infra that has no product value.

#The maintenance tax

Everything above works in April 2026. None of it is guaranteed to work in July 2026. Reddit has been tightening access progressively since the API pricing change. They can turn off .json responses for unauthenticated traffic tomorrow (they already did it briefly in 2023 before reversing). They can tighten rate limits. They can block common data center IP ranges.

If your product depends on Reddit data, the question is not "can I write a scraper" but "how many engineering hours per month am I willing to spend keeping the scraper running." For personal projects, the answer is "zero because I will fix it on a Saturday." For a commercial product, the answer is usually "fewer than I think," because every hour spent on scraper maintenance is an hour not spent on the actual product.

#When to hand the problem to an API

The point at which DIY stops paying off is when you want any of:

More than one platform (Reddit + Twitter + YouTube, etc.)
Guaranteed uptime on the data layer
A support contract when something breaks
Native AI agent access via MCP rather than HTTP glue
Historical backfill that goes beyond what .json returns

CreatorCrawl covers Reddit alongside TikTok, Instagram, YouTube, LinkedIn, and Twitter/X under one API key. The Reddit endpoints today include:

Subreddit details (subscribers, description, rules)
Subreddit posts (hot, top, new, rising with pagination)
Subreddit search
Post comments (full nested tree, flattened or recursive)
Cross-subreddit search

curl "https://creatorcrawl.com/api/v1/reddit/subreddit/posts?subreddit=MachineLearning&sort=top" \
  -H "x-api-key: YOUR_API_KEY"

Because it is the same API that covers the other five platforms, an AI agent plugged in via the MCP server can search Reddit, cross-reference with Twitter, and pull YouTube comments without needing three different integrations. That collapses the maintenance tax on all of it to zero.

Pricing is pay-as-you-go credits. 50 credits free on signup, credits never expire, no subscription. For most teams the economics beat building and maintaining a Reddit scraper by a wide margin once you count engineering hours honestly.

#Decision table

Use case	Best approach
One-off research, <1,000 posts	Method 1, `.json` endpoint
Personal project, ongoing	PRAW (official API, free for non-commercial)
Commercial product, Reddit only	Method 1 or 2 + proxy rotation
Commercial product, multi-platform	Data API like CreatorCrawl
AI agent reading Reddit	MCP server (CreatorCrawl or similar)
Historical data dump	Arctic Shift (academic torrents)

#Where to go from here

Reddit is still one of the more scrape-friendly major platforms in 2026. You can get a long way with .json endpoints and a polite user agent. The honest decision point is when scraper maintenance starts eating hours you would rather spend on the actual product. At that line, a multi-platform data API with an SLA becomes the cheaper option once you cost engineering time properly.

If you want to try the managed path, sign up for CreatorCrawl. 50 credits free, no card, native MCP for Claude and Cursor.

Scrape Reddit Without the API (2026)

#Method 1: The `.json` endpoint trick

#The catch

#Method 3: Proxy rotation for scale

#The maintenance tax

#When to hand the problem to an API

#Decision table

#Where to go from here

Explore CreatorCrawl

More from the Blog

How to Download Instagram Data in 2026

Instagram API Pricing in 2026: Official vs Third-Party

Instagram Graph API in 2026: What It Does and Doesn't

Add one line to your
MCP config. Ship today.

#Method 1: The .json endpoint trick

#The catch

#Method 2: Pagination and comment threads

#Method 3: Proxy rotation for scale

#The maintenance tax

#When to hand the problem to an API

#Decision table

#Where to go from here

Explore CreatorCrawl

More from the Blog

How to Download Instagram Data in 2026

Instagram API Pricing in 2026: Official vs Third-Party

Instagram Graph API in 2026: What It Does and Doesn't

Add one line to yourMCP config. Ship today.

#Method 1: The `.json` endpoint trick

Add one line to your
MCP config. Ship today.