Scrape Reddit Without the API (2026)
Reddit's free API days ended in 2023. The official API still exists, it is still technically free for non-commercial use, and it is still the right starting point for a personal project. But for anything commercial, at scale, or plugged into an AI agent that needs to read 50 subreddits on a schedule, you will run into the same three problems developers have been running into since the pricing change:
- The free tier rate limits are aggressive (60 requests per minute authenticated).
- Commercial use triggers a $0.24 per 1,000 calls pricing tier that stacks up fast.
- OAuth setup, app review, and user agent policing all add friction for what should be a simple HTTP fetch.
This guide covers the three techniques developers actually use to get Reddit data without jumping through those hoops, ordered from easiest to most resilient under scale, with runnable Python for each. At the end I explain when it makes sense to stop writing your own scraper and hand the problem to a data API.
Method 1: The .json endpoint trick
The single most underused feature of Reddit is that every page responds to a .json suffix with structured JSON. No API key. No OAuth. No app registration. Append .json to any subreddit, post, or user URL and you get the same data Reddit's frontend uses.
import requests
def fetch_subreddit(subreddit: str, sort: str = "hot", limit: int = 25):
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
headers = {
"User-Agent": "MyResearchBot/1.0 (by /u/myuser)",
}
params = {"limit": limit, "raw_json": 1}
res = requests.get(url, headers=headers, params=params, timeout=10)
res.raise_for_status()
posts = res.json()["data"]["children"]
return [
{
"id": p["data"]["id"],
"title": p["data"]["title"],
"author": p["data"]["author"],
"score": p["data"]["score"],
"num_comments": p["data"]["num_comments"],
"url": p["data"]["url"],
"selftext": p["data"]["selftext"],
"created_utc": p["data"]["created_utc"],
}
for p in posts
]
posts = fetch_subreddit("MachineLearning", sort="top", limit=100)
for p in posts[:5]:
print(p["score"], p["title"])
You get the full post object back. selftext for text posts, url for link posts, score, comment count, author, created time, everything a logged-in user would see on the subreddit page.
The catch
Reddit aggressively rate-limits unauthenticated traffic. Expect roughly 30 requests per minute per IP before you start seeing 429 responses. The rate limit is not documented, it is not in the response headers, and it is enforced silently: you get a 200 with a degraded response body instead of a clear error. If your scraper suddenly starts returning fewer posts per call than you asked for, that is what is happening.
Three fixes, in order of effort:
- Slow down.
time.sleep(2)between calls is enough for most projects. - Rotate user agents. Reddit actually reads the
User-Agentstring and penalises suspicious ones. Make it identifiable and honest. - Target
old.reddit.com. The old subdomain serves the same JSON with lighter protection and no JavaScript rendering of HTML fallbacks.
For small-scale research or a side project reading one or two subreddits, Method 1 is all you need. Zero infrastructure, zero cost, working code in 10 lines.
Method 2: Pagination and comment threads
The .json endpoint paginates with an after token. Each response includes an after field in data.after that you pass as a query param on the next call.
import requests
import time
def fetch_all_posts(subreddit: str, sort: str = "top", pages: int = 5):
headers = {"User-Agent": "MyResearchBot/1.0 (by /u/myuser)"}
out = []
after = None
for _ in range(pages):
params = {"limit": 100, "raw_json": 1}
if after:
params["after"] = after
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
res = requests.get(url, headers=headers, params=params, timeout=10)
res.raise_for_status()
body = res.json()["data"]
out.extend(body["children"])
after = body.get("after")
if not after:
break
time.sleep(2)
return out
Comments live under a different URL pattern: https://www.reddit.com/r/{subreddit}/comments/{post_id}.json. The response is a two-element array: [0] is the post, [1] is the top-level comments tree. Replies nest inside each comment's replies field.
def fetch_comments(subreddit: str, post_id: str):
url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"
headers = {"User-Agent": "MyResearchBot/1.0 (by /u/myuser)"}
res = requests.get(url, headers=headers, timeout=10)
res.raise_for_status()
data = res.json()
comments_tree = data[1]["data"]["children"]
return _flatten(comments_tree)
def _flatten(nodes, depth=0):
out = []
for n in nodes:
if n["kind"] != "t1":
continue
d = n["data"]
out.append({
"id": d["id"],
"author": d["author"],
"body": d["body"],
"score": d["score"],
"depth": depth,
})
replies = d.get("replies")
if isinstance(replies, dict):
out.extend(_flatten(replies["data"]["children"], depth + 1))
return out
Comment trees on large threads can run to thousands of nodes. Reddit uses a more stub (kind: "more") for deeply nested replies it does not load upfront. You either ignore those (fine for most analytics) or follow the children list in each more stub by making additional requests. Doing it properly is fiddly.
Method 3: Proxy rotation for scale
Past ~10,000 requests per day or when your IP gets soft-banned, you need rotating residential proxies. The cheap path is a commercial proxy service: ScraperAPI, Scrape.do, Bright Data, or similar. You send your request to their endpoint, they rotate the exit IP, and you pay per request.
def fetch_with_proxy(target_url: str, proxy_key: str):
res = requests.get(
"http://api.scraperapi.com",
params={"api_key": proxy_key, "url": target_url, "country_code": "us"},
timeout=30,
)
res.raise_for_status()
return res.json()
This unblocks you immediately but you are now paying per request to a proxy vendor on top of whatever you pay for servers and engineering time to maintain the scraper. If your use case is anything more than "read public Reddit data for my product," you are burning engineering capacity on infra that has no product value.
The maintenance tax
Everything above works in April 2026. None of it is guaranteed to work in July 2026. Reddit has been tightening access progressively since the API pricing change. They can turn off .json responses for unauthenticated traffic tomorrow (they already did it briefly in 2023 before reversing). They can tighten rate limits. They can block common data center IP ranges.
If your product depends on Reddit data, the question is not "can I write a scraper" but "how many engineering hours per month am I willing to spend keeping the scraper running." For personal projects, the answer is "zero because I will fix it on a Saturday." For a commercial product, the answer is usually "fewer than I think," because every hour spent on scraper maintenance is an hour not spent on the actual product.
When to hand the problem to an API
The point at which DIY stops paying off is when you want any of:
- More than one platform (Reddit + Twitter + YouTube, etc.)
- Guaranteed uptime on the data layer
- A support contract when something breaks
- Native AI agent access via MCP rather than HTTP glue
- Historical backfill that goes beyond what
.jsonreturns
CreatorCrawl covers Reddit alongside TikTok, Instagram, YouTube, LinkedIn, and Twitter/X under one API key. The Reddit endpoints today include:
- Subreddit details (subscribers, description, rules)
- Subreddit posts (hot, top, new, rising with pagination)
- Subreddit search
- Post comments (full nested tree, flattened or recursive)
- Cross-subreddit search
curl "https://creatorcrawl.com/api/v1/reddit/subreddit/posts?subreddit=MachineLearning&sort=top" \
-H "x-api-key: YOUR_API_KEY"
Because it is the same API that covers the other five platforms, an AI agent plugged in via the MCP server can search Reddit, cross-reference with Twitter, and pull YouTube comments without needing three different integrations. That collapses the maintenance tax on all of it to zero.
Pricing is pay-as-you-go credits. 250 credits free on signup, credits never expire, no subscription. For most teams the economics beat building and maintaining a Reddit scraper by a wide margin once you count engineering hours honestly.
Decision table
| Use case | Best approach |
|---|---|
| One-off research, <1,000 posts | Method 1, .json endpoint |
| Personal project, ongoing | PRAW (official API, free for non-commercial) |
| Commercial product, Reddit only | Method 1 or 2 + proxy rotation |
| Commercial product, multi-platform | Data API like CreatorCrawl |
| AI agent reading Reddit | MCP server (CreatorCrawl or similar) |
| Historical data dump | Arctic Shift (academic torrents) |
Where to go from here
Reddit is still one of the more scrape-friendly major platforms in 2026. You can get a long way with .json endpoints and a polite user agent. The honest decision point is when scraper maintenance starts eating hours you would rather spend on the actual product. At that line, a multi-platform data API with an SLA becomes the cheaper option once you cost engineering time properly.
If you want to try the managed path, sign up for CreatorCrawl. 250 credits free, no card, native MCP for Claude and Cursor.
Explore CreatorCrawl
More from the Blog
How to Download Instagram Data in 2026
Four ways to download Instagram data in 2026: in-app export, official Graph API, third-party data APIs, and DIY scrapers. With code and limits.
Read moreInstagram API Pricing in 2026: Official vs Third-Party
What an Instagram API actually costs in 2026: official Graph API limits, third-party pricing across 6 providers, and which fits which use case.
Read moreInstagram Graph API in 2026: What It Does and Doesn't
The Instagram Graph API is Meta's only official Instagram data endpoint after the 2024 Basic Display retirement. What it gives you, what it doesn't.
Read more