How to Build an MCP Server for Web Scraping (and Why You Might Not Want To)
If you are an AI engineer building an agent that reads from the web, you have probably hit the same wall. The agent can call your scraper, but the response shape changes every time the target site renames a div. The agent loops on retries. You end up writing a per-site adapter layer that looks like a normal scraper framework, except slightly worse. MCP, the Model Context Protocol Anthropic shipped in late 2024, exists for exactly this gap. This post walks through how to build an MCP server for web scraping, what the gotchas are, and the honest tradeoffs of rolling your own versus plugging into a managed one.
What MCP actually changes for scrapers
A scraper without MCP looks like a REST API that an agent has to learn through prompt examples. A scraper with MCP looks like a typed function the agent already knows how to call. The protocol forces three things that matter for agent use:
- Tool discovery. The client asks the server "what can I do" and gets a typed list back. The agent reads it once and knows the surface area.
- Schema-validated arguments. The agent does not freestyle the request body. It fills in fields the server declared.
- Structured errors. Failures come back as protocol-level errors, not 500s embedded in HTML. The agent can branch on them.
For a scraping use case this is the difference between an agent that loops forever on a flaky endpoint and an agent that retries the right number of times then surfaces a clean message to the user.
The minimum useful MCP scraper
Here is the smallest server that does something real. Node, TypeScript, uses the official SDK.
import { Server } from '@modelcontextprotocol/sdk/server/index.js'
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js'
import { z } from 'zod'
const server = new Server(
{ name: 'simple-scraper', version: '1.0.0' },
{ capabilities: { tools: {} } },
)
server.tool(
'fetch_url',
'Fetch a URL and return the cleaned text content.',
{
url: z.string().url().describe('Absolute URL to fetch'),
},
async ({ url }) => {
const res = await fetch(url, {
headers: { 'user-agent': 'Mozilla/5.0 (scraper-bot)' },
})
if (!res.ok) {
throw new Error(`Upstream ${res.status} ${res.statusText}`)
}
const html = await res.text()
const text = html.replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim()
return { content: [{ type: 'text', text: text.slice(0, 50_000) }] }
},
)
const transport = new StreamableHTTPServerTransport()
await server.connect(transport)
That is a working MCP server. Point any MCP client at it and the agent can fetch URLs. It is also nowhere near production ready.
What you still need to add
The minimum example is missing every hard part of real scraping. Here is the list of things you discover after launch, in roughly the order they bite.
1. Auth at the right layer
Your MCP server has to authenticate the client (so you can bill them or rate limit them) and your scraping backend has to authenticate to the target (proxies, cookies, captchas). The MCP spec covers the first half. The second half is yours.
The cleanest pattern is to pass a client key as an HTTP header on the MCP transport, then look it up server-side. Every modern MCP client supports custom headers in their config.
2. Per-platform tools, not one generic fetch tool
Agents are bad at "fetch this URL and parse this thing". They are good at "get the recent posts for this TikTok handle". You want a tool per concrete action. A profile tool. A search tool. A comments tool. Each one has a typed input and a typed output. The agent then composes them.
This is the single biggest design decision. If you take the lazy fetch_url route, you end up reimplementing the agent inside your own service to parse HTML. If you go granular, the agent does the work it is good at, which is reasoning about which tool to call.
3. Rate limiting that the protocol can see
When you rate limit, do not just return a 429 in your transport. Translate it into a protocol-level error with a retry_after hint. Agents handle this gracefully. They do not handle "the HTTP layer disappeared".
4. Idempotent tool calls
Agents retry. They retry more than humans do because the cost of retrying is one token, not one minute of attention. Make sure your scrape endpoints are idempotent or at least cheap to repeat. Charge per request, not per click. Cache aggressively on identical inputs.
5. Real anti-bot infrastructure
The hard part of scraping was never the HTTP request. It was the proxies, the headless browsers, the captcha solving, the residential rotation, the JavaScript challenge handling, the TLS fingerprinting. None of that goes away because you wrapped it in MCP. If anything, MCP makes the gap more visible. Agents pound on the endpoint and any flakiness shows up immediately.
6. Structured outputs the agent can reason over
Returning raw HTML to an LLM is a waste of tokens and a recipe for hallucinations. Return clean JSON. Define the schema. Strip the fields agents do not need. Your tool description should match the actual response shape, character for character.
When to build versus when to buy
Here is the honest read after shipping several MCP scrapers in production.
Build your own if:
- The target site is internal or under your control.
- You only need one or two tools and the volume is low.
- You enjoy maintaining a proxy farm.
Buy a managed one if:
- You need cross-platform coverage (TikTok, Instagram, YouTube, LinkedIn, Reddit, Twitter, etc.).
- Your agent's reliability is a product feature, not a side project.
- You would rather spend engineering hours on the agent's reasoning, not on rotating IPs.
That second category is exactly what CreatorCrawl is. We run an MCP server at https://app.creatorcrawl.com/api/mcp that exposes 60+ typed tools across the major social platforms. Each one returns clean structured JSON. The proxies, browsers, retries, and platform-specific quirks are our problem, not yours.
You connect it in any MCP-compatible client with one line of config:
{
"mcpServers": {
"creatorcrawl": {
"url": "https://app.creatorcrawl.com/api/mcp",
"headers": { "x-api-key": "your_api_key_here" }
}
}
}
Same protocol, no infra to run, no maintenance when a platform updates their frontend.
TLDR
If you are scraping the internal network at your company, build your own MCP server. The example above is a real starting point. If you are scraping public social platforms for an agent product, you will burn months on the infrastructure. The managed route is the right call.
Sign up for free and connect any MCP client to live social data in under a minute. Or read the MCP docs for the full tool list.
Explore CreatorCrawl
More from the Blog
Claude Skills for Social Media Data: Adding Live Creator Data to Claude Code
A Claude Skill turns a long instruction file into a reusable capability. CreatorCrawl ships one for social media data. This post explains what skills are, why they are different from MCP, and how to install ours in 30 seconds.
Read moreInstagram MCP Server: Connect AI Agents to Instagram Data
Set up an Instagram MCP server in Claude Desktop, Cursor, and Windsurf. Give your AI agents direct access to Instagram profiles, posts, reels, and engagement data without writing custom integration code.
Read moreLinkedIn MCP Server: Connect AI Agents to LinkedIn Data
A LinkedIn MCP server that gives AI agents access to profiles, companies, posts, and ads. Setup instructions for Claude Desktop, Cursor, Windsurf, plus example prompts for outbound, hiring, and competitive intel.
Read more