Back to blog

How to Build an MCP Server for Web Scraping (and Why You Might Not Want To)

by Simon Balfe·

If you are an AI engineer building an agent that reads from the web, you have probably hit the same wall. The agent can call your scraper, but the response shape changes every time the target site renames a div. The agent loops on retries. You end up writing a per-site adapter layer that looks like a normal scraper framework, except slightly worse. MCP, the Model Context Protocol Anthropic shipped in late 2024, exists for exactly this gap. This post walks through how to build an MCP server for web scraping, what the gotchas are, and the honest tradeoffs of rolling your own versus plugging into a managed one.

#What MCP actually changes for scrapers

A scraper without MCP looks like a REST API that an agent has to learn through prompt examples. A scraper with MCP looks like a typed function the agent already knows how to call. The protocol forces three things that matter for agent use:

  1. Tool discovery. The client asks the server "what can I do" and gets a typed list back. The agent reads it once and knows the surface area.
  2. Schema-validated arguments. The agent does not freestyle the request body. It fills in fields the server declared.
  3. Structured errors. Failures come back as protocol-level errors, not 500s embedded in HTML. The agent can branch on them.

For a scraping use case this is the difference between an agent that loops forever on a flaky endpoint and an agent that retries the right number of times then surfaces a clean message to the user.

#The minimum useful MCP scraper

Here is the smallest server that does something real. Node, TypeScript, uses the official SDK.

import { Server } from '@modelcontextprotocol/sdk/server/index.js'
import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js'
import { z } from 'zod'

const server = new Server(
  { name: 'simple-scraper', version: '1.0.0' },
  { capabilities: { tools: {} } },
)

server.tool(
  'fetch_url',
  'Fetch a URL and return the cleaned text content.',
  {
    url: z.string().url().describe('Absolute URL to fetch'),
  },
  async ({ url }) => {
    const res = await fetch(url, {
      headers: { 'user-agent': 'Mozilla/5.0 (scraper-bot)' },
    })
    if (!res.ok) {
      throw new Error(`Upstream ${res.status} ${res.statusText}`)
    }
    const html = await res.text()
    const text = html.replace(/<[^>]+>/g, ' ').replace(/\s+/g, ' ').trim()
    return { content: [{ type: 'text', text: text.slice(0, 50_000) }] }
  },
)

const transport = new StreamableHTTPServerTransport()
await server.connect(transport)

That is a working MCP server. Point any MCP client at it and the agent can fetch URLs. It is also nowhere near production ready.

#What you still need to add

The minimum example is missing every hard part of real scraping. Here is the list of things you discover after launch, in roughly the order they bite.

#1. Auth at the right layer

Your MCP server has to authenticate the client (so you can bill them or rate limit them) and your scraping backend has to authenticate to the target (proxies, cookies, captchas). The MCP spec covers the first half. The second half is yours.

The cleanest pattern is to pass a client key as an HTTP header on the MCP transport, then look it up server-side. Every modern MCP client supports custom headers in their config.

#2. Per-platform tools, not one generic fetch tool

Agents are bad at "fetch this URL and parse this thing". They are good at "get the recent posts for this TikTok handle". You want a tool per concrete action. A profile tool. A search tool. A comments tool. Each one has a typed input and a typed output. The agent then composes them.

This is the single biggest design decision. If you take the lazy fetch_url route, you end up reimplementing the agent inside your own service to parse HTML. If you go granular, the agent does the work it is good at, which is reasoning about which tool to call.

#3. Rate limiting that the protocol can see

When you rate limit, do not just return a 429 in your transport. Translate it into a protocol-level error with a retry_after hint. Agents handle this gracefully. They do not handle "the HTTP layer disappeared".

#4. Idempotent tool calls

Agents retry. They retry more than humans do because the cost of retrying is one token, not one minute of attention. Make sure your scrape endpoints are idempotent or at least cheap to repeat. Charge per request, not per click. Cache aggressively on identical inputs.

#5. Real anti-bot infrastructure

The hard part of scraping was never the HTTP request. It was the proxies, the headless browsers, the captcha solving, the residential rotation, the JavaScript challenge handling, the TLS fingerprinting. None of that goes away because you wrapped it in MCP. If anything, MCP makes the gap more visible. Agents pound on the endpoint and any flakiness shows up immediately.

#6. Structured outputs the agent can reason over

Returning raw HTML to an LLM is a waste of tokens and a recipe for hallucinations. Return clean JSON. Define the schema. Strip the fields agents do not need. Your tool description should match the actual response shape, character for character.

#When to build versus when to buy

Here is the honest read after shipping several MCP scrapers in production.

Build your own if:

  • The target site is internal or under your control.
  • You only need one or two tools and the volume is low.
  • You enjoy maintaining a proxy farm.

Buy a managed one if:

  • You need cross-platform coverage (TikTok, Instagram, YouTube, LinkedIn, Reddit, Twitter, etc.).
  • Your agent's reliability is a product feature, not a side project.
  • You would rather spend engineering hours on the agent's reasoning, not on rotating IPs.

That second category is exactly what CreatorCrawl is. We run an MCP server at https://app.creatorcrawl.com/api/mcp that exposes 60+ typed tools across the major social platforms. Each one returns clean structured JSON. The proxies, browsers, retries, and platform-specific quirks are our problem, not yours.

You connect it in any MCP-compatible client with one line of config:

{
  "mcpServers": {
    "creatorcrawl": {
      "url": "https://app.creatorcrawl.com/api/mcp",
      "headers": { "x-api-key": "your_api_key_here" }
    }
  }
}

Same protocol, no infra to run, no maintenance when a platform updates their frontend.

#TLDR

If you are scraping the internal network at your company, build your own MCP server. The example above is a real starting point. If you are scraping public social platforms for an agent product, you will burn months on the infrastructure. The managed route is the right call.

Sign up for free and connect any MCP client to live social data in under a minute. Or read the MCP docs for the full tool list.

Explore CreatorCrawl

More from the Blog