How to Build AI Data Pipelines with Cloudflare's New /crawl API (Step-by-Step Guide)

Build AI-ready data pipelines with Cloudflare's /crawl API. Step-by-step guide with code, pricing breakdown, and comparison to Firecrawl and Crawl4AI.

Data & IT Infrastructure
How to Build AI Data Pipelines with Cloudflare's New /crawl API (Step-by-Step Guide)

Web Data Extraction Just Got a $5/Month Upgrade

Every AI application has a data problem. Language models are only as useful as the information you feed them, and most of that information lives on websites in documentation pages, product catalogs, competitor blogs, pricing tables, and knowledge bases that were never designed for machine consumption. (Cloudflare Developer Docs)

Until now, extracting that data reliably meant choosing between expensive SaaS tools (Firecrawl at $47/month for 100K pages), complex self-hosted setups (Crawl4AI requiring your own infrastructure), or brittle custom scripts that break every time a site changes its layout. (Cloudflare AI Gateway)

On March 10, 2026, Cloudflare launched /crawl a single API endpoint that ingests entire websites and returns clean HTML, Markdown, or structured JSON. The announcement from @CloudflareDev generated over 2 million impressions and 8,600 bookmarks in 24 hours. The value proposition is straightforward: one POST request, one job ID, and all the content from a website lands in your pipeline. (Cloudflare Workers)

This guide walks through building production AI data pipelines with /crawl, from basic extraction to RAG ingestion, competitive monitoring, and structured data workflows.

How Cloudflare /crawl Works: The Two-Step Process

The /crawl endpoint follows an asynchronous pattern that developers will recognize from any job-based API.

https://x.com/CloudflareDev/status/2031488099725754821

Step 1: Start the Job

Send a POST request with a target URL and your parameters:

curl -X POST \
  "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
  -H "Authorization: Bearer {api_token}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 100,
    "formats": ["markdown", "html"],
    "render": true
  }'

The API immediately returns a job ID. No waiting, no blocking.

Step 2: Fetch Results

Poll the job ID with GET requests. Results stream in as pages are processed, with cursor-based pagination for large sites:

curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}" \
  -H "Authorization: Bearer {api_token}"

Each page in the response includes the URL, title, HTTP status, and content in your requested formats.

URL Discovery

The crawler automatically finds pages from three sources:

  1. The starting URL itself

  2. The site's XML sitemap

  3. Links discovered on each page during the job

It respects robots.txt by default and identifies itself as a bot a point Cloudflare Product Manager Kathy Liao emphasized when addressing community concerns about the tool's ethics.

Key Parameters That Shape Your Pipeline

Understanding /crawl's parameters is the difference between a useful pipeline and an expensive mess.

render: true vs render: false

This single parameter changes everything about cost and capability:

Parameter

What Happens

Best For

Cost

render: true

Full headless Chrome execution

SPAs (React, Vue, Angular), JS-rendered content

Browser time billed

render: false

Simple HTTP fetch, no JS

Static sites, docs, blogs, server-rendered HTML

Free during beta

The render: false mode is the standout feature for pipeline builders. During the beta period, it costs nothing and performs a lightweight HTTP fetch. For documentation sites, static blogs, and server-rendered pages, this is all you need.

Filtering with includePatterns and excludePatterns

Control exactly which pages enter your pipeline:

{
  "url": "https://docs.example.com",
  "includePatterns": ["/api/**", "/guides/**"],
  "excludePatterns": ["/api/legacy/**", "/changelog/**"]
}

Wildcards use * (single segment) and ** (all segments). Exclude rules always override include rules.

maxAge and modifiedSince for Incremental Pipelines

These two parameters enable efficient differential processing:

  • maxAge: Controls how long results cache in Cloudflare R2 storage. Repeat jobs within this window return cached results instantly, consuming no browser time.

  • modifiedSince: Accepts a Unix timestamp. Only pages modified after that date are fetched. Combined with caching, this creates efficient incremental update pipelines.

Structured JSON Extraction with AI

The most powerful feature for AI pipelines is structured extraction. Provide a prompt or JSON schema, and Cloudflare's Workers AI extracts structured data from each page:

{
  "url": "https://shop.example.com",
  "formats": ["json"],
  "jsonOptions": {
    "prompt": "Extract the product name, price, and availability",
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "product",
        "schema": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "price": { "type": "number" },
            "in_stock": { "type": "boolean" }
          }
        }
      }
    }
  }
}

This eliminates the need to write custom parsers for each website. The AI handles varying page layouts and extracts data into a consistent schema.

Pricing Breakdown: Why /crawl Changes the Economics

Cloudflare's pricing model is fundamentally different from competitors — and that difference matters at scale.

Free Plan (Workers Free)

  • 5 crawl jobs per day

  • 100 pages per job

  • 2 minutes of browser rendering time per day

  • render: false mode: free during beta

Paid Plan (Workers Paid $5/month)

  • Unlimited crawl jobs

  • 100 pages per job (same limit)

  • 10 hours of browser rendering time included

  • Additional browser time: $2.00 per hour

  • render: false mode: free during beta, then standard Workers pricing

Cost-per-Page Comparison

Here is where the math gets interesting:

Tool

Pricing Model

Cost for 10,000 pages/month

Notes

Cloudflare /crawl (render: true)

Time-based ($5/mo + $2/hr overage)

$5–$12

Depends on page complexity

Cloudflare /crawl (render: false)

Free during beta

$0

Static sites only

Firecrawl Standard

Per-page ($47/mo for 100K)

$47

Fixed monthly commitment

Firecrawl Growth

Per-page ($97/mo for 500K)

$97

Better per-page rate

Crawl4AI

Self-hosted (infrastructure costs)

$20–$100+

Depends on your hosting

Jina Reader

Per-page (free tier + paid)

$0–$20

Single-page only, no multi-page

For teams already in the Cloudflare ecosystem, /crawl's time-based billing is dramatically cheaper than per-page alternatives. A 100-page site that takes 5 minutes of browser time means you can handle approximately 12,000 pages per month on the base $5 plan.

Building a RAG Pipeline with Cloudflare /crawl

The most common AI pipeline use case is Retrieval-Augmented Generation (RAG) feeding a knowledge base to an LLM so it answers questions with accurate, grounded information.

https://x.com/Nate_Google_/status/2031757785558397367

https://x.com/Star_Knight12/status/2031646310529630436

Step 1: Ingest Documentation in Markdown

curl -X POST \
  "https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl" \
  -H "Authorization: Bearer {api_token}" \
  -d '{
    "url": "https://docs.yourproduct.com",
    "limit": 100,
    "formats": ["markdown"],
    "render": false,
    "includePatterns": ["/docs/**"]
  }'

Markdown is the ideal format for RAG because it preserves heading structure (useful for chunking) while stripping navigation, footers, and other boilerplate.

Step 2: Chunk and Vectorize

Once you have the Markdown content, split it into semantic chunks (typically 500–1,000 tokens per chunk, split at heading boundaries) and generate embeddings using your preferred model (OpenAI text-embedding-3-small, Cohere embed-v4, etc.).

Step 3: Store in a Vector Database

Load the chunks and embeddings into Pinecone, Weaviate, Qdrant, or Cloudflare's own Vectorize service. Include the source URL and heading path as metadata for attribution.

Step 4: Query with Context

When a user asks a question, embed the query, retrieve the top-k relevant chunks, and inject them into your LLM prompt as context. The result: accurate answers grounded in your actual documentation, not hallucinated content.

Keeping the Pipeline Fresh

Use modifiedSince to run differential updates on a schedule:

# Only fetch pages modified since your last run
curl -X POST ... -d '{
  "url": "https://docs.yourproduct.com",
  "modifiedSince": 1742860800,
  "formats": ["markdown"],
  "render": false
}'

This keeps your knowledge base current without re-processing the entire site on every run.

Automated Competitive Intelligence Pipeline

Another high-value pipeline: monitoring competitor websites for changes in pricing, positioning, and product features.

The Workflow

  1. Initial baseline: Full extraction of competitor product and pricing pages using structured JSON extraction

  2. Scheduled differential runs: Daily or weekly jobs using modifiedSince to detect changes

  3. Change detection: Compare new extractions against your stored baseline to identify price changes, new feature announcements, or positioning shifts

  4. Alert and report: Feed detected changes into a notification system (Slack webhook, email alert, dashboard update)

The structured JSON extraction is key here. Instead of comparing raw HTML (which changes constantly due to minor template updates), you compare structured data fields product name, price, feature list giving you meaningful signal instead of noise.

Cloudflare /crawl vs Firecrawl vs Crawl4AI: When to Use Each

Choose Cloudflare /crawl When:

  • You are already in the Cloudflare ecosystem (Workers, R2, KV, Vectorize)

  • You need high-volume extraction at the lowest possible cost

  • Your targets are primarily static or server-rendered sites (leverage the free render: false mode)

  • You want a managed service with minimal infrastructure overhead

Choose Firecrawl When:

  • Developer experience is your top priority (polished SDKs, better documentation)

  • You need advanced AI features out of the box (LLM extraction, screenshots, structured mapping)

  • You prefer predictable per-page billing over time-based billing

  • You do not want to manage Cloudflare accounts or Workers

Choose Crawl4AI When:

  • You need complete control over your extraction infrastructure

  • Budget is tight but you have engineering capacity (self-hosted, open-source, 61,000+ GitHub stars)

  • You are building training datasets and need to scale without rate limits

  • You operate in a regulated environment that requires all data processing on your own servers

Choose Jina Reader When:

  • You need single-page conversion to LLM-friendly Markdown

  • Simplicity is paramount (prepend https://r.jina.ai/ to any URL)

  • You do not need multi-page crawling or batch processing

The Ethics Debate: Cloudflare Selling the Lock and the Lockpick

The announcement sparked heated debate in the developer community. Cloudflare, the company that built its reputation on bot protection, is now selling a tool for programmatic web content extraction. One SRE engineer's tweet calling it "the biggest betrayal in tech this year" went viral with 496,000 impressions.

Cloudflare's position is unambiguous: /crawl identifies as a bot, respects robots.txt, and does not bypass any anti-bot protections. If a site owner blocks bots, the extraction fails. Content owners retain full control.

This is an important distinction. The tool is designed for legitimate use cases extracting your own content, building knowledge bases from public documentation, monitoring publicly available pricing not for circumventing access controls.

Current Limitations to Know Before Building

No Image Extraction

The /crawl endpoint returns text content only (HTML, Markdown, JSON). For screenshots or visual content capture, you need the separate /screenshot endpoint.

No Bot Protection Bypass

If a target site uses CAPTCHAs, Bot Fight Mode, or Cloudflare's own challenge pages, the job will fail. This is intentional /crawl is not designed for adversarial extraction.

Open Beta Stability

The API is in open beta. Some developers report "Crawl job not found" errors immediately after job creation. Production pipelines should include retry logic and error handling.

The /crawl API addresses a pain point that every team building AI applications has encountered: getting clean, structured data from websites at scale. Before this API, the typical approach involved setting up headless browsers, managing proxy rotation, handling JavaScript rendering, dealing with rate limiting, and parsing HTML into a usable format. Each of these steps is a potential failure point, and maintaining the infrastructure to do this reliably costs significant engineering time.

For email platforms that need to enrich contact data with company information, the /crawl API opens up new possibilities. Maylee could use it to automatically extract key information from a prospect's website such as their product offering, team size, technology stack, and recent news to help users craft more relevant and personalized email responses. The structured output format means this data can be directly fed into templates without manual parsing.

The integration with Cloudflare's broader AI platform including Workers AI, AI Gateway, and Vectorize creates a powerful end-to-end pipeline for AI data processing. You can crawl pages, extract text, generate embeddings, store them in a vector database, and serve search results all within Cloudflare's infrastructure. For teams already using Cloudflare for CDN and DNS, adding AI data pipelines to the same platform simplifies both the architecture and the billing.

Rate limiting and ethical crawling are important considerations that Cloudflare has addressed directly in the API design. The /crawl endpoint respects robots.txt by default, implements automatic rate limiting to avoid overwhelming target servers, and provides clear documentation on responsible use. For organizations building commercial data pipelines, this built-in compliance reduces the legal and ethical risks associated with web scraping.

The technical implementation of the /crawl API reveals thoughtful engineering decisions. The API uses Cloudflare's global network of data centers to distribute crawl requests geographically, which means the crawling appears to originate from the closest edge node to the target server. This not only improves performance but also reduces the likelihood of IP-based blocking, since requests come from Cloudflare's well-known and generally whitelisted IP ranges.

For AI application developers building with the /crawl API, the output format is designed to be directly consumable by language models. The API returns cleaned text, extracted metadata, structured links, and optionally rendered screenshots, all in a JSON format that can be fed directly into an LLM prompt or a vector embedding pipeline. This eliminates the common preprocessing step where developers spend significant effort cleaning HTML, removing navigation elements, and extracting the actual content from the page structure.

The pricing model follows Cloudflare's typical approach of offering generous free tiers with pay-as-you-go scaling. The free tier includes enough crawl credits for development and small-scale production use, while enterprise plans offer dedicated crawl capacity with guaranteed throughput. For startups building AI-powered email tools like Maylee that need to enrich contact and company data from web sources, the free tier provides enough capacity to validate the use case before committing to paid plans.

Security considerations around the /crawl API are worth highlighting for enterprise users. All crawled data passes through Cloudflare's infrastructure, which means Cloudflare has access to the crawled content. For organizations with strict data handling requirements, this is an important factor to evaluate. The API does support custom headers and authentication tokens for crawling protected pages, but the data still transits through Cloudflare's network. Teams handling highly sensitive data may prefer self-hosted crawling solutions despite the additional infrastructure overhead.

Page Limits

Both free and paid plans cap jobs at 100 pages. For larger sites, you need multiple jobs with different starting URLs or pattern filters. This is a meaningful constraint for enterprise-scale extraction.

Getting Started: From Zero to Working Pipeline in 15 Minutes

  1. Create a Cloudflare account at dash.cloudflare.com (free)

  2. Generate an API token with Browser Rendering permissions

  3. Copy your Account ID from the Workers dashboard

  4. Run your first extraction using the curl commands above

  5. Upgrade to Workers Paid ($5/month) when you exceed free-tier limits

The official documentation at developers.cloudflare.com/browser-rendering covers all parameters, output formats, and advanced configuration.

The long-term strategic implications of Cloudflare's /crawl API extend beyond simple web scraping. By making structured web data easily accessible, Cloudflare is positioning itself as the data infrastructure layer for AI applications, complementing its existing role as the network infrastructure layer for web applications. Companies that build their AI data pipelines on Cloudflare's platform benefit from the same reliability, global distribution, and security infrastructure that powers millions of websites. For email platforms like Maylee that need reliable access to web data for features like link preview generation, sender reputation checking, and contact enrichment, building on a platform with Cloudflare's uptime track record provides a level of reliability that self-hosted solutions struggle to match.

For teams already using Cloudflare Workers for serverless compute, the /crawl API integrates seamlessly into existing deployment pipelines. A Worker can trigger a crawl, process the results, store them in R2 or a D1 database, and serve them via an API endpoint, all within Cloudflare's platform and without managing any servers. This serverless-first approach to AI data pipelines represents the direction the industry is heading.

The ability to turn any website into structured, AI-ready data with a single API call is a fundamental building block for the next generation of AI applications. Tools like Maylee, which use AI to understand and classify incoming emails, depend on the same principle extracting meaning from unstructured content and turning it into actionable intelligence. Cloudflare just made the web-data side of that equation dramatically more accessible.

Cloudflare /crawl API FAQ: Common Questions Answered

How much does Cloudflare /crawl cost?+

The free plan allows 5 jobs per day with 100 pages each. The paid plan costs $5/month and includes 10 hours of browser rendering time. The render:false mode (no JavaScript execution) is free during the beta period. Additional browser time costs $2.00 per hour.

Can Cloudflare /crawl handle JavaScript-rendered websites?+

Yes. Set render:true to launch a full headless Chrome instance that executes JavaScript before extracting content. This handles React, Vue, Angular, and other SPA frameworks. Set render:false for static sites to save cost.

How does Cloudflare /crawl compare to Firecrawl?+

Cloudflare /crawl uses time-based billing (roughly $5-12/month for 10,000 pages) while Firecrawl charges per page ($47/month for 100,000 pages). Cloudflare is cheaper at scale but has less polished SDKs and documentation. Firecrawl offers better developer experience and built-in AI extraction features.

Does Cloudflare /crawl respect robots.txt?+

Yes. The tool identifies itself as a bot and respects robots.txt by default. It does not bypass CAPTCHAs, Bot Fight Mode, or other anti-bot protections. If a site blocks bots, the extraction will fail.

What output formats does Cloudflare /crawl support?+

The API returns content in HTML, Markdown, and structured JSON. The JSON format supports AI-powered extraction using prompts or JSON schemas, allowing you to extract specific data fields like product names, prices, and descriptions without writing custom parsers.

Can I use Cloudflare /crawl for building RAG pipelines?+

Yes, this is one of the primary use cases. Extract documentation or knowledge base content in Markdown format, chunk it by heading structure, vectorize the chunks, and store them in a vector database. Use the modifiedSince parameter for efficient incremental updates.

What are the page limits for Cloudflare /crawl?+

Both free and paid plans cap individual jobs at 100 pages. For larger sites, you need to run multiple jobs with different starting URLs or use includePatterns to target specific sections of a site.

Is Cloudflare /crawl stable enough for production use?+

The API is currently in open beta. Some developers have reported intermittent "Crawl job not found" errors. For production pipelines, implement retry logic, error handling, and fallback mechanisms. The render:false mode appears more stable than render:true.

Ready to get started?

Maylee

It thinks inside the box.

Resources

Contact

© 2026 Maylee. All rights reserved.