> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

CloudProxy: Self-Hosting Your Way Past Cloudflare's Bot Detection

[ View on GitHub ]

CloudProxy: Self-Hosting Your Way Past Cloudflare's Bot Detection

Hook

Cloudflare blocks over 180 billion bot requests every day, yet a 569-star TypeScript project claims to bypass it using nothing more than a headless browser and some clever fingerprint masking.

Context

Web scraping hit a wall when Cloudflare became ubiquitous. What started as simple rate limiting evolved into sophisticated JavaScript challenges that analyze browser fingerprints, TLS signatures, and behavioral patterns to distinguish humans from bots. Traditional HTTP clients like requests or axios fail immediately—Cloudflare's "Checking your browser" page never resolves because there's no JavaScript engine to execute the challenge.

The scraping ecosystem split into two camps: pay-per-request commercial services charging $1-5 per thousand requests, or DIY solutions requiring browser automation expertise. CloudProxy emerged as a middle path—a self-hosted proxy server that handles the Cloudflare dance so your lightweight HTTP client doesn't have to. It's particularly popular in the sneaker bot community where margins are thin and API costs add up quickly.

Technical Insight

CloudProxy's architecture is deceptively simple: it's an HTTP server that proxies requests through Puppeteer-controlled Chrome instances equipped with stealth plugins. When you send a request to CloudProxy's REST API, it doesn't make a direct HTTP call—it launches (or reuses) a headless browser, navigates to your target URL, waits for Cloudflare's JavaScript challenge to auto-solve, then extracts the resulting cookies and HTML.

The stealth plugin is critical here. Vanilla Puppeteer is trivially detectable—Cloudflare checks for properties like navigator.webdriver, Chrome DevTools Protocol artifacts, and inconsistencies in the JavaScript environment. The puppeteer-extra-plugin-stealth package patches over 30 detection vectors, making the automated browser indistinguishable from a human-operated one (at least to Cloudflare's current detection methods).

Here's how you'd integrate CloudProxy into a scraping workflow:

import axios from 'axios';

const CLOUDPROXY_URL = 'http://localhost:8191/v1';

// First, create a session for persistent cookies
const sessionResponse = await axios.post(`${CLOUDPROXY_URL}`, {
  cmd: 'sessions.create',
  session: 'my_sneaker_session'
});

// Make a request through CloudProxy
const response = await axios.post(`${CLOUDPROXY_URL}`, {
  cmd: 'request.get',
  url: 'https://cloudflare-protected-site.com/product-page',
  session: 'my_sneaker_session',
  maxTimeout: 60000
});

// CloudProxy returns cookies, headers, and HTML
const { solution } = response.data;
console.log(solution.status); // HTTP status code
console.log(solution.cookies); // Array of cookies
console.log(solution.response); // Page HTML

// Now use these cookies with a fast HTTP client
const cookies = solution.cookies
  .map(c => `${c.name}=${c.value}`)
  .join('; ');

const fastRequest = await axios.get(
  'https://cloudflare-protected-site.com/api/products',
  { headers: { Cookie: cookies } }
);

The session-based architecture is where CloudProxy shines for sequential requests. Creating a session keeps the Chrome instance alive with persistent cookies, so subsequent requests to the same domain bypass Cloudflare entirely—you've already proven you're "human" in the first request. This reduces a 5-10 second challenge-solving process to a sub-second cookie-authenticated request.

Under the hood, CloudProxy maintains a session registry mapping session IDs to Puppeteer browser instances. When a session expires or you explicitly destroy it, the browser closes and releases memory. This design choice—persistent browsers rather than one-shot instances—is a double-edged sword. It dramatically improves performance for batch scraping (think monitoring 50 products on the same site) but punishes you with 300-500MB RAM per concurrent session.

The POST request implementation reveals another architectural decision:

const postResponse = await axios.post(`${CLOUDPROXY_URL}`, {
  cmd: 'request.post',
  url: 'https://cloudflare-protected-site.com/api/search',
  postData: 'query=limited+edition&sort=price',
  session: 'my_sneaker_session'
});

Notice CloudProxy doesn't just return the challenge-solved cookies—it executes the entire POST request within the browser context. This handles sites that validate the POST request itself with additional JavaScript checks, though it means you're still paying the full browser automation cost even when cookies might suffice.

Gotcha

CloudProxy's beta status isn't just a disclaimer—it's a warning. The API has undergone breaking changes, and session management quirks surface in production. Sessions sometimes fail to destroy cleanly, leading to zombie Chrome processes consuming memory until you manually kill them. The project's documentation acknowledges these issues but offers no timeline for stability.

Memory consumption is the real killer for scaling. Each Chrome instance devours 300-500MB baseline, and complex JavaScript-heavy pages can push that to 1GB+. Scraping 10 different domains concurrently means 5GB RAM minimum, and that's before considering your application's own memory footprint. I've seen CloudProxy installations brought to their knees by 20 concurrent sessions on a 16GB server—the OOM killer starts terminating processes, and your scraping pipeline collapses.

Cloudflare's detection methods also evolve. What works today may fail tomorrow. CloudProxy's effectiveness depends entirely on the stealth plugin keeping pace with Cloudflare's detection updates, and there's an inherent cat-and-mouse dynamic. Sites using Cloudflare's most aggressive "I'm Under Attack" mode or human-verification CAPTCHAs will still block you—CloudProxy can't solve visual puzzles or hCaptcha challenges. You're bypassing JavaScript challenges, not all challenges.

Verdict

Use if: You're scraping Cloudflare-protected sites at low-to-medium volume (under 10 concurrent sessions), have dedicated server resources with at least 8GB RAM, want to avoid per-request API costs, and need session persistence for sequential requests to the same domains. It's particularly compelling for hobbyist projects, sneaker bots monitoring specific drops, or data collection where you control the infrastructure and can tolerate occasional instability. Skip if: You need production-grade stability with guaranteed uptime, plan to scrape at high concurrency (20+ simultaneous targets), operate in memory-constrained environments like serverless functions or small VPS instances, or face advanced CAPTCHA challenges that require human solving. In those cases, bite the bullet and pay for commercial services like ScraperAPI or Bright Data—they've already solved the scaling and reliability problems you'll spend months debugging with CloudProxy.