Building CAPTCHA-Solving Agents with Multi-Modal LLMs: A Pattern Study

Hook

CAPTCHAs were designed to prove you're human by presenting challenges computers can't solve. Now, a 200-line Python project demonstrates how vision-language models can crack them using nothing but carefully crafted prompts.

Context

For two decades, CAPTCHAs have served as the internet's gatekeepers, distinguishing humans from bots through visual puzzles that automated systems supposedly couldn't solve. Text distortion, mathematical equations, and image recognition tasks became ubiquitous friction points across the web. But the emergence of multi-modal large language models—systems that can process both images and text—has fundamentally shifted this landscape.

The i-am-a-bot repository by Aashiq Ramachandran represents an interesting inflection point in this evolution. Rather than building complex computer vision pipelines with specialized OCR models, segmentation algorithms, and machine learning classifiers, it demonstrates how modern vision LLMs can solve certain CAPTCHAs through pure prompt engineering. More importantly for developers, it showcases a clean pattern for orchestrating multiple specialized agents from a single foundation model—a technique increasingly relevant as multi-modal AI becomes commoditized through APIs like Google's Vertex AI, OpenAI's GPT-4 Vision, and Anthropic's Claude.

Technical Insight

The architecture follows a defensive pipeline pattern where each stage validates and narrows the problem space before passing work downstream. Rather than immediately attempting to solve every image, the system employs three distinct agent types, each implemented as a prompt template paired with the Gemini Vision Pro model.

The entry point is a validator agent that determines whether an input image is actually a CAPTCHA. This seems redundant until you consider production scenarios where users might submit arbitrary images, or where you're scraping pages and need to detect CAPTCHA challenges dynamically. The validator returns a boolean decision based on visual analysis.

Next, a classifier agent categorizes the CAPTCHA type. Here's where the prompt engineering becomes interesting. The system defines an enum of CAPTCHA types and asks the model to classify:

class CaptchaType(Enum):
    TEXT = "text"
    MATHEMATICAL = "mathematical"
    IMAGE_ROTATION = "image_rotation"
    IMAGE_PUZZLE = "image_puzzle"
    IMAGE_SELECTION = "image_selection"

class ClassifyAgent:
    def classify(self, image_bytes):
        prompt = f"""
        Analyze this image and classify the type of CAPTCHA.
        Return ONLY one of these values: {', '.join([t.value for t in CaptchaType])}
        
        - TEXT: distorted letters/numbers to read
        - MATHEMATICAL: equations to solve
        - IMAGE_ROTATION: images to rotate correctly
        - IMAGE_PUZZLE: puzzle pieces to arrange
        - IMAGE_SELECTION: select images matching criteria
        """
        # Send to Gemini Vision with image
        response = self.model.generate_content([prompt, image_bytes])
        return CaptchaType(response.text.strip())

This structured output approach—constraining the LLM to specific enum values rather than freeform text—is crucial for reliable agent orchestration. The prompt explicitly lists valid outputs and describes each type, leveraging the model's visual understanding to categorize without traditional feature engineering.

Finally, specialized solver agents handle the actual CAPTCHA breaking. The text solver prompts the model to extract distorted characters, while the math solver requests equation solutions. Here's the text solver implementation pattern:

class TextSolverAgent:
    def solve(self, image_bytes):
        prompt = """
        This image contains a text-based CAPTCHA with distorted characters.
        Carefully analyze the image and extract the exact text shown.
        Return ONLY the text characters with no explanation or additional words.
        
        Important:
        - Pay attention to similar-looking characters (0 vs O, 1 vs l)
        - Ignore background noise and focus on the main text
        - Preserve capitalization exactly as shown
        """
        response = self.model.generate_content([prompt, image_bytes])
        return response.text.strip()

The prompts include specific instructions about edge cases (character confusion, noise handling) that emerged from testing. This iterative prompt refinement—starting simple and adding constraints based on failure modes—mirrors how you'd refine any algorithm.

The central orchestration happens in a Solve class that manages Google Cloud authentication and chains these agents together. It's essentially a state machine: validate → classify → route to appropriate solver. The pattern is simple but extensible—adding a new CAPTCHA type requires implementing another solver agent and updating the classifier's enum.

What's particularly educational here is how the same foundation model (Gemini Vision Pro) serves all roles through different prompts. Traditional approaches would require separate models: one for object detection, another for OCR, perhaps a third for mathematical expression recognition. Multi-modal LLMs collapse this complexity, trading model specialization for prompt specialization. The cost is API calls and latency; the benefit is dramatically simpler code and faster iteration on new CAPTCHA types.

Gotcha

The repository's README honestly acknowledges that rotation, puzzle, and image-selection CAPTCHAs—despite being detectable—cannot be solved. This isn't just a missing feature; it reveals a fundamental limitation of using conversational AI APIs for spatial reasoning tasks. Vision LLMs excel at describing images and extracting text but struggle with geometric transformations and precise coordinate-based manipulations. You can't reliably ask Gemini to "rotate this image 45 degrees clockwise" or "identify the exact angle needed" because these models don't have deterministic spatial reasoning—they're statistical pattern matchers optimized for language, not geometry.

The production readiness is questionable. There's no error handling for API failures, no retry logic, no rate limiting to prevent bill shock from thousands of concurrent requests. The code assumes happy paths: that Vertex AI responds quickly, that images are well-formed, that the model always returns parseable output. In reality, LLM APIs can timeout, return malformed JSON, or hallucinate unexpected responses. For any real deployment, you'd need to wrap each agent call in retry logic with exponential backoff, validate response formats with schemas, and implement circuit breakers.

Cost is another hidden gotcha. Gemini Vision Pro charges per image analysis, and this pipeline makes three API calls per CAPTCHA (validate, classify, solve). At scale, those costs compound quickly. A bot solving hundreds of CAPTCHAs hourly could rack up significant bills compared to specialized OCR libraries like Tesseract or EasyOCR, which run locally for free. The trade-off between development speed and operational cost isn't discussed.

Verdict

Use if: You're learning how to architect multi-agent systems with vision LLMs and want a clear, minimal example without framework bloat. The codebase is excellent for understanding prompt engineering patterns, agent orchestration, and how to structure defensive pipelines. Also consider it for one-off automation tasks where you need to bypass simple text or math CAPTCHAs in controlled environments—research projects, personal scripts, or testing your own CAPTCHA implementations. Skip if: You need production-grade CAPTCHA solving with reliability guarantees, cost controls, and support for complex visual puzzles. The lack of error handling, limited CAPTCHA type coverage, and per-call API costs make it unsuitable for anything beyond experimentation. Also skip if you're considering this for unethical purposes—bypassing CAPTCHAs at scale to abuse services undermines internet security and violates most terms of service. For legitimate accessibility needs, use audio CAPTCHA alternatives or commercial services with proper legal frameworks.

Building CAPTCHA-Solving Agents with Multi-Modal LLMs: A Pattern Study

Building CAPTCHA-Solving Agents with Multi-Modal LLMs: A Pattern Study

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building CAPTCHA-Solving Agents with Multi-Modal LLMs: A Pattern Study

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

LobeHub: The Agent Orchestration Platform That Treats AI as Your Employee, Not Your Chatbot

OpenSRE: Building the SWE-bench for Production Incidents

Agent Orchestrator: Git Worktrees Are the Secret to Parallel AI Coding

OpenSandbox: Building Production-Grade Isolation for AI Agents That Actually Execute Code

LobeHub: The Agent Orchestration Platform That Treats AI as Your Employee, Not Your Chatbot

OpenSRE: Building the SWE-bench for Production Incidents

Agent Orchestrator: Git Worktrees Are the Secret to Parallel AI Coding

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]