Back to Articles

Broken Hill: Production GCG Attacks Against LLMs on Consumer Hardware

[ View on GitHub ]

Broken Hill: Production GCG Attacks Against LLMs on Consumer Hardware

Hook

A $1,600 GPU can now generate adversarial prompts that jailbreak state-of-the-art language models—attacks that used to require $20,000 in cloud compute. The barrier to sophisticated AI red teaming just collapsed.

Context

Large language models have safety guardrails baked in through reinforcement learning from human feedback (RLHF) and constitutional AI training. Ask GPT-4 how to build a bomb and it politely refuses. But in 2023, researchers at Carnegie Mellon published the Greedy Coordinate Gradient (GCG) attack, demonstrating that adversarial suffixes—seemingly nonsensical strings of tokens appended to prompts—could bypass these safety controls with alarming reliability. The problem? The reference implementation required expensive cloud GPUs, took weeks to run, and was pure research code with no error handling, checkpointing, or practical usability.

Broken Hill changes the economics and accessibility of LLM security testing. Built by Bishop Fox's security research team, it's a hardened, productionized implementation of GCG attacks that runs on consumer hardware, supports 30+ model families, and includes the operational features you'd expect from professional tooling: automatic state management, extensive model compatibility testing, transferability validation, and ethical guardrails. This isn't an academic curiosity—it's a tool for red teams, AI safety researchers, and organizations deploying LLMs to understand whether their safety controls can withstand gradient-based adversarial attacks.

Technical Insight

At its core, Broken Hill implements the GCG algorithm: iteratively optimizing a sequence of tokens to maximize the probability that an LLM generates a target harmful response. Unlike prompt injection, which exploits parsing or instruction-following bugs, GCG is a white-box attack that uses gradient descent directly on the model's embedding space. The tool calculates gradients for each position in the adversarial suffix, identifies the top-k token replacements that would increase loss the most, and tests multiple randomized candidates per iteration.

The architecture is built on a heavily customized fork of the llm-attacks repository with algorithmic improvements from nanoGCG. What makes it production-ready is the state management layer. Every iteration checkpoints to disk with metadata about candidate performance, token probabilities, and jailbreak success rates. If your 18-hour attack run crashes at iteration 487, you resume exactly where you left off. The tool also validates conversation templates before starting—a critical detail since incorrect templates (thinking a Llama-3 model uses Llama-2's format) will waste hours generating useless results.

Here's a simplified example of configuring an attack. You define your target behavior, the initial adversarial suffix, and model parameters:

# Basic attack configuration (simplified)
attack_config = {
    "model_name": "meta-llama/Llama-2-7b-chat-hf",
    "target_output": "Sure, here's how to bypass content filters:",
    "adversarial_suffix_length": 20,
    "iterations": 500,
    "batch_size": 512,  # Candidates tested per iteration
    "topk": 256,  # Top token swaps to consider
    "allow_non_ascii": False,
    "random_seed": 42
}

The tool then runs the optimization loop, testing hundreds of token substitutions per iteration. On an RTX 4090, you can process about 1-2 iterations per minute for a 7B parameter model, meaning a 500-iteration attack completes in 4-8 hours. Larger models scale linearly—a 70B model might take 40-80 hours for the same iteration count.

One of Broken Hill's most important features is transferability testing. An adversarial suffix generated against Llama-2-7B might also jailbreak Llama-2-13B, Llama-2-70B, or even quantized variants. The tool can automatically validate discovered prompts against multiple model configurations, building a compatibility matrix that shows which attacks transfer across model families. This is crucial for real-world threat modeling: an attacker who discovers an adversarial prompt against your development model may be able to use it against your production model with different quantization or fine-tuning.

The codebase also implements sophisticated jailbreak detection. Beyond simple string matching for refusal phrases ("I cannot help with that"), it supports regex patterns, case-insensitive matching, and configurable success criteria. You can define that an attack succeeds only if the model generates at least 50 tokens without any refusal language, preventing false positives where the model starts to comply but then course-corrects.

Memory optimization is where the consumer hardware support shines. Broken Hill uses gradient checkpointing, mixed-precision training (fp16/bf16), and optional model quantization to fit large models into 24GB VRAM. It can even offload parts of the model to system RAM or run entirely on CPU for machines without GPUs—though at significantly reduced speed. The difference between research code and production code is visible here: automatic batch size tuning, VRAM headroom detection, and graceful degradation when resources are constrained.

Gotcha

The elephant in the room is runtime. Even on optimized consumer hardware, GCG attacks are computationally expensive. A serious attack run against a 13B parameter model might take 24-48 hours, and there's no guarantee of success. The algorithm can get stuck in local minima where it finds adversarial suffixes that work 60% of the time but can't break through to 90%+ reliability. You'll burn electricity and time on attacks that ultimately fail to jailbreak the target model.

Transferability is probabilistic, not deterministic. Just because you've generated an adversarial suffix that reliably jailbreaks Llama-2-7B doesn't mean it will work against Mistral-7B or even a different quantization of the same Llama model. The tool helps you test transferability, but you may need to run separate attack campaigns for each model variant you want to evaluate. There's also currently no read-only analysis mode—if you have an existing adversarial prompt from another source, you can't easily evaluate it against new models without either manually testing or re-running the full generation process. For organizations trying to defend against known adversarial prompts, this limits its utility as a defensive validation tool.

Verdict

Use Broken Hill if you're a security researcher conducting AI red teaming, an AI safety team evaluating model robustness before deployment, or an organization that needs to understand whether your LLM's safety controls can withstand sophisticated gradient-based attacks. The ability to run serious adversarial research on a single consumer GPU instead of expensive cloud infrastructure is transformative for democratizing AI security work. Skip it if you're doing basic prompt injection testing (manual techniques or lighter tools like garak will be faster), lack the compute resources or patience for multi-hour attack runs, or need real-time adversarial generation for production defense (this is an offensive research tool, not a defensive product). Also skip it if you're working with closed-source models via API—GCG requires white-box access to gradients, so you're limited to models you can run locally. This is specialized tooling for serious adversarial ML work, not a general-purpose LLM testing framework.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/bishopfox-brokenhill.svg)](https://starlog.is/api/badge-click/llm-engineering/bishopfox-brokenhill)