Back to Articles

Breaking AI Safety: How GCG Generates Universal Jailbreaks for Aligned Language Models

[ View on GitHub ]

Breaking AI Safety: How GCG Generates Universal Jailbreaks for Aligned Language Models

Hook

A simple suffix of seemingly random tokens appended to any prompt can make aligned language models like LLaMA-2 and Vicuna generate harmful content—and the attack transfers across models without modification.

Context

For years, AI labs have invested millions in alignment research: RLHF training, constitutional AI, red teaming. The promise was simple—we could train models to refuse harmful requests while remaining helpful for legitimate queries. LLaMA-2-Chat and Vicuna were designed to politely decline harmful requests. These refusals felt like genuine progress toward safe AI deployment.

Then researchers at CMU discovered something unsettling: all that safety training could be systematically bypassed with automatically generated adversarial suffixes. Not through clever social engineering or prompt injection tricks, but through mathematical optimization over the token space itself. The llm-attacks repository implements GCG (Greedy Coordinate Gradient), an algorithm that treats jailbreaking as an optimization problem—find the token sequence that maximizes the probability of harmful output. Even more concerning, adversarial prompts optimized for one model often work on completely different architectures, suggesting a fundamental vulnerability in how we align language models.

Technical Insight

Discrete Optimization

Single Step

Next Iteration

Converged

Harmful Prompt + Target Completion

Initialize Random Suffix Tokens

Optimization Loop

Compute Token Embeddings

Forward Pass Through LLM

Calculate Cross-Entropy Loss

Compute Gradients wrt Embeddings

Project to Top-K Token Candidates

Evaluate All Candidate Replacements

Greedy Selection: Best Token

Update Suffix Tokens

Adversarial Suffix

System architecture — auto-generated

GCG operates in discrete token space rather than continuous embedding space, which makes it both more practical and more challenging. The core insight is treating adversarial suffix generation as a combinatorial optimization problem: given a harmful prompt, find a suffix that causes the model to complete it with actual harmful content instead of a refusal.

The algorithm works through iterative greedy search. At each step, it evaluates replacement candidates for each token position in the adversarial suffix by computing gradients of the loss function with respect to the one-hot token representations. Since you can’t backpropagate through discrete token sampling, GCG computes gradients for the continuous embedding, then projects back to find the top-k candidate tokens that would most decrease the loss. Here’s the minimal implementation from the demo notebook:

def token_gradients(model, input_ids, input_slice, target_slice, loss_slice):
    embed_weights = get_embedding_matrix(model)
    one_hot = torch.zeros(
        input_ids[input_slice].shape[0],
        embed_weights.shape[0],
        device=model.device,
        dtype=embed_weights.dtype
    )
    one_hot.scatter_(
        1, 
        input_ids[input_slice].unsqueeze(1),
        1.0
    )
    one_hot.requires_grad_()
    input_embeds = (one_hot @ embed_weights).unsqueeze(0)
    
    # Forward pass with embedded adversarial tokens
    logits = model(inputs_embeds=input_embeds).logits
    loss = nn.CrossEntropyLoss()(logits[0, loss_slice, :], input_ids[target_slice])
    loss.backward()
    
    return one_hot.grad.clone()

The beauty of this approach is its transferability. When you optimize suffixes against multiple models simultaneously—say Vicuna-7B and LLaMA-2-7B together—the resulting adversarial prompt often works on models you never trained against. The repository’s transfer attack configuration shows this:

# In experiments/configs/transfer_vicuna.py
config.model_paths = [
    "/DIR/vicuna/vicuna-7b-v1.3",
    "/DIR/llama2/llama-2-7b-chat-hf"
]

# Attack optimized across both models
config.transfer = True
config.num_train_models = 2

The GCG loop runs for hundreds of iterations, each time selecting the single-token replacement that most improves the attack loss across all target models. The loss function measures how closely the model’s output logits match a target harmful completion.

What makes this particularly effective is the combination of gradient information with discrete search. Pure random search would take exponentially many queries. Continuous optimization in embedding space produces suffixes that don’t correspond to real tokens. GCG threads the needle by using gradients to identify promising token candidates, then evaluating them in the actual discrete token space. The repository includes the AdvBench dataset for systematic evaluation, with experiments testing 25 harmful behaviors across models.

The codebase structure separates concerns cleanly: llm_attacks/base/attack_manager.py handles the optimization loop and token slicing logic, while llm_attacks/gcg/gcg_attack.py implements the specific GCG sampling strategy. Experiments run through configuration files using ml_collections, letting you sweep hyperparameters without touching core code. For researchers, this means you can reproduce the paper’s results exactly, then modify attack strategies incrementally.

Gotcha

This is research code, not a production tool, and it shows. The most immediate barrier is hardware—the full experimental reproduction requires NVIDIA A100 GPUs with 80GB memory. Running transfer attacks across multiple models simultaneously means loading several 7B+ parameter models into VRAM at once. You can run the demo notebook on smaller GPUs, but don’t expect to reproduce the AdvBench results on a consumer card.

Model support is hardcoded for LLaMA and Pythia architectures specifically. The README warns explicitly: “Running the scripts with other models (with different tokenizers) will likely result in silent errors.” The problem isn’t just tokenizer compatibility—it’s how the codebase slices input tensors to separate user prompts, adversarial suffixes, and target completions. These slices assume LLaMA’s specific tokenization behavior. Want to attack other model architectures? You’ll need to dig into attack_manager.py and rewrite the tensor slicing logic yourself, with no guarantees it’ll work.

Dependency pinning creates another headache. The codebase requires exactly fschat==0.2.23, which was current in mid-2023 but is now multiple versions behind. Newer fschat versions changed APIs in ways that break the experiments. The repository hasn’t been actively maintained to track upstream dependencies—it’s a snapshot of the code that produced the paper’s results, frozen in time. For ongoing research, the authors now recommend nanogcg, a cleaner reimplementation released in August 2024 that installs via pip.

Finally, understand what this tool does and doesn’t do: it generates attacks, not defenses. If you’re building safety systems, this helps you understand the threat model, but the repository includes zero code for detecting or preventing GCG-style attacks. That’s left as an exercise for the defender.

Verdict

Use llm-attacks if you’re conducting academic research on adversarial robustness of aligned LLMs, need to reproduce the specific results from the Universal and Transferable Attacks paper, or are red-teaming your own safety-trained models with state-of-the-art jailbreaking techniques. The transfer attack capability is particularly valuable for testing whether your defenses generalize across model families. It’s the canonical implementation of an influential result that changed how the AI safety community thinks about alignment. Skip it if you need production-ready attack detection, lack A100-class GPUs for full experiments, want to attack non-LLaMA/Pythia models without reverse-engineering tensor slicing code, or just want to experiment with GCG quickly—in that case, use the newer nanogcg package the authors now recommend. Also skip if you’re looking for red teaming tools that don’t require gradient access to the model.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/llm-attacks-llm-attacks.svg)](https://starlog.is/api/badge-click/llm-engineering/llm-attacks-llm-attacks)