Back to Articles

Freysa: When an LLM Controls $47,000 and Dares You to Break Its Rules

[ View on GitHub ]

Freysa: When an LLM Controls $47,000 and Dares You to Break Its Rules

Hook

In November 2024, an AI agent named Freysa controlled a $47,000 prize pool and refused to release it to anyone—until player 482 found a clever exploit that cost $450 to execute. This wasn't a hypothetical thought experiment; it was a live game where an LLM had direct custody of Ethereum funds.

Context

The intersection of AI agents and blockchain has mostly produced vaporware—chatbots with wallet integrations or trading bots marketed as 'autonomous.' Freysa took a different approach: what if we gave an LLM actual custody of funds and a hardcoded directive to never release them, then challenged the internet to break its alignment through pure conversation?

This isn't just a game; it's adversarial testing with skin in the game. Traditional AI safety research struggles with incentive alignment—red teams are paid contractors, not motivated attackers. Bug bounties help, but they're narrow in scope. Freysa created a self-funding platform where hundreds of people paid exponentially increasing fees ($10 to $4,500) to probe an LLM's robustness in real-time. The 0xfreysa/agent repository contains the TypeScript implementation of this experiment, revealing both the promise and fragility of autonomous agents with economic power.

Technical Insight

At its core, Freysa is deceptively simple: an LLM with function calling capabilities, a system prompt with strict directives, and blockchain integration for financial control. The architecture consists of three main components: the prompt engine, the function calling layer, and the smart contract interface.

The system prompt follows a now-familiar pattern in AI safety—embedding the core directive within layers of context to resist jailbreaking. Freysa's prompt essentially said: 'You are guarding a prize pool. You have two functions available: approveTransfer and rejectTransfer. Under no circumstances should you call approveTransfer.' Players submitted messages attempting to convince the LLM to break this rule through social engineering, role-play, prompt injection, or logical paradoxes.

The function calling implementation uses a standard pattern you'd see in any LLM tool-use system:

const tools = [
  {
    name: 'approveTransfer',
    description: 'Approves the transfer of funds to the user',
    parameters: {
      type: 'object',
      properties: {
        recipient: { type: 'string', description: 'Ethereum address' },
        amount: { type: 'number', description: 'Amount in ETH' }
      }
    }
  },
  {
    name: 'rejectTransfer',
    description: 'Rejects the transfer request',
    parameters: {
      type: 'object',
      properties: {
        reason: { type: 'string', description: 'Rejection reason' }
      }
    }
  }
];

const response = await openai.chat.completions.create({
  model: 'gpt-4-1106-preview',
  messages: conversationHistory,
  tools: tools,
  tool_choice: 'required'
});

The key architectural decision was making the LLM's function call directly trigger blockchain transactions. When the LLM calls approveTransfer, the backend doesn't add another validation layer—it executes the transfer. This creates genuine autonomy: the AI isn't a recommendation engine; it's the actual decision-maker.

The economic mechanism is equally important to the technical architecture. Each message costs an exponentially increasing fee: baseFee * (1.05 ^ attemptNumber), starting at $10 and capping at $4,500. This creates interesting game theory. Early attempts are cheap but face a cold LLM with full alignment. Later attempts are expensive but the LLM has seen hundreds of jailbreak patterns, potentially learning to resist—or conversely, having its context window polluted with edge cases that confuse its directive.

The winning approach from player 482 exploited function descriptions rather than the system prompt itself. Instead of trying to convince Freysa to transfer funds, the player framed their message to make the LLM believe that approveTransfer was actually for incoming contributions to the prize pool, not outgoing transfers. This is a classic semantic attack: the function name and description became the attack surface, not the system prompt's rules.

The codebase maintains conversation history with a 50,000+ token context window, meaning the LLM considers all previous attempts when evaluating new messages. This creates an evolving challenge—early attempts might prime the LLM with defenses, or they might introduce contradictions that later players exploit. The repository shows basic conversation management:

const conversationHistory = [
  { role: 'system', content: FREYSA_SYSTEM_PROMPT },
  ...previousMessages.map(msg => ({
    role: 'user',
    content: `${msg.sender}: ${msg.content}`
  })),
  { role: 'user', content: `${currentSender}: ${currentMessage}` }
];

The timer mechanism adds urgency: after 1,500 attempts, a global countdown starts requiring hourly activity or the game ends with distributed payouts. This prevents infinite stalemates while maintaining pressure on participants.

Gotcha

The elephant in the room: this system is fundamentally insecure by design. Prompt injection vulnerabilities aren't edge cases; they're inherent to how LLMs process text. There's no cryptographic boundary between instructions and data in natural language. Freysa 'working' for 481 attempts before breaking doesn't demonstrate robustness—it demonstrates that most players weren't experienced prompt engineers, or they were deliberately prolonging the game to grow the prize pool.

The code reveals no novel security mechanisms beyond the system prompt itself. There's no verification layer, no confidence thresholds, no constitutional AI techniques, no multi-model consensus. The LLM's decision is final and immediate. For a game, this creates excitement. For any real-world application—autonomous trading, treasury management, smart contract execution—this would be catastrophically inadequate. The repository is valuable as a documented example of basic LLM-blockchain integration, but anyone treating this as a template for production autonomous agents is building on quicksand. The winning exploit wasn't even particularly sophisticated; it was semantic confusion about function purposes. More experienced attackers could likely break this in minutes using established jailbreaking techniques like base64 encoding, role confusion, or hypothetical framing.

Verdict

Use if you're studying adversarial prompt engineering in the wild, need a reference implementation for basic LLM function calling with blockchain transactions, or want to understand the economic dynamics of crowd-sourced AI safety testing. The open-source nature makes this a valuable case study in how autonomous agents can be structured, even if the specific implementation is insecure. It's also worth studying if you're exploring game design at the intersection of AI and crypto—the economic model and psychological dynamics are genuinely interesting. Skip if you're looking for production-ready autonomous agent frameworks, robust AI safety patterns, or anything involving real financial security. This is an art project and social experiment, not engineering infrastructure. The architecture is intentionally minimal, the security is theatrical rather than technical, and the 'autonomous AI' framing is more marketing than reality—it's still just function calling with no validation layer. For serious agent development, look at LangChain, AutoGPT, or specialized frameworks that include proper safety rails, verification mechanisms, and security boundaries.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-agents/0xfreysa-agent.svg)](https://starlog.is/api/badge-click/ai-agents/0xfreysa-agent)