Back to Articles

Inside Claude's Mind: A Forensic Analysis of AI Tool Recommendations

[ View on GitHub ]

Inside Claude's Mind: A Forensic Analysis of AI Tool Recommendations

Hook

When you ask Claude what database to use, you're not getting neutral advice—you're getting opinions baked into 175 billion parameters. This dataset proves it with 2,430 controlled experiments.

Context

AI coding assistants like GitHub Copilot, Cursor, and Claude Code have moved beyond autocomplete. They now recommend entire technology stacks, suggest architectural patterns, and influence which libraries make it into production codebases. But unlike human senior engineers whose biases you can interrogate over code review, LLM preferences are opaque—the result of training data composition, RLHF fine-tuning, and emergent behaviors no one fully understands.

This matters because recommendation bias compounds at scale. If Claude systematically prefers Prisma over Drizzle, and 100,000 developers use Claude to scaffold new projects, Prisma's ecosystem grows while alternatives atrophy—not because of technical merit, but because of statistical patterns in training data from 2021-2023. We've seen this movie before with Google search results shaping web standards, but LLM recommendations operate in a black box. The amplifying-ai/claude-code-picks repository is one of the first public attempts to open that box using reproducible scientific methodology rather than anecdotal screenshots on Twitter.

Technical Insight

Two-Phase Pipeline

Experimental Controls

Reset between trials

100 Prompt Variations

5 phrasings × 20 categories

Experiment Runner

Git Reset Loop

Clean State Per Trial

4 Greenfield Repos

TS/Next.js, TS/Python, Flask, FastAPI

3 Claude Models

Sonnet 4.5, Opus 4.6, Sonnet 4.0

Raw Response Capture

36 generation files

Extraction Pipeline

Tool names + reasoning

Structured Metadata

36 extraction files

Merged Analysis Artifact

Single dataset

System architecture — auto-generated

The architecture is deliberately anti-production: this isn't a runtime system but a forensic laboratory. The core insight is treating LLM conversations as experimental trials requiring control group methodology. The dataset structure reveals this immediately—100 semantically equivalent prompt variations (5 phrasings × 20 technology categories) executed against 4 greenfield TypeScript and Python repositories, with git state reset between each run to prevent context contamination.

That git reset is methodologically critical and rare in LLM benchmarks. Most evaluations chain prompts together like real conversations, but that introduces confounding variables: if Claude recommends Stripe for payments in prompt 17, does it then suggest Stripe webhooks for background jobs in prompt 23 because of technical compatibility or because 'Stripe' is now in the context window? By resetting to a clean repo state, the dataset isolates each recommendation as an independent sample.

The data pipeline is two-phase to separate generation from analysis. First, raw LLM responses are captured:

// Pseudo-structure of generation phase
for (const model of ['claude-sonnet-4.5', 'claude-opus-4.6', 'claude-sonnet-4.0']) {
  for (const repo of [tsNextjs, tsPython, pyFlask, pyFastAPI]) {
    git.reset('HEAD'); // Critical: clean state
    for (const category of categories) {
      for (const phrasing of category.prompts) {
        const response = await anthropic.complete({
          model,
          prompt: phrasing,
          context: repo.getState()
        });
        
        fs.writeFile(`generations/${model}/${repo}/${category}-${phrasing}.json`, {
          prompt: phrasing,
          response: response.text,
          timestamp: Date.now(),
          repoState: repo.getCommitHash()
        });
      }
    }
  }
}

Phase two applies structured extraction to those raw responses. This is where the research design shines: instead of just counting tool mentions, it extracts reasoning paths, hierarchical preferences (primary recommendation vs. alternatives), and contextual qualifiers ('for production workloads' vs. 'for prototyping'). The extraction schema captures:

interface ExtractedRecommendation {
  toolName: string;
  category: 'primary' | 'alternative' | 'mentioned';
  reasoning: string[];  // Extracted justification phrases
  confidence: 'strong' | 'moderate' | 'hedged';
  conditionals: string[]; // 'if you need X' qualifiers
  ecosystemLinks: string[]; // Other tools mentioned alongside
}

This structured metadata enables analysis that raw transcripts can't support. You can ask: Does Claude justify PostgreSQL with 'reliability' more often than 'performance'? Does it recommend different tools when the repo contains 'startup' vs. 'enterprise' in the README? Do alternative recommendations get weaker reasoning than primary picks, suggesting post-hoc justification?

The prompt variation strategy tests something most developers assume but never verify: semantic stability under rephrasing. The five phrasings for database selection might be:

  1. 'What database should I use for this project?'
  2. 'I need to store data. Recommendations?'
  3. 'Help me pick a database for a TypeScript API'
  4. 'What's the best database choice here?'
  5. 'Database options for this stack?'

These are semantically equivalent to humans but structurally distinct to transformers. If Claude recommends PostgreSQL 95% of the time for phrasing 1 but only 60% for phrasing 5, that brittleness matters—production systems can't control how users phrase questions, so prompt sensitivity becomes a reliability surface.

The model comparison dimension (Sonnet 4.5, Opus 4.6, Sonnet 4.0) exposes personality divergence. These aren't just capability tiers—they're different engineers with different risk profiles. Early analysis shows Opus 4.6 hedges more ('depends on your scale requirements') while Sonnet 4.5 commits harder to specific tools. That's not a bug, it's RLHF philosophy: Anthropic tuned Opus for safety and carefulness, Sonnet for developer velocity. The dataset makes those trade-offs measurable.

The denormalized storage pattern—36 generation files expanded to 36 extraction files then merged to a single analysis artifact—prioritizes reproducibility over query performance. You can audit any single recommendation back to the exact prompt, model version, and repo state that produced it. That's rare in ML research and essential for trust: when the dataset claims 'Claude prefers X', you can verify the raw transcript rather than trusting aggregated statistics.

Gotcha

The dataset's greatest strength—methodological rigor on a small, controlled sample—is also its limitation. With 2,430 samples collected as a February 2026 snapshot, this is historical research not a live benchmark. Claude models update continuously (sometimes weekly), and the tools being recommended evolve faster. The PostgreSQL preferences captured here might be irrelevant six months later if Anthropic's training data refresh includes the Drizzle renaissance or if Supabase becomes the new default in developer tutorials that feed future training runs.

The prompt diversity is shallow by design: five syntactic variations per category aren't testing semantic edge cases. Missing are adversarial phrasings ('I heard PostgreSQL is slow, what's faster?'), constraint-based questions ('recommend a database that works with our existing Oracle enterprise license'), or correction scenarios ('you said Prisma but we can't use it, alternatives?'). These test reasoning under pressure—where models often reveal hidden biases or fail consistency—but they'd compromise the clean experimental design. You can't have both tight controls and comprehensive coverage.

Most critically, there's no validation layer. If Claude recommends Next.js + Prisma + Vercel + NextAuth across 80% of prompts, the dataset proves consistency but not correctness. Do those tools actually integrate cleanly? Are the recommended versions compatible? Does the stack Claude describes actually deploy, or does it recommend configurations that hit version conflicts, deprecated APIs, or undocumented gotchas? High recommendation frequency could mask integration hell—the dataset measures what Claude says, not whether it works.

Verdict

Use if you're building AI coding assistants, developer tools, or research infrastructure for LLM evaluation—the extraction pipeline and prompt isolation methodology are immediately applicable templates worth copying. Security researchers tracking supply chain risks should mine this to understand which ecosystems AI recommendations are amplifying (if Claude pushes tool X to 50,000 developers, attackers know to target X's npm packages). Engineering leaders evaluating Claude Code, Cursor, or Copilot can use this to calibrate expectations: your AI pair programmer has baked-in opinions that will influence technology choices across your team, and this dataset shows how to measure those opinions systematically. Skip if you need executable benchmarks (this is observational research, not a CI/CD-ready test harness), want current recommendations (the 2026 snapshot is already dated), or expect prescriptive tool rankings (the methodology is the contribution, not 'PostgreSQL won'). The real value is proving LLM behavior can be studied with social science rigor—treat this as a blueprint for your own evaluations, not a leaderboard.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/amplifying-ai-claude-code-picks.svg)](https://starlog.is/api/badge-click/ai-dev-tools/amplifying-ai-claude-code-picks)