Back to Articles

Inside AI Product Bench: Why Two LLMs Disagree on Half Their Product Recommendations

[ View on GitHub ]

Inside AI Product Bench: Why Two LLMs Disagree on Half Their Product Recommendations

Hook

Ask ChatGPT and Google AI Mode for the same product recommendation three times each, and they'll disagree with each other—and themselves—nearly half the time. This dataset proves it with numbers.

Context

As LLMs become the new search interface, businesses face an uncomfortable reality: product visibility is no longer deterministic. Traditional SEO operates on stable rankings—optimize your content, climb to position three, stay there until the algorithm changes. But when ChatGPT recommends laptops or Google AI Mode suggests kitchen appliances, there's no SERP to track, no position to monitor. The same query can yield entirely different products across runs, models, or even successive calls to the same model.

The ai-product-bench repository attempts to quantify this chaos. Created by amplifying-ai, it's not a benchmarking framework but a dataset artifact: 3,806 product recommendations collected from ChatGPT and Google AI Mode across 132 query variations, each run three times per model. The core finding—a 47.3% agreement rate between models—transforms vague anxieties about LLM inconsistency into measurable evidence. For researchers studying recommender system reliability, SEO professionals navigating AI search, and product teams wondering if their offerings will surface in conversational interfaces, this dataset provides a temporal snapshot of how two major LLMs behave when asked "What should I buy?"

Technical Insight

Analysis Layer

Data Collection

Extract entities

Compute metrics

Load via fetch

Manual prompts

Manual prompts

Group by query_id

Render

Raw LLM Responses

JSONL Files

products.jsonl

2,074 Recommendations

analysis.json

Consistency Stats

Static HTML Dashboard

ChatGPT API

Google AI Mode

Overlap Calculation

Client-Side Visualizations

System architecture — auto-generated

The repository's architecture is deliberately minimal: three JSONL files and a static HTML dashboard. This isn't a criticism—it's a design choice that prioritizes data portability over infrastructure complexity. The responses/ directory contains raw LLM outputs, while products.jsonl represents the extracted product entities with metadata, and analysis.json holds pre-computed consistency metrics.

The most interesting structural decision is the relational schema hidden within flat files. Each product entry links to its source response via query_id and response_id, enabling join operations despite the JSONL format:

// Reconstructing which products came from which query
const responses = await fetch('responses/chatgpt_responses.jsonl')
  .then(r => r.text())
  .then(text => text.trim().split('\n').map(JSON.parse));

const products = await fetch('products.jsonl')
  .then(r => r.text())
  .then(text => text.trim().split('\n').map(JSON.parse));

// Find products recommended by both models for the same query
const queryGroups = products.reduce((acc, product) => {
  const key = product.query_id;
  if (!acc[key]) acc[key] = { chatgpt: [], google: [] };
  acc[key][product.model].push(product.name);
  return acc;
}, {});

// Calculate overlap percentage
const overlapStats = Object.entries(queryGroups).map(([queryId, models]) => {
  const intersection = models.chatgpt.filter(p => 
    models.google.includes(p)
  ).length;
  const union = new Set([...models.chatgpt, ...models.google]).size;
  return { queryId, jaccard: intersection / union };
});

This code reveals the dataset's analytical potential—you can compute Jaccard similarity, track which products appear consistently across runs, or identify queries where models diverge most dramatically. The three-run design is particularly valuable for measuring intra-model variance. If ChatGPT recommends different products across three identical runs of "best budget laptop," that's evidence of temperature-driven randomness or retrieval-augmented generation pulling from different sources.

The analysis.json file pre-computes overlap metrics, but its schema is surprisingly sparse—just aggregate percentages without query-level granularity. A production eval system would track per-category consistency (do models agree more on headphones than laptops?), temporal drift (did agreement rates change between December and January?), and citation diversity (how many unique sources does each model cite?).

What's conspicuously absent is the collection methodology. No prompts are versioned, no API parameters documented. This matters because LLM behavior is extraordinarily sensitive to phrasing. "What's the best budget laptop?" versus "Recommend an affordable laptop" might yield 20% different products. Temperature settings (presumably 0.7-1.0 for ChatGPT's default) introduce randomness, while top_p nucleus sampling affects which tokens even get considered. Without these parameters, the dataset is unreproducible—useful for one-time analysis but unsuitable for longitudinal studies.

The HTML dashboard (index.html) loads analysis.json and renders it with Chart.js, entirely client-side. It's a 200-line proof-of-concept that visualizes the 47.3% statistic but offers no interactivity. For researchers wanting to drill down—"Show me queries where agreement dropped below 30%" or "Which products appeared in ChatGPT responses but never Google's?"—you'll need to write your own scripts against the JSONL files.

The data structure itself is clean:

{
  "query_id": "laptop_budget_001",
  "model": "chatgpt",
  "run": 2,
  "product_name": "Lenovo IdeaPad 3",
  "product_url": "https://www.lenovo.com/...",
  "cited_source": "TechRadar",
  "position": 3,
  "explanation": "Great value with Ryzen 5..."
}

The cited_source field is gold for citation analysis. If 60% of ChatGPT's product links point to CNET and Wirecutter (both behind NYT paywalls), while Google AI Mode favors YouTube creators, that reveals business relationship influence. The position field lets you study whether models bury certain brands in later recommendations—positional bias that mirrors traditional search ranking.

One clever quasi-feature: by capturing recommendations as structured entities rather than raw text, the dataset enables entity resolution research. "Apple MacBook Air M2" and "MacBook Air (M2, 2023)" are the same product, but string matching won't catch it. Researchers could use this dataset to train deduplication models or test fuzzy matching heuristics.

Gotcha

The elephant in the room is ground truth—or the complete absence of it. The dataset measures inter-model agreement, which is correlation without accuracy. If ChatGPT and Google AI Mode both recommend the same mediocre laptop 100% of the time, that's high consistency but zero quality. Conversely, if they disagree because one suggests the objectively better product, that disagreement is valuable. Without human preference labels ("Which recommendation was more helpful?") or expert judgments ("Which laptop actually fits this query?"), the 47.3% statistic floats in a vacuum. You can't distinguish signal from noise.

The methodology gaps compound this limitation. We don't know the collection date beyond "likely late 2024," which matters when LLMs update their training data monthly and retrieval systems index the web continuously. We don't know if queries were run sequentially or in parallel, whether sessions were cleared between runs (affecting context windows), or if responses were sampled from the same API endpoints used by production users. The lack of prompts is particularly painful—if queries were phrased as "Best X for Y" versus "I need X that does Y," the consistency rates could differ by 30 percentage points.

The dataset's single-domain focus (consumer electronics and household products) limits generalizability. B2B software recommendations likely show different patterns—fewer products to choose from but more nuanced feature requirements. Medical or financial advice would involve regulatory constraints that product recommendations don't face. You can't extrapolate "LLMs agree 47% on laptops" to "LLMs agree 47% on chemotherapy drugs." The repository implicitly claims to benchmark "AI product recommendations" but actually benchmarks "two models on consumer goods in Q4 2024."

Finally, this is a dead-end dataset—no collection scripts, no pipeline to refresh data, no eval harness to extend coverage. If you wanted to add Claude or Gemini, you'd need to reverse-engineer the entire methodology from 2,074 product entries. If you wanted to track how consistency evolves as models update, you'd build the infrastructure from scratch. The repository provides the fish, not the fishing rod.

Verdict

Use if: You're researching LLM non-determinism in search-augmented generation and need empirical data showing real models disagreeing on real queries. The 3,806 product entries with preserved citations are valuable for citation bias analysis, especially if you're studying how LLMs favor certain publishers or affiliate networks. SEO professionals pivoting to "LLM visibility optimization" will find the structured product data useful for reverse-engineering what makes products surface consistently. It's also a solid teaching dataset for students learning entity resolution or building recommendation consistency metrics—small enough to process locally, messy enough to be realistic. Skip if: You need reproducible benchmarks (no collection scripts, no prompts, no API parameters), ground truth for accuracy evaluation (measures agreement not quality), or infrastructure for ongoing monitoring (static snapshot from Q4 2024 with no refresh mechanism). Avoid if you're outside consumer products—the insights don't transfer to B2B, medical, or financial domains. Also skip if you expect a real benchmarking framework like HELM or lm-evaluation-harness—this is a dataset contribution, not eval tooling. The 47.3% statistic is the headline, but the lack of methodology means you're building everything else yourself.