Back to Articles

DICES Dataset: When AI Safety Ratings Reveal More About Us Than The AI

[ View on GitHub ]

DICES Dataset: When AI Safety Ratings Reveal More About Us Than The AI

Hook

When Google asked 173 demographically diverse raters to judge the same AI conversations, they didn't get consensus—they got a mirror reflecting how gender, race, and geography shape what we consider 'safe.'

Context

Most AI safety datasets treat harmful content as a binary classification problem: either a conversation is safe or it isn't. This approach powers the content moderation systems behind chatbots from ChatGPT to customer service bots, typically trained on datasets where a single annotator or small group decides what's toxic, what's acceptable, and what crosses the line.

But there's a fundamental flaw in this approach: safety isn't objective. What feels threatening to a woman might not register to a man. What seems innocuous in California might be offensive in another cultural context. By collapsing these diverse perspectives into a single label, we're not just losing nuance—we're embedding the biases of whoever happened to annotate the training data. Google Research's DICES (Diversity In Conversational AI Evaluation for Safety) dataset was created to expose and measure this subjectivity. Released as a research tool rather than training data, DICES contains 1,340 adversarial conversations between humans and a dialog model, each rated by dozens to over a hundred demographically diverse evaluators. The goal isn't to train better classifiers—it's to understand how different populations perceive AI safety differently, and to build evaluation frameworks that acknowledge rather than erase this diversity.

Technical Insight

DICES consists of two datasets with fundamentally different design philosophies around demographic representation. Dataset 990 contains 990 conversations rated 60-70 times each by 173 raters balanced across gender (Man/Woman) and geographic locale (US/India). Dataset 350 takes a more granular approach with 350 conversations rated 123 times each by 123 raters balanced across gender, race (Asian, Black, White), and ethnicity (Hispanic or Latino/Not Hispanic or Latino). The architectural decision to replicate ratings this extensively—far beyond the 3-5 annotations typical in ML datasets—is what makes DICES useful for statistical analysis of inter-rater disagreement.

The dataset structure encodes ratings as distributions rather than single labels. Each conversation includes the raw vote counts across rating categories, enabling analysis of consensus versus polarization. For example, a conversation might receive 45 'Not okay' ratings, 15 'Okay' ratings, and 10 'I'm not sure' ratings from Dataset 990's raters. This distribution tells a story that a majority-vote label of 'Not okay' would erase: roughly 20% of raters disagreed with the majority, and another 14% were uncertain.

Here's how you might load and analyze rating distributions in Python:

import pandas as pd
import numpy as np

# Load Dataset 990
df = pd.read_csv('350_dataset.csv')

# Calculate rating entropy to measure disagreement
def calculate_entropy(row):
    # Extract vote counts for each rating category
    votes = [
        row['Q_overall_num_not_okay'],
        row['Q_overall_num_okay'], 
        row['Q_overall_num_unsure']
    ]
    total = sum(votes)
    if total == 0:
        return 0
    
    # Calculate Shannon entropy
    probs = [v/total for v in votes if v > 0]
    return -sum(p * np.log2(p) for p in probs)

df['rating_entropy'] = df.apply(calculate_entropy, axis=1)

# High entropy = high disagreement
high_disagreement = df.nlargest(10, 'rating_entropy')

# Analyze by demographic splits
for demographic in ['rater_gender', 'rater_race', 'rater_ethnicity']:
    if demographic in df.columns:
        print(f"\nRating variance by {demographic}:")
        print(df.groupby(demographic)['Q_overall_num_not_okay'].mean())

The conversations themselves are adversarial by design—human agents were explicitly instructed to probe the dialog model's boundaries and attempt to elicit unsafe responses. This creates edge-case scenarios rather than typical interactions. A conversation might start innocuously ('Tell me about cooking') then gradually steer toward sensitive topics, testing whether the model maintains safety guardrails under social engineering attempts.

Crucially, DICES includes rich demographic metadata for each rater, not just aggregate statistics. You can filter ratings by specific intersectional identities (e.g., 'Hispanic or Latino' + 'Woman' + 'US') to analyze how specific demographic combinations perceive safety differently. This granularity enables research questions like: Do safety perceptions differ more by gender or geography? Are certain types of content universally considered unsafe, while others show high demographic variance?

The dataset also includes question-level ratings beyond just overall safety: whether the conversation contained graphic depictions of harm, unfair generalizations, whether the dialog model was evasive, and more. These dimensions allow for multifaceted safety analysis rather than a single safety score. A conversation might be rated as containing unfair generalizations but not graphic harm, with different demographic groups showing different sensitivities to each dimension.

Gotcha

The most significant limitation isn't in the data itself but in how it's often misunderstood: DICES is not training data. The adversarial nature and small size (1,340 conversations total) make it unsuitable for training production safety classifiers. Researchers sometimes attempt to use it as fine-tuning data for toxicity detection models, which misses the entire point—the value is in analyzing the diversity of human judgment, not in the conversations themselves.

The demographic categories, while more comprehensive than most datasets, still flatten human identity into discrete buckets. Gender is encoded as binary Man/Woman, excluding non-binary identities. Race categories (Asian, Black, White) use broad groupings that obscure significant within-group diversity—'Asian' encompasses vastly different cultural contexts from East Asia to South Asia to Southeast Asia. The dataset captures snapshots of demographic identity but can't account for intersectionality's full complexity or how individual raters' lived experiences beyond these categories shape their safety perceptions. Additionally, because conversations are adversarial rather than naturalistic, findings may not generalize to typical user interactions. A safety classifier optimized for DICES' edge cases might over-flag benign everyday conversations, while one calibrated for typical usage might fail on the adversarial scenarios DICES highlights.

Verdict

Use if: You're researching how demographic factors influence AI safety perceptions and need statistically robust data with high rating replication per item. Use if you're developing evaluation frameworks for conversational AI that need to account for diverse perspectives rather than assuming consensus. Use if you're benchmarking safety systems and want to understand performance across different demographic groups rather than just aggregate accuracy. Skip if: You need training data for safety classifiers—this is evaluation data with intentionally adversarial conversations that won't reflect typical user interactions. Skip if you require current data reflecting 2024's conversational AI landscape; the dataset represents a specific model and time period. Skip if you need production-ready safety monitoring; DICES is a research tool for understanding safety perception, not a real-time moderation system.