PII Detective: Finding Sensitive Data by Reading the Room (Not the Records)
Hook
What if you could find personally identifiable information across petabytes of data by never actually looking at the data itself? PII Detective proves that sometimes the column name tells you everything you need to know.
Context
Data governance teams face a painful paradox: you need to know where PII lives before you can protect it, but scanning for PII means exposing it to scanning tools. Traditional approaches fall into two camps, both problematic. Regex-based scanners look for patterns like social security numbers or email addresses, but they’re brittle—miss one format variation and you’ve got a compliance gap. They also can’t understand context; a column named ‘employee_tax_id’ screams PII even if you haven’t seen a single value. On the other end, cloud-native solutions like GCP’s Sensitive Data Protection (formerly Cloud DLP) offer ML-powered content inspection that’s genuinely good at finding PII in actual data. The catch? At scale, you’re paying per gigabyte scanned. A moderately-sized data warehouse can rack up thousands of dollars in scanning costs, and that’s before you’ve masked a single field.
PII Detective takes a radically different approach: it asks an LLM to make educated guesses based purely on metadata—table names, column names, and schemas. No content scanning. No sampling. Just contextual reasoning about whether ‘user_email_address’ or ‘customer_ssn_encrypted’ might contain sensitive data. For organizations with reasonably consistent naming conventions, this metadata-only approach delivers 80-90% of the value at less than 1% of the cost. The trick is pairing LLM inference with human-in-the-loop validation, so you catch the false positives before applying irreversible masking policies to production tables.
Technical Insight
PII Detective is built as a Next.js application that orchestrates a three-stage pipeline: metadata extraction, LLM-powered classification, and policy enforcement. The architecture is deliberately simple—Postgres tracks detection state, OpenAI provides the reasoning engine, and native platform features (BigQuery Policy Tags or Snowflake Masking Policies) handle the actual data protection.
The metadata extraction phase connects to your data warehouse and pulls table/column information without touching row data. For BigQuery, this means querying INFORMATION_SCHEMA views; for Snowflake, it’s equivalent system tables. The extracted metadata gets structured into prompts that give the LLM maximum context:
const prompt = `Analyze these database columns and identify potential PII:
Table: ${tableName}
Schema: ${schemaName}
Columns:
${columns.map(col => `- ${col.name} (${col.dataType})`).join('\n')}
For each column, respond with:
1. pii_likelihood: high/medium/low
2. pii_type: email/ssn/phone/name/address/financial/health/none
3. reasoning: brief explanation
Consider naming patterns, data types, and contextual clues.`;
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
temperature: 0.2 // Lower temperature for consistent classification
});
The LLM’s responses get parsed and stored in Postgres with a validation status. This is where human-in-the-loop becomes critical. The application presents a review interface where data stewards see the AI’s suggestions alongside the column metadata. They can approve, reject, or reclassify before any policies get applied. This validation layer prevents the nightmare scenario where an LLM misreads ‘user_id’ as PII and you accidentally mask your primary key.
Once validated, PII Detective applies masking policies using platform-native features. For BigQuery, it creates and assigns Policy Tags that trigger column-level security. For Snowflake, it generates and applies Masking Policies. The masking function itself uses SHA256 hashing rather than encryption or tokenization:
-- Snowflake masking policy example
CREATE OR REPLACE MASKING POLICY pii_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_STEWARD', 'COMPLIANCE_ADMIN')
THEN val -- Authorized roles see raw data
ELSE SHA2(val, 256) -- Everyone else sees hash
END;
ALTER TABLE users MODIFY COLUMN email
SET MASKING POLICY pii_mask;
The SHA256 choice is clever for analytics use cases. Unlike encryption, hashing is deterministic—the same input always produces the same output. This means analysts can still perform joins on hashed email addresses or count distinct hashed user IDs without ever seeing the plaintext. They can’t reverse the hash to recover PII, but they can still extract statistical insights. For filtering operations, analysts can hash their filter values client-side and compare against the hashed column.
The cost economics are striking. A BigQuery dataset with 500 tables and 10,000 columns might cost $3-5 in OpenAI API calls to classify—GPT-4 charges about $0.03 per 1K tokens, and metadata is token-cheap. The equivalent content-based scan using GCP’s Sensitive Data Protection could easily hit $2,000-5,000 depending on data volume. The tradeoff is coverage: metadata analysis won’t catch PII hiding in poorly-named columns or free-text fields, but for most structured data warehouses with reasonable naming hygiene, the hit rate is surprisingly good.
The system’s state management in Postgres tracks not just current classifications but also the history of human decisions. This creates a feedback loop where you can analyze which types of columns the LLM consistently gets wrong, then refine your prompts or add explicit rules. Over time, organizations build up a knowledge base of validated PII patterns specific to their data culture.
Gotcha
The fundamental limitation is right there in the architecture: PII Detective only sees what you tell it through names. If your data warehouse has columns named ‘field_1’, ‘col_xyz’, or ‘misc_data’, the LLM is flying blind. It can’t detect a social security number living in a column called ‘reference_code’ no matter how sophisticated the prompt engineering. This makes it a poor fit for data platforms with inconsistent naming conventions or legacy systems where columns accumulated cryptic names over decades. You need at least moderate naming hygiene for this approach to work.
The human-in-the-loop requirement is both a feature and a friction point. Yes, it prevents false positives from auto-masking critical columns. But it also means PII Detective isn’t a ‘set and forget’ solution—someone needs to review classifications, especially in the initial deployment. For organizations with thousands of tables, that first-pass review can be time-consuming. The tool doesn’t currently offer bulk approval workflows or confidence-based auto-approval, so high-confidence classifications still queue up for human review alongside borderline cases.
Another gotcha: the SHA256 masking approach doesn’t work for all use cases. If analysts need to perform range queries (find all users with income between X and Y) or phonetic matching (find names that sound like ‘Smith’), hashing breaks those operations. The deterministic property only preserves exact equality comparisons. For these scenarios, you’d need to integrate with more sophisticated masking solutions like format-preserving encryption or tokenization, which PII Detective doesn’t currently handle. And because the masking happens at the warehouse layer, you’re locked into whatever masking primitives your platform provides—Snowflake and BigQuery have different capabilities and limitations.
Verdict
Use PII Detective if you’re working with structured data warehouses (BigQuery or Snowflake specifically) that follow reasonable naming conventions, you need to bootstrap PII discovery without a five-figure budget, and you can dedicate data steward time to validation workflows. It’s particularly strong for teams that want to leverage platform-native masking features but need help identifying which columns deserve protection. The metadata-only approach makes it practical to scan large datasets frequently—you could run weekly sweeps to catch newly created PII columns without worrying about scanning costs. Skip it if your data has poor naming hygiene, you need content-level semantic detection (catching PII in free-text fields or JSON blobs), you require fully automated classification without human oversight, or you work with data platforms outside the BigQuery/Snowflake ecosystem. Also skip if your masking requirements go beyond simple hashing—analysts who need format-preserving encryption, realistic synthetic data, or reversible tokenization will need heavier tooling. This is a lightweight starter tool for organizations building data governance muscle, not an enterprise-grade replacement for comprehensive DLP platforms.