PII Detective: Why Scanning Column Names Beats Scanning Data for Privacy Compliance
Hook
Scanning every row of a 10-billion-record table for PII could cost you $500 in API calls. Scanning just the column names? Less than a penny.
Context
Data privacy regulations like GDPR and CCPA have turned PII detection from a compliance checkbox into an operational nightmare. Traditional approaches fall into two camps: regex-based content scanning that's thorough but prohibitively expensive at scale, or manual audits that can't keep pace with the dozens of tables your data team ships each week.
The dirty secret of most PII detection tools is that they're built for security teams who scan infrequently, not data teams managing hundreds of pipelines. Tools like Google's DLP API will dutifully scan every cell in your warehouse, identify PII with impressive accuracy, and send you a bill that makes your CFO question the value of data governance. Meanwhile, your analysts are creating new user_email and customer_phone columns every sprint, and your manual review process is always six weeks behind. PII Detective takes a different bet: in well-managed data warehouses, the column name tells you everything you need to know—and LLMs are eerily good at understanding context that simple pattern matching misses.
Technical Insight
The core architectural insight of PII Detective is that modern data warehouses already contain the signal you need for PII detection: metadata. Instead of scanning content, it queries your warehouse's information schema to extract table and column names, then feeds this metadata to an LLM with a carefully crafted prompt asking a simple question: "Which of these columns likely contains personally identifiable information?"
Here's the actual workflow, stripped to essentials. First, PII Detective connects to your warehouse (BigQuery or Snowflake) and pulls metadata:
// Simplified BigQuery metadata extraction
const tables = await bigquery.query(`
SELECT
table_catalog,
table_schema,
table_name,
column_name,
data_type
FROM
\`${projectId}.${datasetId}.INFORMATION_SCHEMA.COLUMNS\`
`);
// Package for LLM analysis
const metadata = tables[0].map(row => ({
table: `${row.table_schema}.${row.table_name}`,
column: row.column_name,
type: row.data_type
}));
This metadata gets sent to OpenAI with a prompt that asks the LLM to reason about context. The genius is in what the LLM can infer that regex cannot. A column named email is obviously PII, but what about contact_email versus email_template? What about ssn_last_4 versus full_ssn? The LLM understands semantic context in ways that pattern matching never will. It knows that users.phone_number is probably PII while restaurants.phone_number is publicly listed business information.
Once the LLM flags potential PII columns, PII Detective doesn't immediately apply masking policies. This is where the human-in-the-loop design becomes critical. The application stores suggestions in a Postgres database with a status field:
interface PIIDetection {
id: string;
project_id: string;
dataset_id: string;
table_name: string;
column_name: string;
confidence: number;
status: 'pending' | 'approved' | 'rejected';
detected_at: Date;
reviewed_at?: Date;
reviewed_by?: string;
}
Users review these suggestions through a Next.js interface, approving or rejecting each one. Only approved detections trigger actual masking policies. This validation step is what makes PII Detective production-ready rather than a risky automation experiment.
When a detection is approved, PII Detective applies native platform masking. For BigQuery, it creates policy tags:
// Apply BigQuery policy tag for approved PII column
async function applyBigQueryMask(
projectId: string,
datasetId: string,
tableName: string,
columnName: string
) {
const policyTagId = 'projects/PROJECT/locations/LOCATION/taxonomies/TAXONOMY/policyTags/TAG';
await bigquery.query(`
ALTER TABLE \`${projectId}.${datasetId}.${tableName}\`
ALTER COLUMN ${columnName}
SET OPTIONS (policy_tags = ["${policyTagId}"])
`);
}
For Snowflake, it uses dynamic data masking with SHA256 hashing:
// Snowflake DDM policy application
await snowflake.execute(`
ALTER TABLE ${database}.${schema}.${table}
MODIFY COLUMN ${column}
SET MASKING POLICY pii_sha256_mask
`);
The masking policy itself is typically defined once and reused:
CREATE MASKING POLICY pii_sha256_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('ADMIN', 'COMPLIANCE') THEN val
ELSE SHA2(val, 256)
END;
This architecture means PII Detective never builds its own masking layer—it orchestrates your existing platform security features. If you already have BigQuery policy tags or Snowflake masking policies configured, PII Detective just applies them systematically based on LLM suggestions and human validation. The cost savings are dramatic: analyzing metadata for 1,000 tables might consume 100K tokens (~$0.20 with GPT-4), while content-scanning those same tables could mean processing billions of tokens.
Gotcha
The metadata-only approach has an obvious Achilles heel: garbage column names. If your data warehouse is full of columns named field_1, data_col, attribute_42, or varchar_column, PII Detective will be nearly useless. The LLM cannot divine that field_7 contains social security numbers if the metadata provides zero semantic signal. This isn't a minor limitation—it's a fundamental dependency on good data hygiene.
You'll also hit walls around platform support and permissions. Only BigQuery and Snowflake are currently supported, which means Redshift, Databricks, Postgres, or MySQL users are out of luck without writing custom connectors. More problematically, PII Detective requires admin-level permissions to both discover schemas and apply masking policies. In many organizations, getting BigQuery Admin or Snowflake ACCOUNTADMIN privileges for an application is a non-starter, even if that application is managing PII protection. The security team will rightfully point out that if PII Detective itself is compromised, an attacker would have keys to the kingdom. You'll need to have serious conversations about service account permissions, secret management, and whether the application should run in a hardened environment with audit logging.
Verdict
Use PII Detective if you manage well-structured data warehouses on BigQuery or Snowflake where column naming conventions are consistent and semantic. The human-in-the-loop workflow makes it perfect for data governance teams who want AI assistance accelerating their review process without the risk of automated false positives breaking analytics workflows. It's especially valuable if you're shipping dozens of new tables per month and manual PII audits have become a bottleneck. The cost savings over content-scanning approaches make this a no-brainer for large-scale deployments where even moderate DLP API usage would run into thousands of dollars monthly.
Skip it if your data warehouse has poor naming conventions—you'll need content scanning regardless of cost. Also skip if you're on Redshift, Databricks, or any platform beyond BigQuery and Snowflake, or if your security policies prohibit granting the broad admin permissions this tool requires. Finally, reconsider if you need real-time PII detection for streaming data or user-generated content; PII Detective is designed for batch analysis of warehouse schemas, not live data pipelines. For those use cases, you're better off with traditional DLP tools or building custom detection into your ingestion layer.