Back to Articles

FOCA: Mining Organizational Secrets from Public Document Metadata

[ View on GitHub ]

FOCA: Mining Organizational Secrets from Public Document Metadata

Hook

A single leaked PDF can expose your organization's internal network structure, software versions, and employee names—and search engines have indexed millions of them.

Context

Before FOCA emerged from Telefónica's ElevenPaths security research team, penetration testers manually downloaded documents from target organizations and ran standalone metadata extraction tools one file at a time. The process was tedious: craft search engine queries, download Office files and PDFs, extract EXIF data with separate utilities, then manually correlate findings to build an organizational profile.

The insight that FOCA codified is that metadata isn't just technical trivia—it's intelligence. When Microsoft Word embeds the author's username, or when a PDF preserves the full Windows file path from the creator's machine, these artifacts reveal usernames, domain naming conventions, software versions, and internal directory structures. Multiply this across hundreds of publicly indexed documents, and you can fingerprint an organization's technology stack without ever touching their network. FOCA automated this entire reconnaissance workflow into a single application, turning what was a multi-day manual process into a push-button operation.

Technical Insight

URLs by filetype

Document URLs

Raw files

Users/Software/Paths

Aggregated data

Fingerprint insights

Query results

Search Engines

Google/Bing/DuckDuckGo

Discovery Module

Query Orchestration

Document Downloader

HTTP Retrieval

Metadata Analyzer

Format-Specific Extraction

SQL Server

Normalized Schema

Pattern Recognition

Intelligence Engine

Desktop UI

Results Dashboard

System architecture — auto-generated

FOCA's architecture follows a three-stage pipeline: discovery, acquisition, and extraction. The discovery phase queries multiple search engines simultaneously using provider-specific APIs and scraping techniques. For each target domain, it constructs specialized search queries to find document filetypes—the classic "site:example.com filetype:pdf" pattern, but orchestrated across Google, Bing, and DuckDuckGo in parallel to maximize coverage and work around individual rate limits.

The application's core is built around a SQL Server backend that normalizes disparate metadata formats into a unified schema. This isn't just storage—it's the foundation for pattern recognition. When FOCA extracts metadata from a hundred PDFs and discovers that 80% were created by users following the pattern "firstname.lastname", that's actionable intelligence about the organization's email naming convention. The database schema includes tables for users, folders, printers, software versions, and passwords (discovered through pattern matching in document text), with foreign key relationships that enable correlation queries.

Here's what the metadata extraction process looks like conceptually (simplified from FOCA's actual implementation):

public class DocumentAnalyzer
{
    public MetadataResult ExtractMetadata(string filePath)
    {
        var result = new MetadataResult();
        var fileType = Path.GetExtension(filePath).ToLower();
        
        switch (fileType)
        {
            case ".pdf":
                using (PdfReader reader = new PdfReader(filePath))
                {
                    var info = reader.Info;
                    result.Author = info.ContainsKey("Author") ? info["Author"] : null;
                    result.Creator = info.ContainsKey("Creator") ? info["Creator"] : null;
                    result.Producer = info.ContainsKey("Producer") ? info["Producer"] : null;
                    result.CreationDate = ParsePdfDate(info["CreationDate"]);
                }
                break;
                
            case ".docx":
            case ".xlsx":
            case ".pptx":
                // Office Open XML formats are ZIP archives
                using (var package = Package.Open(filePath, FileMode.Open, FileAccess.Read))
                {
                    var coreProps = package.PackageProperties;
                    result.Author = coreProps.Creator;
                    result.LastModifiedBy = coreProps.LastModifiedBy;
                    result.Company = coreProps.Category;
                    result.CreationDate = coreProps.Created;
                    
                    // Extract custom properties that often contain sensitive paths
                    var customProps = ExtractCustomProperties(package);
                    result.InternalPaths = customProps.Where(p => 
                        p.Value.Contains(@"C:\Users") || 
                        p.Value.Contains(@"\\share")).ToList();
                }
                break;
        }
        
        // Pattern matching for email addresses and usernames
        result.ExtractedEmails = ExtractEmailPatterns(result);
        result.ExtractedUsernames = InferUsernamesFromPaths(result.InternalPaths);
        
        return result;
    }
    
    private List<string> InferUsernamesFromPaths(List<string> paths)
    {
        // Extract usernames from paths like "C:\Users\john.doe\Documents"
        var regex = new Regex(@"C:\\Users\\([^\\]+)");
        return paths
            .Select(p => regex.Match(p))
            .Where(m => m.Success)
            .Select(m => m.Groups[1].Value)
            .Distinct()
            .ToList();
    }
}

The extraction logic goes beyond simple metadata reading. FOCA performs heuristic analysis to extract network information—printer names embedded in document properties reveal naming conventions, embedded OLE objects can contain cached credentials, and application version strings inform vulnerability assessments. The tool even attempts to identify users, servers, and folders by parsing full file paths that Office applications sometimes embed in custom properties.

What makes FOCA's approach powerful is aggregation. A single document revealing "Adobe Acrobat 9.0" isn't particularly valuable, but when 200 documents show consistent use of outdated software versions, it suggests an organization-wide update lag that could indicate exploitable vulnerabilities. The SQL backend enables queries like "show me all unique usernames extracted from documents created in the last 6 months"—the kind of temporal analysis that transforms individual data points into organizational intelligence.

The application also implements a DNS and network fingerprinting module that takes discovered metadata (like server names or email domains) and performs additional reconnaissance—reverse DNS lookups, Shodan queries, and network range identification. This secondary analysis layer attempts to map the digital footprint beyond just document metadata, though this component depends on third-party services that may no longer be accessible or configured.

Gotcha

FOCA's biggest limitation is its Windows-only architecture with heavyweight dependencies. You'll need .NET Framework 4.7.1, Visual C++ 2010 redistributables, and a full SQL Server instance (2014 or later)—not SQL Express in some configurations due to database size limitations during large scans. This infrastructure requirement is substantial compared to modern Python-based OSINT tools that run anywhere with minimal setup.

The search engine dependency is increasingly problematic. Google has aggressively limited automated queries, Bing's API terms restrict usage, and DuckDuckGo doesn't officially support bulk scraping. What worked reliably when FOCA was actively developed now frequently hits rate limits or CAPTCHAs. The tool's effectiveness has degraded as search providers closed the doors on automated OSINT gathering. You'll find yourself manually supplementing FOCA's results or pre-collecting URLs through other means.

Maintenance appears stalled. The last significant commits were years ago, and modern document formats (especially cloud-native formats from Google Workspace or Microsoft 365) aren't fully supported. Organizations increasingly strip metadata during upload or use privacy-conscious document workflows, reducing the intelligence FOCA can gather. The tool represents a snapshot of mid-2010s OSINT methodology that hasn't evolved with contemporary document security practices.

Verdict

Use if: You're conducting authorized penetration tests on organizations with legacy document repositories, already have Windows infrastructure with SQL Server available, and need comprehensive metadata correlation beyond what command-line tools provide. FOCA excels at aggregating patterns across large document collections when you can work around search engine limitations. Skip if: You need cross-platform support, want a maintained tool with modern format support, lack the SQL Server infrastructure, or are working in environments where search engines effectively block automated discovery. For most modern use cases, a combination of metagoofil for extraction and custom Python scripts for correlation will be more flexible and maintainable than FOCA's monolithic Windows approach. Also skip if you're not conducting explicitly authorized security assessments—metadata mining occupies legal gray areas that require proper authorization.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/elevenpaths-foca.svg)](https://starlog.is/api/badge-click/developer-tools/elevenpaths-foca)