Elasticdump: The Swiss Army Knife for Elasticsearch Migrations

Hook

Elasticdump has been downloaded over 100,000 times monthly to solve a problem that Elasticsearch's own tooling should handle—but doesn't, at least not without shared filesystems or cloud storage locks.

Context

Elasticsearch's official snapshot and restore API is powerful but constraining. It requires a shared filesystem across all nodes or expensive cloud storage configurations like S3 with specific IAM policies. For developers who need to copy an index between two clusters, export data to CSV for analysis, or simply backup a development environment to a local file, the official tooling feels like bringing a tank to a knife fight.

This gap created elasticdump. Originally built as a simple Node.js script to dump Elasticsearch indices to JSON files, it evolved into the de facto community standard for ad-hoc Elasticsearch and OpenSearch data movement. With nearly 8,000 GitHub stars and integration into countless CI/CD pipelines, elasticdump fills the void between "I need to move this data right now" and "I need to architect a proper snapshot repository." It's the tool you reach for when official solutions demand infrastructure you don't have or time you don't want to spend.

Technical Insight

System architecture — auto-generated

Elasticdump's architecture reveals intelligent design choices for handling Elasticsearch's scale challenges. At its core, it's a streaming ETL pipeline built on Node.js that leverages Elasticsearch's scroll API for pagination. When you dump an index, elasticdump doesn't load everything into memory—it requests batches (default 100 documents) using scroll contexts, processes them, and writes to the output destination.

Here's a basic migration between two clusters:

elasticdump \
  --input=http://production.es.local:9200/my-index \
  --output=http://staging.es.local:9200/my-index \
  --type=data

But the real intelligence emerges in how elasticdump separates concerns. Elasticsearch indices aren't just data—they're mappings, analyzers, aliases, and templates. Elasticdump treats these as discrete data types, requiring separate operations:

# First: Copy the mapping schema
elasticdump \
  --input=http://source:9200/my-index \
  --output=http://dest:9200/my-index \
  --type=mapping

# Second: Copy analyzer definitions
elasticdump \
  --input=http://source:9200/my-index \
  --output=http://dest:9200/my-index \
  --type=analyzer

# Third: Copy the actual documents
elasticdump \
  --input=http://source:9200/my-index \
  --output=http://dest:9200/my-index \
  --type=data

# Finally: Copy aliases
elasticdump \
  --input=http://source:9200/my-index \
  --output=http://dest:9200/my-index \
  --type=alias

This separation prevents the common migration failure where data arrives before its mapping, causing dynamic mapping conflicts. It's tedious but architecturally sound.

Version 6.1.0 introduced overlapping promise processing, a performance optimization that trades ordering guarantees for speed. Instead of waiting for batch N to complete before starting batch N+1, elasticdump now processes multiple batches concurrently:

// Conceptual view of parallel processing
const limit = options.limit || 100;
const concurrency = options.concurrency || 1;

// Pre-6.1.0: Sequential
for (const batch of batches) {
  await processBatch(batch);
}

// Post-6.1.0: Parallel with concurrency control
await Promise.all(
  batches.map((batch, i) => {
    if (i < concurrency) {
      return processBatch(batch);
    }
  })
);

This parallelization can cut migration time by 70% on well-provisioned clusters, but it means document order at the destination is no longer guaranteed to match the source. For most use cases (Elasticsearch scores and sorts at query time anyway), this doesn't matter. For time-series data where insertion order affects internal shard structure, it might.

Elasticdump's query filtering demonstrates another architectural strength—you can export subsets using Elasticsearch's Query DSL:

elasticdump \
  --input=http://source:9200/logs \
  --output=/backups/error-logs.json \
  --searchBody='{"query":{"term":{"level":"ERROR"}}}' \
  --type=data

This turns elasticdump into more than a backup tool—it's a data export pipeline. Combined with stdout/stdin support, you can pipe through jq for transformations, gzip for compression, or AWS CLI for direct S3 uploads:

elasticdump \
  --input=http://localhost:9200/my-index \
  --output=$ \
  --type=data \
  | gzip \
  | aws s3 cp - s3://my-bucket/backup-$(date +%Y%m%d).json.gz

The Docker image (elasticdump/elasticsearch-dump) wraps all this functionality for containerized environments, making it trivial to add to Kubernetes CronJobs or CI/CD pipelines without installing Node.js dependencies.

One lesser-known feature is file splitting for massive indices. The --fileSize option automatically chunks output:

elasticdump \
  --input=http://source:9200/huge-index \
  --output=/backups/huge-index.json \
  --fileSize=1gb \
  --type=data

This generates huge-index.json, huge-index-1.json, huge-index-2.json, etc., preventing single-file limitations and enabling parallel restoration later.

Gotcha

Elasticdump's biggest footgun is version churn. The project has gone through six major versions (1.x → 6.x), each with breaking changes. If you have backup scripts from 2018, they might fail silently or produce corrupted data with current versions. The 6.1.0 ordering change is particularly insidious—your migration completes successfully, but subtle data inconsistencies appear later in time-series workflows or when you depend on document _id patterns.

Performance on truly large datasets (100GB+ indices) exposes elasticdump's single-threaded Node.js limitations. While the scroll API is efficient, a Go-based alternative like esm can be 3-5x faster with lower memory overhead. Elasticsearch's native snapshot/restore also wins decisively at scale—it operates at the Lucene segment level, copying raw index files rather than re-indexing JSON documents. For production clusters with terabytes of data, elasticdump becomes the slow path.

The tool also doesn't handle authentication edge cases gracefully. AWS Elasticsearch Service with IAM roles, Elastic Cloud with API keys, and self-hosted clusters with client certificates all require different flag combinations. The documentation covers these, but error messages when you get it wrong are often opaque Node.js stack traces rather than actionable guidance. You'll spend time Googling "ECONNREFUSED" or "401 Unauthorized" to discover you needed --httpAuthFile instead of --headers.

Verdict

Use if: You need quick, flexible data movement between Elasticsearch/OpenSearch clusters without configuring snapshot repositories. Perfect for development workflows, CI/CD index seeding, exporting subsets to CSV for analysis, or one-off migrations where infrastructure setup time exceeds data transfer time. The Docker image makes it invaluable in containerized environments, and the query filtering turns it into a powerful data export pipeline. If your indices are under 50GB and you value scriptability over raw performance, elasticdump is the right tool.

Skip if: You're managing production clusters with hundreds of gigabytes of data—Elasticsearch's native snapshot/restore is orders of magnitude faster and more reliable. Also skip if you need guaranteed document ordering (stick to pre-6.1.0 versions or use the Reindex API), require transactional guarantees with resume capability, or if you're locked into old Node.js versions (pre-v10). For high-frequency automated backups, invest in proper snapshot repository infrastructure rather than scripting elasticdump cron jobs that will eventually fail silently and leave you without valid backups when you need them most.

Elasticdump: The Swiss Army Knife for Elasticsearch Migrations

Elasticdump: The Swiss Army Knife for Elasticsearch Migrations

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Elasticdump: The Swiss Army Knife for Elasticsearch Migrations

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]