Back to Articles

repo2txt: The Dead-Simple Tool for Feeding Entire Codebases to GPT

[ View on GitHub ]

repo2txt: The Dead-Simple Tool for Feeding Entire Codebases to GPT

Hook

Every developer working with GPT-4 has hit the same wall: you want to ask a question about your entire codebase, but manually copying files into a prompt is tedious nightmare. What if you could flatten any repository into a single, LLM-ready text file in one command?

Context

The rise of large language models with expanding context windows has created a new workflow problem. GPT-4 can handle 128K tokens, Claude can process entire novels, but getting your codebase into these models remains surprisingly manual. You could copy-paste files individually, lose track of what you've included, and waste time formatting. Or you could write a bash one-liner with find and cat, but then you're including node_modules, .git directories, and binary files that bloat the output.

repo2txt emerged to solve this specific friction point. It's not trying to be a comprehensive code analysis platform or a fancy documentation generator. Instead, it does one thing: traverse a repository, respect sensible ignore patterns, and concatenate everything into a single file that you can immediately paste into ChatGPT or Claude. The tool gained traction because it's exactly as complex as it needs to be—no authentication flows, no cloud services, no configuration hell. Just point it at a directory and get a text file.

Technical Insight

The architecture of repo2txt is refreshingly straightforward, which is precisely why it works. At its core, it's a directory walker with a filtering layer and a concatenation output stage. The tool starts by loading ignore patterns from a config.json file that ships with sensible defaults: common directories like node_modules, .git, pycache, and file extensions like .pyc, .jpg, .exe.

Here's how you use it in practice:

# Install via pip
pip install repo2txt

# Basic usage - creates output.txt in current directory
repo2txt /path/to/your/project

# Specify custom output file
repo2txt /path/to/your/project -o analysis.txt

# Generate a Word document instead
repo2txt /path/to/your/project -o output.docx

The filtering system is where repo2txt shines in its simplicity. It uses three layers of ignore patterns: directory names (like 'node_modules'), file extensions (like '.min.js'), and specific filenames (like '.DS_Store'). The config.json file is just a JSON object with arrays for each category, making it trivial to customize. Want to exclude test files? Add 'test' to the ignore_dirs array or '.test.js' to ignore_extensions.

The output format is deliberately basic but functional. First, it generates a tree structure showing the directory hierarchy of included files. Then it iterates through each file, adding a header with the file path and dumping the raw contents. The separator between files is just a line of dashes, making it easy for both humans and LLMs to parse:

project/
├── src/
│   ├── main.py
│   └── utils.py
└── README.md

--- project/README.md ---
# My Project
This is the readme content...

--- project/src/main.py ---
import utils

def main():
    print("Hello world")

The implementation uses Python's pathlib for cross-platform path handling and the standard library's file I/O. For DOCX generation, it optionally imports python-docx, but only when needed—if you just want text output, you don't need any external dependencies. This design decision keeps the tool lightweight and reduces installation friction.

One clever detail is how it handles encoding. The tool attempts to read files as UTF-8 and silently skips files that fail to decode. This prevents crashes on binary files that slip through the ignore filters, though it means you won't get error messages about problematic files. It's a pragmatic trade-off: fail gracefully rather than requiring perfect ignore configuration.

The entire process is single-threaded and synchronous, which is actually fine for its use case. You're not running this in a CI/CD pipeline or processing thousands of repositories. You're running it once on your local project before starting a ChatGPT session. The performance is dominated by disk I/O anyway, and parallelizing file reads on a single disk rarely helps.

Gotcha

The simplicity that makes repo2txt elegant also creates boundaries. The most obvious limitation is binary file handling—if a .png or .woff2 file makes it past your ignore filters, you'll get either garbage characters or the file will be silently skipped depending on encoding errors. There's no magic detection that says "this is binary, let me note that in the output instead." You need to maintain your ignore patterns carefully.

Large repositories expose the lack of optimization. If you point repo2txt at a 10,000-file monorepo, even with good ignore patterns, you might generate a 50MB text file that exceeds every LLM's context window. The tool doesn't warn you, doesn't chunk the output, and doesn't help you understand token counts. You'll only discover the problem when you try to paste into ChatGPT and hit the limit. More sophisticated tools like repopack include token counting and can split outputs intelligently.

The output format is also non-negotiable. You can't specify a custom template, can't group files by type, can't add syntax highlighting markers, and can't control the order of files beyond alphabetical traversal. If you want to add prompt engineering instructions like "Focus on the src/ directory" or include a table of contents with file summaries, you'll need to manually edit the output file or wrap repo2txt in your own scripts.

Verdict

Use if: You need to quickly package a small-to-medium codebase (under 5,000 files) for LLM analysis, you want zero configuration complexity, you're working with standard text-based code repositories, or you need a simple tool you can audit in 10 minutes and trust with proprietary code. It's perfect for asking ChatGPT to review your weekend project, explain an unfamiliar codebase, or suggest refactoring opportunities. Skip if: You're working with massive monorepos that need intelligent chunking, you need binary file handling or encoding detection, you want advanced features like token counting or custom output templates, or you're building an automated pipeline that requires robust error handling and progress reporting. For those cases, look at repopack or build a custom solution with proper streaming and chunking logic.