Back to Articles

repo2txt: The Brutally Simple Tool for Feeding Your Codebase to ChatGPT

[ View on GitHub ]

repo2txt: The Brutally Simple Tool for Feeding Your Codebase to ChatGPT

Hook

Every developer who’s tried to ask ChatGPT to review their codebase has faced the same annoying copy-paste ritual: manually combining dozens of files into a single prompt. repo2txt eliminates that tedious workflow in one command.

Context

The explosion of large language models has created a new developer workflow: treating AI as a pair programmer, code reviewer, or documentation generator. But there’s a fundamental mismatch between how we organize code (distributed across hundreds of files and directories) and how LLMs consume it (as monolithic text blocks). Early adopters resorted to manual copy-pasting, shell script hacks with find and cat, or writing custom Python scripts to concatenate files.

repo2txt emerged as a dedicated solution to this specific problem. It’s not trying to be a documentation generator, build tool, or comprehensive codebase analyzer. Instead, it does exactly one thing: recursively walk a directory tree, filter out noise, and produce a single text or Word document containing your entire codebase with clear file delineations and directory structure context. Think of it as a specialized serialization tool optimized for the LLM use case, where you need to package code context efficiently while staying under token limits.

Technical Insight

Filtering Layers

excluded paths

file paths

file contents + paths

--txt

--docx

CLI Arguments

Filtering Engine

config.json

Directory Traversal

os.walk

File Content Reader

Tree Structure Generator

Output Format

Plain Text Writer

Word Document Writer

python-docx

repo2txt_output.txt

repo2txt_output.docx

Hardcoded Defaults

System architecture — auto-generated

Under the hood, repo2txt is refreshingly straightforward—about 300 lines of Python that prioritize pragmatism over architectural complexity. The core architecture revolves around three components: a recursive directory traversal engine, a multi-layered filtering system, and dual-format output generators.

The filtering mechanism deserves attention because it’s where most of the tool’s flexibility lives. repo2txt implements a cascading exclusion system that checks each file against four separate filter layers. First, it reads a config.json file at the repository root that can specify ignore patterns:

{
  "ignore_files": ["package-lock.json", ".env"],
  "ignore_dirs": ["node_modules", ".git", "dist"],
  "ignore_extensions": [".pyc", ".jpg", ".png"],
  "ignore_settings_files": true
}

Second, it respects command-line arguments that override or supplement the config. Third, it has hardcoded sensible defaults for common noise (like .DS_Store or Thumbs.db). Finally, it can optionally skip configuration files entirely with the ignore_settings_files flag. This layered approach means you can create project-specific configs while still having global overrides for one-off exports.

The directory traversal uses Python’s os.walk() with a pre-filtering step that eliminates ignored directories before recursion, avoiding the performance hit of descending into massive node_modules folders. For each included file, the tool reads contents and wraps them in clearly delineated sections:

def format_file_content(filepath, relative_path, content):
    separator = "=" * 80
    return f"{separator}\nFile: {relative_path}\n{separator}\n{content}\n\n"

What’s clever here is the inclusion of a visual directory tree at the beginning of the output file. This gives LLMs structural context about the codebase organization before diving into file contents—critical for understanding imports, module relationships, and architectural patterns. The tree generation uses a simple recursive algorithm that builds an ASCII representation similar to the tree command.

For Word document output, repo2txt leverages the python-docx library to create structured documents with styled headings and code blocks. Each file becomes a distinct section with heading styles, making the DOCX output more navigable than a raw text dump:

from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT

doc = Document()
doc.add_heading('Repository Contents', 0)

for file_path, content in files:
    doc.add_heading(file_path, level=2)
    code_paragraph = doc.add_paragraph(content)
    code_paragraph.style = 'Code'

This dual-format approach recognizes different use cases: text files for direct LLM API calls or terminal-based workflows, and DOCX for human review, annotation, or presentation contexts where formatting matters.

The tool’s output can be piped directly into LLM prompts or saved for repeated use. A typical workflow looks like this:

# Install
pip install repo2txt

# Generate from current directory
repo2txt -o codebase.txt

# Generate with custom filters
repo2txt -o output.docx --ignore-dirs build,dist --ignore-ext .log,.tmp

# Then feed to an LLM
cat codebase.txt | pbcopy  # Copy to clipboard on macOS

The simplicity is both a strength and weakness. There’s no incremental processing, streaming, or chunking—it reads everything into memory, formats it, and writes it out. For small-to-medium projects (under 10,000 lines), this is perfectly fine and completes in seconds. For larger codebases, you’ll quickly discover the limitations.

Gotcha

The most glaring limitation is the complete absence of size awareness. repo2txt will happily try to concatenate a 500MB repository into a single text file, consume all your RAM, and either crash or produce an output file that exceeds every LLM’s context window by orders of magnitude. There’s no token counting, no file size limits, no warnings when you’re about to create a 100MB text file that no model can actually process. You’re flying blind until you hit the wall.

Binary file handling is similarly naive. The tool doesn’t detect or skip binary files—it attempts to read them as text, which produces garbled output and corrupts the final document. A repository containing images, PDFs, or compiled binaries will generate unusable output unless you meticulously configure exclusion patterns for every binary type. There’s no magic number detection or MIME type checking.

The filtering system, while flexible, isn’t git-aware. It won’t automatically respect your .gitignore file, meaning you need to manually duplicate those exclusion rules in config.json. This creates maintenance burden and potential drift between what’s version-controlled and what gets exported. Symbolic links aren’t handled specially either—following them could create duplicate content or even infinite loops in pathological cases.

Finally, there’s zero consideration for sensitive data. The tool will blindly include .env files, private keys, API tokens, or any other secrets unless you explicitly exclude them. There’s no warning system, no pattern matching for common secret formats, and no integration with tools like git-secrets or truffleHog. If you’re sharing the output with an LLM API, you might accidentally leak credentials.

Verdict

Use if: You’re working with small-to-medium codebases (under 5,000 files or 50MB) that you want to quickly package for one-off LLM analysis, code reviews, or documentation generation. It’s perfect for side projects, interview take-homes, or educational repositories where you need a no-setup solution that just works. The dual-format output is genuinely useful if you’re bouncing between human review and LLM consumption.

Skip if: You’re dealing with large monorepos, need token budget management, require git-aware filtering, or want safety guards against accidentally including secrets. For production workflows, consider repomix (which adds token counting and security scanning) or gpt-repository-loader (which handles chunking strategies for large codebases). If you’re building this into CI/CD pipelines or need programmatic control, the LangChain document loaders offer better integration points. repo2txt is a sharp knife for specific jobs, not a Swiss Army chainsaw.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/developer-tools/donoceidon-repo2txt.svg)](https://starlog.is/api/badge-click/developer-tools/donoceidon-repo2txt)