repo2file: The Minimalist's Guide to Feeding Your Codebase to LLMs
Hook
Every developer has copy-pasted code into ChatGPT one file at a time. What if your entire repository—structure, context, and all—could be a single paste away?
Context
The rise of large language models like GPT-4 and Claude has fundamentally changed how developers debug, refactor, and architect code. These models excel at understanding codebases when given sufficient context, but there's a friction point: LLMs consume text through chat interfaces, not git repositories. The typical workflow involves manually copying files, explaining directory structures, and hoping you've included enough context for the model to understand your problem.
This creates a tedious ritual. You're debugging a React component, so you paste the component file. The LLM asks about your state management, so you paste your Redux store. It needs to see your API layer, so you copy three more files. Twenty minutes later, you've reconstructed half your repository in a chat window, poorly organized and missing critical context. repo2file was built to solve this exact problem: intelligently flatten a repository into a single, LLM-consumable artifact that preserves both structure and content, with zero configuration overhead.
Technical Insight
At its core, repo2file is a masterclass in doing one thing exceptionally well. The entire tool is a single Python script that leverages Python's pathlib for cross-platform filesystem traversal and fnmatch for .gitignore pattern matching. The architecture is deliberately simple: walk the directory tree, filter out ignored files, generate a visual tree structure, then concatenate file contents with clear delimiters.
The magic lies in its .gitignore handling. Rather than requiring users to manually exclude node_modules, .git, or build artifacts, repo2file parses your existing .gitignore file using shell-style wildcard patterns. This means if you're already practicing good version control hygiene, the tool automatically produces clean output without accidentally dumping 50,000 dependency files into your LLM context.
Here's the typical usage pattern:
python repo2file.py /path/to/your/project output.txt --extensions .py .js .jsx
This generates output structured in two sections. First, a tree visualization:
project/
├── src/
│ ├── components/
│ │ ├── Header.jsx
│ │ └── Footer.jsx
│ ├── utils/
│ │ └── api.js
│ └── app.py
└── README.md
Followed by delimited file contents:
=== src/components/Header.jsx ===
import React from 'react';
export default function Header() {
return <header>My App</header>;
}
=== src/components/Footer.jsx ===
...
The extension filtering is particularly clever for polyglot projects. If you're debugging a Python backend issue, you can exclude frontend files entirely with --extensions .py, reducing token count and keeping the LLM focused. For full-stack context, omit the flag and capture everything.
The tool's decision to generate both tree structure and file contents serves a specific purpose in LLM interactions. When you paste this output into Claude or ChatGPT, the model first sees the high-level architecture (the tree), then can reference specific files as it formulates responses. This mirrors how humans navigate codebases: understand the layout, then drill into specifics.
One implementation detail worth noting: repo2file reads all files into memory before writing output. For a typical web application (50-200 source files), this is negligible. The script processes files in a single pass, making it fast enough that you won't notice the delay. Here's a simplified version of the core logic:
from pathlib import Path
import fnmatch
def should_ignore(path, gitignore_patterns):
for pattern in gitignore_patterns:
if fnmatch.fnmatch(str(path), pattern):
return True
return False
def dump_repo(root_path, extensions=None):
gitignore = parse_gitignore(root_path / '.gitignore')
tree = []
contents = []
for path in Path(root_path).rglob('*'):
if path.is_file() and not should_ignore(path, gitignore):
if extensions and path.suffix not in extensions:
continue
tree.append(format_tree_entry(path))
contents.append(f"=== {path} ===\n{path.read_text()}\n")
return '\n'.join(tree) + '\n\n' + '\n'.join(contents)
This straightforward approach means there's no complex state management, no async processing, no dependency injection—just pure, readable Python that does exactly what it promises.
The zero-dependency philosophy is intentional. Tools that require pip installing multiple packages create friction: virtual environment setup, dependency conflicts, version mismatches. By relying solely on Python's standard library, repo2file works on any machine with Python 3.6+ installed, from your laptop to a locked-down enterprise server to a minimal Docker container.
Gotcha
The elephant in the room is token counting. LLMs have context windows measured in tokens (roughly 4 characters per token), and repo2file doesn't help you stay within those limits. GPT-4 Turbo supports 128k tokens, Claude 3 supports 200k, but a moderately-sized codebase can easily exceed these limits. You're flying blind—the tool will happily generate a 500k token file that no LLM can accept, and you won't know until you try to paste it and hit an error.
This becomes particularly painful with monorepos or projects with extensive documentation. A Next.js application with comprehensive JSDoc comments, markdown docs, and test files might generate 300k+ tokens even after .gitignore filtering. Your only recourse is manually adding exclusion patterns or using extensions filters, which requires understanding your repository's token footprint through trial and error.
The extension filtering, while useful, is also quite blunt. You can't exclude test files specifically, only all files with a .test.js extension (which requires knowing your testing conventions). You can't prioritize recently changed files or focus on a specific module. If you need "only the authentication system" from a large codebase, you'll need to either run repo2file on a subdirectory or manually edit the output file—neither of which is elegant.
Finally, binary file handling could be smarter. The tool attempts to read everything as text, which means encountering a stray image or compiled binary will either crash the script or inject garbage into your output. While .gitignore typically prevents this, repositories that version control assets or builds might hit this edge case.
Verdict
Use if: You're working on small-to-medium codebases (under 50k lines of code) and need to quickly share context with Claude, ChatGPT, or similar LLMs for debugging, code review, or architectural discussions. It's perfect for side projects, microservices, or isolated modules where you want zero setup friction. If you're already disciplined about .gitignore hygiene, this tool will feel like magic—run once, paste into your LLM, and get meaningful feedback about your entire codebase. Skip if: You're dealing with large monorepos, need token budget management, or require sophisticated filtering beyond file extensions. For those scenarios, look at gpt-repository-loader (includes token counting) or repomix (supports multiple output formats and chunking strategies). Also skip if you need ongoing LLM integration—this is a one-shot tool, not a framework for building RAG systems or automated code analysis pipelines. It solves the manual context-sharing problem beautifully but won't help with programmatic LLM workflows.