Back to Articles

Building a YouTube Video Summarizer with LlamaIndex and OpenAI: A Code Walkthrough

[ View on GitHub ]

Building a YouTube Video Summarizer with LlamaIndex and OpenAI: A Code Walkthrough

Hook

The average YouTube video is 11.7 minutes long, but most developers watch at 2x speed and still wish they could skip to the insights. What if you could compress any video into a structured summary with zero playback required?

Context

YouTube has become the de facto knowledge base for developers, with tutorials, conference talks, and technical deep-dives uploaded constantly. But video is an inefficient medium for information retrieval—you can't easily skim a 45-minute conference talk the way you'd scan a blog post, and searching within video content remains clumsy despite YouTube's best efforts.

The yt-sum project tackles this friction head-on by treating video transcripts as queryable documents. Created by Efrain Hernandez-Mendoza (yencarnacion), this Python tool downloads YouTube transcripts, feeds them through OpenAI's GPT models via LlamaIndex, and produces both automated summaries and an interactive REPL for asking follow-up questions. It's a straightforward solution that sidesteps the complexity of audio processing by relying on YouTube's existing caption infrastructure, then applies modern LLM techniques to extract insights.

Technical Insight

The architecture is refreshingly simple: a bash script orchestrates the workflow, calling Python scripts that handle transcript extraction and LLM interactions. The main entry point, go.sh, takes a YouTube URL and produces an HTML summary file. Under the hood, it uses the youtube_transcript_api library to pull captions, then sends them to OpenAI through LlamaIndex's abstraction layer.

What makes yt-sum interesting isn't novel architecture—it's the thoughtful prompt engineering. The tool includes a template system that asks pointed questions designed to extract maximum value from video content. Looking at the codebase, you'll find prompts that ask the LLM to identify key assertions, explain concepts for different technical levels, and surface practical implications. This transforms raw transcripts into structured knowledge.

Here's how the interactive REPL works once you've generated an initial summary:

# repl.py - simplified excerpt
from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
import openai
import os

openai.api_key = os.environ['OPENAI_API_KEY']

# Load the transcript as a document
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)

# Interactive query loop
while True:
    query = input("Ask a question about the video (or 'quit'): ")
    if query.lower() == 'quit':
        break
    
    response = index.query(query)
    print(f"\nAnswer: {response}\n")

This pattern—indexing documents with LlamaIndex and querying them interactively—is the core value proposition. LlamaIndex handles the embedding generation, vector storage, and retrieval-augmented generation behind the scenes. You're essentially building a chatbot trained on a single video's content with about a dozen lines of code.

The workflow separation is clever: go.sh handles the expensive one-time summarization (which costs more tokens), while repl.py lets you ask targeted questions cheaply. This two-phase approach makes sense economically—you're not re-summarizing the entire transcript with every query. The HTML output from the summary phase also serves as a standalone artifact you can archive or share without needing to re-run the tool.

One architectural decision worth noting: the tool saves transcripts locally before processing them. This means you can experiment with different prompts or LLMs without repeatedly hitting YouTube's API. It's a small detail that reflects practical engineering—building in retry-ability and iterability from the start.

The HTML output generation is bare-bones but functional. The script wraps summaries in basic HTML tags, making them readable in any browser. For a personal tool, this is perfect—no JavaScript dependencies, no build process, just plain HTML files you can grep or archive. Production tools might opt for Markdown or JSON, but HTML is immediately human-readable without additional tooling.

Gotcha

The biggest limitation is the hard dependency on OpenAI's API. There's no fallback to local models, and the code directly imports OpenAI-specific implementations from LlamaIndex. If you're processing dozens of videos, the costs add up quickly—a one-hour video with a detailed transcript could easily consume 10,000+ tokens for summarization alone. The tool provides no cost estimation or token counting before making API calls.

Error handling is minimal. The README explicitly mentions issues with URLs containing certain special characters, and there's no graceful degradation when videos lack transcripts. If a video only has auto-generated captions in a language the tool doesn't expect, or if the transcript API changes its response format, you'll likely see cryptic stack traces rather than helpful error messages. The code assumes happy-path execution—transcripts exist, the API key is valid, the network is stable. Production use would require wrapping this in additional error handling and validation logic.

The tool also inherits all the limitations of transcript-based analysis. Visual demonstrations, code shown on screen, or whiteboard diagrams are completely invisible to the summarization process. For videos where visual content carries significant meaning—think live coding sessions or architecture diagram walkthroughs—you're getting an incomplete picture. The summaries can only be as good as the spoken content captured in captions.

Verdict

Use if: You're a researcher, student, or developer who regularly consumes long-form YouTube educational content and already has an OpenAI API key. This tool excels at processing conference talks, technical tutorials, and lecture series where the audio track contains most of the value. It's perfect for building a personal knowledge base from video content or quickly deciding whether a 2-hour workshop is worth watching in full. The interactive REPL is genuinely useful for fact-checking or finding specific details without scrubbing through video timelines. Skip if: You need a production-ready solution, want to avoid API costs, or primarily watch visually-oriented content. The lack of error handling and documentation makes this unsuitable for team deployment or user-facing applications. If you're cost-sensitive, combining youtube-transcript-api with a local LLM through Ollama would give you similar functionality without per-request charges. Also skip if you need to process videos without existing captions—this tool won't generate transcripts from audio.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/ai-dev-tools/yencarnacion-yt-sum.svg)](https://starlog.is/api/badge-click/ai-dev-tools/yencarnacion-yt-sum)