Prompt Budgeting: How to Compress YouTube Transcripts to Use Fewer Tokens
Try it now: Paste any YouTube URL and get subtitles free
Get Subtitles →A one-hour YouTube video can produce a transcript of 10,000 words or more. Feed that directly into GPT-4o or Claude and you are burning through tokens fast — running up costs, eating into context windows, and often getting worse responses because the model is drowning in filler.
Prompt budgeting is the practice of trimming, compressing, and restructuring your input so that you send only what the model actually needs. In this guide we will cover five practical techniques for compressing YouTube transcripts before sending them to an LLM, with code examples you can use right away.
Why Prompt Budgeting Matters
Every token you send to an LLM costs money and occupies context window space. Here is why that matters:
- Cost. OpenAI charges per token for both input and output. A 15,000-token transcript costs roughly 3-10x more to process than a 3,000-token summary of the same content, depending on the model.
- Context limits. Even models with large context windows (128K tokens for GPT-4o, 200K for Claude) perform better with focused input. Research consistently shows that models struggle with relevant information buried in long contexts.
- Response quality. Shorter, more focused prompts tend to produce better outputs. When you ask a model to "summarize this transcript" and hand it 15,000 tokens of filler-heavy speech, you get a weaker summary than if you hand it 4,000 tokens of cleaned, structured content.
Use our token counter tool to check exactly how many tokens your transcript uses before and after applying these techniques.
Technique 1 — Remove Filler Words
Spoken language is full of filler: "um", "uh", "you know", "like", "basically", "actually", "sort of", "kind of". These words carry no meaning and inflate your token count by 5-15% in a typical transcript.
Here is a simple Python script to strip common fillers:
import re
def remove_fillers(text: str) -> str:
# Remove common filler words from a transcript.
fillers = [
r'um', r'uh', r'like,?\s', r'you know,?\s',
r'basically,?\s', r'actually,?\s', r'sort of',
r'kind of', r'I mean,?\s', r'right\?\s',
]
pattern = '|'.join(fillers)
cleaned = re.sub(pattern, ' ', text, flags=re.IGNORECASE)
# Collapse multiple spaces
cleaned = re.sub(r' {2,}', ' ', cleaned)
return cleaned.strip()
# Example usage
raw = "So um basically what I want to talk about, you know, is like the importance of, uh, prompt budgeting."
print(remove_fillers(raw))
# Output: "So what I want to talk about, is the importance of, prompt budgeting."
This is a blunt instrument — it can occasionally remove "like" or "right" when they are used meaningfully — but for most transcripts the trade-off is worth it. A typical one-hour transcript drops from around 8,000 tokens to 7,000 tokens with filler removal alone.
Technique 2 — Two-Pass Summarization
For very long transcripts that exceed your target token budget even after cleaning, a two-pass summarization approach is highly effective:
- First pass: Split the transcript into chunks (e.g., 3,000 tokens each). Send each chunk to the model with a prompt like "Summarize the key points of this section in 3-5 bullet points."
- Second pass: Combine all the chunk summaries into a single document. Send that combined summary to the model with your actual analytical prompt.
This approach is sometimes called "map-reduce summarization." The first pass (map) extracts key information from each section independently. The second pass (reduce) synthesizes the extracted information into a coherent whole.
The token savings are dramatic: a 15,000-token transcript might produce 2,000 tokens of chunk summaries, which then produce a 500-token final summary. That is a 97% reduction while preserving the essential content.
The trade-off is that you are making multiple API calls, so while the final prompt is cheaper, the total cost includes the summarization passes. For most use cases, the improved response quality more than justifies the approach.
Technique 3 — Chunked Extraction
Instead of summarizing everything, you can extract only the specific information you need from each chunk. This works well when you have a targeted question rather than a general "summarize this" request.
import tiktoken
def chunk_text(text: str, max_tokens: int = 3000, model: str = "gpt-4o") -> list[str]:
# Split text into chunks that fit within a token limit.
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunks.append(enc.decode(chunk_tokens))
return chunks
# Example: split a transcript into processable chunks
transcript = open("transcript.txt").read()
chunks = chunk_text(transcript, max_tokens=3000)
print(f"Split into {len(chunks)} chunks")
You then process each chunk with a focused extraction prompt. For example, if you are looking for action items mentioned in a meeting recording, you send each chunk with "List any action items, deadlines, or commitments mentioned in this section" and then combine the results.
Technique 4 — Structured Extraction Prompt
One of the biggest sources of wasted tokens is vague prompting. Instead of sending your entire transcript with "summarize this," use a structured extraction prompt that tells the model exactly what to pull out.
# Instead of this:
prompt = f"Summarize the following transcript:\n\n{transcript}"
# Use this:
prompt = f"Extract the following from this transcript:\n\n"
prompt += "1. MAIN TOPICS (list 3-5 key topics discussed)\n"
prompt += "2. KEY CLAIMS (list any factual claims or statistics mentioned)\n"
prompt += "3. ACTION ITEMS (list any recommendations or next steps)\n"
prompt += "4. NOTABLE QUOTES (list 2-3 direct quotes worth preserving)\n\n"
prompt += f"Transcript:\n{transcript}"
This does not reduce the input tokens, but it dramatically reduces the output tokens and improves the usefulness of the response. Instead of a rambling paragraph summary, you get structured data you can actually work with.
The structured approach also pairs well with chunked extraction. Apply the same structured prompt to each chunk, then combine the extracted data in a final pass.
Technique 5 — Pre-Processing Pipeline
The most effective approach combines multiple techniques into a single pipeline. Here is a workflow you can adapt:
- Download the transcript from YouTube (use SubtitlesYT or any subtitle tool)
- Clean the text: remove filler words, collapse whitespace, fix obvious speech-to-text errors
- Measure the token count (use our token counter to check)
- Decide your strategy based on the token count:
- Under 4,000 tokens: send directly with a structured extraction prompt
- 4,000-15,000 tokens: use chunked extraction with a structured prompt
- Over 15,000 tokens: use two-pass summarization, then structured extraction
- Extract the information you need using the chosen strategy
- Merge the results from all chunks into your final output
This pipeline ensures you are never sending more tokens than necessary while preserving the information that matters.
Savings Comparison
| Technique | Token Reduction | Effort Level | Best For |
|---|---|---|---|
| Remove Filler Words | 5-15% | Low | Quick cleanup of any transcript |
| Two-Pass Summarization | 80-97% | Medium | Very long transcripts, general summaries |
| Chunked Extraction | 60-85% | Medium | Targeted questions, specific info retrieval |
| Structured Prompt | 0% input / 50-70% output | Low | Better outputs without changing input size |
| Full Pipeline | 85-97% | High | Production workflows, cost-sensitive applications |
Start with Technique 1 (filler removal) since it takes almost no effort. If your transcript is still too long, layer on chunked extraction or two-pass summarization depending on whether you need specific information or a general overview.
Want to check your token count before and after applying these techniques? Use our token counter to measure exactly how many tokens your transcript uses across different models. And if you need to download a transcript first, grab subtitles here in TXT, SRT, or VTT format.
Ready to download subtitles? Paste a URL and get started.
Get Subtitles →