# Audio-Only Podcast Optimization Plan ## Executive Summary This document outlines the optimization strategy for audio-only podcasts in ALwrity's Podcast Maker. The goal is to maximize the character throughput per API request while maintaining cost efficiency and audio quality. --- ## 1. Current Cost Analysis ### 1.1 Pricing Structure | Service | Provider | Cost Formula | Notes | |---------|----------|--------------|-------| | **TTS (Audio)** | Minimax Speech-02-HD (WaveSpeed) | $0.05 per 1,000 chars | Exact billing per character | | **Voice Clone** | Minimax Voice Clone | $0.50 per clone | One-time if using custom voice | | **Research** | Exa Neural Search | $0.005 per query | + ~$0.001 for LLM insight extraction | | **Avatar** | Ideogram Character | $0.10 per image | Only if AI-generated | ### 1.2 Cost Examples | Podcast Duration | Characters (est.) | TTS Cost | Total Cost (audio-only) | |------------------|-------------------|----------|--------------------------| | 1 minute | 750 | $0.04 | $0.07 | | 3 minutes | 2,250 | $0.11 | $0.14 | | 5 minutes | 3,750 | $0.19 | $0.22 | | 10 minutes | 7,500 | $0.38 | $0.41 | --- ## 2. Technical Constraints ### 2.1 API Limits **Backend**: `main_audio_generation.py` (line 100) ```python if len(text) > 10000: raise ValueError(f"Text is too long ({len(text)} characters). Maximum is 10,000 characters.") ``` **Current Limit**: 10,000 characters per single API request ### 2.2 Scene-Based Architecture - Each scene = 1 API call - Default scene length: 45 seconds (`scene_length_target` knob) - Audio is generated per scene, then concatenated --- ## 3. Optimization Strategies ### 3.1 Strategy 1: Fewer, Longer Scenes **Problem**: More scenes = more API calls = higher costs **Solution**: - Increase `scene_length_target` from 45s to 60s or 90s - Fewer scenes for the same podcast duration **Impact**: | Duration | Scenes (45s) | Scenes (60s) | Scenes (90s) | API Call Savings | |----------|-------------|--------------|--------------|------------------| | 5 min | 7 | 5 | 3 | 57% fewer calls | | 10 min | 13 | 10 | 7 | 46% fewer calls | ### 3.2 Strategy 2: Per-Scene Character Budgeting **Current behavior**: Each scene text is sent separately to TTS API **Optimization options**: 1. **Text Concatenation**: Combine multiple scene texts with `<#x#>` pause markers ```python # Example: Combine scenes with pause markers combined_text = "Scene 1 text.<#x#>Scene 2 text.<#x#>Scene 3 text." ``` - Risk: May hit 10,000 char limit faster - Benefit: Single API call for multiple scenes 2. **Smart Chunking**: Dynamically batch scenes based on character count ```python MAX_CHARS_PER_REQUEST = 9500 # Leave buffer # Group scenes until approaching limit ``` ### 3.3 Strategy 3: Voice Settings for Longer Content **Speed factor impacts**: - Speed 0.8 = 25% more content per same duration - Speed 1.2 = 20% less content **Recommendation**: Use speed 0.9-1.0 for optimal quality/cost balance ### 3.4 Strategy 4: Audio-Only Mode Skip **For audio-only podcasts** (no video): 1. **Skip avatar generation** - Save $0.10 per speaker 2. **Skip video rendering** - Save $0.30 per scene 3. **Skip scene images** - Save $0.04-$0.10 per scene **Estimated savings for 5-min, 5-scene audio podcast**: | Component | Cost | Audio-Only Savings | |-----------|------|---------------------| | Avatar | $0.10 | $0.10 | | Video (5 scenes) | $1.50 | $1.50 | | Images (5 scenes) | $0.20-$0.50 | $0.20-$0.50 | | **Total** | $1.80-$2.10 | **$1.80-$2.10** | --- ## 4. Implementation Plan ### 4.1 Phase 1: User-Facing Controls (Frontend) #### 4.1.1 Add "Audio Only" Toggle - Location: `CreateModal.tsx` or `PodcastConfiguration.tsx` - Options: `Audio Only` | `Video Only` | `Audio + Video` - When enabled: Skip avatar, image, video generation - Pass `audio_only: true` or `video_only: true` to backend #### 4.1.2 Cost Preview Updates - Show cost comparison based on selected mode - Display potential savings for audio-only vs video ### 4.2 Phase 2: Script Editor UI (NEW - CRITICAL) #### 4.2.1 Three Mode UI Strategy The script editor needs to adapt based on the podcast mode: | Mode | Script Editor UI | Available Actions | |------|------------------|-------------------| | **Audio Only** | Single audio-optimized script | Generate Audio only | | **Video Only** | Current video script editor | Generate Audio + Image + Video | | **Audio + Video** | Two tabs: "Audio Script" + "Video Script" | Full generation options | #### 4.2.2 Implementation Details **File:** `frontend/src/components/PodcastMaker/ScriptEditor/ScriptEditor.tsx` **New Component Structure:** ```typescript interface ScriptEditorProps { // ... existing props audioOnlyMode: boolean; // Audio-only podcast videoOnlyMode: boolean; // Video-only podcast (current behavior) audioScript?: Script; // Audio-optimized script (3-4 scenes, more lines) videoScript?: Script; // Video-optimized script (current) onAudioScriptChange?: (script: Script) => void; onVideoScriptChange?: (script: Script) => void; } ``` **UI Layout:** ``` ┌─────────────────────────────────────────────────────────────┐ │ Script Editor [Audio] [Video] tabs (if both) ├─────────────────────────────────────────────────────────────┤ │ Mode: Audio-Only │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Scene 1: Introduction (90s) [Edit]│ │ │ │ Host: Welcome to today's episode... │ │ │ │ Host: Today we're diving deep into... │ │ │ │ ... (6-10 lines per scene for audio) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ [Generate Audio] $0.04 │ └─────────────────────────────────────────────────────────────┘ ``` #### 4.2.3 Tab Implementation for Audio + Video Mode **When both Audio and Video are selected:** 1. Show two tabs in script editor: - **Tab 1: "Audio Script"** - Audio-optimized (fewer scenes, more content) - **Tab 2: "Video Script"** - Current video script (more scenes, visual) 2. Each tab has independent: - Scene structure - Edit capabilities - Generation buttons 3. Generation actions differ by tab: - Audio Tab: "Generate Audio" button only - Video Tab: "Generate Audio" + "Generate Image" + "Generate Video" #### 4.2.4 Backend Script Generation Updates **Script generation endpoint changes:** ```python # In PodcastScriptRequest model class PodcastScriptRequest(BaseModel): # ... existing fields audio_only: bool = False # Generate audio-optimized script video_only: bool = False # Generate video-optimized script (current) # If both False AND audio/video mode is "both", generate both scripts ``` **Prompt Selection Logic:** ```python if request.audio_only: prompt = AUDIO_ONLY_PROMPT # 3-4 scenes, 6-10 lines/scene elif request.video_only: prompt = VIDEO_PROMPT # Current 5-6 scenes, 2-4 lines/scene else: # Generate both scripts with respective prompts audio_prompt = AUDIO_ONLY_PROMPT video_prompt = VIDEO_PROMPT ``` ### 4.3 Phase 3: Backend Script Generation (AI Prompts) #### 4.2.1 Two-Tier Script Generation Strategy **Current Behavior (Video Podcast):** - Existing prompt in `backend/api/podcast/handlers/script.py` (lines 125-151) - Optimized for video with shorter scenes (2-4 lines per scene) - 5-6 scenes max for visual storytelling - Less content per scene to match video duration **New Audio-Only Mode:** - New prompt optimized for audio-only content - More content-dense, information-rich - Fewer scenes with MORE content per scene - Maximizes use of research data - Reduces API calls while delivering more value #### 4.2.2 Audio-Only Script Prompt **Location:** `backend/api/podcast/handlers/script.py` **New Prompt for Audio-Only:** ```python AUDIO_ONLY_PROMPT = """Create a DEEP, content-rich podcast script optimized for AUDIO-ONLY delivery. {f"RESEARCH DATA (Use extensively - this is audio only, more content is better): {research_context[:3000]}" if research_context else "No research available - generate general content"} {f"BIBLE: {bible_context[:1500]}" if bible_context else ""} {f"{analysis_context}" if analysis_context else ""} Topic: "{request.idea}" Duration: {request.duration_minutes} min | Speakers: {request.speakers} MODE: AUDIO-ONLY (no video constraints - maximize content density) COST OPTIMIZATION (Audio-Only): - 3-4 scenes MAX for entire episode (fewer scenes = fewer API calls) - EACH scene should have 6-10 LINES (more content per scene) - Each line: 3-5 sentences, information-dense - Include: facts, statistics, examples, insights from research - NO visual descriptions needed (save tokens for content) - Make every line deliver unique value STRUCTURE per scene: - scene_id: string - title: short descriptive title - duration: seconds (target {request.duration_minutes*60 // 3}-{request.duration_minutes*60 // 4} per scene) - emotion: neutral|happy|excited|serious|curious|confident - lines: array of {{speaker, text, emphasis}} - speaker: "Host" or "Guest" - text: 3-5 sentences, rich with facts/insights - emphasis: true|false for important points Return JSON with scenes array. """ ``` **Key Differences:** | Aspect | Video (Current) | Audio-Only (New) | |--------|------------------|------------------| | Scenes | 5-6 | 3-4 | | Lines/Scene | 2-4 | 6-10 | | Sentences/Line | 1-3 | 3-5 | | Research Usage | 1,200 chars | 3,000 chars | | Focus | Visual storytelling | Content density | | API Calls | More (lower cost/scene) | Fewer (higher cost/scene) | #### 4.2.3 Implementation Details **File:** `backend/api/podcast/handlers/script.py` 1. Add `audio_only: bool` parameter to `PodcastScriptRequest` 2. Conditionally select prompt based on `audio_only` flag 3. For audio-only: - Use expanded research context (3,000 chars vs 1,200) - Request more lines per scene - Fewer total scenes - More content per line ### 4.4 Phase 4: Backend Optimizations #### 4.3.1 Smart Scene Batching - File: `backend/api/podcast/handlers/audio.py` - Logic: Group scenes with total chars < 9000 - Add pause markers between scenes #### 4.3.2 Audio-Only Flag in Project - Model: Add `audio_only: bool` to project settings - Skip: Avatar generation, image generation, video rendering ### 4.4 Phase 4: Cost Calculation Updates #### 4.4.1 Update Frontend Estimation - File: `frontend/src/services/podcastApi.ts` - Formula updates: ```typescript const estimatedApiCalls = Math.ceil(totalChars / 9500); const ttsCost = estimatedApiCalls * 0.05; ``` --- ## 5. Technical Details ### 5.1 Files to Modify | File | Changes | |------|---------| | `frontend/src/components/PodcastMaker/types.ts` | Add `audio_only`, `video_only`, `podcast_mode` to project settings | | `frontend/src/components/PodcastMaker/CreateModal.tsx` | Add mode toggle (Audio/Video/Both) | | `frontend/src/services/podcastApi.ts` | Update cost estimation for each mode | | `frontend/src/components/PodcastMaker/ScriptEditor/ScriptEditor.tsx` | Add tab support for Audio + Video mode | | `frontend/src/components/PodcastMaker/ScriptEditor/SceneEditor.tsx` | Conditional action buttons per mode | | `backend/api/podcast/models.py` | Add `audio_only`, `video_only` fields to request model | | `backend/api/podcast/handlers/script.py` | Add audio-only + video-only prompts, return both scripts when needed | | `backend/api/podcast/handlers/audio.py` | Implement smart batching | ### 5.2 API Endpoints ```python # PodcastScriptRequest model changes class PodcastScriptRequest(BaseModel): idea: str duration_minutes: int speakers: int research: Optional[Dict] = None bible: Optional[Dict] = None analysis: Optional[Dict] = None outline: Optional[Dict] = None # NEW FIELDS: audio_only: bool = False # Generate audio-optimized script video_only: bool = False # Generate video-optimized script (current) # Both False = generate both scripts for audio+video mode # Response includes both scripts when needed class PodcastScriptResponse(BaseModel): audio_script: Optional[Script] = None # Audio-optimized video_script: Optional[Script] = None # Video-optimized ``` ### 5.3 Database Schema ```python # In PodcastProject model audio_only: bool = False scene_length_target: int = 60 # seconds ``` --- ## 6. User Experience ### 6.1 Create Phase - Mode Toggle ``` ┌─────────────────────────────────────────────────────────────┐ │ 🎙️ Create New Podcast │ ├─────────────────────────────────────────────────────────────┤ │ Duration: [5] minutes Speakers: [1] [2] │ │ │ │ Podcast Mode: │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Audio Only │ │ Video Only │ │ Audio+Video │ │ │ │ ($0.22) │ │ ($2.02) │ │ ($2.24) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ Est. Cost: $0.22 (audio only) vs $2.02 (with video) │ └─────────────────────────────────────────────────────────────┘ ``` ### 6.2 Script Editor - Audio Only Mode ``` ┌─────────────────────────────────────────────────────────────┐ │ Script Editor │ ├─────────────────────────────────────────────────────────────┤ │ 📻 Audio-Only Mode │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Scene 1: Introduction (90s) [Edit]│ │ │ Host: Welcome to today's episode on AI... │ │ │ Host: Today we're diving deep into how AI... │ │ │ Host: I'm excited to share three key insights... │ │ │ ... (6-10 lines for audio) │ │ │ │ │ │ Scene 2: Main Topic (120s) [Edit]│ │ │ ... │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ [Generate Audio] $0.04 [Generate Image] Disabled │ │ [Generate Video] Disabled │ └─────────────────────────────────────────────────────────────┘ ``` ### 6.3 Script Editor - Video Only Mode (Current) ``` ┌─────────────────────────────────────────────────────────────┐ │ Script Editor │ ├─────────────────────────────────────────────────────────────┤ │ 🎬 Video Mode │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ Scene 1: Intro (30s) [Image] [Audio] [V] │ │ │ Scene 2: Hook (30s) [Image] [Audio] [V] │ │ │ Scene 3: Content (45s) [Image] [Audio] [V] │ │ │ Scene 4: Example (30s) [Image] [Audio] [V] │ │ │ Scene 5: CTA (15s) [Image] [Audio] [V] │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ [Generate Audio] $0.19 [Generate Image] $0.10 │ │ [Generate Video] $1.50 │ └─────────────────────────────────────────────────────────────┘ ``` ### 6.4 Script Editor - Audio + Video Mode (Both) ``` ┌─────────────────────────────────────────────────────────────┐ │ Script Editor [Audio] [Video] │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────────────────────────────────────────────┐ │ │ │ [Audio] Tab | [Video] Tab │ │ │ ├─────────────────────────────────────────────────────┤ │ │ │ Audio Script: │ │ │ │ Scene 1: Intro (90s) - 8 lines │ │ │ │ Scene 2: Deep Dive (120s) - 10 lines │ │ │ │ │ │ │ │ [Generate Audio] $0.04 │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ OR ┌─────────────────────────────────────────────────────────────┐ │ Script Editor [Audio] [Video] │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────────────────────────────────────────────┐ │ │ │ [Audio] Tab | [Video] Tab │ │ │ ├─────────────────────────────────────────────────────┤ │ │ │ Video Script: │ │ │ │ Scene 1: Intro (30s) [Img] [Aud] [Vid] │ │ │ │ Scene 2: Hook (30s) [Img] [Aud] [Vid] │ │ │ │ Scene 3: Content (45s) [Img] [Aud] [Vid] │ │ │ │ │ │ │ │ [Generate Audio] [Generate Image] [Generate Video] │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ``` ### 6.5 Cost Comparison UI | Mode | Scenes | Lines/Scene | TTS Cost | Video Cost | Total | |------|--------|-------------|----------|------------|-------| | Audio Only | 3-4 | 6-10 | $0.19 | $0 | **$0.22** | | Video Only | 5-6 | 2-4 | $0.19 | $1.50 | **$1.69** | | Audio+Video | 3-4 + 5-6 | varies | $0.19 | $1.50 | **$1.72** | --- ## 7. Testing Plan ### 7.1 Unit Tests 1. Test character count calculation 2. Test scene batching logic (under 10k chars) 3. Test cost estimation accuracy ### 7.2 Integration Tests 1. Generate audio for 10-minute podcast with 5 scenes 2. Verify all scenes generate correctly 3. Verify cost tracking in database ### 7.3 Performance Tests 1. Measure time for batched vs sequential API calls 2. Verify no timeout issues with longer text --- ## 8. Success Metrics | Metric | Target | Current | |--------|--------|---------| | API calls per 5-min podcast | 5 | 7 | | Cost per 5-min audio podcast | $0.22 | $0.22 + video | | User-visible savings | 50%+ | N/A | | Scene length default | 60s | 45s | --- ## 9. Appendix: Related Files ### Backend - `backend/services/llm_providers/main_audio_generation.py` - TTS cost calculation - `backend/api/podcast/handlers/audio.py` - Audio generation endpoint - `backend/api/podcast/handlers/script.py` - Script generation - `backend/services/subscription/pricing_service.py` - Pricing configuration ### Frontend - `frontend/src/services/podcastApi.ts` - Cost estimation - `frontend/src/components/PodcastMaker/CreateModal.tsx` - Create UI - `frontend/src/components/PodcastMaker/types.ts` - Type definitions --- ## Document History | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | 2026-04-08 | ALwrity Team | Initial document creation | --- *This document serves as the reference for audio-only podcast optimization in ALwrity Podcast Maker.*