23 KiB
Audio-Only Podcast Optimization Plan
Executive Summary
This document outlines the optimization strategy for audio-only podcasts in ALwrity's Podcast Maker. The goal is to maximize the character throughput per API request while maintaining cost efficiency and audio quality.
1. Current Cost Analysis
1.1 Pricing Structure
| Service | Provider | Cost Formula | Notes |
|---|---|---|---|
| TTS (Audio) | Minimax Speech-02-HD (WaveSpeed) | $0.05 per 1,000 chars | Exact billing per character |
| Voice Clone | Minimax Voice Clone | $0.50 per clone | One-time if using custom voice |
| Research | Exa Neural Search | $0.005 per query | + ~$0.001 for LLM insight extraction |
| Avatar | Ideogram Character | $0.10 per image | Only if AI-generated |
1.2 Cost Examples
| Podcast Duration | Characters (est.) | TTS Cost | Total Cost (audio-only) |
|---|---|---|---|
| 1 minute | 750 | $0.04 | $0.07 |
| 3 minutes | 2,250 | $0.11 | $0.14 |
| 5 minutes | 3,750 | $0.19 | $0.22 |
| 10 minutes | 7,500 | $0.38 | $0.41 |
2. Technical Constraints
2.1 API Limits
Backend: main_audio_generation.py (line 100)
if len(text) > 10000:
raise ValueError(f"Text is too long ({len(text)} characters). Maximum is 10,000 characters.")
Current Limit: 10,000 characters per single API request
2.2 Scene-Based Architecture
- Each scene = 1 API call
- Default scene length: 45 seconds (
scene_length_targetknob) - Audio is generated per scene, then concatenated
3. Optimization Strategies
3.1 Strategy 1: Fewer, Longer Scenes
Problem: More scenes = more API calls = higher costs
Solution:
- Increase
scene_length_targetfrom 45s to 60s or 90s - Fewer scenes for the same podcast duration
Impact:
| Duration | Scenes (45s) | Scenes (60s) | Scenes (90s) | API Call Savings |
|---|---|---|---|---|
| 5 min | 7 | 5 | 3 | 57% fewer calls |
| 10 min | 13 | 10 | 7 | 46% fewer calls |
3.2 Strategy 2: Per-Scene Character Budgeting
Current behavior: Each scene text is sent separately to TTS API
Optimization options:
-
Text Concatenation: Combine multiple scene texts with
<#x#>pause markers# Example: Combine scenes with pause markers combined_text = "Scene 1 text.<#x#>Scene 2 text.<#x#>Scene 3 text."- Risk: May hit 10,000 char limit faster
- Benefit: Single API call for multiple scenes
-
Smart Chunking: Dynamically batch scenes based on character count
MAX_CHARS_PER_REQUEST = 9500 # Leave buffer # Group scenes until approaching limit
3.3 Strategy 3: Voice Settings for Longer Content
Speed factor impacts:
- Speed 0.8 = 25% more content per same duration
- Speed 1.2 = 20% less content
Recommendation: Use speed 0.9-1.0 for optimal quality/cost balance
3.4 Strategy 4: Audio-Only Mode Skip
For audio-only podcasts (no video):
- Skip avatar generation - Save $0.10 per speaker
- Skip video rendering - Save $0.30 per scene
- Skip scene images - Save $0.04-$0.10 per scene
Estimated savings for 5-min, 5-scene audio podcast:
| Component | Cost | Audio-Only Savings |
|---|---|---|
| Avatar | $0.10 | $0.10 |
| Video (5 scenes) | $1.50 | $1.50 |
| Images (5 scenes) | $0.20-$0.50 | $0.20-$0.50 |
| Total | $1.80-$2.10 | $1.80-$2.10 |
4. Implementation Plan
4.1 Phase 1: User-Facing Controls (Frontend)
4.1.1 Add "Audio Only" Toggle
- Location:
CreateModal.tsxorPodcastConfiguration.tsx - Options:
Audio Only|Video Only|Audio + Video - When enabled: Skip avatar, image, video generation
- Pass
audio_only: trueorvideo_only: trueto backend
4.1.2 Cost Preview Updates
- Show cost comparison based on selected mode
- Display potential savings for audio-only vs video
4.2 Phase 2: Script Editor UI (NEW - CRITICAL)
4.2.1 Three Mode UI Strategy
The script editor needs to adapt based on the podcast mode:
| Mode | Script Editor UI | Available Actions |
|---|---|---|
| Audio Only | Single audio-optimized script | Generate Audio only |
| Video Only | Current video script editor | Generate Audio + Image + Video |
| Audio + Video | Two tabs: "Audio Script" + "Video Script" | Full generation options |
4.2.2 Implementation Details
File: frontend/src/components/PodcastMaker/ScriptEditor/ScriptEditor.tsx
New Component Structure:
interface ScriptEditorProps {
// ... existing props
audioOnlyMode: boolean; // Audio-only podcast
videoOnlyMode: boolean; // Video-only podcast (current behavior)
audioScript?: Script; // Audio-optimized script (3-4 scenes, more lines)
videoScript?: Script; // Video-optimized script (current)
onAudioScriptChange?: (script: Script) => void;
onVideoScriptChange?: (script: Script) => void;
}
UI Layout:
┌─────────────────────────────────────────────────────────────┐
│ Script Editor [Audio] [Video] tabs (if both)
├─────────────────────────────────────────────────────────────┤
│ Mode: Audio-Only │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Scene 1: Introduction (90s) [Edit]│ │
│ │ Host: Welcome to today's episode... │ │
│ │ Host: Today we're diving deep into... │ │
│ │ ... (6-10 lines per scene for audio) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ [Generate Audio] $0.04 │
└─────────────────────────────────────────────────────────────┘
4.2.3 Tab Implementation for Audio + Video Mode
When both Audio and Video are selected:
-
Show two tabs in script editor:
- Tab 1: "Audio Script" - Audio-optimized (fewer scenes, more content)
- Tab 2: "Video Script" - Current video script (more scenes, visual)
-
Each tab has independent:
- Scene structure
- Edit capabilities
- Generation buttons
-
Generation actions differ by tab:
- Audio Tab: "Generate Audio" button only
- Video Tab: "Generate Audio" + "Generate Image" + "Generate Video"
4.2.4 Backend Script Generation Updates
Script generation endpoint changes:
# In PodcastScriptRequest model
class PodcastScriptRequest(BaseModel):
# ... existing fields
audio_only: bool = False # Generate audio-optimized script
video_only: bool = False # Generate video-optimized script (current)
# If both False AND audio/video mode is "both", generate both scripts
Prompt Selection Logic:
if request.audio_only:
prompt = AUDIO_ONLY_PROMPT # 3-4 scenes, 6-10 lines/scene
elif request.video_only:
prompt = VIDEO_PROMPT # Current 5-6 scenes, 2-4 lines/scene
else:
# Generate both scripts with respective prompts
audio_prompt = AUDIO_ONLY_PROMPT
video_prompt = VIDEO_PROMPT
4.3 Phase 3: Backend Script Generation (AI Prompts)
4.2.1 Two-Tier Script Generation Strategy
Current Behavior (Video Podcast):
- Existing prompt in
backend/api/podcast/handlers/script.py(lines 125-151) - Optimized for video with shorter scenes (2-4 lines per scene)
- 5-6 scenes max for visual storytelling
- Less content per scene to match video duration
New Audio-Only Mode:
- New prompt optimized for audio-only content
- More content-dense, information-rich
- Fewer scenes with MORE content per scene
- Maximizes use of research data
- Reduces API calls while delivering more value
4.2.2 Audio-Only Script Prompt
Location: backend/api/podcast/handlers/script.py
New Prompt for Audio-Only:
AUDIO_ONLY_PROMPT = """Create a DEEP, content-rich podcast script optimized for AUDIO-ONLY delivery.
{f"RESEARCH DATA (Use extensively - this is audio only, more content is better): {research_context[:3000]}" if research_context else "No research available - generate general content"}
{f"BIBLE: {bible_context[:1500]}" if bible_context else ""}
{f"{analysis_context}" if analysis_context else ""}
Topic: "{request.idea}"
Duration: {request.duration_minutes} min | Speakers: {request.speakers}
MODE: AUDIO-ONLY (no video constraints - maximize content density)
COST OPTIMIZATION (Audio-Only):
- 3-4 scenes MAX for entire episode (fewer scenes = fewer API calls)
- EACH scene should have 6-10 LINES (more content per scene)
- Each line: 3-5 sentences, information-dense
- Include: facts, statistics, examples, insights from research
- NO visual descriptions needed (save tokens for content)
- Make every line deliver unique value
STRUCTURE per scene:
- scene_id: string
- title: short descriptive title
- duration: seconds (target {request.duration_minutes*60 // 3}-{request.duration_minutes*60 // 4} per scene)
- emotion: neutral|happy|excited|serious|curious|confident
- lines: array of {{speaker, text, emphasis}}
- speaker: "Host" or "Guest"
- text: 3-5 sentences, rich with facts/insights
- emphasis: true|false for important points
Return JSON with scenes array.
"""
Key Differences:
| Aspect | Video (Current) | Audio-Only (New) |
|---|---|---|
| Scenes | 5-6 | 3-4 |
| Lines/Scene | 2-4 | 6-10 |
| Sentences/Line | 1-3 | 3-5 |
| Research Usage | 1,200 chars | 3,000 chars |
| Focus | Visual storytelling | Content density |
| API Calls | More (lower cost/scene) | Fewer (higher cost/scene) |
4.2.3 Implementation Details
File: backend/api/podcast/handlers/script.py
- Add
audio_only: boolparameter toPodcastScriptRequest - Conditionally select prompt based on
audio_onlyflag - For audio-only:
- Use expanded research context (3,000 chars vs 1,200)
- Request more lines per scene
- Fewer total scenes
- More content per line
4.4 Phase 4: Backend Optimizations
4.3.1 Smart Scene Batching
- File:
backend/api/podcast/handlers/audio.py - Logic: Group scenes with total chars < 9000
- Add pause markers between scenes
4.3.2 Audio-Only Flag in Project
- Model: Add
audio_only: boolto project settings - Skip: Avatar generation, image generation, video rendering
4.4 Phase 4: Cost Calculation Updates
4.4.1 Update Frontend Estimation
- File:
frontend/src/services/podcastApi.ts - Formula updates:
const estimatedApiCalls = Math.ceil(totalChars / 9500); const ttsCost = estimatedApiCalls * 0.05;
5. Technical Details
5.1 Files to Modify
| File | Changes |
|---|---|
frontend/src/components/PodcastMaker/types.ts |
Add audio_only, video_only, podcast_mode to project settings |
frontend/src/components/PodcastMaker/CreateModal.tsx |
Add mode toggle (Audio/Video/Both) |
frontend/src/services/podcastApi.ts |
Update cost estimation for each mode |
frontend/src/components/PodcastMaker/ScriptEditor/ScriptEditor.tsx |
Add tab support for Audio + Video mode |
frontend/src/components/PodcastMaker/ScriptEditor/SceneEditor.tsx |
Conditional action buttons per mode |
backend/api/podcast/models.py |
Add audio_only, video_only fields to request model |
backend/api/podcast/handlers/script.py |
Add audio-only + video-only prompts, return both scripts when needed |
backend/api/podcast/handlers/audio.py |
Implement smart batching |
5.2 API Endpoints
# PodcastScriptRequest model changes
class PodcastScriptRequest(BaseModel):
idea: str
duration_minutes: int
speakers: int
research: Optional[Dict] = None
bible: Optional[Dict] = None
analysis: Optional[Dict] = None
outline: Optional[Dict] = None
# NEW FIELDS:
audio_only: bool = False # Generate audio-optimized script
video_only: bool = False # Generate video-optimized script (current)
# Both False = generate both scripts for audio+video mode
# Response includes both scripts when needed
class PodcastScriptResponse(BaseModel):
audio_script: Optional[Script] = None # Audio-optimized
video_script: Optional[Script] = None # Video-optimized
5.3 Database Schema
# In PodcastProject model
audio_only: bool = False
scene_length_target: int = 60 # seconds
6. User Experience
6.1 Create Phase - Mode Toggle
┌─────────────────────────────────────────────────────────────┐
│ 🎙️ Create New Podcast │
├─────────────────────────────────────────────────────────────┤
│ Duration: [5] minutes Speakers: [1] [2] │
│ │
│ Podcast Mode: │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Audio Only │ │ Video Only │ │ Audio+Video │ │
│ │ ($0.22) │ │ ($2.02) │ │ ($2.24) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Est. Cost: $0.22 (audio only) vs $2.02 (with video) │
└─────────────────────────────────────────────────────────────┘
6.2 Script Editor - Audio Only Mode
┌─────────────────────────────────────────────────────────────┐
│ Script Editor │
├─────────────────────────────────────────────────────────────┤
│ 📻 Audio-Only Mode │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Scene 1: Introduction (90s) [Edit]│
│ │ Host: Welcome to today's episode on AI... │
│ │ Host: Today we're diving deep into how AI... │
│ │ Host: I'm excited to share three key insights... │
│ │ ... (6-10 lines for audio) │
│ │ │
│ │ Scene 2: Main Topic (120s) [Edit]│
│ │ ... │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ [Generate Audio] $0.04 [Generate Image] Disabled │
│ [Generate Video] Disabled │
└─────────────────────────────────────────────────────────────┘
6.3 Script Editor - Video Only Mode (Current)
┌─────────────────────────────────────────────────────────────┐
│ Script Editor │
├─────────────────────────────────────────────────────────────┤
│ 🎬 Video Mode │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Scene 1: Intro (30s) [Image] [Audio] [V] │
│ │ Scene 2: Hook (30s) [Image] [Audio] [V] │
│ │ Scene 3: Content (45s) [Image] [Audio] [V] │
│ │ Scene 4: Example (30s) [Image] [Audio] [V] │
│ │ Scene 5: CTA (15s) [Image] [Audio] [V] │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ [Generate Audio] $0.19 [Generate Image] $0.10 │
│ [Generate Video] $1.50 │
└─────────────────────────────────────────────────────────────┘
6.4 Script Editor - Audio + Video Mode (Both)
┌─────────────────────────────────────────────────────────────┐
│ Script Editor [Audio] [Video] │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [Audio] Tab | [Video] Tab │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ Audio Script: │ │
│ │ Scene 1: Intro (90s) - 8 lines │ │
│ │ Scene 2: Deep Dive (120s) - 10 lines │ │
│ │ │ │
│ │ [Generate Audio] $0.04 │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
OR
┌─────────────────────────────────────────────────────────────┐
│ Script Editor [Audio] [Video] │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────┐ │
│ │ [Audio] Tab | [Video] Tab │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ Video Script: │ │
│ │ Scene 1: Intro (30s) [Img] [Aud] [Vid] │ │
│ │ Scene 2: Hook (30s) [Img] [Aud] [Vid] │ │
│ │ Scene 3: Content (45s) [Img] [Aud] [Vid] │ │
│ │ │ │
│ │ [Generate Audio] [Generate Image] [Generate Video] │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
6.5 Cost Comparison UI
| Mode | Scenes | Lines/Scene | TTS Cost | Video Cost | Total |
|---|---|---|---|---|---|
| Audio Only | 3-4 | 6-10 | $0.19 | $0 | $0.22 |
| Video Only | 5-6 | 2-4 | $0.19 | $1.50 | $1.69 |
| Audio+Video | 3-4 + 5-6 | varies | $0.19 | $1.50 | $1.72 |
7. Testing Plan
7.1 Unit Tests
- Test character count calculation
- Test scene batching logic (under 10k chars)
- Test cost estimation accuracy
7.2 Integration Tests
- Generate audio for 10-minute podcast with 5 scenes
- Verify all scenes generate correctly
- Verify cost tracking in database
7.3 Performance Tests
- Measure time for batched vs sequential API calls
- Verify no timeout issues with longer text
8. Success Metrics
| Metric | Target | Current |
|---|---|---|
| API calls per 5-min podcast | 5 | 7 |
| Cost per 5-min audio podcast | $0.22 | $0.22 + video |
| User-visible savings | 50%+ | N/A |
| Scene length default | 60s | 45s |
9. Appendix: Related Files
Backend
backend/services/llm_providers/main_audio_generation.py- TTS cost calculationbackend/api/podcast/handlers/audio.py- Audio generation endpointbackend/api/podcast/handlers/script.py- Script generationbackend/services/subscription/pricing_service.py- Pricing configuration
Frontend
frontend/src/services/podcastApi.ts- Cost estimationfrontend/src/components/PodcastMaker/CreateModal.tsx- Create UIfrontend/src/components/PodcastMaker/types.ts- Type definitions
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-04-08 | ALwrity Team | Initial document creation |
This document serves as the reference for audio-only podcast optimization in ALwrity Podcast Maker.