Files

ajaysi 0732887c09 Analyzing your idea with AI...

2026-04-19 13:21:36 +05:30

23 KiB

Raw Blame History

Audio-Only Podcast Optimization Plan

Executive Summary

This document outlines the optimization strategy for audio-only podcasts in ALwrity's Podcast Maker. The goal is to maximize the character throughput per API request while maintaining cost efficiency and audio quality.

1. Current Cost Analysis

1.1 Pricing Structure

Service	Provider	Cost Formula	Notes
TTS (Audio)	Minimax Speech-02-HD (WaveSpeed)	$0.05 per 1,000 chars	Exact billing per character
Voice Clone	Minimax Voice Clone	$0.50 per clone	One-time if using custom voice
Research	Exa Neural Search	$0.005 per query	+ ~$0.001 for LLM insight extraction
Avatar	Ideogram Character	$0.10 per image	Only if AI-generated

1.2 Cost Examples

Podcast Duration	Characters (est.)	TTS Cost	Total Cost (audio-only)
1 minute	750	$0.04	$0.07
3 minutes	2,250	$0.11	$0.14
5 minutes	3,750	$0.19	$0.22
10 minutes	7,500	$0.38	$0.41

2. Technical Constraints

2.1 API Limits

Backend: main_audio_generation.py (line 100)

if len(text) > 10000:
    raise ValueError(f"Text is too long ({len(text)} characters). Maximum is 10,000 characters.")

Current Limit: 10,000 characters per single API request

2.2 Scene-Based Architecture

Each scene = 1 API call
Default scene length: 45 seconds (scene_length_target knob)
Audio is generated per scene, then concatenated

3. Optimization Strategies

3.1 Strategy 1: Fewer, Longer Scenes

Problem: More scenes = more API calls = higher costs

Solution:

Increase scene_length_target from 45s to 60s or 90s
Fewer scenes for the same podcast duration

Impact:

Duration	Scenes (45s)	Scenes (60s)	Scenes (90s)	API Call Savings
5 min	7	5	3	57% fewer calls
10 min	13	10	7	46% fewer calls

3.2 Strategy 2: Per-Scene Character Budgeting

Current behavior: Each scene text is sent separately to TTS API

Optimization options:

Text Concatenation: Combine multiple scene texts with <#x#> pause markers
```
# Example: Combine scenes with pause markers
combined_text = "Scene 1 text.<#x#>Scene 2 text.<#x#>Scene 3 text."
```
- Risk: May hit 10,000 char limit faster
- Benefit: Single API call for multiple scenes

Smart Chunking: Dynamically batch scenes based on character count

MAX_CHARS_PER_REQUEST = 9500  # Leave buffer
# Group scenes until approaching limit

3.3 Strategy 3: Voice Settings for Longer Content

Speed factor impacts:

Speed 0.8 = 25% more content per same duration
Speed 1.2 = 20% less content

Recommendation: Use speed 0.9-1.0 for optimal quality/cost balance

3.4 Strategy 4: Audio-Only Mode Skip

For audio-only podcasts (no video):

Skip avatar generation - Save $0.10 per speaker
Skip video rendering - Save $0.30 per scene
Skip scene images - Save $0.04-$0.10 per scene

Estimated savings for 5-min, 5-scene audio podcast:

Component	Cost	Audio-Only Savings
Avatar	$0.10	$0.10
Video (5 scenes)	$1.50	$1.50
Images (5 scenes)	$0.20-$0.50	$0.20-$0.50
Total	$1.80-$2.10	$1.80-$2.10

4. Implementation Plan

4.1 Phase 1: User-Facing Controls (Frontend)

4.1.1 Add "Audio Only" Toggle

Location: CreateModal.tsx or PodcastConfiguration.tsx
Options: Audio Only | Video Only | Audio + Video
When enabled: Skip avatar, image, video generation
Pass audio_only: true or video_only: true to backend

4.1.2 Cost Preview Updates

Show cost comparison based on selected mode
Display potential savings for audio-only vs video

4.2 Phase 2: Script Editor UI (NEW - CRITICAL)

4.2.1 Three Mode UI Strategy

The script editor needs to adapt based on the podcast mode:

Mode	Script Editor UI	Available Actions
Audio Only	Single audio-optimized script	Generate Audio only
Video Only	Current video script editor	Generate Audio + Image + Video
Audio + Video	Two tabs: "Audio Script" + "Video Script"	Full generation options

4.2.2 Implementation Details

File: frontend/src/components/PodcastMaker/ScriptEditor/ScriptEditor.tsx

New Component Structure:

interface ScriptEditorProps {
  // ... existing props
  audioOnlyMode: boolean;    // Audio-only podcast
  videoOnlyMode: boolean;    // Video-only podcast (current behavior)
  audioScript?: Script;      // Audio-optimized script (3-4 scenes, more lines)
  videoScript?: Script;      // Video-optimized script (current)
  onAudioScriptChange?: (script: Script) => void;
  onVideoScriptChange?: (script: Script) => void;
}

UI Layout:

┌─────────────────────────────────────────────────────────────┐
│  Script Editor                              [Audio] [Video] tabs (if both)
├─────────────────────────────────────────────────────────────┤
│  Mode: Audio-Only                                          │
│  ┌─────────────────────────────────────────────────────┐  │
│  │ Scene 1: Introduction (90s)                     [Edit]│  │
│  │   Host: Welcome to today's episode...                 │  │
│  │   Host: Today we're diving deep into...               │  │
│  │   ... (6-10 lines per scene for audio)                │  │
│  └─────────────────────────────────────────────────────┘  │
│                                                             │
│  [Generate Audio] $0.04                                   │
└─────────────────────────────────────────────────────────────┘

4.2.3 Tab Implementation for Audio + Video Mode

When both Audio and Video are selected:

Show two tabs in script editor:
- Tab 1: "Audio Script" - Audio-optimized (fewer scenes, more content)
- Tab 2: "Video Script" - Current video script (more scenes, visual)
Each tab has independent:
- Scene structure
- Edit capabilities
- Generation buttons
Generation actions differ by tab:
- Audio Tab: "Generate Audio" button only
- Video Tab: "Generate Audio" + "Generate Image" + "Generate Video"

4.2.4 Backend Script Generation Updates

Script generation endpoint changes:

# In PodcastScriptRequest model
class PodcastScriptRequest(BaseModel):
    # ... existing fields
    audio_only: bool = False      # Generate audio-optimized script
    video_only: bool = False     # Generate video-optimized script (current)
    # If both False AND audio/video mode is "both", generate both scripts

Prompt Selection Logic:

if request.audio_only:
    prompt = AUDIO_ONLY_PROMPT  # 3-4 scenes, 6-10 lines/scene
elif request.video_only:
    prompt = VIDEO_PROMPT        # Current 5-6 scenes, 2-4 lines/scene
else:
    # Generate both scripts with respective prompts
    audio_prompt = AUDIO_ONLY_PROMPT
    video_prompt = VIDEO_PROMPT

4.3 Phase 3: Backend Script Generation (AI Prompts)

4.2.1 Two-Tier Script Generation Strategy

Current Behavior (Video Podcast):

Existing prompt in backend/api/podcast/handlers/script.py (lines 125-151)
Optimized for video with shorter scenes (2-4 lines per scene)
5-6 scenes max for visual storytelling
Less content per scene to match video duration

New Audio-Only Mode:

New prompt optimized for audio-only content
More content-dense, information-rich
Fewer scenes with MORE content per scene
Maximizes use of research data
Reduces API calls while delivering more value

4.2.2 Audio-Only Script Prompt

Location: backend/api/podcast/handlers/script.py

New Prompt for Audio-Only:

AUDIO_ONLY_PROMPT = """Create a DEEP, content-rich podcast script optimized for AUDIO-ONLY delivery.

{f"RESEARCH DATA (Use extensively - this is audio only, more content is better): {research_context[:3000]}" if research_context else "No research available - generate general content"}

{f"BIBLE: {bible_context[:1500]}" if bible_context else ""}
{f"{analysis_context}" if analysis_context else ""}

Topic: "{request.idea}"
Duration: {request.duration_minutes} min | Speakers: {request.speakers}
MODE: AUDIO-ONLY (no video constraints - maximize content density)

COST OPTIMIZATION (Audio-Only):
- 3-4 scenes MAX for entire episode (fewer scenes = fewer API calls)
- EACH scene should have 6-10 LINES (more content per scene)
- Each line: 3-5 sentences, information-dense
- Include: facts, statistics, examples, insights from research
- NO visual descriptions needed (save tokens for content)
- Make every line deliver unique value

STRUCTURE per scene:
- scene_id: string
- title: short descriptive title
- duration: seconds (target {request.duration_minutes*60 // 3}-{request.duration_minutes*60 // 4} per scene)
- emotion: neutral|happy|excited|serious|curious|confident
- lines: array of {{speaker, text, emphasis}}
  - speaker: "Host" or "Guest"
  - text: 3-5 sentences, rich with facts/insights
  - emphasis: true|false for important points

Return JSON with scenes array.
"""

Key Differences:

Aspect	Video (Current)	Audio-Only (New)
Scenes	5-6	3-4
Lines/Scene	2-4	6-10
Sentences/Line	1-3	3-5
Research Usage	1,200 chars	3,000 chars
Focus	Visual storytelling	Content density
API Calls	More (lower cost/scene)	Fewer (higher cost/scene)

4.2.3 Implementation Details

File: backend/api/podcast/handlers/script.py

Add audio_only: bool parameter to PodcastScriptRequest
Conditionally select prompt based on audio_only flag
For audio-only:
- Use expanded research context (3,000 chars vs 1,200)
- Request more lines per scene
- Fewer total scenes
- More content per line

4.4 Phase 4: Backend Optimizations

4.3.1 Smart Scene Batching

File: backend/api/podcast/handlers/audio.py
Logic: Group scenes with total chars < 9000
Add pause markers between scenes

4.3.2 Audio-Only Flag in Project

Model: Add audio_only: bool to project settings
Skip: Avatar generation, image generation, video rendering

4.4 Phase 4: Cost Calculation Updates

4.4.1 Update Frontend Estimation

File: frontend/src/services/podcastApi.ts

Formula updates:

const estimatedApiCalls = Math.ceil(totalChars / 9500);
const ttsCost = estimatedApiCalls * 0.05;

5. Technical Details

5.1 Files to Modify

File	Changes
`frontend/src/components/PodcastMaker/types.ts`	Add `audio_only`, `video_only`, `podcast_mode` to project settings
`frontend/src/components/PodcastMaker/CreateModal.tsx`	Add mode toggle (Audio/Video/Both)
`frontend/src/services/podcastApi.ts`	Update cost estimation for each mode
`frontend/src/components/PodcastMaker/ScriptEditor/ScriptEditor.tsx`	Add tab support for Audio + Video mode
`frontend/src/components/PodcastMaker/ScriptEditor/SceneEditor.tsx`	Conditional action buttons per mode
`backend/api/podcast/models.py`	Add `audio_only`, `video_only` fields to request model
`backend/api/podcast/handlers/script.py`	Add audio-only + video-only prompts, return both scripts when needed
`backend/api/podcast/handlers/audio.py`	Implement smart batching

5.2 API Endpoints

# PodcastScriptRequest model changes
class PodcastScriptRequest(BaseModel):
    idea: str
    duration_minutes: int
    speakers: int
    research: Optional[Dict] = None
    bible: Optional[Dict] = None
    analysis: Optional[Dict] = None
    outline: Optional[Dict] = None
    # NEW FIELDS:
    audio_only: bool = False      # Generate audio-optimized script
    video_only: bool = False      # Generate video-optimized script (current)
    # Both False = generate both scripts for audio+video mode

# Response includes both scripts when needed
class PodcastScriptResponse(BaseModel):
    audio_script: Optional[Script] = None   # Audio-optimized
    video_script: Optional[Script] = None   # Video-optimized

5.3 Database Schema

# In PodcastProject model
audio_only: bool = False
scene_length_target: int = 60  # seconds

6. User Experience

6.1 Create Phase - Mode Toggle

┌─────────────────────────────────────────────────────────────┐
│  🎙️ Create New Podcast                                     │
├─────────────────────────────────────────────────────────────┤
│  Duration: [5] minutes   Speakers: [1] [2]                   │
│                                                             │
│  Podcast Mode:                                              │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐          │
│  │ Audio Only  │ │ Video Only  │ │ Audio+Video │          │
│  │   ($0.22)   │ │   ($2.02)   │ │   ($2.24)   │          │
│  └─────────────┘ └─────────────┘ └─────────────┘          │
│                                                             │
│  Est. Cost: $0.22 (audio only) vs $2.02 (with video)       │
└─────────────────────────────────────────────────────────────┘

6.2 Script Editor - Audio Only Mode

┌─────────────────────────────────────────────────────────────┐
│  Script Editor                                              │
├─────────────────────────────────────────────────────────────┤
│  📻 Audio-Only Mode                                         │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Scene 1: Introduction (90s)                     [Edit]│
│  │   Host: Welcome to today's episode on AI...         │
│  │   Host: Today we're diving deep into how AI...      │
│  │   Host: I'm excited to share three key insights...  │
│  │   ... (6-10 lines for audio)                        │
│  │                                                      │
│  │ Scene 2: Main Topic (120s)                      [Edit]│
│  │   ...                                               │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  [Generate Audio] $0.04      [Generate Image] Disabled    │
│  [Generate Video] Disabled                                   │
└─────────────────────────────────────────────────────────────┘

6.3 Script Editor - Video Only Mode (Current)

┌─────────────────────────────────────────────────────────────┐
│  Script Editor                                              │
├─────────────────────────────────────────────────────────────┤
│  🎬 Video Mode                                               │
│  ┌─────────────────────────────────────────────────────┐    │
│  │ Scene 1: Intro (30s)          [Image] [Audio] [V] │
│  │ Scene 2: Hook (30s)            [Image] [Audio] [V]  │
│  │ Scene 3: Content (45s)         [Image] [Audio] [V]  │
│  │ Scene 4: Example (30s)         [Image] [Audio] [V]  │
│  │ Scene 5: CTA (15s)             [Image] [Audio] [V]   │
│  └─────────────────────────────────────────────────────┘    │
│                                                             │
│  [Generate Audio] $0.19   [Generate Image] $0.10           │
│  [Generate Video] $1.50                                     │
└─────────────────────────────────────────────────────────────┘

6.4 Script Editor - Audio + Video Mode (Both)

┌─────────────────────────────────────────────────────────────┐
│  Script Editor                             [Audio] [Video] │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐  │
│  │ [Audio] Tab | [Video] Tab                           │  │
│  ├─────────────────────────────────────────────────────┤  │
│  │ Audio Script:                                        │  │
│  │   Scene 1: Intro (90s) - 8 lines                   │  │
│  │   Scene 2: Deep Dive (120s) - 10 lines              │  │
│  │                                                      │  │
│  │ [Generate Audio] $0.04                              │  │
│  └─────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
OR
┌─────────────────────────────────────────────────────────────┐
│  Script Editor                             [Audio] [Video] │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────┐  │
│  │ [Audio] Tab | [Video] Tab                           │  │
│  ├─────────────────────────────────────────────────────┤  │
│  │ Video Script:                                       │  │
│  │   Scene 1: Intro (30s)    [Img] [Aud] [Vid]         │  │
│  │   Scene 2: Hook (30s)      [Img] [Aud] [Vid]        │  │
│  │   Scene 3: Content (45s)   [Img] [Aud] [Vid]        │  │
│  │                                                      │  │
│  │ [Generate Audio] [Generate Image] [Generate Video]  │  │
│  └─────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

6.5 Cost Comparison UI

Mode	Scenes	Lines/Scene	TTS Cost	Video Cost	Total
Audio Only	3-4	6-10	$0.19	$0	$0.22
Video Only	5-6	2-4	$0.19	$1.50	$1.69
Audio+Video	3-4 + 5-6	varies	$0.19	$1.50	$1.72

7. Testing Plan

7.1 Unit Tests

Test character count calculation
Test scene batching logic (under 10k chars)
Test cost estimation accuracy

7.2 Integration Tests

Generate audio for 10-minute podcast with 5 scenes
Verify all scenes generate correctly
Verify cost tracking in database

7.3 Performance Tests

Measure time for batched vs sequential API calls
Verify no timeout issues with longer text

8. Success Metrics

Metric	Target	Current
API calls per 5-min podcast	5	7
Cost per 5-min audio podcast	$0.22	$0.22 + video
User-visible savings	50%+	N/A
Scene length default	60s	45s

Backend

backend/services/llm_providers/main_audio_generation.py - TTS cost calculation
backend/api/podcast/handlers/audio.py - Audio generation endpoint
backend/api/podcast/handlers/script.py - Script generation
backend/services/subscription/pricing_service.py - Pricing configuration

Frontend

frontend/src/services/podcastApi.ts - Cost estimation
frontend/src/components/PodcastMaker/CreateModal.tsx - Create UI
frontend/src/components/PodcastMaker/types.ts - Type definitions

Document History

Version	Date	Author	Changes
1.0	2026-04-08	ALwrity Team	Initial document creation

This document serves as the reference for audio-only podcast optimization in ALwrity Podcast Maker.

23 KiB Raw Blame History