WIP: AI Podcast Maker and YouTube Creator Studio integration

2025-12-10 09:37:55 +05:30
parent 31f078c763
commit 81590cf4db
75 changed files with 11879 additions and 1380 deletions
--- a/docs/PODCAST_API_CALL_ANALYSIS.md
+++ b/docs/PODCAST_API_CALL_ANALYSIS.md
@@ -0,0 +1,295 @@
+# Podcast Maker External API Call Analysis
+
+## Overview
+This document analyzes all external API calls made during the podcast creation workflow and how they scale with duration, number of speakers, and other factors.
+
+---
+
+## External API Providers
+
+1. **Gemini (Google)** - LLM for story setup and script generation
+2. **Google Grounding** - Research via Gemini's native search grounding
+3. **Exa** - Alternative neural search provider for research
+4. **WaveSpeed** - API gateway for:
+   - **Minimax Speech 02 HD** - Text-to-Speech (TTS)
+   - **InfiniteTalk** - Avatar animation (image + audio → video)
+
+---
+
+## Workflow Phases & API Calls
+
+### Phase 1: Project Creation (`createProject`)
+
+**External API Calls:**
+1. **Gemini LLM** - Story setup generation
+   - **Endpoint**: `/api/story/generate-setup`
+   - **Backend**: `storyWriterApi.generateStorySetup()`
+   - **Service**: `backend/services/story_writer/service_components/setup.py`
+   - **Function**: `llm_text_gen()` → Gemini API
+   - **Calls per project**: **1 call**
+   - **Scaling**: Fixed (1 call regardless of duration)
+
+2. **Research Config** (Optional)
+   - **Endpoint**: `/api/research-config`
+   - **Calls per project**: **0-1 call** (cached)
+   - **Scaling**: Fixed
+
+**Total Phase 1**: **1-2 external API calls** (fixed)
+
+---
+
+### Phase 2: Research (`runResearch`)
+
+**External API Calls:**
+1. **Google Grounding** (via Gemini) OR **Exa Neural Search**
+   - **Endpoint**: `/api/blog/research/start` → async task
+   - **Backend**: `blogWriterApi.startResearch()`
+   - **Service**: `backend/services/blog_writer/research/research_service.py`
+   - **Provider Selection**:
+     - **Google Grounding**: Uses Gemini's native Google Search grounding
+     - **Exa**: Direct Exa API calls
+   - **Calls per research**: **1 call** (handles all keywords in one request)
+   - **Scaling**: 
+     - **Fixed per research operation** (1 call regardless of number of queries)
+     - **Queries are batched** into a single research request
+     - **Number of queries**: Typically 1-6 (from `mapPersonaQueries`)
+
+**Polling Calls:**
+- **Internal task polling**: `blogWriterApi.pollResearchStatus()`
+- **Not external API calls** (internal task status checks)
+- **Polling frequency**: Every 2.5 seconds, max 120 attempts (5 minutes)
+
+**Total Phase 2**: **1 external API call** (fixed per research operation)
+
+---
+
+### Phase 3: Script Generation (`generateScript`)
+
+**External API Calls:**
+1. **Gemini LLM** - Story outline generation
+   - **Endpoint**: `/api/story/generate-outline`
+   - **Backend**: `storyWriterApi.generateOutline()`
+   - **Service**: `backend/services/story_writer/service_components/outline.py`
+   - **Function**: `llm_text_gen()` → Gemini API
+   - **Calls per script**: **1 call**
+   - **Scaling**: 
+     - **Fixed per script generation** (1 call regardless of duration)
+     - **Duration affects output length** (more scenes), but not number of API calls
+
+**Total Phase 3**: **1 external API call** (fixed)
+
+---
+
+### Phase 4: Audio Rendering (`renderSceneAudio`)
+
+**External API Calls:**
+1. **WaveSpeed → Minimax Speech 02 HD** - Text-to-Speech
+   - **Endpoint**: `/api/story/generate-audio`
+   - **Backend**: `storyWriterApi.generateAIAudio()`
+   - **Service**: `backend/services/wavespeed/client.py::generate_speech()`
+   - **External API**: WaveSpeed API → Minimax Speech 02 HD
+   - **Calls per scene**: **1 call per scene**
+   - **Scaling with duration**:
+     - **Number of scenes** = `Math.ceil((duration * 60) / scene_length_target)`
+     - **Default scene_length_target**: 45 seconds
+     - **Example calculations**:
+       - 5 minutes → `ceil(300 / 45)` = **7 scenes** = **7 TTS calls**
+       - 10 minutes → `ceil(600 / 45)` = **14 scenes** = **14 TTS calls**
+       - 15 minutes → `ceil(900 / 45)` = **20 scenes** = **20 TTS calls**
+       - 30 minutes → `ceil(1800 / 45)` = **40 scenes** = **40 TTS calls**
+   - **Scaling with speakers**:
+     - **Fixed per scene** (1 call per scene regardless of speakers)
+     - **Speakers affect text splitting** (lines per speaker), but not API calls
+   - **Text length per call**:
+     - **Characters per scene** ≈ `(scene_length_target * 15)` (assuming ~15 chars/second)
+     - **5-minute podcast**: ~675 chars/scene × 7 scenes = ~4,725 total chars
+     - **30-minute podcast**: ~675 chars/scene × 40 scenes = ~27,000 total chars
+
+**Total Phase 4**: **N external API calls** where **N = number of scenes**
+
+---
+
+### Phase 5: Video Rendering (`generateVideo`) - Optional
+
+**External API Calls:**
+1. **WaveSpeed → InfiniteTalk** - Avatar animation
+   - **Endpoint**: `/api/podcast/render/video`
+   - **Backend**: `podcastApi.generateVideo()`
+   - **Service**: `backend/services/wavespeed/infinitetalk.py::animate_scene_with_voiceover()`
+   - **External API**: WaveSpeed API → InfiniteTalk
+   - **Calls per scene**: **1 call per scene** (if video is generated)
+   - **Scaling with duration**:
+     - **Same as audio rendering**: 1 call per scene
+     - **5 minutes**: **7 video calls**
+     - **10 minutes**: **14 video calls**
+     - **15 minutes**: **20 video calls**
+     - **30 minutes**: **40 video calls**
+   - **Scaling with speakers**:
+     - **Fixed per scene** (1 call per scene regardless of speakers)
+     - **Avatar image is provided** (not generated per speaker)
+
+**Polling Calls:**
+- **Internal task polling**: `podcastApi.pollTaskStatus()`
+- **Not external API calls** (internal task status checks)
+- **Polling frequency**: Every 2.5 seconds until completion (can take up to 10 minutes per video)
+
+**Total Phase 5**: **N external API calls** where **N = number of scenes** (if video is enabled)
+
+---
+
+## Summary: Total External API Calls
+
+### Minimum Workflow (No Video, 5-minute podcast)
+1. Project Creation: **1 call** (Gemini - story setup)
+2. Research: **1 call** (Google Grounding or Exa)
+3. Script Generation: **1 call** (Gemini - outline)
+4. Audio Rendering: **7 calls** (Minimax TTS - 7 scenes)
+5. Video Rendering: **0 calls** (not enabled)
+
+**Total**: **10 external API calls** for a 5-minute podcast
+
+### Full Workflow (With Video, 5-minute podcast)
+1. Project Creation: **1 call** (Gemini - story setup)
+2. Research: **1 call** (Google Grounding or Exa)
+3. Script Generation: **1 call** (Gemini - outline)
+4. Audio Rendering: **7 calls** (Minimax TTS - 7 scenes)
+5. Video Rendering: **7 calls** (InfiniteTalk - 7 scenes)
+
+**Total**: **17 external API calls** for a 5-minute podcast
+
+### Scaling with Duration
+
+| Duration | Scenes | Audio Calls | Video Calls | Total (Audio Only) | Total (Audio + Video) |
+|----------|--------|-------------|-------------|-------------------|----------------------|
+| 5 min    | 7      | 7           | 7           | 10                | 17                   |
+| 10 min   | 14     | 14          | 14          | 17                | 31                   |
+| 15 min   | 20     | 20          | 20          | 23                | 43                   |
+| 30 min   | 40     | 40          | 40          | 43                | 83                   |
+
+**Formula**: 
+- **Scenes** = `ceil((duration_minutes * 60) / scene_length_target)`
+- **Total (Audio Only)** = `3 + scenes` (3 fixed + N scenes)
+- **Total (Audio + Video)** = `3 + (scenes * 2)` (3 fixed + N audio + N video)
+
+---
+
+## Scaling Factors
+
+### 1. Duration
+- **Impact**: Linear scaling of rendering calls (audio + video)
+- **Fixed calls**: 3 (setup, research, script)
+- **Variable calls**: `2 * scenes` (if video enabled) or `1 * scenes` (audio only)
+- **Scene count formula**: `ceil((duration * 60) / scene_length_target)`
+
+### 2. Number of Speakers
+- **Impact**: **No impact on external API calls**
+- **Reason**: 
+  - Text is split into lines per speaker **before** API calls
+  - Each scene makes **1 TTS call** regardless of speaker count
+  - Video uses **1 avatar image** (not per speaker)
+
+### 3. Scene Length Target
+- **Impact**: Affects number of scenes (and thus rendering calls)
+- **Default**: 45 seconds
+- **Shorter scenes** = More scenes = More API calls
+- **Longer scenes** = Fewer scenes = Fewer API calls
+
+### 4. Research Provider
+- **Impact**: **No impact on call count**
+- **Google Grounding**: 1 call (batched)
+- **Exa**: 1 call (batched)
+- **Both**: Same number of calls
+
+### 5. Video Generation
+- **Impact**: **Doubles rendering calls** (adds 1 call per scene)
+- **Audio only**: `N` calls (N = scenes)
+- **Audio + Video**: `2N` calls (N audio + N video)
+
+---
+
+## Cost Implications
+
+### API Call Costs (Estimated)
+
+1. **Gemini LLM** (Story Setup & Script):
+   - **Setup**: ~2,000 tokens → ~$0.001-0.002
+   - **Outline**: ~3,000-5,000 tokens → ~$0.002-0.005
+   - **Total**: ~$0.003-0.007 per podcast
+
+2. **Google Grounding** (Research):
+   - **Per research**: ~1,200 tokens → ~$0.001-0.002
+   - **Fixed cost** regardless of query count
+
+3. **Exa Neural Search** (Alternative):
+   - **Per research**: ~$0.005 (flat rate)
+   - **Fixed cost** regardless of query count
+
+4. **Minimax TTS** (Audio):
+   - **Per scene**: ~$0.05 per 1,000 characters
+   - **5-minute podcast**: ~4,725 chars → ~$0.24
+   - **30-minute podcast**: ~27,000 chars → ~$1.35
+   - **Scales linearly with duration**
+
+5. **InfiniteTalk** (Video):
+   - **Per scene**: ~$0.03-0.06 per second (depending on resolution)
+   - **5-minute podcast**: 7 scenes × 45s × $0.03 = ~$9.45
+   - **30-minute podcast**: 40 scenes × 45s × $0.03 = ~$54.00
+   - **Scales linearly with duration**
+
+### Total Cost Examples
+
+| Duration | Audio Only | Audio + Video (720p) |
+|----------|-----------|---------------------|
+| 5 min    | ~$0.25    | ~$9.50              |
+| 10 min   | ~$0.50    | ~$19.00             |
+| 15 min   | ~$0.75    | ~$28.50             |
+| 30 min   | ~$1.50    | ~$57.00             |
+
+**Note**: Costs are estimates and may vary based on actual API pricing, text length, and video resolution.
+
+---
+
+## Optimization Opportunities
+
+1. **Batch TTS Calls**: Currently 1 call per scene. Could batch multiple scenes if API supports it.
+2. **Cache Research Results**: Already implemented for exact keyword matches.
+3. **Parallel Rendering**: Audio and video rendering could be parallelized per scene.
+4. **Scene Length Optimization**: Longer scenes = fewer API calls (but may reduce quality).
+5. **Video Optional**: Video generation doubles costs - make it optional/on-demand.
+
+---
+
+## Internal vs External Calls
+
+### Internal (Not Counted as External)
+- Preflight validation checks (`/api/billing/preflight`)
+- Task status polling (`/api/story/task/{taskId}/status`)
+- Project persistence (`/api/podcast/projects/*`)
+- Content asset library (`/api/content-assets/*`)
+
+### External (Counted)
+- Gemini LLM (story setup, script generation)
+- Google Grounding (research)
+- Exa (research alternative)
+- WaveSpeed → Minimax TTS (audio)
+- WaveSpeed → InfiniteTalk (video)
+
+---
+
+## Conclusion
+
+**Key Findings:**
+1. **Fixed overhead**: 3 external API calls per podcast (setup, research, script)
+2. **Variable overhead**: 1-2 calls per scene (audio, optionally video)
+3. **Duration is the primary scaling factor** for rendering calls
+4. **Number of speakers does NOT affect API call count**
+5. **Video generation doubles rendering API calls**
+
+**Recommendations:**
+- Monitor API call counts and costs per podcast duration
+- Consider batching strategies for TTS calls if supported
+- Make video generation optional/on-demand to reduce costs
+- Optimize scene length to balance quality vs. API call count
+
+
+