296 lines
11 KiB
Markdown
296 lines
11 KiB
Markdown
# Podcast Maker External API Call Analysis
|
||
|
||
## Overview
|
||
This document analyzes all external API calls made during the podcast creation workflow and how they scale with duration, number of speakers, and other factors.
|
||
|
||
---
|
||
|
||
## External API Providers
|
||
|
||
1. **Gemini (Google)** - LLM for story setup and script generation
|
||
2. **Google Grounding** - Research via Gemini's native search grounding
|
||
3. **Exa** - Alternative neural search provider for research
|
||
4. **WaveSpeed** - API gateway for:
|
||
- **Minimax Speech 02 HD** - Text-to-Speech (TTS)
|
||
- **InfiniteTalk** - Avatar animation (image + audio → video)
|
||
|
||
---
|
||
|
||
## Workflow Phases & API Calls
|
||
|
||
### Phase 1: Project Creation (`createProject`)
|
||
|
||
**External API Calls:**
|
||
1. **Gemini LLM** - Story setup generation
|
||
- **Endpoint**: `/api/story/generate-setup`
|
||
- **Backend**: `storyWriterApi.generateStorySetup()`
|
||
- **Service**: `backend/services/story_writer/service_components/setup.py`
|
||
- **Function**: `llm_text_gen()` → Gemini API
|
||
- **Calls per project**: **1 call**
|
||
- **Scaling**: Fixed (1 call regardless of duration)
|
||
|
||
2. **Research Config** (Optional)
|
||
- **Endpoint**: `/api/research-config`
|
||
- **Calls per project**: **0-1 call** (cached)
|
||
- **Scaling**: Fixed
|
||
|
||
**Total Phase 1**: **1-2 external API calls** (fixed)
|
||
|
||
---
|
||
|
||
### Phase 2: Research (`runResearch`)
|
||
|
||
**External API Calls:**
|
||
1. **Google Grounding** (via Gemini) OR **Exa Neural Search**
|
||
- **Endpoint**: `/api/blog/research/start` → async task
|
||
- **Backend**: `blogWriterApi.startResearch()`
|
||
- **Service**: `backend/services/blog_writer/research/research_service.py`
|
||
- **Provider Selection**:
|
||
- **Google Grounding**: Uses Gemini's native Google Search grounding
|
||
- **Exa**: Direct Exa API calls
|
||
- **Calls per research**: **1 call** (handles all keywords in one request)
|
||
- **Scaling**:
|
||
- **Fixed per research operation** (1 call regardless of number of queries)
|
||
- **Queries are batched** into a single research request
|
||
- **Number of queries**: Typically 1-6 (from `mapPersonaQueries`)
|
||
|
||
**Polling Calls:**
|
||
- **Internal task polling**: `blogWriterApi.pollResearchStatus()`
|
||
- **Not external API calls** (internal task status checks)
|
||
- **Polling frequency**: Every 2.5 seconds, max 120 attempts (5 minutes)
|
||
|
||
**Total Phase 2**: **1 external API call** (fixed per research operation)
|
||
|
||
---
|
||
|
||
### Phase 3: Script Generation (`generateScript`)
|
||
|
||
**External API Calls:**
|
||
1. **Gemini LLM** - Story outline generation
|
||
- **Endpoint**: `/api/story/generate-outline`
|
||
- **Backend**: `storyWriterApi.generateOutline()`
|
||
- **Service**: `backend/services/story_writer/service_components/outline.py`
|
||
- **Function**: `llm_text_gen()` → Gemini API
|
||
- **Calls per script**: **1 call**
|
||
- **Scaling**:
|
||
- **Fixed per script generation** (1 call regardless of duration)
|
||
- **Duration affects output length** (more scenes), but not number of API calls
|
||
|
||
**Total Phase 3**: **1 external API call** (fixed)
|
||
|
||
---
|
||
|
||
### Phase 4: Audio Rendering (`renderSceneAudio`)
|
||
|
||
**External API Calls:**
|
||
1. **WaveSpeed → Minimax Speech 02 HD** - Text-to-Speech
|
||
- **Endpoint**: `/api/story/generate-audio`
|
||
- **Backend**: `storyWriterApi.generateAIAudio()`
|
||
- **Service**: `backend/services/wavespeed/client.py::generate_speech()`
|
||
- **External API**: WaveSpeed API → Minimax Speech 02 HD
|
||
- **Calls per scene**: **1 call per scene**
|
||
- **Scaling with duration**:
|
||
- **Number of scenes** = `Math.ceil((duration * 60) / scene_length_target)`
|
||
- **Default scene_length_target**: 45 seconds
|
||
- **Example calculations**:
|
||
- 5 minutes → `ceil(300 / 45)` = **7 scenes** = **7 TTS calls**
|
||
- 10 minutes → `ceil(600 / 45)` = **14 scenes** = **14 TTS calls**
|
||
- 15 minutes → `ceil(900 / 45)` = **20 scenes** = **20 TTS calls**
|
||
- 30 minutes → `ceil(1800 / 45)` = **40 scenes** = **40 TTS calls**
|
||
- **Scaling with speakers**:
|
||
- **Fixed per scene** (1 call per scene regardless of speakers)
|
||
- **Speakers affect text splitting** (lines per speaker), but not API calls
|
||
- **Text length per call**:
|
||
- **Characters per scene** ≈ `(scene_length_target * 15)` (assuming ~15 chars/second)
|
||
- **5-minute podcast**: ~675 chars/scene × 7 scenes = ~4,725 total chars
|
||
- **30-minute podcast**: ~675 chars/scene × 40 scenes = ~27,000 total chars
|
||
|
||
**Total Phase 4**: **N external API calls** where **N = number of scenes**
|
||
|
||
---
|
||
|
||
### Phase 5: Video Rendering (`generateVideo`) - Optional
|
||
|
||
**External API Calls:**
|
||
1. **WaveSpeed → InfiniteTalk** - Avatar animation
|
||
- **Endpoint**: `/api/podcast/render/video`
|
||
- **Backend**: `podcastApi.generateVideo()`
|
||
- **Service**: `backend/services/wavespeed/infinitetalk.py::animate_scene_with_voiceover()`
|
||
- **External API**: WaveSpeed API → InfiniteTalk
|
||
- **Calls per scene**: **1 call per scene** (if video is generated)
|
||
- **Scaling with duration**:
|
||
- **Same as audio rendering**: 1 call per scene
|
||
- **5 minutes**: **7 video calls**
|
||
- **10 minutes**: **14 video calls**
|
||
- **15 minutes**: **20 video calls**
|
||
- **30 minutes**: **40 video calls**
|
||
- **Scaling with speakers**:
|
||
- **Fixed per scene** (1 call per scene regardless of speakers)
|
||
- **Avatar image is provided** (not generated per speaker)
|
||
|
||
**Polling Calls:**
|
||
- **Internal task polling**: `podcastApi.pollTaskStatus()`
|
||
- **Not external API calls** (internal task status checks)
|
||
- **Polling frequency**: Every 2.5 seconds until completion (can take up to 10 minutes per video)
|
||
|
||
**Total Phase 5**: **N external API calls** where **N = number of scenes** (if video is enabled)
|
||
|
||
---
|
||
|
||
## Summary: Total External API Calls
|
||
|
||
### Minimum Workflow (No Video, 5-minute podcast)
|
||
1. Project Creation: **1 call** (Gemini - story setup)
|
||
2. Research: **1 call** (Google Grounding or Exa)
|
||
3. Script Generation: **1 call** (Gemini - outline)
|
||
4. Audio Rendering: **7 calls** (Minimax TTS - 7 scenes)
|
||
5. Video Rendering: **0 calls** (not enabled)
|
||
|
||
**Total**: **10 external API calls** for a 5-minute podcast
|
||
|
||
### Full Workflow (With Video, 5-minute podcast)
|
||
1. Project Creation: **1 call** (Gemini - story setup)
|
||
2. Research: **1 call** (Google Grounding or Exa)
|
||
3. Script Generation: **1 call** (Gemini - outline)
|
||
4. Audio Rendering: **7 calls** (Minimax TTS - 7 scenes)
|
||
5. Video Rendering: **7 calls** (InfiniteTalk - 7 scenes)
|
||
|
||
**Total**: **17 external API calls** for a 5-minute podcast
|
||
|
||
### Scaling with Duration
|
||
|
||
| Duration | Scenes | Audio Calls | Video Calls | Total (Audio Only) | Total (Audio + Video) |
|
||
|----------|--------|-------------|-------------|-------------------|----------------------|
|
||
| 5 min | 7 | 7 | 7 | 10 | 17 |
|
||
| 10 min | 14 | 14 | 14 | 17 | 31 |
|
||
| 15 min | 20 | 20 | 20 | 23 | 43 |
|
||
| 30 min | 40 | 40 | 40 | 43 | 83 |
|
||
|
||
**Formula**:
|
||
- **Scenes** = `ceil((duration_minutes * 60) / scene_length_target)`
|
||
- **Total (Audio Only)** = `3 + scenes` (3 fixed + N scenes)
|
||
- **Total (Audio + Video)** = `3 + (scenes * 2)` (3 fixed + N audio + N video)
|
||
|
||
---
|
||
|
||
## Scaling Factors
|
||
|
||
### 1. Duration
|
||
- **Impact**: Linear scaling of rendering calls (audio + video)
|
||
- **Fixed calls**: 3 (setup, research, script)
|
||
- **Variable calls**: `2 * scenes` (if video enabled) or `1 * scenes` (audio only)
|
||
- **Scene count formula**: `ceil((duration * 60) / scene_length_target)`
|
||
|
||
### 2. Number of Speakers
|
||
- **Impact**: **No impact on external API calls**
|
||
- **Reason**:
|
||
- Text is split into lines per speaker **before** API calls
|
||
- Each scene makes **1 TTS call** regardless of speaker count
|
||
- Video uses **1 avatar image** (not per speaker)
|
||
|
||
### 3. Scene Length Target
|
||
- **Impact**: Affects number of scenes (and thus rendering calls)
|
||
- **Default**: 45 seconds
|
||
- **Shorter scenes** = More scenes = More API calls
|
||
- **Longer scenes** = Fewer scenes = Fewer API calls
|
||
|
||
### 4. Research Provider
|
||
- **Impact**: **No impact on call count**
|
||
- **Google Grounding**: 1 call (batched)
|
||
- **Exa**: 1 call (batched)
|
||
- **Both**: Same number of calls
|
||
|
||
### 5. Video Generation
|
||
- **Impact**: **Doubles rendering calls** (adds 1 call per scene)
|
||
- **Audio only**: `N` calls (N = scenes)
|
||
- **Audio + Video**: `2N` calls (N audio + N video)
|
||
|
||
---
|
||
|
||
## Cost Implications
|
||
|
||
### API Call Costs (Estimated)
|
||
|
||
1. **Gemini LLM** (Story Setup & Script):
|
||
- **Setup**: ~2,000 tokens → ~$0.001-0.002
|
||
- **Outline**: ~3,000-5,000 tokens → ~$0.002-0.005
|
||
- **Total**: ~$0.003-0.007 per podcast
|
||
|
||
2. **Google Grounding** (Research):
|
||
- **Per research**: ~1,200 tokens → ~$0.001-0.002
|
||
- **Fixed cost** regardless of query count
|
||
|
||
3. **Exa Neural Search** (Alternative):
|
||
- **Per research**: ~$0.005 (flat rate)
|
||
- **Fixed cost** regardless of query count
|
||
|
||
4. **Minimax TTS** (Audio):
|
||
- **Per scene**: ~$0.05 per 1,000 characters
|
||
- **5-minute podcast**: ~4,725 chars → ~$0.24
|
||
- **30-minute podcast**: ~27,000 chars → ~$1.35
|
||
- **Scales linearly with duration**
|
||
|
||
5. **InfiniteTalk** (Video):
|
||
- **Per scene**: ~$0.03-0.06 per second (depending on resolution)
|
||
- **5-minute podcast**: 7 scenes × 45s × $0.03 = ~$9.45
|
||
- **30-minute podcast**: 40 scenes × 45s × $0.03 = ~$54.00
|
||
- **Scales linearly with duration**
|
||
|
||
### Total Cost Examples
|
||
|
||
| Duration | Audio Only | Audio + Video (720p) |
|
||
|----------|-----------|---------------------|
|
||
| 5 min | ~$0.25 | ~$9.50 |
|
||
| 10 min | ~$0.50 | ~$19.00 |
|
||
| 15 min | ~$0.75 | ~$28.50 |
|
||
| 30 min | ~$1.50 | ~$57.00 |
|
||
|
||
**Note**: Costs are estimates and may vary based on actual API pricing, text length, and video resolution.
|
||
|
||
---
|
||
|
||
## Optimization Opportunities
|
||
|
||
1. **Batch TTS Calls**: Currently 1 call per scene. Could batch multiple scenes if API supports it.
|
||
2. **Cache Research Results**: Already implemented for exact keyword matches.
|
||
3. **Parallel Rendering**: Audio and video rendering could be parallelized per scene.
|
||
4. **Scene Length Optimization**: Longer scenes = fewer API calls (but may reduce quality).
|
||
5. **Video Optional**: Video generation doubles costs - make it optional/on-demand.
|
||
|
||
---
|
||
|
||
## Internal vs External Calls
|
||
|
||
### Internal (Not Counted as External)
|
||
- Preflight validation checks (`/api/billing/preflight`)
|
||
- Task status polling (`/api/story/task/{taskId}/status`)
|
||
- Project persistence (`/api/podcast/projects/*`)
|
||
- Content asset library (`/api/content-assets/*`)
|
||
|
||
### External (Counted)
|
||
- Gemini LLM (story setup, script generation)
|
||
- Google Grounding (research)
|
||
- Exa (research alternative)
|
||
- WaveSpeed → Minimax TTS (audio)
|
||
- WaveSpeed → InfiniteTalk (video)
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
**Key Findings:**
|
||
1. **Fixed overhead**: 3 external API calls per podcast (setup, research, script)
|
||
2. **Variable overhead**: 1-2 calls per scene (audio, optionally video)
|
||
3. **Duration is the primary scaling factor** for rendering calls
|
||
4. **Number of speakers does NOT affect API call count**
|
||
5. **Video generation doubles rendering API calls**
|
||
|
||
**Recommendations:**
|
||
- Monitor API call counts and costs per podcast duration
|
||
- Consider batching strategies for TTS calls if supported
|
||
- Make video generation optional/on-demand to reduce costs
|
||
- Optimize scene length to balance quality vs. API call count
|
||
|
||
|
||
|