ALwrity/docs/Podcast_maker/PODCAST_API_CALL_ANALYSIS.md

# Podcast Maker External API Call Analysis

## Overview
This document analyzes all external API calls made during the podcast creation workflow and how they scale with duration, number of speakers, and other factors.

---

## External API Providers

1. **Gemini (Google)** - LLM for story setup and script generation
2. **Google Grounding** - Research via Gemini's native search grounding
3. **Exa** - Alternative neural search provider for research
4. **WaveSpeed** - API gateway for:
   - **Minimax Speech 02 HD** - Text-to-Speech (TTS)
   - **InfiniteTalk** - Avatar animation (image + audio → video)

---

## Workflow Phases & API Calls

### Phase 1: Project Creation (`createProject`)

**External API Calls:**
1. **Gemini LLM** - Story setup generation
   - **Endpoint**: `/api/story/generate-setup`
   - **Backend**: `storyWriterApi.generateStorySetup()`
   - **Service**: `backend/services/story_writer/service_components/setup.py`
   - **Function**: `llm_text_gen()` → Gemini API
   - **Calls per project**: **1 call**
   - **Scaling**: Fixed (1 call regardless of duration)

2. **Research Config** (Optional)
   - **Endpoint**: `/api/research-config`
   - **Calls per project**: **0-1 call** (cached)
   - **Scaling**: Fixed

**Total Phase 1**: **1-2 external API calls** (fixed)

---

### Phase 2: Research (`runResearch`)

**External API Calls:**
1. **Google Grounding** (via Gemini) OR **Exa Neural Search**
   - **Endpoint**: `/api/blog/research/start` → async task
   - **Backend**: `blogWriterApi.startResearch()`
   - **Service**: `backend/services/blog_writer/research/research_service.py`
   - **Provider Selection**:
     - **Google Grounding**: Uses Gemini's native Google Search grounding
     - **Exa**: Direct Exa API calls
   - **Calls per research**: **1 call** (handles all keywords in one request)
   - **Scaling**:
     - **Fixed per research operation** (1 call regardless of number of queries)
     - **Queries are batched** into a single research request
     - **Number of queries**: Typically 1-6 (from `mapPersonaQueries`)

**Polling Calls:**
- **Internal task polling**: `blogWriterApi.pollResearchStatus()`
- **Not external API calls** (internal task status checks)
- **Polling frequency**: Every 2.5 seconds, max 120 attempts (5 minutes)

**Total Phase 2**: **1 external API call** (fixed per research operation)

---

### Phase 3: Script Generation (`generateScript`)

**External API Calls:**
1. **Gemini LLM** - Story outline generation
   - **Endpoint**: `/api/story/generate-outline`
   - **Backend**: `storyWriterApi.generateOutline()`
   - **Service**: `backend/services/story_writer/service_components/outline.py`
   - **Function**: `llm_text_gen()` → Gemini API
   - **Calls per script**: **1 call**
   - **Scaling**:
     - **Fixed per script generation** (1 call regardless of duration)
     - **Duration affects output length** (more scenes), but not number of API calls

**Total Phase 3**: **1 external API call** (fixed)

---

### Phase 4: Audio Rendering (`renderSceneAudio`)

**External API Calls:**
1. **WaveSpeed → Minimax Speech 02 HD** - Text-to-Speech
   - **Endpoint**: `/api/story/generate-audio`
   - **Backend**: `storyWriterApi.generateAIAudio()`
   - **Service**: `backend/services/wavespeed/client.py::generate_speech()`
   - **External API**: WaveSpeed API → Minimax Speech 02 HD
   - **Calls per scene**: **1 call per scene**
   - **Scaling with duration**:
     - **Number of scenes** = `Math.ceil((duration * 60) / scene_length_target)`
     - **Default scene_length_target**: 45 seconds
     - **Example calculations**:
       - 5 minutes → `ceil(300 / 45)` = **7 scenes** = **7 TTS calls**
       - 10 minutes → `ceil(600 / 45)` = **14 scenes** = **14 TTS calls**
       - 15 minutes → `ceil(900 / 45)` = **20 scenes** = **20 TTS calls**
       - 30 minutes → `ceil(1800 / 45)` = **40 scenes** = **40 TTS calls**
   - **Scaling with speakers**:
     - **Fixed per scene** (1 call per scene regardless of speakers)
     - **Speakers affect text splitting** (lines per speaker), but not API calls
   - **Text length per call**:
     - **Characters per scene** ≈ `(scene_length_target * 15)` (assuming ~15 chars/second)
     - **5-minute podcast**: ~675 chars/scene × 7 scenes = ~4,725 total chars
     - **30-minute podcast**: ~675 chars/scene × 40 scenes = ~27,000 total chars

**Total Phase 4**: **N external API calls** where **N = number of scenes**

---

### Phase 5: Video Rendering (`generateVideo`) - Optional

**External API Calls:**
1. **WaveSpeed → InfiniteTalk** - Avatar animation
   - **Endpoint**: `/api/podcast/render/video`
   - **Backend**: `podcastApi.generateVideo()`
   - **Service**: `backend/services/wavespeed/infinitetalk.py::animate_scene_with_voiceover()`
   - **External API**: WaveSpeed API → InfiniteTalk
   - **Calls per scene**: **1 call per scene** (if video is generated)
   - **Scaling with duration**:
     - **Same as audio rendering**: 1 call per scene
     - **5 minutes**: **7 video calls**
     - **10 minutes**: **14 video calls**
     - **15 minutes**: **20 video calls**
     - **30 minutes**: **40 video calls**
   - **Scaling with speakers**:
     - **Fixed per scene** (1 call per scene regardless of speakers)
     - **Avatar image is provided** (not generated per speaker)

**Polling Calls:**
- **Internal task polling**: `podcastApi.pollTaskStatus()`
- **Not external API calls** (internal task status checks)
- **Polling frequency**: Every 2.5 seconds until completion (can take up to 10 minutes per video)

**Total Phase 5**: **N external API calls** where **N = number of scenes** (if video is enabled)

---

## Summary: Total External API Calls

### Minimum Workflow (No Video, 5-minute podcast)
1. Project Creation: **1 call** (Gemini - story setup)
2. Research: **1 call** (Google Grounding or Exa)
3. Script Generation: **1 call** (Gemini - outline)
4. Audio Rendering: **7 calls** (Minimax TTS - 7 scenes)
5. Video Rendering: **0 calls** (not enabled)

**Total**: **10 external API calls** for a 5-minute podcast

### Full Workflow (With Video, 5-minute podcast)
1. Project Creation: **1 call** (Gemini - story setup)
2. Research: **1 call** (Google Grounding or Exa)
3. Script Generation: **1 call** (Gemini - outline)
4. Audio Rendering: **7 calls** (Minimax TTS - 7 scenes)
5. Video Rendering: **7 calls** (InfiniteTalk - 7 scenes)

**Total**: **17 external API calls** for a 5-minute podcast

### Scaling with Duration

| Duration | Scenes | Audio Calls | Video Calls | Total (Audio Only) | Total (Audio + Video) |
|----------|--------|-------------|-------------|-------------------|----------------------|
| 5 min    | 7      | 7           | 7           | 10                | 17                   |
| 10 min   | 14     | 14          | 14          | 17                | 31                   |
| 15 min   | 20     | 20          | 20          | 23                | 43                   |
| 30 min   | 40     | 40          | 40          | 43                | 83                   |

**Formula**:
- **Scenes** = `ceil((duration_minutes * 60) / scene_length_target)`
- **Total (Audio Only)** = `3 + scenes` (3 fixed + N scenes)
- **Total (Audio + Video)** = `3 + (scenes * 2)` (3 fixed + N audio + N video)

---

## Scaling Factors

### 1. Duration
- **Impact**: Linear scaling of rendering calls (audio + video)
- **Fixed calls**: 3 (setup, research, script)
- **Variable calls**: `2 * scenes` (if video enabled) or `1 * scenes` (audio only)
- **Scene count formula**: `ceil((duration * 60) / scene_length_target)`

### 2. Number of Speakers
- **Impact**: **No impact on external API calls**
- **Reason**:
  - Text is split into lines per speaker **before** API calls
  - Each scene makes **1 TTS call** regardless of speaker count
  - Video uses **1 avatar image** (not per speaker)

### 3. Scene Length Target
- **Impact**: Affects number of scenes (and thus rendering calls)
- **Default**: 45 seconds
- **Shorter scenes** = More scenes = More API calls
- **Longer scenes** = Fewer scenes = Fewer API calls

### 4. Research Provider
- **Impact**: **No impact on call count**
- **Google Grounding**: 1 call (batched)
- **Exa**: 1 call (batched)
- **Both**: Same number of calls

### 5. Video Generation
- **Impact**: **Doubles rendering calls** (adds 1 call per scene)
- **Audio only**: `N` calls (N = scenes)
- **Audio + Video**: `2N` calls (N audio + N video)

---

## Cost Implications

### API Call Costs (Estimated)

1. **Gemini LLM** (Story Setup & Script):
   - **Setup**: ~2,000 tokens → ~$0.001-0.002
   - **Outline**: ~3,000-5,000 tokens → ~$0.002-0.005
   - **Total**: ~$0.003-0.007 per podcast

2. **Google Grounding** (Research):
   - **Per research**: ~1,200 tokens → ~$0.001-0.002
   - **Fixed cost** regardless of query count

3. **Exa Neural Search** (Alternative):
   - **Per research**: ~$0.005 (flat rate)
   - **Fixed cost** regardless of query count

4. **Minimax TTS** (Audio):
   - **Per scene**: ~$0.05 per 1,000 characters
   - **5-minute podcast**: ~4,725 chars → ~$0.24
   - **30-minute podcast**: ~27,000 chars → ~$1.35
   - **Scales linearly with duration**

5. **InfiniteTalk** (Video):
   - **Per scene**: ~$0.03-0.06 per second (depending on resolution)
   - **5-minute podcast**: 7 scenes × 45s × $0.03 = ~$9.45
   - **30-minute podcast**: 40 scenes × 45s × $0.03 = ~$54.00
   - **Scales linearly with duration**

### Total Cost Examples

| Duration | Audio Only | Audio + Video (720p) |
|----------|-----------|---------------------|
| 5 min    | ~$0.25    | ~$9.50              |
| 10 min   | ~$0.50    | ~$19.00             |
| 15 min   | ~$0.75    | ~$28.50             |
| 30 min   | ~$1.50    | ~$57.00             |

**Note**: Costs are estimates and may vary based on actual API pricing, text length, and video resolution.

---

## Optimization Opportunities

1. **Batch TTS Calls**: Currently 1 call per scene. Could batch multiple scenes if API supports it.
2. **Cache Research Results**: Already implemented for exact keyword matches.
3. **Parallel Rendering**: Audio and video rendering could be parallelized per scene.
4. **Scene Length Optimization**: Longer scenes = fewer API calls (but may reduce quality).
5. **Video Optional**: Video generation doubles costs - make it optional/on-demand.

---

## Internal vs External Calls

### Internal (Not Counted as External)
- Preflight validation checks (`/api/billing/preflight`)
- Task status polling (`/api/story/task/{taskId}/status`)
- Project persistence (`/api/podcast/projects/*`)
- Content asset library (`/api/content-assets/*`)

### External (Counted)
- Gemini LLM (story setup, script generation)
- Google Grounding (research)
- Exa (research alternative)
- WaveSpeed → Minimax TTS (audio)
- WaveSpeed → InfiniteTalk (video)

---

## Conclusion

**Key Findings:**
1. **Fixed overhead**: 3 external API calls per podcast (setup, research, script)
2. **Variable overhead**: 1-2 calls per scene (audio, optionally video)
3. **Duration is the primary scaling factor** for rendering calls
4. **Number of speakers does NOT affect API call count**
5. **Video generation doubles rendering API calls**

**Recommendations:**
- Monitor API call counts and costs per podcast duration
- Consider batching strategies for TTS calls if supported
- Make video generation optional/on-demand to reduce costs
- Optimize scene length to balance quality vs. API call count