Files

ajaysi 81590cf4db WIP: AI Podcast Maker and YouTube Creator Studio integration

2025-12-10 09:37:55 +05:30

11 KiB

Raw Blame History

Podcast Maker External API Call Analysis

Overview

This document analyzes all external API calls made during the podcast creation workflow and how they scale with duration, number of speakers, and other factors.

External API Providers

Gemini (Google) - LLM for story setup and script generation
Google Grounding - Research via Gemini's native search grounding
Exa - Alternative neural search provider for research
WaveSpeed - API gateway for:
- Minimax Speech 02 HD - Text-to-Speech (TTS)
- InfiniteTalk - Avatar animation (image + audio → video)

Workflow Phases & API Calls

Phase 1: Project Creation (`createProject`)

External API Calls:

Gemini LLM - Story setup generation
- Endpoint: /api/story/generate-setup
- Backend: storyWriterApi.generateStorySetup()
- Service: backend/services/story_writer/service_components/setup.py
- Function: llm_text_gen() → Gemini API
- Calls per project: 1 call
- Scaling: Fixed (1 call regardless of duration)
Research Config (Optional)
- Endpoint: /api/research-config
- Calls per project: 0-1 call (cached)
- Scaling: Fixed

Total Phase 1: 1-2 external API calls (fixed)

Phase 2: Research (`runResearch`)

External API Calls:

Google Grounding (via Gemini) OR Exa Neural Search
- Endpoint: /api/blog/research/start → async task
- Backend: blogWriterApi.startResearch()
- Service: backend/services/blog_writer/research/research_service.py
- Provider Selection:
  - Google Grounding: Uses Gemini's native Google Search grounding
  - Exa: Direct Exa API calls
- Calls per research: 1 call (handles all keywords in one request)
- Scaling:
  - Fixed per research operation (1 call regardless of number of queries)
  - Queries are batched into a single research request
  - Number of queries: Typically 1-6 (from mapPersonaQueries)

Polling Calls:

Internal task polling: blogWriterApi.pollResearchStatus()
Not external API calls (internal task status checks)
Polling frequency: Every 2.5 seconds, max 120 attempts (5 minutes)

Total Phase 2: 1 external API call (fixed per research operation)

Phase 3: Script Generation (`generateScript`)

External API Calls:

Gemini LLM - Story outline generation
- Endpoint: /api/story/generate-outline
- Backend: storyWriterApi.generateOutline()
- Service: backend/services/story_writer/service_components/outline.py
- Function: llm_text_gen() → Gemini API
- Calls per script: 1 call
- Scaling:
  - Fixed per script generation (1 call regardless of duration)
  - Duration affects output length (more scenes), but not number of API calls

Total Phase 3: 1 external API call (fixed)

Phase 4: Audio Rendering (`renderSceneAudio`)

External API Calls:

WaveSpeed → Minimax Speech 02 HD - Text-to-Speech
- Endpoint: /api/story/generate-audio
- Backend: storyWriterApi.generateAIAudio()
- Service: backend/services/wavespeed/client.py::generate_speech()
- External API: WaveSpeed API → Minimax Speech 02 HD
- Calls per scene: 1 call per scene
- Scaling with duration:
  - Number of scenes = Math.ceil((duration * 60) / scene_length_target)
  - Default scene_length_target: 45 seconds
  - Example calculations:
    - 5 minutes → ceil(300 / 45) = 7 scenes = 7 TTS calls
    - 10 minutes → ceil(600 / 45) = 14 scenes = 14 TTS calls
    - 15 minutes → ceil(900 / 45) = 20 scenes = 20 TTS calls
    - 30 minutes → ceil(1800 / 45) = 40 scenes = 40 TTS calls
- Scaling with speakers:
  - Fixed per scene (1 call per scene regardless of speakers)
  - Speakers affect text splitting (lines per speaker), but not API calls
- Text length per call:
  - Characters per scene ≈ (scene_length_target * 15) (assuming ~15 chars/second)
  - 5-minute podcast: ~675 chars/scene × 7 scenes = ~4,725 total chars
  - 30-minute podcast: ~675 chars/scene × 40 scenes = ~27,000 total chars

Total Phase 4: N external API calls where N = number of scenes

Phase 5: Video Rendering (`generateVideo`) - Optional

External API Calls:

WaveSpeed → InfiniteTalk - Avatar animation
- Endpoint: /api/podcast/render/video
- Backend: podcastApi.generateVideo()
- Service: backend/services/wavespeed/infinitetalk.py::animate_scene_with_voiceover()
- External API: WaveSpeed API → InfiniteTalk
- Calls per scene: 1 call per scene (if video is generated)
- Scaling with duration:
  - Same as audio rendering: 1 call per scene
  - 5 minutes: 7 video calls
  - 10 minutes: 14 video calls
  - 15 minutes: 20 video calls
  - 30 minutes: 40 video calls
- Scaling with speakers:
  - Fixed per scene (1 call per scene regardless of speakers)
  - Avatar image is provided (not generated per speaker)

Polling Calls:

Internal task polling: podcastApi.pollTaskStatus()
Not external API calls (internal task status checks)
Polling frequency: Every 2.5 seconds until completion (can take up to 10 minutes per video)

Total Phase 5: N external API calls where N = number of scenes (if video is enabled)

Summary: Total External API Calls

Minimum Workflow (No Video, 5-minute podcast)

Project Creation: 1 call (Gemini - story setup)
Research: 1 call (Google Grounding or Exa)
Script Generation: 1 call (Gemini - outline)
Audio Rendering: 7 calls (Minimax TTS - 7 scenes)
Video Rendering: 0 calls (not enabled)

Total: 10 external API calls for a 5-minute podcast

Full Workflow (With Video, 5-minute podcast)

Project Creation: 1 call (Gemini - story setup)
Research: 1 call (Google Grounding or Exa)
Script Generation: 1 call (Gemini - outline)
Audio Rendering: 7 calls (Minimax TTS - 7 scenes)
Video Rendering: 7 calls (InfiniteTalk - 7 scenes)

Total: 17 external API calls for a 5-minute podcast

Scaling with Duration

Duration	Scenes	Audio Calls	Video Calls	Total (Audio Only)	Total (Audio + Video)
5 min	7	7	7	10	17
10 min	14	14	14	17	31
15 min	20	20	20	23	43
30 min	40	40	40	43	83

Formula:

Scenes = ceil((duration_minutes * 60) / scene_length_target)
Total (Audio Only) = 3 + scenes (3 fixed + N scenes)
Total (Audio + Video) = 3 + (scenes * 2) (3 fixed + N audio + N video)

Scaling Factors

1. Duration

Impact: Linear scaling of rendering calls (audio + video)
Fixed calls: 3 (setup, research, script)
Variable calls: 2 * scenes (if video enabled) or 1 * scenes (audio only)
Scene count formula: ceil((duration * 60) / scene_length_target)

2. Number of Speakers

Impact: No impact on external API calls
Reason:
- Text is split into lines per speaker before API calls
- Each scene makes 1 TTS call regardless of speaker count
- Video uses 1 avatar image (not per speaker)

3. Scene Length Target

Impact: Affects number of scenes (and thus rendering calls)
Default: 45 seconds
Shorter scenes = More scenes = More API calls
Longer scenes = Fewer scenes = Fewer API calls

4. Research Provider

Impact: No impact on call count
Google Grounding: 1 call (batched)
Exa: 1 call (batched)
Both: Same number of calls

5. Video Generation

Impact: Doubles rendering calls (adds 1 call per scene)
Audio only: N calls (N = scenes)
Audio + Video: 2N calls (N audio + N video)

Cost Implications

API Call Costs (Estimated)

Gemini LLM (Story Setup & Script):
- Setup: ~2,000 tokens → ~$0.001-0.002
- Outline: ~3,000-5,000 tokens → ~$0.002-0.005
- Total: ~$0.003-0.007 per podcast
Google Grounding (Research):
- Per research: ~1,200 tokens → ~$0.001-0.002
- Fixed cost regardless of query count
Exa Neural Search (Alternative):
- Per research: ~$0.005 (flat rate)
- Fixed cost regardless of query count
Minimax TTS (Audio):
- Per scene: ~$0.05 per 1,000 characters
- 5-minute podcast: ~4,725 chars → ~$0.24
- 30-minute podcast: ~27,000 chars → ~$1.35
- Scales linearly with duration
InfiniteTalk (Video):
- Per scene: ~$0.03-0.06 per second (depending on resolution)
- 5-minute podcast: 7 scenes × 45s × $0.03 = ~$9.45
- 30-minute podcast: 40 scenes × 45s × $0.03 = ~$54.00
- Scales linearly with duration

Total Cost Examples

Duration	Audio Only	Audio + Video (720p)
5 min	~$0.25	~$9.50
10 min	~$0.50	~$19.00
15 min	~$0.75	~$28.50
30 min	~$1.50	~$57.00

Note: Costs are estimates and may vary based on actual API pricing, text length, and video resolution.

Optimization Opportunities

Batch TTS Calls: Currently 1 call per scene. Could batch multiple scenes if API supports it.
Cache Research Results: Already implemented for exact keyword matches.
Parallel Rendering: Audio and video rendering could be parallelized per scene.
Scene Length Optimization: Longer scenes = fewer API calls (but may reduce quality).
Video Optional: Video generation doubles costs - make it optional/on-demand.

Internal vs External Calls

Internal (Not Counted as External)

Preflight validation checks (/api/billing/preflight)
Task status polling (/api/story/task/{taskId}/status)
Project persistence (/api/podcast/projects/*)
Content asset library (/api/content-assets/*)

External (Counted)

Gemini LLM (story setup, script generation)
Google Grounding (research)
Exa (research alternative)
WaveSpeed → Minimax TTS (audio)
WaveSpeed → InfiniteTalk (video)

Conclusion

Key Findings:

Fixed overhead: 3 external API calls per podcast (setup, research, script)
Variable overhead: 1-2 calls per scene (audio, optionally video)
Duration is the primary scaling factor for rendering calls
Number of speakers does NOT affect API call count
Video generation doubles rendering API calls

Recommendations:

Monitor API call counts and costs per podcast duration
Consider batching strategies for TTS calls if supported
Make video generation optional/on-demand to reduce costs
Optimize scene length to balance quality vs. API call count

11 KiB Raw Blame History Unescape Escape

Podcast Maker External API Call Analysis

Overview

External API Providers

Workflow Phases & API Calls

Phase 1: Project Creation (createProject)

Phase 2: Research (runResearch)

Phase 3: Script Generation (generateScript)

Phase 4: Audio Rendering (renderSceneAudio)

Phase 5: Video Rendering (generateVideo) - Optional

Summary: Total External API Calls

Minimum Workflow (No Video, 5-minute podcast)

Full Workflow (With Video, 5-minute podcast)

Scaling with Duration

Scaling Factors

1. Duration

2. Number of Speakers

3. Scene Length Target

4. Research Provider

5. Video Generation

Cost Implications

API Call Costs (Estimated)

Total Cost Examples

Optimization Opportunities

Internal vs External Calls

Internal (Not Counted as External)

External (Counted)

Conclusion

11 KiB

Raw Blame History

Phase 1: Project Creation (`createProject`)

Phase 2: Research (`runResearch`)

Phase 3: Script Generation (`generateScript`)

Phase 4: Audio Rendering (`renderSceneAudio`)

Phase 5: Video Rendering (`generateVideo`) - Optional