Files
ALwrity/docs/PODCAST_API_CALL_ANALYSIS.md

11 KiB
Raw Blame History

Podcast Maker External API Call Analysis

Overview

This document analyzes all external API calls made during the podcast creation workflow and how they scale with duration, number of speakers, and other factors.


External API Providers

  1. Gemini (Google) - LLM for story setup and script generation
  2. Google Grounding - Research via Gemini's native search grounding
  3. Exa - Alternative neural search provider for research
  4. WaveSpeed - API gateway for:
    • Minimax Speech 02 HD - Text-to-Speech (TTS)
    • InfiniteTalk - Avatar animation (image + audio → video)

Workflow Phases & API Calls

Phase 1: Project Creation (createProject)

External API Calls:

  1. Gemini LLM - Story setup generation

    • Endpoint: /api/story/generate-setup
    • Backend: storyWriterApi.generateStorySetup()
    • Service: backend/services/story_writer/service_components/setup.py
    • Function: llm_text_gen() → Gemini API
    • Calls per project: 1 call
    • Scaling: Fixed (1 call regardless of duration)
  2. Research Config (Optional)

    • Endpoint: /api/research-config
    • Calls per project: 0-1 call (cached)
    • Scaling: Fixed

Total Phase 1: 1-2 external API calls (fixed)


Phase 2: Research (runResearch)

External API Calls:

  1. Google Grounding (via Gemini) OR Exa Neural Search
    • Endpoint: /api/blog/research/start → async task
    • Backend: blogWriterApi.startResearch()
    • Service: backend/services/blog_writer/research/research_service.py
    • Provider Selection:
      • Google Grounding: Uses Gemini's native Google Search grounding
      • Exa: Direct Exa API calls
    • Calls per research: 1 call (handles all keywords in one request)
    • Scaling:
      • Fixed per research operation (1 call regardless of number of queries)
      • Queries are batched into a single research request
      • Number of queries: Typically 1-6 (from mapPersonaQueries)

Polling Calls:

  • Internal task polling: blogWriterApi.pollResearchStatus()
  • Not external API calls (internal task status checks)
  • Polling frequency: Every 2.5 seconds, max 120 attempts (5 minutes)

Total Phase 2: 1 external API call (fixed per research operation)


Phase 3: Script Generation (generateScript)

External API Calls:

  1. Gemini LLM - Story outline generation
    • Endpoint: /api/story/generate-outline
    • Backend: storyWriterApi.generateOutline()
    • Service: backend/services/story_writer/service_components/outline.py
    • Function: llm_text_gen() → Gemini API
    • Calls per script: 1 call
    • Scaling:
      • Fixed per script generation (1 call regardless of duration)
      • Duration affects output length (more scenes), but not number of API calls

Total Phase 3: 1 external API call (fixed)


Phase 4: Audio Rendering (renderSceneAudio)

External API Calls:

  1. WaveSpeed → Minimax Speech 02 HD - Text-to-Speech
    • Endpoint: /api/story/generate-audio
    • Backend: storyWriterApi.generateAIAudio()
    • Service: backend/services/wavespeed/client.py::generate_speech()
    • External API: WaveSpeed API → Minimax Speech 02 HD
    • Calls per scene: 1 call per scene
    • Scaling with duration:
      • Number of scenes = Math.ceil((duration * 60) / scene_length_target)
      • Default scene_length_target: 45 seconds
      • Example calculations:
        • 5 minutes → ceil(300 / 45) = 7 scenes = 7 TTS calls
        • 10 minutes → ceil(600 / 45) = 14 scenes = 14 TTS calls
        • 15 minutes → ceil(900 / 45) = 20 scenes = 20 TTS calls
        • 30 minutes → ceil(1800 / 45) = 40 scenes = 40 TTS calls
    • Scaling with speakers:
      • Fixed per scene (1 call per scene regardless of speakers)
      • Speakers affect text splitting (lines per speaker), but not API calls
    • Text length per call:
      • Characters per scene(scene_length_target * 15) (assuming ~15 chars/second)
      • 5-minute podcast: ~675 chars/scene × 7 scenes = ~4,725 total chars
      • 30-minute podcast: ~675 chars/scene × 40 scenes = ~27,000 total chars

Total Phase 4: N external API calls where N = number of scenes


Phase 5: Video Rendering (generateVideo) - Optional

External API Calls:

  1. WaveSpeed → InfiniteTalk - Avatar animation
    • Endpoint: /api/podcast/render/video
    • Backend: podcastApi.generateVideo()
    • Service: backend/services/wavespeed/infinitetalk.py::animate_scene_with_voiceover()
    • External API: WaveSpeed API → InfiniteTalk
    • Calls per scene: 1 call per scene (if video is generated)
    • Scaling with duration:
      • Same as audio rendering: 1 call per scene
      • 5 minutes: 7 video calls
      • 10 minutes: 14 video calls
      • 15 minutes: 20 video calls
      • 30 minutes: 40 video calls
    • Scaling with speakers:
      • Fixed per scene (1 call per scene regardless of speakers)
      • Avatar image is provided (not generated per speaker)

Polling Calls:

  • Internal task polling: podcastApi.pollTaskStatus()
  • Not external API calls (internal task status checks)
  • Polling frequency: Every 2.5 seconds until completion (can take up to 10 minutes per video)

Total Phase 5: N external API calls where N = number of scenes (if video is enabled)


Summary: Total External API Calls

Minimum Workflow (No Video, 5-minute podcast)

  1. Project Creation: 1 call (Gemini - story setup)
  2. Research: 1 call (Google Grounding or Exa)
  3. Script Generation: 1 call (Gemini - outline)
  4. Audio Rendering: 7 calls (Minimax TTS - 7 scenes)
  5. Video Rendering: 0 calls (not enabled)

Total: 10 external API calls for a 5-minute podcast

Full Workflow (With Video, 5-minute podcast)

  1. Project Creation: 1 call (Gemini - story setup)
  2. Research: 1 call (Google Grounding or Exa)
  3. Script Generation: 1 call (Gemini - outline)
  4. Audio Rendering: 7 calls (Minimax TTS - 7 scenes)
  5. Video Rendering: 7 calls (InfiniteTalk - 7 scenes)

Total: 17 external API calls for a 5-minute podcast

Scaling with Duration

Duration Scenes Audio Calls Video Calls Total (Audio Only) Total (Audio + Video)
5 min 7 7 7 10 17
10 min 14 14 14 17 31
15 min 20 20 20 23 43
30 min 40 40 40 43 83

Formula:

  • Scenes = ceil((duration_minutes * 60) / scene_length_target)
  • Total (Audio Only) = 3 + scenes (3 fixed + N scenes)
  • Total (Audio + Video) = 3 + (scenes * 2) (3 fixed + N audio + N video)

Scaling Factors

1. Duration

  • Impact: Linear scaling of rendering calls (audio + video)
  • Fixed calls: 3 (setup, research, script)
  • Variable calls: 2 * scenes (if video enabled) or 1 * scenes (audio only)
  • Scene count formula: ceil((duration * 60) / scene_length_target)

2. Number of Speakers

  • Impact: No impact on external API calls
  • Reason:
    • Text is split into lines per speaker before API calls
    • Each scene makes 1 TTS call regardless of speaker count
    • Video uses 1 avatar image (not per speaker)

3. Scene Length Target

  • Impact: Affects number of scenes (and thus rendering calls)
  • Default: 45 seconds
  • Shorter scenes = More scenes = More API calls
  • Longer scenes = Fewer scenes = Fewer API calls

4. Research Provider

  • Impact: No impact on call count
  • Google Grounding: 1 call (batched)
  • Exa: 1 call (batched)
  • Both: Same number of calls

5. Video Generation

  • Impact: Doubles rendering calls (adds 1 call per scene)
  • Audio only: N calls (N = scenes)
  • Audio + Video: 2N calls (N audio + N video)

Cost Implications

API Call Costs (Estimated)

  1. Gemini LLM (Story Setup & Script):

    • Setup: ~2,000 tokens → ~$0.001-0.002
    • Outline: ~3,000-5,000 tokens → ~$0.002-0.005
    • Total: ~$0.003-0.007 per podcast
  2. Google Grounding (Research):

    • Per research: ~1,200 tokens → ~$0.001-0.002
    • Fixed cost regardless of query count
  3. Exa Neural Search (Alternative):

    • Per research: ~$0.005 (flat rate)
    • Fixed cost regardless of query count
  4. Minimax TTS (Audio):

    • Per scene: ~$0.05 per 1,000 characters
    • 5-minute podcast: ~4,725 chars → ~$0.24
    • 30-minute podcast: ~27,000 chars → ~$1.35
    • Scales linearly with duration
  5. InfiniteTalk (Video):

    • Per scene: ~$0.03-0.06 per second (depending on resolution)
    • 5-minute podcast: 7 scenes × 45s × $0.03 = ~$9.45
    • 30-minute podcast: 40 scenes × 45s × $0.03 = ~$54.00
    • Scales linearly with duration

Total Cost Examples

Duration Audio Only Audio + Video (720p)
5 min ~$0.25 ~$9.50
10 min ~$0.50 ~$19.00
15 min ~$0.75 ~$28.50
30 min ~$1.50 ~$57.00

Note: Costs are estimates and may vary based on actual API pricing, text length, and video resolution.


Optimization Opportunities

  1. Batch TTS Calls: Currently 1 call per scene. Could batch multiple scenes if API supports it.
  2. Cache Research Results: Already implemented for exact keyword matches.
  3. Parallel Rendering: Audio and video rendering could be parallelized per scene.
  4. Scene Length Optimization: Longer scenes = fewer API calls (but may reduce quality).
  5. Video Optional: Video generation doubles costs - make it optional/on-demand.

Internal vs External Calls

Internal (Not Counted as External)

  • Preflight validation checks (/api/billing/preflight)
  • Task status polling (/api/story/task/{taskId}/status)
  • Project persistence (/api/podcast/projects/*)
  • Content asset library (/api/content-assets/*)

External (Counted)

  • Gemini LLM (story setup, script generation)
  • Google Grounding (research)
  • Exa (research alternative)
  • WaveSpeed → Minimax TTS (audio)
  • WaveSpeed → InfiniteTalk (video)

Conclusion

Key Findings:

  1. Fixed overhead: 3 external API calls per podcast (setup, research, script)
  2. Variable overhead: 1-2 calls per scene (audio, optionally video)
  3. Duration is the primary scaling factor for rendering calls
  4. Number of speakers does NOT affect API call count
  5. Video generation doubles rendering API calls

Recommendations:

  • Monitor API call counts and costs per podcast duration
  • Consider batching strategies for TTS calls if supported
  • Make video generation optional/on-demand to reduce costs
  • Optimize scene length to balance quality vs. API call count