Files
ALwrity/docs-site/docs/features/podcast-maker/best-practices.md

7.9 KiB
Raw Permalink Blame History

Podcast Maker Best Practices

This guide is implementation-aware: every recommendation below is based on how the current Podcast Maker APIs actually behave in frontend and backend code.

1) Start with budget-safe defaults (preflight-first workflow)

Podcast Maker runs preflight validation before major steps (analysis, research, script generation, TTS preview, and full TTS render). Use that as your workflow guardrail:

  1. Analyze idea first
  2. Approve a small set of research queries
  3. Generate script
  4. Preview voice on short excerpts
  5. Render full scene audio
  6. Generate scene videos
  7. Combine final assets

Why this matters:

  • If credits/limits are insufficient, preflight fails fast before expensive operations.
  • Video generation also runs server-side animation validation and returns subscription-friendly errors for insufficient credits.

2) Duration vs. scene-count tradeoffs (cost + reliability)

The stack defaults to a 45s scene target and cost estimate logic effectively scales scene count as:

  • scene_count ≈ ceil(duration_minutes * 60 / scene_length_target_seconds)

Practical recommendations:

  • 58 min episodes: target 58 scenes.
  • 1015 min episodes: target 814 scenes.
  • Increase scene_length_target when you need fewer API calls and faster completion.
  • Keep script concise because per-scene TTS has a 10,000-character max (long text gets truncated by frontend before render).

Rule of thumb:

  • More scenes = better pacing granularity but more TTS/video calls.
  • Fewer scenes = cheaper/faster pipeline, but each scene must carry more narrative weight.

3) Voice strategy: preview first, render second

Use a two-pass voice workflow:

Pass A: Preview and lock voice profile

Use preview on short, representative lines (intro, data-heavy line, CTA) to validate:

  • voice identity
  • speed
  • emotion
  • pronunciation behavior (especially numbers/statistics)

Pass B: Full scene render with tuned knobs

When rendering scene audio, adjust only the knobs that matter:

  • voice_id (or custom_voice_id for cloned voice)
  • speed (default 1.0 is usually safest for timing)
  • emotion (scene-level emotion is supported)
  • english_normalization (keep enabled for number-heavy scripts)
  • audio format controls (sample_rate, bitrate, channel, format, language_boost) only when distribution requires them

Also note:

  • The frontend injects pause markers and strips markdown before TTS for better natural rhythm.
  • Use short lines (24 per scene is a good operational target from script generation guidance).

4) Research quality: when to use Exa config options

Use Exa config knobs intentionally, not by default.

Search type

  • auto: default for most projects.
  • keyword: use when topic vocabulary is stable/specific.
  • neural: use when you need semantic discovery across mixed phrasing.

Domain filters

Use either include or exclude domains (not both).

  • Prefer exa_include_domains for compliance/brand-safe sourcing.
  • Use exa_exclude_domains to remove noisy/untrusted sources.

If both are sent, the backend/frontend sanitize behavior will prefer include-domain intent and drop the conflicting side.

max_sources, category, and freshness

  • Increase max_sources only when synthesis quality is poor at default depth.
  • Use date_range (e.g. last month/quarter/year) for trend-sensitive topics.
  • Turn on statistics-oriented options when the episode needs hard numbers.

Query operations

  • Always approve only the strongest queries before running research.
  • Empty query sets are rejected server-side.

5) Avatar + image prompt strategy for visual consistency

Consistency is strongest when you anchor scene images to a persistent base avatar.

Recommended approach:

  1. Create/upload a presenter avatar once per project.
  2. Reuse that avatar as base_avatar_url for scene images.
  3. Keep one shared style nucleus across prompts (lighting, environment, host look, framing).
  4. Change only scene-specific context (topic, emotion, supporting visual motif).

Important implementation notes:

  • If base_avatar_url is provided, image generation uses character-consistency flow; if the base avatar cannot be loaded, image generation fails (no silent fallback).
  • Keep scene emotion aligned to visual lighting cues for continuity.
  • For presenter generation, keep speakers realistic (supported range is 12).

6) Script and scene structure that survives production

Generate script with full context:

  • analysis (audience/type/keywords)
  • selected outline
  • research payload
  • bible/persona context

Then enforce editorial constraints before render:

  • Remove filler and repeated lines.
  • Ensure each scene has a single narrative job.
  • Keep line lengths short enough for natural TTS breathing.
  • Verify emotion tag is valid (neutral, happy, excited, serious, curious, confident) to avoid fallback normalization.

7) Project save/resume + asset-library workflows

Treat a podcast as a resumable production artifact.

Save/resume

  • Persist state to project APIs throughout the workflow (analysis, research, script, render jobs, knobs, final video URL).
  • Use project list filtering/sorting to resume active work quickly.
  • Handle duplicate-idea conflicts by reopening existing project IDs instead of cloning work.

Asset library workflow

  • Save generated and uploaded assets (audio/avatar/images) into the content asset library with project metadata.
  • Use consistent tags (podcast, project id, scene id) so assets are searchable and reusable.
  • Reuse previously approved host avatars and voice samples across episodes to reduce generation churn.

8) Video and dubbing execution strategy

Video

  • Only pass supported video resolution (480p or 720p).
  • Poll task status (video generation is asynchronous and can take up to ~10 minutes).
  • Use mask image only when you need controlled motion region.
  • Generate all scene videos before starting combine to avoid failed final assembly.

Dubbing

  • Use quality=low for fast/cheap exploration.
  • Use quality=high + use_voice_clone=true when voice identity matters.
  • Keep speed in 0.52.0 and voice clone accuracy in 0.11.0.
  • For voice cloning, feed a clean 1060s sample for best identity retention.

Common failure modes and fixes

For broader platform issues, see the main Troubleshooting Guide.

Failure mode Why it happens Fix
Preflight blocked (analysis/research/script/TTS/video) Insufficient credits or operation limits Run lighter settings first: fewer scenes, lower duration, fewer research queries; then retry.
Research request rejected No approved queries selected Approve at least one non-empty query before running Exa research.
Research config mismatch Include + exclude domains both supplied Use only one domain filter type per run.
Scene audio cuts off Scene text exceeded TTS max characters Reduce scene length/lines; split long scene into two scenes.
Avatar-consistent image generation fails base_avatar_url is broken/inaccessible Re-upload avatar or switch to a valid project image URL; retry scene generation.
Video task fails quickly Invalid media URL, unsupported resolution, missing assets Verify audio/image URLs are valid and use only 480p/720p.
Final combine video fails One or more scene video files missing/invalid Confirm every scene has a completed video task before combine.
Dubbing quality sounds robotic Low quality mode or weak source audio Switch to high quality and/or use voice cloning with a cleaner sample.
Voice clone results are unstable Poor sample or extreme accuracy/speed settings Use clean 1060s sample; keep accuracy near default and speed near 1.0.
Save appears inconsistent across sessions Save failed and only partial local fallback exists Trigger explicit save after each major step and verify project reload from API.