7.9 KiB
Podcast Maker Best Practices
This guide is implementation-aware: every recommendation below is based on how the current Podcast Maker APIs actually behave in frontend and backend code.
1) Start with budget-safe defaults (preflight-first workflow)
Podcast Maker runs preflight validation before major steps (analysis, research, script generation, TTS preview, and full TTS render). Use that as your workflow guardrail:
- Analyze idea first
- Approve a small set of research queries
- Generate script
- Preview voice on short excerpts
- Render full scene audio
- Generate scene videos
- Combine final assets
Why this matters:
- If credits/limits are insufficient, preflight fails fast before expensive operations.
- Video generation also runs server-side animation validation and returns subscription-friendly errors for insufficient credits.
2) Duration vs. scene-count tradeoffs (cost + reliability)
The stack defaults to a 45s scene target and cost estimate logic effectively scales scene count as:
scene_count ≈ ceil(duration_minutes * 60 / scene_length_target_seconds)
Practical recommendations:
- 5–8 min episodes: target 5–8 scenes.
- 10–15 min episodes: target 8–14 scenes.
- Increase
scene_length_targetwhen you need fewer API calls and faster completion. - Keep script concise because per-scene TTS has a 10,000-character max (long text gets truncated by frontend before render).
Rule of thumb:
- More scenes = better pacing granularity but more TTS/video calls.
- Fewer scenes = cheaper/faster pipeline, but each scene must carry more narrative weight.
3) Voice strategy: preview first, render second
Use a two-pass voice workflow:
Pass A: Preview and lock voice profile
Use preview on short, representative lines (intro, data-heavy line, CTA) to validate:
- voice identity
- speed
- emotion
- pronunciation behavior (especially numbers/statistics)
Pass B: Full scene render with tuned knobs
When rendering scene audio, adjust only the knobs that matter:
voice_id(orcustom_voice_idfor cloned voice)speed(default 1.0 is usually safest for timing)emotion(scene-level emotion is supported)english_normalization(keep enabled for number-heavy scripts)- audio format controls (
sample_rate,bitrate,channel,format,language_boost) only when distribution requires them
Also note:
- The frontend injects pause markers and strips markdown before TTS for better natural rhythm.
- Use short lines (2–4 per scene is a good operational target from script generation guidance).
4) Research quality: when to use Exa config options
Use Exa config knobs intentionally, not by default.
Search type
auto: default for most projects.keyword: use when topic vocabulary is stable/specific.neural: use when you need semantic discovery across mixed phrasing.
Domain filters
Use either include or exclude domains (not both).
- Prefer
exa_include_domainsfor compliance/brand-safe sourcing. - Use
exa_exclude_domainsto remove noisy/untrusted sources.
If both are sent, the backend/frontend sanitize behavior will prefer include-domain intent and drop the conflicting side.
max_sources, category, and freshness
- Increase
max_sourcesonly when synthesis quality is poor at default depth. - Use
date_range(e.g. last month/quarter/year) for trend-sensitive topics. - Turn on statistics-oriented options when the episode needs hard numbers.
Query operations
- Always approve only the strongest queries before running research.
- Empty query sets are rejected server-side.
5) Avatar + image prompt strategy for visual consistency
Consistency is strongest when you anchor scene images to a persistent base avatar.
Recommended approach:
- Create/upload a presenter avatar once per project.
- Reuse that avatar as
base_avatar_urlfor scene images. - Keep one shared style nucleus across prompts (lighting, environment, host look, framing).
- Change only scene-specific context (topic, emotion, supporting visual motif).
Important implementation notes:
- If
base_avatar_urlis provided, image generation uses character-consistency flow; if the base avatar cannot be loaded, image generation fails (no silent fallback). - Keep scene emotion aligned to visual lighting cues for continuity.
- For presenter generation, keep speakers realistic (supported range is 1–2).
6) Script and scene structure that survives production
Generate script with full context:
- analysis (audience/type/keywords)
- selected outline
- research payload
- bible/persona context
Then enforce editorial constraints before render:
- Remove filler and repeated lines.
- Ensure each scene has a single narrative job.
- Keep line lengths short enough for natural TTS breathing.
- Verify emotion tag is valid (
neutral,happy,excited,serious,curious,confident) to avoid fallback normalization.
7) Project save/resume + asset-library workflows
Treat a podcast as a resumable production artifact.
Save/resume
- Persist state to project APIs throughout the workflow (analysis, research, script, render jobs, knobs, final video URL).
- Use project list filtering/sorting to resume active work quickly.
- Handle duplicate-idea conflicts by reopening existing project IDs instead of cloning work.
Asset library workflow
- Save generated and uploaded assets (audio/avatar/images) into the content asset library with project metadata.
- Use consistent tags (
podcast, project id, scene id) so assets are searchable and reusable. - Reuse previously approved host avatars and voice samples across episodes to reduce generation churn.
8) Video and dubbing execution strategy
Video
- Only pass supported video resolution (
480por720p). - Poll task status (video generation is asynchronous and can take up to ~10 minutes).
- Use mask image only when you need controlled motion region.
- Generate all scene videos before starting combine to avoid failed final assembly.
Dubbing
- Use
quality=lowfor fast/cheap exploration. - Use
quality=high+use_voice_clone=truewhen voice identity matters. - Keep
speedin 0.5–2.0 and voice clone accuracy in 0.1–1.0. - For voice cloning, feed a clean 10–60s sample for best identity retention.
Common failure modes and fixes
For broader platform issues, see the main Troubleshooting Guide.
| Failure mode | Why it happens | Fix |
|---|---|---|
| Preflight blocked (analysis/research/script/TTS/video) | Insufficient credits or operation limits | Run lighter settings first: fewer scenes, lower duration, fewer research queries; then retry. |
| Research request rejected | No approved queries selected | Approve at least one non-empty query before running Exa research. |
| Research config mismatch | Include + exclude domains both supplied | Use only one domain filter type per run. |
| Scene audio cuts off | Scene text exceeded TTS max characters | Reduce scene length/lines; split long scene into two scenes. |
| Avatar-consistent image generation fails | base_avatar_url is broken/inaccessible |
Re-upload avatar or switch to a valid project image URL; retry scene generation. |
| Video task fails quickly | Invalid media URL, unsupported resolution, missing assets | Verify audio/image URLs are valid and use only 480p/720p. |
| Final combine video fails | One or more scene video files missing/invalid | Confirm every scene has a completed video task before combine. |
| Dubbing quality sounds robotic | Low quality mode or weak source audio | Switch to high quality and/or use voice cloning with a cleaner sample. |
| Voice clone results are unstable | Poor sample or extreme accuracy/speed settings | Use clean 10–60s sample; keep accuracy near default and speed near 1.0. |
| Save appears inconsistent across sessions | Save failed and only partial local fallback exists | Trigger explicit save after each major step and verify project reload from API. |