25 KiB
Story Writer Video Generation Enhancement Plan
Current State Analysis
Current Video Generation
- Provider: HuggingFace (tencent/HunyuanVideo via fal-ai)
- Issues:
- Unreliable API responses
- Limited quality control
- No audio synchronization
- Single provider dependency
- Poor error handling
Current Audio Generation
- Provider: gTTS (Google Text-to-Speech)
- Limitations:
- Robotic, non-natural voice
- No brand voice consistency
- Limited language options
- No emotion control
- Cannot clone user's voice
Current Story Writer Workflow
- User creates story outline with scenes
- Each scene has
audio_narrationtext - Audio generated via gTTS per scene
- Video generated via HuggingFace per scene
- Videos compiled into final story video
Location: backend/api/story_writer/ and frontend/src/components/StoryWriter/
Proposed Enhancements
Core Principles
Provider Abstraction:
- Users should NOT see provider names (HuggingFace, WaveSpeed, etc.)
- All provider routing/switching happens automatically in the background
- Users only see user-friendly options like "Standard Quality" or "Premium Quality"
- System automatically selects best available provider based on user's subscription and credits
Preserve Existing Options:
- gTTS remains available as free fallback when credits run out
- HuggingFace remains available as fallback option
- All existing functionality preserved
- New features are additions, not replacements
Cost Transparency:
- All buttons show cost information in tooltips
- Users make informed decisions before generating
- No surprise costs
1. Provider-Agnostic Video Generation System
1.1 Smart Provider Routing
Backend Implementation (backend/services/llm_providers/main_video_generation.py):
def ai_video_generate(
prompt: str,
quality: str = "standard", # "standard" (480p), "high" (720p), "premium" (1080p)
duration: int = 5,
audio_file_path: Optional[str] = None,
user_id: str,
**kwargs,
) -> bytes:
"""
Unified video generation entry point.
Automatically routes to best available provider:
- WaveSpeed WAN 2.5 (primary, if credits available)
- HuggingFace (fallback, if WaveSpeed unavailable)
Users never see provider names - only quality options.
"""
# 1. Check user subscription and credits
# 2. Select best available provider automatically
# 3. Route to appropriate provider function
# 4. Handle fallbacks transparently
pass
def _select_video_provider(
user_id: str,
quality: str,
pricing_service: PricingService,
) -> Tuple[str, str]:
"""
Automatically select best video provider.
Returns: (provider_name, model_name)
Selection logic:
1. Check user credits/subscription
2. Prefer WaveSpeed if available and credits sufficient
3. Fallback to HuggingFace if WaveSpeed unavailable
4. Return error if no providers available
"""
# Implementation details...
Key Features:
- Automatic provider selection (users don't choose)
- Seamless fallback between providers
- Quality-based options (Standard/High/Premium) instead of provider names
- Cost-aware routing (uses cheapest available option)
- Transparent error handling
Quality Mapping:
- Standard Quality (480p): $0.05/second - Uses WaveSpeed 480p or HuggingFace
- High Quality (720p): $0.10/second - Uses WaveSpeed 720p
- Premium Quality (1080p): $0.15/second - Uses WaveSpeed 1080p
Cost Optimization:
- Default to Standard Quality (480p) for cost-effectiveness
- Allow upgrade to High/Premium for final export
- Pre-flight validation prevents waste
- Automatic fallback to free options when credits exhausted
2. Enhanced Audio Generation with Voice Cloning
2.1 User-Friendly Voice Selection
Key Principle: Users choose between "AI Clone Voice" or "Default Voice" (gTTS) - no provider names shown.
Backend Implementation (backend/services/story_writer/audio_generation_service.py):
class StoryAudioGenerationService:
def generate_scene_audio(
self,
scene: Dict[str, Any],
user_id: str,
use_ai_voice: bool = False, # User's choice: AI Clone or Default
**kwargs,
) -> Dict[str, Any]:
"""
Generate audio with automatic provider selection.
If use_ai_voice=True:
- Try persona voice clone (if trained)
- Try Minimax voice clone (if credits available)
- Fallback to gTTS if no credits
If use_ai_voice=False:
- Use gTTS (always free, always available)
"""
if use_ai_voice:
# Try AI voice options
if self._has_persona_voice(user_id):
return self._generate_with_persona_voice(scene, user_id)
elif self._has_credits_for_voice_clone(user_id):
return self._generate_with_minimax_voice_clone(scene, user_id)
else:
# Fallback to gTTS with notification
logger.info(f"Credits exhausted, falling back to gTTS for user {user_id}")
return self._generate_with_gtts(scene, **kwargs)
else:
# User explicitly chose default voice
return self._generate_with_gtts(scene, **kwargs)
Voice Options in Story Setup:
- Default Voice (gTTS): Free, always available, robotic but functional
- AI Clone Voice: Natural, human-like, requires credits ($0.02/minute)
Cost Considerations:
- Voice training: One-time cost (~$0.75) - only if user wants to train custom voice
- Voice generation: ~$0.02 per minute (only when AI Clone Voice selected)
- gTTS: Always free, always available as fallback
- Automatic fallback to gTTS when credits exhausted (with user notification)
3. Enhanced Story Setup UI
3.1 Video Generation Settings (Provider-Agnostic)
Location: frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx
User-Friendly Settings (No Provider Names):
interface VideoGenerationSettings {
// Quality selection (NOT provider selection)
videoQuality: 'standard' | 'high' | 'premium'; // Maps to 480p/720p/1080p
// Duration
videoDuration: 5 | 10; // seconds
// Cost estimation (shown in tooltip)
estimatedCostPerScene: number;
totalEstimatedCost: number;
// Provider routing happens automatically in backend
// Users never see "WaveSpeed" or "HuggingFace"
}
UI Components:
- Quality selector: "Standard" / "High" / "Premium" (with cost in tooltip)
- Duration selector: 5s (default) / 10s (premium)
- Cost tooltip: Shows estimated cost per scene and total
- Pre-flight validation warnings
- No provider selector - routing is automatic
Tooltip Example:
Standard Quality (480p)
├─ Cost: $0.25 per scene (5 seconds)
├─ Quality: Good for previews and testing
└─ Provider: Automatically selected based on credits
3.2 Audio Generation Settings (Simple Choice)
New Settings:
interface AudioGenerationSettings {
// Simple user choice - no provider names
voiceType: 'default' | 'ai_clone'; // "Default Voice" or "AI Clone Voice"
// Only shown if ai_clone selected
voiceTrainingStatus: 'not_trained' | 'training' | 'ready' | 'failed';
// Existing gTTS settings (preserved)
audioLang: string;
audioSlow: boolean;
audioRate: number;
}
UI Components:
- Voice Type Selector:
- "Default Voice (gTTS)" - Free, always available
- "AI Clone Voice" - Natural, $0.02/minute (with cost tooltip)
- Voice training section (only if AI Clone Voice selected)
- Existing gTTS settings (preserved for Default Voice)
- Cost per minute display in tooltip
Tooltip for "AI Clone Voice":
AI Clone Voice
├─ Cost: $0.02 per minute
├─ Quality: Natural, human-like narration
├─ Fallback: Automatically uses Default Voice if credits exhausted
└─ Training: One-time $0.75 to train your custom voice (optional)
Tooltip for "Default Voice":
Default Voice (gTTS)
├─ Cost: Free
├─ Quality: Standard text-to-speech
└─ Always Available: Works even when credits exhausted
4. New "Animate Scene" Feature in Outline Phase
4.1 Per-Scene Animation Preview
Location: frontend/src/components/StoryWriter/Phases/StoryOutline.tsx
Feature: Add "Animate Scene" hover option alongside existing scene actions
Implementation:
- Add to
OutlineHoverActionscomponent - Appears on hover over scene cards
- Only generates for single scene (never bulk)
- Uses cheapest option (480p/Standard Quality) to give users a feel
- Shows cost in tooltip before generation
UI Component:
// In OutlineHoverActions.tsx
const sceneHoverActions = [
// Existing actions...
{
icon: <PlayArrowIcon />,
label: 'Animate Scene',
action: 'animate-scene',
tooltip: `Animate this scene with video\nCost: ~$0.25 (5 seconds, Standard Quality)\nPreview only - uses cheapest option`,
onClick: handleAnimateScene,
},
];
Backend Endpoint:
@router.post("/animate-scene-preview")
async def animate_scene_preview(
request: SceneAnimationRequest,
current_user: Dict[str, Any] = Depends(get_current_user),
) -> SceneAnimationResponse:
"""
Generate preview animation for a single scene.
Always uses cheapest option (480p/Standard Quality).
Per-scene only - never bulk generation.
"""
# 1. Validate single scene only
# 2. Use Standard Quality (480p) - cheapest option
# 3. Generate video with automatic provider routing
# 4. Return preview video URL
pass
Cost Management:
- Always uses Standard Quality (480p) - $0.25 per scene
- Pre-flight validation before generation
- Clear cost display in tooltip
- Per-scene only prevents bulk waste
5. New "Animate Story with VoiceOver" Button in Writing Phase
5.1 Complete Story Animation
Location: frontend/src/components/StoryWriter/Phases/StoryWriting.tsx
Feature: New button alongside existing HuggingFace video options
Implementation:
- Add button in Writing phase toolbar
- Generates complete animated story with synchronized voiceover
- Uses user's voice preference from Setup (AI Clone or Default)
- Shows comprehensive cost breakdown in tooltip
- Pre-flight validation before generation
UI Component:
<Button
variant="contained"
startIcon={<SmartDisplayIcon />}
onClick={handleAnimateStoryWithVoiceOver}
disabled={!state.storyContent || isGenerating}
title={`Animate Story with VoiceOver\n\nCost Breakdown:\n- Video: $${videoCost} (${scenes.length} scenes × $${costPerScene})\n- Audio: $${audioCost} (${totalAudioMinutes} minutes)\n- Total: $${totalCost}\n\nQuality: ${state.videoQuality}\nVoice: ${state.voiceType === 'ai_clone' ? 'AI Clone' : 'Default'}`}
>
Animate Story with VoiceOver
</Button>
Backend Endpoint:
@router.post("/animate-story-with-voiceover")
async def animate_story_with_voiceover(
request: StoryAnimationRequest,
current_user: Dict[str, Any] = Depends(get_current_user),
) -> StoryAnimationResponse:
"""
Generate complete animated story with synchronized voiceover.
Uses user's quality and voice preferences from Setup.
"""
# 1. Pre-flight validation (cost, credits, limits)
# 2. Generate audio for all scenes (using user's voice preference)
# 3. Generate videos for all scenes (using user's quality preference)
# 4. Synchronize audio with video
# 5. Compile into final story video
# 6. Return video URL and cost breakdown
pass
Cost Tooltip Example:
Animate Story with VoiceOver
Cost Breakdown:
├─ Video (Standard Quality): $2.50
│ └─ 10 scenes × $0.25 per scene
├─ Audio (AI Clone Voice): $1.00
│ └─ 50 minutes total × $0.02/minute
└─ Total: $3.50
Settings:
├─ Quality: Standard (480p)
├─ Voice: AI Clone Voice
└─ Duration: 5 seconds per scene
⚠️ This will use $3.50 of your monthly credits
Implementation Phases
Phase 1: Provider-Agnostic Video System (Week 1-2)
Priority: HIGH - Solves immediate HuggingFace issues with provider abstraction
Tasks:
- ✅ Create WaveSpeed API client (
backend/services/wavespeed/client.py) - ✅ Add WAN 2.5 text-to-video function
- ✅ Implement smart provider routing in
main_video_generation.py - ✅ Add quality-based selection (Standard/High/Premium)
- ✅ Preserve HuggingFace as fallback option
- ✅ Update
hd_video.pywith provider routing - ✅ Add pre-flight cost validation
- ✅ Update frontend with quality selector (remove provider names)
- ✅ Add cost tooltips to all buttons
- ✅ Update subscription limits
- ✅ Testing and error handling
Files to Modify:
backend/services/llm_providers/main_video_generation.py(add routing logic)backend/api/story_writer/utils/hd_video.py(use quality-based API)backend/api/story_writer/routes/video_generation.pyfrontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx(quality selector)frontend/src/components/StoryWriter/components/HdVideoSection.tsxbackend/services/subscription/pricing_service.py
Success Criteria:
- Video generation works reliably with automatic provider routing
- Users see quality options, not provider names
- HuggingFace preserved as fallback
- Cost tracking accurate
- Pre-flight validation prevents waste
- Error messages clear and actionable
Phase 2: Voice Cloning Integration (Week 3-4)
Priority: MEDIUM - Enhances audio quality with simple user choice
Tasks:
- ✅ Create Minimax API client (
backend/services/minimax/voice_clone.py) - ✅ Add voice training endpoint
- ✅ Add voice generation endpoint
- ✅ Update
audio_generation_service.pywith "AI Clone" vs "Default" logic - ✅ Preserve gTTS as always-available fallback
- ✅ Add automatic fallback when credits exhausted
- ✅ Update Story Setup with simple voice type selector
- ✅ Add cost tooltips to voice options
- ✅ Add voice preview and testing (if AI Clone selected)
- ✅ Ensure gTTS always works even when credits exhausted
Files to Create:
backend/services/minimax/voice_clone.pybackend/services/story_writer/voice_management_service.py
Files to Modify:
backend/services/story_writer/audio_generation_service.py(add voice type logic)frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx(voice type selector)backend/models/story_models.py(add voice type field)
Success Criteria:
- Users see simple choice: "Default Voice" or "AI Clone Voice"
- gTTS always available as fallback
- Automatic fallback when credits exhausted
- Cost tracking accurate
- Voice quality significantly better than gTTS when AI Clone used
Phase 3: New Features - Animate Scene & Animate Story (Week 5-6)
Priority: MEDIUM - Add preview and complete animation features
Tasks:
- ✅ Add "Animate Scene" hover option in Outline phase
- ✅ Implement per-scene animation preview (cheapest option only)
- ✅ Add "Animate Story with VoiceOver" button in Writing phase
- ✅ Implement complete story animation with voiceover
- ✅ Add comprehensive cost tooltips to all buttons
- ✅ Add pre-flight validation for all animation features
- ✅ Ensure per-scene only (no bulk generation in Outline)
- ✅ Update documentation
- ✅ User testing and feedback
Files to Create:
backend/api/story_writer/routes/scene_animation.py(new endpoint)frontend/src/components/StoryWriter/components/AnimateSceneButton.tsx
Files to Modify:
frontend/src/components/StoryWriter/Phases/StoryOutlineParts/OutlineHoverActions.tsx(add Animate Scene)frontend/src/components/StoryWriter/Phases/StoryWriting.tsx(add Animate Story button)backend/api/story_writer/routes/video_generation.py(add story animation endpoint)
Success Criteria:
- "Animate Scene" works in Outline (per-scene, cheapest option)
- "Animate Story with VoiceOver" works in Writing phase
- All buttons show cost in tooltips
- Pre-flight validation prevents waste
- Good user experience
Phase 4: Integration & Optimization (Week 7-8)
Priority: MEDIUM - Polish and optimize
Tasks:
- ✅ Integrate audio with video (synchronized videos)
- ✅ Improve error handling and retry logic
- ✅ Add progress indicators
- ✅ Optimize cost calculations
- ✅ Add usage analytics
- ✅ Update documentation
- ✅ User testing and feedback
Success Criteria:
- Smooth end-to-end workflow
- Cost-effective for users
- Reliable generation
- Excellent user experience
- All features work seamlessly together
Cost Management & Prevention of Waste
Pre-Flight Validation
Implementation: backend/services/subscription/preflight_validator.py
Checks Before Generation:
- User has sufficient subscription tier
- Estimated cost within monthly budget
- Video generation limit not exceeded
- Audio generation limit not exceeded
- Total story cost reasonable (<$5 for typical story)
Validation Flow:
def validate_story_generation(
pricing_service: PricingService,
user_id: str,
num_scenes: int,
video_resolution: str,
video_duration: int,
use_voice_clone: bool,
) -> Tuple[bool, str, Dict[str, Any]]:
"""
Pre-flight validation before story generation.
Returns: (allowed, message, cost_breakdown)
"""
# Calculate estimated costs
video_cost_per_scene = get_wavespeed_cost(video_resolution, video_duration)
audio_cost_per_scene = get_voice_clone_cost() if use_voice_clone else 0.0
total_estimated_cost = (video_cost_per_scene + audio_cost_per_scene) * num_scenes
# Check limits
limits = pricing_service.get_user_limits(user_id)
current_usage = pricing_service.get_current_usage(user_id)
# Validation logic...
return (allowed, message, cost_breakdown)
Cost Estimation Display
Frontend Implementation:
- Real-time cost calculator in Story Setup
- Per-scene cost breakdown
- Total story cost estimate
- Monthly budget remaining
- Warning if approaching limits
UI Example:
Video Generation Cost Estimate:
├─ Resolution: 720p ($0.10/second)
├─ Duration: 5 seconds per scene
├─ Scenes: 10
└─ Total: $5.00
Audio Generation Cost Estimate:
├─ Provider: Voice Clone ($0.02/minute)
├─ Average: 30 seconds per scene
├─ Scenes: 10
└─ Total: $1.00
Total Estimated Cost: $6.00
Monthly Budget Remaining: $44.00
Usage Tracking
Enhanced Tracking:
- Track video generation per scene
- Track audio generation per scene
- Track total story cost
- Alert users approaching limits
- Provide cost breakdown in analytics
Pricing Integration
WaveSpeed WAN 2.5 Pricing
Add to pricing_service.py:
# WaveSpeed WAN 2.5 Text-to-Video
{
"provider": APIProvider.VIDEO, # Or new WAVESPEED provider
"model_name": "wan-2.5-480p",
"cost_per_second": 0.05,
"description": "WaveSpeed WAN 2.5 Text-to-Video (480p)"
},
{
"provider": APIProvider.VIDEO,
"model_name": "wan-2.5-720p",
"cost_per_second": 0.10,
"description": "WaveSpeed WAN 2.5 Text-to-Video (720p)"
},
{
"provider": APIProvider.VIDEO,
"model_name": "wan-2.5-1080p",
"cost_per_second": 0.15,
"description": "WaveSpeed WAN 2.5 Text-to-Video (1080p)"
}
Minimax Voice Clone Pricing
Add to pricing_service.py:
# Minimax Voice Clone
{
"provider": APIProvider.AUDIO, # New provider type
"model_name": "minimax-voice-clone-train",
"cost_per_request": 0.75, # One-time training cost
"description": "Minimax Voice Clone Training"
},
{
"provider": APIProvider.AUDIO,
"model_name": "minimax-voice-clone-generate",
"cost_per_minute": 0.02, # Per minute of generated audio
"description": "Minimax Voice Clone Generation"
}
Subscription Tier Limits
Update subscription limits:
- Free: 3 stories/month, 480p only, gTTS only
- Basic: 10 stories/month, up to 720p, voice clone available
- Pro: 50 stories/month, up to 1080p, voice clone included
- Enterprise: Unlimited, all features
Technical Architecture
Backend Services
backend/services/
├── wavespeed/
│ ├── __init__.py
│ ├── client.py # WaveSpeed API client
│ ├── wan25_video.py # WAN 2.5 video generation
│ └── models.py # Request/response models
├── minimax/
│ ├── __init__.py
│ ├── client.py # Minimax API client
│ ├── voice_clone.py # Voice cloning service
│ └── models.py
└── story_writer/
├── audio_generation_service.py # Updated with voice clone
└── video_generation_service.py # Updated with WaveSpeed
Frontend Components
frontend/src/components/StoryWriter/
├── Phases/StorySetup/
│ └── GenerationSettingsSection.tsx # Enhanced with new settings
├── components/
│ ├── HdVideoSection.tsx # Updated for WaveSpeed
│ ├── VoiceTrainingSection.tsx # NEW: Voice training UI
│ └── CostEstimationDisplay.tsx # NEW: Cost calculator
└── hooks/
└── useStoryGenerationCost.ts # NEW: Cost calculation hook
Error Handling & User Experience
Error Scenarios
-
WaveSpeed API Failure:
- Retry with exponential backoff (3 attempts)
- Fallback to HuggingFace if available
- Clear error message with cost refund notice
-
Voice Clone Training Failure:
- Provide specific error (audio quality, length, format)
- Suggest improvements
- Allow retry with different audio
-
Cost Limit Exceeded:
- Pre-flight validation prevents this
- Show upgrade prompt
- Suggest reducing scenes/resolution
-
Audio/Video Mismatch:
- Validate audio length matches video duration
- Auto-trim or extend audio
- Warn user before generation
User Feedback
- Progress indicators for all operations
- Clear cost breakdowns
- Quality previews before final generation
- Regeneration options with cost tracking
- Usage analytics dashboard
Testing Plan
Unit Tests
- WaveSpeed API client
- Voice clone service
- Cost calculation
- Pre-flight validation
Integration Tests
- End-to-end story generation
- Audio + video synchronization
- Error handling and fallbacks
- Subscription limit enforcement
User Acceptance Tests
- Story generation workflow
- Voice training process
- Cost estimation accuracy
- Error recovery
Success Metrics
Technical Metrics
- Video generation success rate >95%
- Audio generation success rate >98%
- Average generation time per scene <30s
- API error rate <2%
Business Metrics
- User satisfaction with video quality
- Cost per story (target: <$5 for 10-scene story)
- Voice clone adoption rate
- Story completion rate
User Experience Metrics
- Time to generate story
- Error recovery time
- User understanding of costs
- Feature discovery rate
Provider Management Strategy
Always-Available Options
- gTTS: Always available, always free, works even when credits exhausted
- HuggingFace: Preserved as fallback option, works when WaveSpeed unavailable
Automatic Provider Routing
- Primary: WaveSpeed WAN 2.5 (when credits available)
- Fallback: HuggingFace (when WaveSpeed unavailable or credits exhausted)
- Audio Fallback: gTTS (always available, always free)
User Experience
- Users never see provider names
- System automatically selects best available option
- Seamless fallback when credits exhausted
- Clear notifications when fallback occurs
- No user intervention required
No Deprecation
- HuggingFace: Kept as permanent fallback option
- gTTS: Kept as permanent free option
- All existing functionality preserved
- New features are additions, not replacements
Next Steps
- Week 1: Set up WaveSpeed API access and credentials
- Week 1: Implement provider-agnostic routing system
- Week 2: Integrate into Story Writer with quality-based UI
- Week 3: Implement voice cloning with simple "AI Clone" vs "Default" choice
- Week 4: Add voice training UI (only if AI Clone selected)
- Week 5: Add "Animate Scene" hover option in Outline
- Week 6: Add "Animate Story with VoiceOver" button in Writing
- Week 7-8: Testing, optimization, and polish
Key Design Principles
- Provider Abstraction: Users never see provider names - only quality/voice options
- Preserve Existing: gTTS and HuggingFace remain available as fallbacks
- Cost Transparency: All buttons show costs in tooltips
- Automatic Fallback: System automatically uses free options when credits exhausted
- Per-Scene Only: Outline phase only allows per-scene generation (no bulk)
- User-Friendly: Simple choices like "Standard Quality" not "WaveSpeed 480p"
Risk Mitigation
| Risk | Mitigation |
|---|---|
| WaveSpeed API changes | Version pinning, abstraction layer |
| Cost overruns | Strict pre-flight validation |
| Voice quality issues | Quality checks, fallback options |
| User confusion | Clear UI, tooltips, documentation |
| Integration complexity | Phased rollout, extensive testing |
Document Version: 1.0
Last Updated: January 2025
Priority: HIGH - Immediate Implementation