# Story Writer Video Generation Enhancement Plan --- ## Current State Analysis ### Current Video Generation - **Provider**: HuggingFace (tencent/HunyuanVideo via fal-ai) - **Issues**: - Unreliable API responses - Limited quality control - No audio synchronization - Single provider dependency - Poor error handling ### Current Audio Generation - **Provider**: gTTS (Google Text-to-Speech) - **Limitations**: - Robotic, non-natural voice - No brand voice consistency - Limited language options - No emotion control - Cannot clone user's voice ### Current Story Writer Workflow 1. User creates story outline with scenes 2. Each scene has `audio_narration` text 3. Audio generated via gTTS per scene 4. Video generated via HuggingFace per scene 5. Videos compiled into final story video **Location**: `backend/api/story_writer/` and `frontend/src/components/StoryWriter/` --- ## Proposed Enhancements ### Core Principles **Provider Abstraction**: - Users should NOT see provider names (HuggingFace, WaveSpeed, etc.) - All provider routing/switching happens automatically in the background - Users only see user-friendly options like "Standard Quality" or "Premium Quality" - System automatically selects best available provider based on user's subscription and credits **Preserve Existing Options**: - gTTS remains available as free fallback when credits run out - HuggingFace remains available as fallback option - All existing functionality preserved - New features are additions, not replacements **Cost Transparency**: - All buttons show cost information in tooltips - Users make informed decisions before generating - No surprise costs --- ### 1. Provider-Agnostic Video Generation System #### 1.1 Smart Provider Routing **Backend Implementation** (`backend/services/llm_providers/main_video_generation.py`): ```python def ai_video_generate( prompt: str, quality: str = "standard", # "standard" (480p), "high" (720p), "premium" (1080p) duration: int = 5, audio_file_path: Optional[str] = None, user_id: str, **kwargs, ) -> bytes: """ Unified video generation entry point. Automatically routes to best available provider: - WaveSpeed WAN 2.5 (primary, if credits available) - HuggingFace (fallback, if WaveSpeed unavailable) Users never see provider names - only quality options. """ # 1. Check user subscription and credits # 2. Select best available provider automatically # 3. Route to appropriate provider function # 4. Handle fallbacks transparently pass def _select_video_provider( user_id: str, quality: str, pricing_service: PricingService, ) -> Tuple[str, str]: """ Automatically select best video provider. Returns: (provider_name, model_name) Selection logic: 1. Check user credits/subscription 2. Prefer WaveSpeed if available and credits sufficient 3. Fallback to HuggingFace if WaveSpeed unavailable 4. Return error if no providers available """ # Implementation details... ``` **Key Features**: - Automatic provider selection (users don't choose) - Seamless fallback between providers - Quality-based options (Standard/High/Premium) instead of provider names - Cost-aware routing (uses cheapest available option) - Transparent error handling **Quality Mapping**: - **Standard Quality** (480p): $0.05/second - Uses WaveSpeed 480p or HuggingFace - **High Quality** (720p): $0.10/second - Uses WaveSpeed 720p - **Premium Quality** (1080p): $0.15/second - Uses WaveSpeed 1080p **Cost Optimization**: - Default to Standard Quality (480p) for cost-effectiveness - Allow upgrade to High/Premium for final export - Pre-flight validation prevents waste - Automatic fallback to free options when credits exhausted --- ### 2. Enhanced Audio Generation with Voice Cloning #### 2.1 User-Friendly Voice Selection **Key Principle**: Users choose between "AI Clone Voice" or "Default Voice" (gTTS) - no provider names shown. **Backend Implementation** (`backend/services/story_writer/audio_generation_service.py`): ```python class StoryAudioGenerationService: def generate_scene_audio( self, scene: Dict[str, Any], user_id: str, use_ai_voice: bool = False, # User's choice: AI Clone or Default **kwargs, ) -> Dict[str, Any]: """ Generate audio with automatic provider selection. If use_ai_voice=True: - Try persona voice clone (if trained) - Try Minimax voice clone (if credits available) - Fallback to gTTS if no credits If use_ai_voice=False: - Use gTTS (always free, always available) """ if use_ai_voice: # Try AI voice options if self._has_persona_voice(user_id): return self._generate_with_persona_voice(scene, user_id) elif self._has_credits_for_voice_clone(user_id): return self._generate_with_minimax_voice_clone(scene, user_id) else: # Fallback to gTTS with notification logger.info(f"Credits exhausted, falling back to gTTS for user {user_id}") return self._generate_with_gtts(scene, **kwargs) else: # User explicitly chose default voice return self._generate_with_gtts(scene, **kwargs) ``` **Voice Options in Story Setup**: - **Default Voice (gTTS)**: Free, always available, robotic but functional - **AI Clone Voice**: Natural, human-like, requires credits ($0.02/minute) **Cost Considerations**: - Voice training: One-time cost (~$0.75) - only if user wants to train custom voice - Voice generation: ~$0.02 per minute (only when AI Clone Voice selected) - gTTS: Always free, always available as fallback - Automatic fallback to gTTS when credits exhausted (with user notification) --- ### 3. Enhanced Story Setup UI #### 3.1 Video Generation Settings (Provider-Agnostic) **Location**: `frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx` **User-Friendly Settings** (No Provider Names): ```typescript interface VideoGenerationSettings { // Quality selection (NOT provider selection) videoQuality: 'standard' | 'high' | 'premium'; // Maps to 480p/720p/1080p // Duration videoDuration: 5 | 10; // seconds // Cost estimation (shown in tooltip) estimatedCostPerScene: number; totalEstimatedCost: number; // Provider routing happens automatically in backend // Users never see "WaveSpeed" or "HuggingFace" } ``` **UI Components**: - Quality selector: "Standard" / "High" / "Premium" (with cost in tooltip) - Duration selector: 5s (default) / 10s (premium) - Cost tooltip: Shows estimated cost per scene and total - Pre-flight validation warnings - **No provider selector** - routing is automatic **Tooltip Example**: ``` Standard Quality (480p) ├─ Cost: $0.25 per scene (5 seconds) ├─ Quality: Good for previews and testing └─ Provider: Automatically selected based on credits ``` #### 3.2 Audio Generation Settings (Simple Choice) **New Settings**: ```typescript interface AudioGenerationSettings { // Simple user choice - no provider names voiceType: 'default' | 'ai_clone'; // "Default Voice" or "AI Clone Voice" // Only shown if ai_clone selected voiceTrainingStatus: 'not_trained' | 'training' | 'ready' | 'failed'; // Existing gTTS settings (preserved) audioLang: string; audioSlow: boolean; audioRate: number; } ``` **UI Components**: - **Voice Type Selector**: - "Default Voice (gTTS)" - Free, always available - "AI Clone Voice" - Natural, $0.02/minute (with cost tooltip) - Voice training section (only if AI Clone Voice selected) - Existing gTTS settings (preserved for Default Voice) - Cost per minute display in tooltip **Tooltip for "AI Clone Voice"**: ``` AI Clone Voice ├─ Cost: $0.02 per minute ├─ Quality: Natural, human-like narration ├─ Fallback: Automatically uses Default Voice if credits exhausted └─ Training: One-time $0.75 to train your custom voice (optional) ``` **Tooltip for "Default Voice"**: ``` Default Voice (gTTS) ├─ Cost: Free ├─ Quality: Standard text-to-speech └─ Always Available: Works even when credits exhausted ``` --- ### 4. New "Animate Scene" Feature in Outline Phase #### 4.1 Per-Scene Animation Preview **Location**: `frontend/src/components/StoryWriter/Phases/StoryOutline.tsx` **Feature**: Add "Animate Scene" hover option alongside existing scene actions **Implementation**: - Add to `OutlineHoverActions` component - Appears on hover over scene cards - Only generates for single scene (never bulk) - Uses cheapest option (480p/Standard Quality) to give users a feel - Shows cost in tooltip before generation **UI Component**: ```typescript // In OutlineHoverActions.tsx const sceneHoverActions = [ // Existing actions... { icon: , label: 'Animate Scene', action: 'animate-scene', tooltip: `Animate this scene with video\nCost: ~$0.25 (5 seconds, Standard Quality)\nPreview only - uses cheapest option`, onClick: handleAnimateScene, }, ]; ``` **Backend Endpoint**: ```python @router.post("/animate-scene-preview") async def animate_scene_preview( request: SceneAnimationRequest, current_user: Dict[str, Any] = Depends(get_current_user), ) -> SceneAnimationResponse: """ Generate preview animation for a single scene. Always uses cheapest option (480p/Standard Quality). Per-scene only - never bulk generation. """ # 1. Validate single scene only # 2. Use Standard Quality (480p) - cheapest option # 3. Generate video with automatic provider routing # 4. Return preview video URL pass ``` **Cost Management**: - Always uses Standard Quality (480p) - $0.25 per scene - Pre-flight validation before generation - Clear cost display in tooltip - Per-scene only prevents bulk waste --- ### 5. New "Animate Story with VoiceOver" Button in Writing Phase #### 5.1 Complete Story Animation **Location**: `frontend/src/components/StoryWriter/Phases/StoryWriting.tsx` **Feature**: New button alongside existing HuggingFace video options **Implementation**: - Add button in Writing phase toolbar - Generates complete animated story with synchronized voiceover - Uses user's voice preference from Setup (AI Clone or Default) - Shows comprehensive cost breakdown in tooltip - Pre-flight validation before generation **UI Component**: ```typescript ``` **Backend Endpoint**: ```python @router.post("/animate-story-with-voiceover") async def animate_story_with_voiceover( request: StoryAnimationRequest, current_user: Dict[str, Any] = Depends(get_current_user), ) -> StoryAnimationResponse: """ Generate complete animated story with synchronized voiceover. Uses user's quality and voice preferences from Setup. """ # 1. Pre-flight validation (cost, credits, limits) # 2. Generate audio for all scenes (using user's voice preference) # 3. Generate videos for all scenes (using user's quality preference) # 4. Synchronize audio with video # 5. Compile into final story video # 6. Return video URL and cost breakdown pass ``` **Cost Tooltip Example**: ``` Animate Story with VoiceOver Cost Breakdown: ├─ Video (Standard Quality): $2.50 │ └─ 10 scenes × $0.25 per scene ├─ Audio (AI Clone Voice): $1.00 │ └─ 50 minutes total × $0.02/minute └─ Total: $3.50 Settings: ├─ Quality: Standard (480p) ├─ Voice: AI Clone Voice └─ Duration: 5 seconds per scene ⚠️ This will use $3.50 of your monthly credits ``` --- ## Implementation Phases ### Phase 1: Provider-Agnostic Video System (Week 1-2) **Priority**: HIGH - Solves immediate HuggingFace issues with provider abstraction **Tasks**: 1. ✅ Create WaveSpeed API client (`backend/services/wavespeed/client.py`) 2. ✅ Add WAN 2.5 text-to-video function 3. ✅ Implement smart provider routing in `main_video_generation.py` 4. ✅ Add quality-based selection (Standard/High/Premium) 5. ✅ Preserve HuggingFace as fallback option 6. ✅ Update `hd_video.py` with provider routing 7. ✅ Add pre-flight cost validation 8. ✅ Update frontend with quality selector (remove provider names) 9. ✅ Add cost tooltips to all buttons 10. ✅ Update subscription limits 11. ✅ Testing and error handling **Files to Modify**: - `backend/services/llm_providers/main_video_generation.py` (add routing logic) - `backend/api/story_writer/utils/hd_video.py` (use quality-based API) - `backend/api/story_writer/routes/video_generation.py` - `frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx` (quality selector) - `frontend/src/components/StoryWriter/components/HdVideoSection.tsx` - `backend/services/subscription/pricing_service.py` **Success Criteria**: - Video generation works reliably with automatic provider routing - Users see quality options, not provider names - HuggingFace preserved as fallback - Cost tracking accurate - Pre-flight validation prevents waste - Error messages clear and actionable --- ### Phase 2: Voice Cloning Integration (Week 3-4) **Priority**: MEDIUM - Enhances audio quality with simple user choice **Tasks**: 1. ✅ Create Minimax API client (`backend/services/minimax/voice_clone.py`) 2. ✅ Add voice training endpoint 3. ✅ Add voice generation endpoint 4. ✅ Update `audio_generation_service.py` with "AI Clone" vs "Default" logic 5. ✅ Preserve gTTS as always-available fallback 6. ✅ Add automatic fallback when credits exhausted 7. ✅ Update Story Setup with simple voice type selector 8. ✅ Add cost tooltips to voice options 9. ✅ Add voice preview and testing (if AI Clone selected) 10. ✅ Ensure gTTS always works even when credits exhausted **Files to Create**: - `backend/services/minimax/voice_clone.py` - `backend/services/story_writer/voice_management_service.py` **Files to Modify**: - `backend/services/story_writer/audio_generation_service.py` (add voice type logic) - `frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx` (voice type selector) - `backend/models/story_models.py` (add voice type field) **Success Criteria**: - Users see simple choice: "Default Voice" or "AI Clone Voice" - gTTS always available as fallback - Automatic fallback when credits exhausted - Cost tracking accurate - Voice quality significantly better than gTTS when AI Clone used --- ### Phase 3: New Features - Animate Scene & Animate Story (Week 5-6) **Priority**: MEDIUM - Add preview and complete animation features **Tasks**: 1. ✅ Add "Animate Scene" hover option in Outline phase 2. ✅ Implement per-scene animation preview (cheapest option only) 3. ✅ Add "Animate Story with VoiceOver" button in Writing phase 4. ✅ Implement complete story animation with voiceover 5. ✅ Add comprehensive cost tooltips to all buttons 6. ✅ Add pre-flight validation for all animation features 7. ✅ Ensure per-scene only (no bulk generation in Outline) 8. ✅ Update documentation 9. ✅ User testing and feedback **Files to Create**: - `backend/api/story_writer/routes/scene_animation.py` (new endpoint) - `frontend/src/components/StoryWriter/components/AnimateSceneButton.tsx` **Files to Modify**: - `frontend/src/components/StoryWriter/Phases/StoryOutlineParts/OutlineHoverActions.tsx` (add Animate Scene) - `frontend/src/components/StoryWriter/Phases/StoryWriting.tsx` (add Animate Story button) - `backend/api/story_writer/routes/video_generation.py` (add story animation endpoint) **Success Criteria**: - "Animate Scene" works in Outline (per-scene, cheapest option) - "Animate Story with VoiceOver" works in Writing phase - All buttons show cost in tooltips - Pre-flight validation prevents waste - Good user experience --- ### Phase 4: Integration & Optimization (Week 7-8) **Priority**: MEDIUM - Polish and optimize **Tasks**: 1. ✅ Integrate audio with video (synchronized videos) 2. ✅ Improve error handling and retry logic 3. ✅ Add progress indicators 4. ✅ Optimize cost calculations 5. ✅ Add usage analytics 6. ✅ Update documentation 7. ✅ User testing and feedback **Success Criteria**: - Smooth end-to-end workflow - Cost-effective for users - Reliable generation - Excellent user experience - All features work seamlessly together --- ## Cost Management & Prevention of Waste ### Pre-Flight Validation **Implementation**: `backend/services/subscription/preflight_validator.py` **Checks Before Generation**: 1. User has sufficient subscription tier 2. Estimated cost within monthly budget 3. Video generation limit not exceeded 4. Audio generation limit not exceeded 5. Total story cost reasonable (<$5 for typical story) **Validation Flow**: ```python def validate_story_generation( pricing_service: PricingService, user_id: str, num_scenes: int, video_resolution: str, video_duration: int, use_voice_clone: bool, ) -> Tuple[bool, str, Dict[str, Any]]: """ Pre-flight validation before story generation. Returns: (allowed, message, cost_breakdown) """ # Calculate estimated costs video_cost_per_scene = get_wavespeed_cost(video_resolution, video_duration) audio_cost_per_scene = get_voice_clone_cost() if use_voice_clone else 0.0 total_estimated_cost = (video_cost_per_scene + audio_cost_per_scene) * num_scenes # Check limits limits = pricing_service.get_user_limits(user_id) current_usage = pricing_service.get_current_usage(user_id) # Validation logic... return (allowed, message, cost_breakdown) ``` ### Cost Estimation Display **Frontend Implementation**: - Real-time cost calculator in Story Setup - Per-scene cost breakdown - Total story cost estimate - Monthly budget remaining - Warning if approaching limits **UI Example**: ``` Video Generation Cost Estimate: ├─ Resolution: 720p ($0.10/second) ├─ Duration: 5 seconds per scene ├─ Scenes: 10 └─ Total: $5.00 Audio Generation Cost Estimate: ├─ Provider: Voice Clone ($0.02/minute) ├─ Average: 30 seconds per scene ├─ Scenes: 10 └─ Total: $1.00 Total Estimated Cost: $6.00 Monthly Budget Remaining: $44.00 ``` ### Usage Tracking **Enhanced Tracking**: - Track video generation per scene - Track audio generation per scene - Track total story cost - Alert users approaching limits - Provide cost breakdown in analytics --- ## Pricing Integration ### WaveSpeed WAN 2.5 Pricing **Add to `pricing_service.py`**: ```python # WaveSpeed WAN 2.5 Text-to-Video { "provider": APIProvider.VIDEO, # Or new WAVESPEED provider "model_name": "wan-2.5-480p", "cost_per_second": 0.05, "description": "WaveSpeed WAN 2.5 Text-to-Video (480p)" }, { "provider": APIProvider.VIDEO, "model_name": "wan-2.5-720p", "cost_per_second": 0.10, "description": "WaveSpeed WAN 2.5 Text-to-Video (720p)" }, { "provider": APIProvider.VIDEO, "model_name": "wan-2.5-1080p", "cost_per_second": 0.15, "description": "WaveSpeed WAN 2.5 Text-to-Video (1080p)" } ``` ### Minimax Voice Clone Pricing **Add to `pricing_service.py`**: ```python # Minimax Voice Clone { "provider": APIProvider.AUDIO, # New provider type "model_name": "minimax-voice-clone-train", "cost_per_request": 0.75, # One-time training cost "description": "Minimax Voice Clone Training" }, { "provider": APIProvider.AUDIO, "model_name": "minimax-voice-clone-generate", "cost_per_minute": 0.02, # Per minute of generated audio "description": "Minimax Voice Clone Generation" } ``` ### Subscription Tier Limits **Update subscription limits**: - **Free**: 3 stories/month, 480p only, gTTS only - **Basic**: 10 stories/month, up to 720p, voice clone available - **Pro**: 50 stories/month, up to 1080p, voice clone included - **Enterprise**: Unlimited, all features --- ## Technical Architecture ### Backend Services ``` backend/services/ ├── wavespeed/ │ ├── __init__.py │ ├── client.py # WaveSpeed API client │ ├── wan25_video.py # WAN 2.5 video generation │ └── models.py # Request/response models ├── minimax/ │ ├── __init__.py │ ├── client.py # Minimax API client │ ├── voice_clone.py # Voice cloning service │ └── models.py └── story_writer/ ├── audio_generation_service.py # Updated with voice clone └── video_generation_service.py # Updated with WaveSpeed ``` ### Frontend Components ``` frontend/src/components/StoryWriter/ ├── Phases/StorySetup/ │ └── GenerationSettingsSection.tsx # Enhanced with new settings ├── components/ │ ├── HdVideoSection.tsx # Updated for WaveSpeed │ ├── VoiceTrainingSection.tsx # NEW: Voice training UI │ └── CostEstimationDisplay.tsx # NEW: Cost calculator └── hooks/ └── useStoryGenerationCost.ts # NEW: Cost calculation hook ``` --- ## Error Handling & User Experience ### Error Scenarios 1. **WaveSpeed API Failure**: - Retry with exponential backoff (3 attempts) - Fallback to HuggingFace if available - Clear error message with cost refund notice 2. **Voice Clone Training Failure**: - Provide specific error (audio quality, length, format) - Suggest improvements - Allow retry with different audio 3. **Cost Limit Exceeded**: - Pre-flight validation prevents this - Show upgrade prompt - Suggest reducing scenes/resolution 4. **Audio/Video Mismatch**: - Validate audio length matches video duration - Auto-trim or extend audio - Warn user before generation ### User Feedback - Progress indicators for all operations - Clear cost breakdowns - Quality previews before final generation - Regeneration options with cost tracking - Usage analytics dashboard --- ## Testing Plan ### Unit Tests - WaveSpeed API client - Voice clone service - Cost calculation - Pre-flight validation ### Integration Tests - End-to-end story generation - Audio + video synchronization - Error handling and fallbacks - Subscription limit enforcement ### User Acceptance Tests - Story generation workflow - Voice training process - Cost estimation accuracy - Error recovery --- ## Success Metrics ### Technical Metrics - Video generation success rate >95% - Audio generation success rate >98% - Average generation time per scene <30s - API error rate <2% ### Business Metrics - User satisfaction with video quality - Cost per story (target: <$5 for 10-scene story) - Voice clone adoption rate - Story completion rate ### User Experience Metrics - Time to generate story - Error recovery time - User understanding of costs - Feature discovery rate --- ## Provider Management Strategy ### Always-Available Options - **gTTS**: Always available, always free, works even when credits exhausted - **HuggingFace**: Preserved as fallback option, works when WaveSpeed unavailable ### Automatic Provider Routing - **Primary**: WaveSpeed WAN 2.5 (when credits available) - **Fallback**: HuggingFace (when WaveSpeed unavailable or credits exhausted) - **Audio Fallback**: gTTS (always available, always free) ### User Experience - Users never see provider names - System automatically selects best available option - Seamless fallback when credits exhausted - Clear notifications when fallback occurs - No user intervention required ### No Deprecation - **HuggingFace**: Kept as permanent fallback option - **gTTS**: Kept as permanent free option - All existing functionality preserved - New features are additions, not replacements --- ## Next Steps 1. **Week 1**: Set up WaveSpeed API access and credentials 2. **Week 1**: Implement provider-agnostic routing system 3. **Week 2**: Integrate into Story Writer with quality-based UI 4. **Week 3**: Implement voice cloning with simple "AI Clone" vs "Default" choice 5. **Week 4**: Add voice training UI (only if AI Clone selected) 6. **Week 5**: Add "Animate Scene" hover option in Outline 7. **Week 6**: Add "Animate Story with VoiceOver" button in Writing 8. **Week 7-8**: Testing, optimization, and polish ## Key Design Principles 1. **Provider Abstraction**: Users never see provider names - only quality/voice options 2. **Preserve Existing**: gTTS and HuggingFace remain available as fallbacks 3. **Cost Transparency**: All buttons show costs in tooltips 4. **Automatic Fallback**: System automatically uses free options when credits exhausted 5. **Per-Scene Only**: Outline phase only allows per-scene generation (no bulk) 6. **User-Friendly**: Simple choices like "Standard Quality" not "WaveSpeed 480p" --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | WaveSpeed API changes | Version pinning, abstraction layer | | Cost overruns | Strict pre-flight validation | | Voice quality issues | Quality checks, fallback options | | User confusion | Clear UI, tooltips, documentation | | Integration complexity | Phased rollout, extensive testing | --- *Document Version: 1.0* *Last Updated: January 2025* *Priority: HIGH - Immediate Implementation*