Files
ALwrity/docs/story writer/STORY_WRITER_VIDEO_ENHANCEMENT.md

25 KiB
Raw Permalink Blame History

Story Writer Video Generation Enhancement Plan


Current State Analysis

Current Video Generation

  • Provider: HuggingFace (tencent/HunyuanVideo via fal-ai)
  • Issues:
    • Unreliable API responses
    • Limited quality control
    • No audio synchronization
    • Single provider dependency
    • Poor error handling

Current Audio Generation

  • Provider: gTTS (Google Text-to-Speech)
  • Limitations:
    • Robotic, non-natural voice
    • No brand voice consistency
    • Limited language options
    • No emotion control
    • Cannot clone user's voice

Current Story Writer Workflow

  1. User creates story outline with scenes
  2. Each scene has audio_narration text
  3. Audio generated via gTTS per scene
  4. Video generated via HuggingFace per scene
  5. Videos compiled into final story video

Location: backend/api/story_writer/ and frontend/src/components/StoryWriter/


Proposed Enhancements

Core Principles

Provider Abstraction:

  • Users should NOT see provider names (HuggingFace, WaveSpeed, etc.)
  • All provider routing/switching happens automatically in the background
  • Users only see user-friendly options like "Standard Quality" or "Premium Quality"
  • System automatically selects best available provider based on user's subscription and credits

Preserve Existing Options:

  • gTTS remains available as free fallback when credits run out
  • HuggingFace remains available as fallback option
  • All existing functionality preserved
  • New features are additions, not replacements

Cost Transparency:

  • All buttons show cost information in tooltips
  • Users make informed decisions before generating
  • No surprise costs

1. Provider-Agnostic Video Generation System

1.1 Smart Provider Routing

Backend Implementation (backend/services/llm_providers/main_video_generation.py):

def ai_video_generate(
    prompt: str,
    quality: str = "standard",  # "standard" (480p), "high" (720p), "premium" (1080p)
    duration: int = 5,
    audio_file_path: Optional[str] = None,
    user_id: str,
    **kwargs,
) -> bytes:
    """
    Unified video generation entry point.
    Automatically routes to best available provider:
    - WaveSpeed WAN 2.5 (primary, if credits available)
    - HuggingFace (fallback, if WaveSpeed unavailable)
    
    Users never see provider names - only quality options.
    """
    # 1. Check user subscription and credits
    # 2. Select best available provider automatically
    # 3. Route to appropriate provider function
    # 4. Handle fallbacks transparently
    pass

def _select_video_provider(
    user_id: str,
    quality: str,
    pricing_service: PricingService,
) -> Tuple[str, str]:
    """
    Automatically select best video provider.
    Returns: (provider_name, model_name)
    
    Selection logic:
    1. Check user credits/subscription
    2. Prefer WaveSpeed if available and credits sufficient
    3. Fallback to HuggingFace if WaveSpeed unavailable
    4. Return error if no providers available
    """
    # Implementation details...

Key Features:

  • Automatic provider selection (users don't choose)
  • Seamless fallback between providers
  • Quality-based options (Standard/High/Premium) instead of provider names
  • Cost-aware routing (uses cheapest available option)
  • Transparent error handling

Quality Mapping:

  • Standard Quality (480p): $0.05/second - Uses WaveSpeed 480p or HuggingFace
  • High Quality (720p): $0.10/second - Uses WaveSpeed 720p
  • Premium Quality (1080p): $0.15/second - Uses WaveSpeed 1080p

Cost Optimization:

  • Default to Standard Quality (480p) for cost-effectiveness
  • Allow upgrade to High/Premium for final export
  • Pre-flight validation prevents waste
  • Automatic fallback to free options when credits exhausted

2. Enhanced Audio Generation with Voice Cloning

2.1 User-Friendly Voice Selection

Key Principle: Users choose between "AI Clone Voice" or "Default Voice" (gTTS) - no provider names shown.

Backend Implementation (backend/services/story_writer/audio_generation_service.py):

class StoryAudioGenerationService:
    def generate_scene_audio(
        self,
        scene: Dict[str, Any],
        user_id: str,
        use_ai_voice: bool = False,  # User's choice: AI Clone or Default
        **kwargs,
    ) -> Dict[str, Any]:
        """
        Generate audio with automatic provider selection.
        
        If use_ai_voice=True:
            - Try persona voice clone (if trained)
            - Try Minimax voice clone (if credits available)
            - Fallback to gTTS if no credits
        
        If use_ai_voice=False:
            - Use gTTS (always free, always available)
        """
        if use_ai_voice:
            # Try AI voice options
            if self._has_persona_voice(user_id):
                return self._generate_with_persona_voice(scene, user_id)
            elif self._has_credits_for_voice_clone(user_id):
                return self._generate_with_minimax_voice_clone(scene, user_id)
            else:
                # Fallback to gTTS with notification
                logger.info(f"Credits exhausted, falling back to gTTS for user {user_id}")
                return self._generate_with_gtts(scene, **kwargs)
        else:
            # User explicitly chose default voice
            return self._generate_with_gtts(scene, **kwargs)

Voice Options in Story Setup:

  • Default Voice (gTTS): Free, always available, robotic but functional
  • AI Clone Voice: Natural, human-like, requires credits ($0.02/minute)

Cost Considerations:

  • Voice training: One-time cost (~$0.75) - only if user wants to train custom voice
  • Voice generation: ~$0.02 per minute (only when AI Clone Voice selected)
  • gTTS: Always free, always available as fallback
  • Automatic fallback to gTTS when credits exhausted (with user notification)

3. Enhanced Story Setup UI

3.1 Video Generation Settings (Provider-Agnostic)

Location: frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx

User-Friendly Settings (No Provider Names):

interface VideoGenerationSettings {
  // Quality selection (NOT provider selection)
  videoQuality: 'standard' | 'high' | 'premium';  // Maps to 480p/720p/1080p
  
  // Duration
  videoDuration: 5 | 10;  // seconds
  
  // Cost estimation (shown in tooltip)
  estimatedCostPerScene: number;
  totalEstimatedCost: number;
  
  // Provider routing happens automatically in backend
  // Users never see "WaveSpeed" or "HuggingFace"
}

UI Components:

  • Quality selector: "Standard" / "High" / "Premium" (with cost in tooltip)
  • Duration selector: 5s (default) / 10s (premium)
  • Cost tooltip: Shows estimated cost per scene and total
  • Pre-flight validation warnings
  • No provider selector - routing is automatic

Tooltip Example:

Standard Quality (480p)
├─ Cost: $0.25 per scene (5 seconds)
├─ Quality: Good for previews and testing
└─ Provider: Automatically selected based on credits

3.2 Audio Generation Settings (Simple Choice)

New Settings:

interface AudioGenerationSettings {
  // Simple user choice - no provider names
  voiceType: 'default' | 'ai_clone';  // "Default Voice" or "AI Clone Voice"
  
  // Only shown if ai_clone selected
  voiceTrainingStatus: 'not_trained' | 'training' | 'ready' | 'failed';
  
  // Existing gTTS settings (preserved)
  audioLang: string;
  audioSlow: boolean;
  audioRate: number;
}

UI Components:

  • Voice Type Selector:
    • "Default Voice (gTTS)" - Free, always available
    • "AI Clone Voice" - Natural, $0.02/minute (with cost tooltip)
  • Voice training section (only if AI Clone Voice selected)
  • Existing gTTS settings (preserved for Default Voice)
  • Cost per minute display in tooltip

Tooltip for "AI Clone Voice":

AI Clone Voice
├─ Cost: $0.02 per minute
├─ Quality: Natural, human-like narration
├─ Fallback: Automatically uses Default Voice if credits exhausted
└─ Training: One-time $0.75 to train your custom voice (optional)

Tooltip for "Default Voice":

Default Voice (gTTS)
├─ Cost: Free
├─ Quality: Standard text-to-speech
└─ Always Available: Works even when credits exhausted

4. New "Animate Scene" Feature in Outline Phase

4.1 Per-Scene Animation Preview

Location: frontend/src/components/StoryWriter/Phases/StoryOutline.tsx

Feature: Add "Animate Scene" hover option alongside existing scene actions

Implementation:

  • Add to OutlineHoverActions component
  • Appears on hover over scene cards
  • Only generates for single scene (never bulk)
  • Uses cheapest option (480p/Standard Quality) to give users a feel
  • Shows cost in tooltip before generation

UI Component:

// In OutlineHoverActions.tsx
const sceneHoverActions = [
  // Existing actions...
  {
    icon: <PlayArrowIcon />,
    label: 'Animate Scene',
    action: 'animate-scene',
    tooltip: `Animate this scene with video\nCost: ~$0.25 (5 seconds, Standard Quality)\nPreview only - uses cheapest option`,
    onClick: handleAnimateScene,
  },
];

Backend Endpoint:

@router.post("/animate-scene-preview")
async def animate_scene_preview(
    request: SceneAnimationRequest,
    current_user: Dict[str, Any] = Depends(get_current_user),
) -> SceneAnimationResponse:
    """
    Generate preview animation for a single scene.
    Always uses cheapest option (480p/Standard Quality).
    Per-scene only - never bulk generation.
    """
    # 1. Validate single scene only
    # 2. Use Standard Quality (480p) - cheapest option
    # 3. Generate video with automatic provider routing
    # 4. Return preview video URL
    pass

Cost Management:

  • Always uses Standard Quality (480p) - $0.25 per scene
  • Pre-flight validation before generation
  • Clear cost display in tooltip
  • Per-scene only prevents bulk waste

5. New "Animate Story with VoiceOver" Button in Writing Phase

5.1 Complete Story Animation

Location: frontend/src/components/StoryWriter/Phases/StoryWriting.tsx

Feature: New button alongside existing HuggingFace video options

Implementation:

  • Add button in Writing phase toolbar
  • Generates complete animated story with synchronized voiceover
  • Uses user's voice preference from Setup (AI Clone or Default)
  • Shows comprehensive cost breakdown in tooltip
  • Pre-flight validation before generation

UI Component:

<Button
  variant="contained"
  startIcon={<SmartDisplayIcon />}
  onClick={handleAnimateStoryWithVoiceOver}
  disabled={!state.storyContent || isGenerating}
  title={`Animate Story with VoiceOver\n\nCost Breakdown:\n- Video: $${videoCost} (${scenes.length} scenes × $${costPerScene})\n- Audio: $${audioCost} (${totalAudioMinutes} minutes)\n- Total: $${totalCost}\n\nQuality: ${state.videoQuality}\nVoice: ${state.voiceType === 'ai_clone' ? 'AI Clone' : 'Default'}`}
>
  Animate Story with VoiceOver
</Button>

Backend Endpoint:

@router.post("/animate-story-with-voiceover")
async def animate_story_with_voiceover(
    request: StoryAnimationRequest,
    current_user: Dict[str, Any] = Depends(get_current_user),
) -> StoryAnimationResponse:
    """
    Generate complete animated story with synchronized voiceover.
    Uses user's quality and voice preferences from Setup.
    """
    # 1. Pre-flight validation (cost, credits, limits)
    # 2. Generate audio for all scenes (using user's voice preference)
    # 3. Generate videos for all scenes (using user's quality preference)
    # 4. Synchronize audio with video
    # 5. Compile into final story video
    # 6. Return video URL and cost breakdown
    pass

Cost Tooltip Example:

Animate Story with VoiceOver

Cost Breakdown:
├─ Video (Standard Quality): $2.50
│  └─ 10 scenes × $0.25 per scene
├─ Audio (AI Clone Voice): $1.00
│  └─ 50 minutes total × $0.02/minute
└─ Total: $3.50

Settings:
├─ Quality: Standard (480p)
├─ Voice: AI Clone Voice
└─ Duration: 5 seconds per scene

⚠️ This will use $3.50 of your monthly credits

Implementation Phases

Phase 1: Provider-Agnostic Video System (Week 1-2)

Priority: HIGH - Solves immediate HuggingFace issues with provider abstraction

Tasks:

  1. Create WaveSpeed API client (backend/services/wavespeed/client.py)
  2. Add WAN 2.5 text-to-video function
  3. Implement smart provider routing in main_video_generation.py
  4. Add quality-based selection (Standard/High/Premium)
  5. Preserve HuggingFace as fallback option
  6. Update hd_video.py with provider routing
  7. Add pre-flight cost validation
  8. Update frontend with quality selector (remove provider names)
  9. Add cost tooltips to all buttons
  10. Update subscription limits
  11. Testing and error handling

Files to Modify:

  • backend/services/llm_providers/main_video_generation.py (add routing logic)
  • backend/api/story_writer/utils/hd_video.py (use quality-based API)
  • backend/api/story_writer/routes/video_generation.py
  • frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx (quality selector)
  • frontend/src/components/StoryWriter/components/HdVideoSection.tsx
  • backend/services/subscription/pricing_service.py

Success Criteria:

  • Video generation works reliably with automatic provider routing
  • Users see quality options, not provider names
  • HuggingFace preserved as fallback
  • Cost tracking accurate
  • Pre-flight validation prevents waste
  • Error messages clear and actionable

Phase 2: Voice Cloning Integration (Week 3-4)

Priority: MEDIUM - Enhances audio quality with simple user choice

Tasks:

  1. Create Minimax API client (backend/services/minimax/voice_clone.py)
  2. Add voice training endpoint
  3. Add voice generation endpoint
  4. Update audio_generation_service.py with "AI Clone" vs "Default" logic
  5. Preserve gTTS as always-available fallback
  6. Add automatic fallback when credits exhausted
  7. Update Story Setup with simple voice type selector
  8. Add cost tooltips to voice options
  9. Add voice preview and testing (if AI Clone selected)
  10. Ensure gTTS always works even when credits exhausted

Files to Create:

  • backend/services/minimax/voice_clone.py
  • backend/services/story_writer/voice_management_service.py

Files to Modify:

  • backend/services/story_writer/audio_generation_service.py (add voice type logic)
  • frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx (voice type selector)
  • backend/models/story_models.py (add voice type field)

Success Criteria:

  • Users see simple choice: "Default Voice" or "AI Clone Voice"
  • gTTS always available as fallback
  • Automatic fallback when credits exhausted
  • Cost tracking accurate
  • Voice quality significantly better than gTTS when AI Clone used

Phase 3: New Features - Animate Scene & Animate Story (Week 5-6)

Priority: MEDIUM - Add preview and complete animation features

Tasks:

  1. Add "Animate Scene" hover option in Outline phase
  2. Implement per-scene animation preview (cheapest option only)
  3. Add "Animate Story with VoiceOver" button in Writing phase
  4. Implement complete story animation with voiceover
  5. Add comprehensive cost tooltips to all buttons
  6. Add pre-flight validation for all animation features
  7. Ensure per-scene only (no bulk generation in Outline)
  8. Update documentation
  9. User testing and feedback

Files to Create:

  • backend/api/story_writer/routes/scene_animation.py (new endpoint)
  • frontend/src/components/StoryWriter/components/AnimateSceneButton.tsx

Files to Modify:

  • frontend/src/components/StoryWriter/Phases/StoryOutlineParts/OutlineHoverActions.tsx (add Animate Scene)
  • frontend/src/components/StoryWriter/Phases/StoryWriting.tsx (add Animate Story button)
  • backend/api/story_writer/routes/video_generation.py (add story animation endpoint)

Success Criteria:

  • "Animate Scene" works in Outline (per-scene, cheapest option)
  • "Animate Story with VoiceOver" works in Writing phase
  • All buttons show cost in tooltips
  • Pre-flight validation prevents waste
  • Good user experience

Phase 4: Integration & Optimization (Week 7-8)

Priority: MEDIUM - Polish and optimize

Tasks:

  1. Integrate audio with video (synchronized videos)
  2. Improve error handling and retry logic
  3. Add progress indicators
  4. Optimize cost calculations
  5. Add usage analytics
  6. Update documentation
  7. User testing and feedback

Success Criteria:

  • Smooth end-to-end workflow
  • Cost-effective for users
  • Reliable generation
  • Excellent user experience
  • All features work seamlessly together

Cost Management & Prevention of Waste

Pre-Flight Validation

Implementation: backend/services/subscription/preflight_validator.py

Checks Before Generation:

  1. User has sufficient subscription tier
  2. Estimated cost within monthly budget
  3. Video generation limit not exceeded
  4. Audio generation limit not exceeded
  5. Total story cost reasonable (<$5 for typical story)

Validation Flow:

def validate_story_generation(
    pricing_service: PricingService,
    user_id: str,
    num_scenes: int,
    video_resolution: str,
    video_duration: int,
    use_voice_clone: bool,
) -> Tuple[bool, str, Dict[str, Any]]:
    """
    Pre-flight validation before story generation.
    Returns: (allowed, message, cost_breakdown)
    """
    # Calculate estimated costs
    video_cost_per_scene = get_wavespeed_cost(video_resolution, video_duration)
    audio_cost_per_scene = get_voice_clone_cost() if use_voice_clone else 0.0
    
    total_estimated_cost = (video_cost_per_scene + audio_cost_per_scene) * num_scenes
    
    # Check limits
    limits = pricing_service.get_user_limits(user_id)
    current_usage = pricing_service.get_current_usage(user_id)
    
    # Validation logic...
    return (allowed, message, cost_breakdown)

Cost Estimation Display

Frontend Implementation:

  • Real-time cost calculator in Story Setup
  • Per-scene cost breakdown
  • Total story cost estimate
  • Monthly budget remaining
  • Warning if approaching limits

UI Example:

Video Generation Cost Estimate:
├─ Resolution: 720p ($0.10/second)
├─ Duration: 5 seconds per scene
├─ Scenes: 10
└─ Total: $5.00

Audio Generation Cost Estimate:
├─ Provider: Voice Clone ($0.02/minute)
├─ Average: 30 seconds per scene
├─ Scenes: 10
└─ Total: $1.00

Total Estimated Cost: $6.00
Monthly Budget Remaining: $44.00

Usage Tracking

Enhanced Tracking:

  • Track video generation per scene
  • Track audio generation per scene
  • Track total story cost
  • Alert users approaching limits
  • Provide cost breakdown in analytics

Pricing Integration

WaveSpeed WAN 2.5 Pricing

Add to pricing_service.py:

# WaveSpeed WAN 2.5 Text-to-Video
{
    "provider": APIProvider.VIDEO,  # Or new WAVESPEED provider
    "model_name": "wan-2.5-480p",
    "cost_per_second": 0.05,
    "description": "WaveSpeed WAN 2.5 Text-to-Video (480p)"
},
{
    "provider": APIProvider.VIDEO,
    "model_name": "wan-2.5-720p",
    "cost_per_second": 0.10,
    "description": "WaveSpeed WAN 2.5 Text-to-Video (720p)"
},
{
    "provider": APIProvider.VIDEO,
    "model_name": "wan-2.5-1080p",
    "cost_per_second": 0.15,
    "description": "WaveSpeed WAN 2.5 Text-to-Video (1080p)"
}

Minimax Voice Clone Pricing

Add to pricing_service.py:

# Minimax Voice Clone
{
    "provider": APIProvider.AUDIO,  # New provider type
    "model_name": "minimax-voice-clone-train",
    "cost_per_request": 0.75,  # One-time training cost
    "description": "Minimax Voice Clone Training"
},
{
    "provider": APIProvider.AUDIO,
    "model_name": "minimax-voice-clone-generate",
    "cost_per_minute": 0.02,  # Per minute of generated audio
    "description": "Minimax Voice Clone Generation"
}

Subscription Tier Limits

Update subscription limits:

  • Free: 3 stories/month, 480p only, gTTS only
  • Basic: 10 stories/month, up to 720p, voice clone available
  • Pro: 50 stories/month, up to 1080p, voice clone included
  • Enterprise: Unlimited, all features

Technical Architecture

Backend Services

backend/services/
├── wavespeed/
│   ├── __init__.py
│   ├── client.py              # WaveSpeed API client
│   ├── wan25_video.py        # WAN 2.5 video generation
│   └── models.py              # Request/response models
├── minimax/
│   ├── __init__.py
│   ├── client.py              # Minimax API client
│   ├── voice_clone.py         # Voice cloning service
│   └── models.py
└── story_writer/
    ├── audio_generation_service.py  # Updated with voice clone
    └── video_generation_service.py   # Updated with WaveSpeed

Frontend Components

frontend/src/components/StoryWriter/
├── Phases/StorySetup/
│   └── GenerationSettingsSection.tsx  # Enhanced with new settings
├── components/
│   ├── HdVideoSection.tsx              # Updated for WaveSpeed
│   ├── VoiceTrainingSection.tsx        # NEW: Voice training UI
│   └── CostEstimationDisplay.tsx        # NEW: Cost calculator
└── hooks/
    └── useStoryGenerationCost.ts        # NEW: Cost calculation hook

Error Handling & User Experience

Error Scenarios

  1. WaveSpeed API Failure:

    • Retry with exponential backoff (3 attempts)
    • Fallback to HuggingFace if available
    • Clear error message with cost refund notice
  2. Voice Clone Training Failure:

    • Provide specific error (audio quality, length, format)
    • Suggest improvements
    • Allow retry with different audio
  3. Cost Limit Exceeded:

    • Pre-flight validation prevents this
    • Show upgrade prompt
    • Suggest reducing scenes/resolution
  4. Audio/Video Mismatch:

    • Validate audio length matches video duration
    • Auto-trim or extend audio
    • Warn user before generation

User Feedback

  • Progress indicators for all operations
  • Clear cost breakdowns
  • Quality previews before final generation
  • Regeneration options with cost tracking
  • Usage analytics dashboard

Testing Plan

Unit Tests

  • WaveSpeed API client
  • Voice clone service
  • Cost calculation
  • Pre-flight validation

Integration Tests

  • End-to-end story generation
  • Audio + video synchronization
  • Error handling and fallbacks
  • Subscription limit enforcement

User Acceptance Tests

  • Story generation workflow
  • Voice training process
  • Cost estimation accuracy
  • Error recovery

Success Metrics

Technical Metrics

  • Video generation success rate >95%
  • Audio generation success rate >98%
  • Average generation time per scene <30s
  • API error rate <2%

Business Metrics

  • User satisfaction with video quality
  • Cost per story (target: <$5 for 10-scene story)
  • Voice clone adoption rate
  • Story completion rate

User Experience Metrics

  • Time to generate story
  • Error recovery time
  • User understanding of costs
  • Feature discovery rate

Provider Management Strategy

Always-Available Options

  • gTTS: Always available, always free, works even when credits exhausted
  • HuggingFace: Preserved as fallback option, works when WaveSpeed unavailable

Automatic Provider Routing

  • Primary: WaveSpeed WAN 2.5 (when credits available)
  • Fallback: HuggingFace (when WaveSpeed unavailable or credits exhausted)
  • Audio Fallback: gTTS (always available, always free)

User Experience

  • Users never see provider names
  • System automatically selects best available option
  • Seamless fallback when credits exhausted
  • Clear notifications when fallback occurs
  • No user intervention required

No Deprecation

  • HuggingFace: Kept as permanent fallback option
  • gTTS: Kept as permanent free option
  • All existing functionality preserved
  • New features are additions, not replacements

Next Steps

  1. Week 1: Set up WaveSpeed API access and credentials
  2. Week 1: Implement provider-agnostic routing system
  3. Week 2: Integrate into Story Writer with quality-based UI
  4. Week 3: Implement voice cloning with simple "AI Clone" vs "Default" choice
  5. Week 4: Add voice training UI (only if AI Clone selected)
  6. Week 5: Add "Animate Scene" hover option in Outline
  7. Week 6: Add "Animate Story with VoiceOver" button in Writing
  8. Week 7-8: Testing, optimization, and polish

Key Design Principles

  1. Provider Abstraction: Users never see provider names - only quality/voice options
  2. Preserve Existing: gTTS and HuggingFace remain available as fallbacks
  3. Cost Transparency: All buttons show costs in tooltips
  4. Automatic Fallback: System automatically uses free options when credits exhausted
  5. Per-Scene Only: Outline phase only allows per-scene generation (no bulk)
  6. User-Friendly: Simple choices like "Standard Quality" not "WaveSpeed 480p"

Risk Mitigation

Risk Mitigation
WaveSpeed API changes Version pinning, abstraction layer
Cost overruns Strict pre-flight validation
Voice quality issues Quality checks, fallback options
User confusion Clear UI, tooltips, documentation
Integration complexity Phased rollout, extensive testing

Document Version: 1.0
Last Updated: January 2025
Priority: HIGH - Immediate Implementation