26 KiB
Story Writer Video Generation Enhancement Plan
Executive Summary
This document outlines the immediate enhancement plan for ALwrity's Story Writer to replace problematic HuggingFace video generation with WaveSpeed AI models and upgrade basic gTTS audio to professional voice cloning. This provides immediate value to users while solving current technical issues.
Current State Analysis
Current Video Generation
- Provider: HuggingFace (tencent/HunyuanVideo via fal-ai)
- Issues:
- Unreliable API responses
- Limited quality control
- No audio synchronization
- Single provider dependency
- Poor error handling
Current Audio Generation
- Provider: gTTS (Google Text-to-Speech)
- Limitations:
- Robotic, non-natural voice
- No brand voice consistency
- Limited language options
- No emotion control
- Cannot clone user's voice
Current Story Writer Workflow
- User creates story outline with scenes
- Each scene has
audio_narrationtext - Audio generated via gTTS per scene
- Video generated via HuggingFace per scene
- Videos compiled into final story video
Location: backend/api/story_writer/ and frontend/src/components/StoryWriter/
Proposed Enhancements
Core Principles
Provider Abstraction:
- Users should NOT see provider names (HuggingFace, WaveSpeed, etc.)
- All provider routing/switching happens automatically in the background
- Users only see user-friendly options like "Standard Quality" or "Premium Quality"
- System automatically selects best available provider based on user's subscription and credits
Preserve Existing Options:
- gTTS remains available as free fallback when credits run out
- HuggingFace remains available as fallback option
- All existing functionality preserved
- New features are additions, not replacements
Cost Transparency:
- All buttons show cost information in tooltips
- Users make informed decisions before generating
- No surprise costs
1. Provider-Agnostic Video Generation System
1.1 Smart Provider Routing
Backend Implementation (backend/services/llm_providers/main_video_generation.py):
def ai_video_generate(
prompt: str,
quality: str = "standard", # "standard" (480p), "high" (720p), "premium" (1080p)
duration: int = 5,
audio_file_path: Optional[str] = None,
user_id: str,
**kwargs,
) -> bytes:
"""
Unified video generation entry point.
Automatically routes to best available provider:
- WaveSpeed WAN 2.5 (primary, if credits available)
- HuggingFace (fallback, if WaveSpeed unavailable)
Users never see provider names - only quality options.
"""
# 1. Check user subscription and credits
# 2. Select best available provider automatically
# 3. Route to appropriate provider function
# 4. Handle fallbacks transparently
pass
def _select_video_provider(
user_id: str,
quality: str,
pricing_service: PricingService,
) -> Tuple[str, str]:
"""
Automatically select best video provider.
Returns: (provider_name, model_name)
Selection logic:
1. Check user credits/subscription
2. Prefer WaveSpeed if available and credits sufficient
3. Fallback to HuggingFace if WaveSpeed unavailable
4. Return error if no providers available
"""
# Implementation details...
Key Features:
- Automatic provider selection (users don't choose)
- Seamless fallback between providers
- Quality-based options (Standard/High/Premium) instead of provider names
- Cost-aware routing (uses cheapest available option)
- Transparent error handling
Quality Mapping:
- Standard Quality (480p): $0.05/second - Uses WaveSpeed 480p or HuggingFace
- High Quality (720p): $0.10/second - Uses WaveSpeed 720p
- Premium Quality (1080p): $0.15/second - Uses WaveSpeed 1080p
Cost Optimization:
- Default to Standard Quality (480p) for cost-effectiveness
- Allow upgrade to High/Premium for final export
- Pre-flight validation prevents waste
- Automatic fallback to free options when credits exhausted
2. Enhanced Audio Generation with Voice Cloning
2.1 User-Friendly Voice Selection
Key Principle: Users choose between "AI Clone Voice" or "Default Voice" (gTTS) - no provider names shown.
Backend Implementation (backend/services/story_writer/audio_generation_service.py):
class StoryAudioGenerationService:
def generate_scene_audio(
self,
scene: Dict[str, Any],
user_id: str,
use_ai_voice: bool = False, # User's choice: AI Clone or Default
**kwargs,
) -> Dict[str, Any]:
"""
Generate audio with automatic provider selection.
If use_ai_voice=True:
- Try persona voice clone (if trained)
- Try Minimax voice clone (if credits available)
- Fallback to gTTS if no credits
If use_ai_voice=False:
- Use gTTS (always free, always available)
"""
if use_ai_voice:
# Try AI voice options
if self._has_persona_voice(user_id):
return self._generate_with_persona_voice(scene, user_id)
elif self._has_credits_for_voice_clone(user_id):
return self._generate_with_minimax_voice_clone(scene, user_id)
else:
# Fallback to gTTS with notification
logger.info(f"Credits exhausted, falling back to gTTS for user {user_id}")
return self._generate_with_gtts(scene, **kwargs)
else:
# User explicitly chose default voice
return self._generate_with_gtts(scene, **kwargs)
Voice Options in Story Setup:
- Default Voice (gTTS): Free, always available, robotic but functional
- AI Clone Voice: Natural, human-like, requires credits ($0.02/minute)
Cost Considerations:
- Voice training: One-time cost (~$0.75) - only if user wants to train custom voice
- Voice generation: ~$0.02 per minute (only when AI Clone Voice selected)
- gTTS: Always free, always available as fallback
- Automatic fallback to gTTS when credits exhausted (with user notification)
3. Enhanced Story Setup UI
3.1 Video Generation Settings (Provider-Agnostic)
Location: frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx
User-Friendly Settings (No Provider Names):
interface VideoGenerationSettings {
// Quality selection (NOT provider selection)
videoQuality: 'standard' | 'high' | 'premium'; // Maps to 480p/720p/1080p
// Duration
videoDuration: 5 | 10; // seconds
// Cost estimation (shown in tooltip)
estimatedCostPerScene: number;
totalEstimatedCost: number;
// Provider routing happens automatically in backend
// Users never see "WaveSpeed" or "HuggingFace"
}
UI Components:
- Quality selector: "Standard" / "High" / "Premium" (with cost in tooltip)
- Duration selector: 5s (default) / 10s (premium)
- Cost tooltip: Shows estimated cost per scene and total
- Pre-flight validation warnings
- No provider selector - routing is automatic
Tooltip Example:
Standard Quality (480p)
├─ Cost: $0.25 per scene (5 seconds)
├─ Quality: Good for previews and testing
└─ Provider: Automatically selected based on credits
3.2 Audio Generation Settings (Simple Choice)
New Settings:
interface AudioGenerationSettings {
// Simple user choice - no provider names
voiceType: 'default' | 'ai_clone'; // "Default Voice" or "AI Clone Voice"
// Only shown if ai_clone selected
voiceTrainingStatus: 'not_trained' | 'training' | 'ready' | 'failed';
// Existing gTTS settings (preserved)
audioLang: string;
audioSlow: boolean;
audioRate: number;
}
UI Components:
- Voice Type Selector:
- "Default Voice (gTTS)" - Free, always available
- "AI Clone Voice" - Natural, $0.02/minute (with cost tooltip)
- Voice training section (only if AI Clone Voice selected)
- Existing gTTS settings (preserved for Default Voice)
- Cost per minute display in tooltip
Tooltip for "AI Clone Voice":
AI Clone Voice
├─ Cost: $0.02 per minute
├─ Quality: Natural, human-like narration
├─ Fallback: Automatically uses Default Voice if credits exhausted
└─ Training: One-time $0.75 to train your custom voice (optional)
Tooltip for "Default Voice":
Default Voice (gTTS)
├─ Cost: Free
├─ Quality: Standard text-to-speech
└─ Always Available: Works even when credits exhausted
4. New "Animate Scene" Feature in Outline Phase
4.1 Per-Scene Animation Preview
Location: frontend/src/components/StoryWriter/Phases/StoryOutline.tsx
Feature: Add "Animate Scene" hover option alongside existing scene actions
Implementation:
- Add to
OutlineHoverActionscomponent - Appears on hover over scene cards
- Only generates for single scene (never bulk)
- Uses cheapest option (480p/Standard Quality) to give users a feel
- Shows cost in tooltip before generation
UI Component:
// In OutlineHoverActions.tsx
const sceneHoverActions = [
// Existing actions...
{
icon: <PlayArrowIcon />,
label: 'Animate Scene',
action: 'animate-scene',
tooltip: `Animate this scene with video\nCost: ~$0.25 (5 seconds, Standard Quality)\nPreview only - uses cheapest option`,
onClick: handleAnimateScene,
},
];
Backend Endpoint:
@router.post("/animate-scene-preview")
async def animate_scene_preview(
request: SceneAnimationRequest,
current_user: Dict[str, Any] = Depends(get_current_user),
) -> SceneAnimationResponse:
"""
Generate preview animation for a single scene.
Always uses cheapest option (480p/Standard Quality).
Per-scene only - never bulk generation.
"""
# 1. Validate single scene only
# 2. Use Standard Quality (480p) - cheapest option
# 3. Generate video with automatic provider routing
# 4. Return preview video URL
pass
Cost Management:
- Always uses Standard Quality (480p) - $0.25 per scene
- Pre-flight validation before generation
- Clear cost display in tooltip
- Per-scene only prevents bulk waste
5. New "Animate Story with VoiceOver" Button in Writing Phase
5.1 Complete Story Animation
Location: frontend/src/components/StoryWriter/Phases/StoryWriting.tsx
Feature: New button alongside existing HuggingFace video options
Implementation:
- Add button in Writing phase toolbar
- Generates complete animated story with synchronized voiceover
- Uses user's voice preference from Setup (AI Clone or Default)
- Shows comprehensive cost breakdown in tooltip
- Pre-flight validation before generation
UI Component:
<Button
variant="contained"
startIcon={<SmartDisplayIcon />}
onClick={handleAnimateStoryWithVoiceOver}
disabled={!state.storyContent || isGenerating}
title={`Animate Story with VoiceOver\n\nCost Breakdown:\n- Video: $${videoCost} (${scenes.length} scenes × $${costPerScene})\n- Audio: $${audioCost} (${totalAudioMinutes} minutes)\n- Total: $${totalCost}\n\nQuality: ${state.videoQuality}\nVoice: ${state.voiceType === 'ai_clone' ? 'AI Clone' : 'Default'}`}
>
Animate Story with VoiceOver
</Button>
Backend Endpoint:
@router.post("/animate-story-with-voiceover")
async def animate_story_with_voiceover(
request: StoryAnimationRequest,
current_user: Dict[str, Any] = Depends(get_current_user),
) -> StoryAnimationResponse:
"""
Generate complete animated story with synchronized voiceover.
Uses user's quality and voice preferences from Setup.
"""
# 1. Pre-flight validation (cost, credits, limits)
# 2. Generate audio for all scenes (using user's voice preference)
# 3. Generate videos for all scenes (using user's quality preference)
# 4. Synchronize audio with video
# 5. Compile into final story video
# 6. Return video URL and cost breakdown
pass
Cost Tooltip Example:
Animate Story with VoiceOver
Cost Breakdown:
├─ Video (Standard Quality): $2.50
│ └─ 10 scenes × $0.25 per scene
├─ Audio (AI Clone Voice): $1.00
│ └─ 50 minutes total × $0.02/minute
└─ Total: $3.50
Settings:
├─ Quality: Standard (480p)
├─ Voice: AI Clone Voice
└─ Duration: 5 seconds per scene
⚠️ This will use $3.50 of your monthly credits
Implementation Phases
Phase 1: Provider-Agnostic Video System (Week 1-2)
Priority: HIGH - Solves immediate HuggingFace issues with provider abstraction
Tasks:
- ✅ Create WaveSpeed API client (
backend/services/wavespeed/client.py) - ✅ Add WAN 2.5 text-to-video function
- ✅ Implement smart provider routing in
main_video_generation.py - ✅ Add quality-based selection (Standard/High/Premium)
- ✅ Preserve HuggingFace as fallback option
- ✅ Update
hd_video.pywith provider routing - ✅ Add pre-flight cost validation
- ✅ Update frontend with quality selector (remove provider names)
- ✅ Add cost tooltips to all buttons
- ✅ Update subscription limits
- ✅ Testing and error handling
Files to Modify:
backend/services/llm_providers/main_video_generation.py(add routing logic)backend/api/story_writer/utils/hd_video.py(use quality-based API)backend/api/story_writer/routes/video_generation.pyfrontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx(quality selector)frontend/src/components/StoryWriter/components/HdVideoSection.tsxbackend/services/subscription/pricing_service.py
Success Criteria:
- Video generation works reliably with automatic provider routing
- Users see quality options, not provider names
- HuggingFace preserved as fallback
- Cost tracking accurate
- Pre-flight validation prevents waste
- Error messages clear and actionable
Phase 2: Voice Cloning Integration (Week 3-4)
Priority: MEDIUM - Enhances audio quality with simple user choice
Tasks:
- ✅ Create Minimax API client (
backend/services/minimax/voice_clone.py) - ✅ Add voice training endpoint
- ✅ Add voice generation endpoint
- ✅ Update
audio_generation_service.pywith "AI Clone" vs "Default" logic - ✅ Preserve gTTS as always-available fallback
- ✅ Add automatic fallback when credits exhausted
- ✅ Update Story Setup with simple voice type selector
- ✅ Add cost tooltips to voice options
- ✅ Add voice preview and testing (if AI Clone selected)
- ✅ Ensure gTTS always works even when credits exhausted
Files to Create:
backend/services/minimax/voice_clone.pybackend/services/story_writer/voice_management_service.py
Files to Modify:
backend/services/story_writer/audio_generation_service.py(add voice type logic)frontend/src/components/StoryWriter/Phases/StorySetup/GenerationSettingsSection.tsx(voice type selector)backend/models/story_models.py(add voice type field)
Success Criteria:
- Users see simple choice: "Default Voice" or "AI Clone Voice"
- gTTS always available as fallback
- Automatic fallback when credits exhausted
- Cost tracking accurate
- Voice quality significantly better than gTTS when AI Clone used
Phase 3: New Features - Animate Scene & Animate Story (Week 5-6)
Priority: MEDIUM - Add preview and complete animation features
Tasks:
- ✅ Add "Animate Scene" hover option in Outline phase
- ✅ Implement per-scene animation preview (cheapest option only)
- ✅ Add "Animate Story with VoiceOver" button in Writing phase
- ✅ Implement complete story animation with voiceover
- ✅ Add comprehensive cost tooltips to all buttons
- ✅ Add pre-flight validation for all animation features
- ✅ Ensure per-scene only (no bulk generation in Outline)
- ✅ Update documentation
- ✅ User testing and feedback
Files to Create:
backend/api/story_writer/routes/scene_animation.py(new endpoint)frontend/src/components/StoryWriter/components/AnimateSceneButton.tsx
Files to Modify:
frontend/src/components/StoryWriter/Phases/StoryOutlineParts/OutlineHoverActions.tsx(add Animate Scene)frontend/src/components/StoryWriter/Phases/StoryWriting.tsx(add Animate Story button)backend/api/story_writer/routes/video_generation.py(add story animation endpoint)
Success Criteria:
- "Animate Scene" works in Outline (per-scene, cheapest option)
- "Animate Story with VoiceOver" works in Writing phase
- All buttons show cost in tooltips
- Pre-flight validation prevents waste
- Good user experience
Phase 4: Integration & Optimization (Week 7-8)
Priority: MEDIUM - Polish and optimize
Tasks:
- ✅ Integrate audio with video (synchronized videos)
- ✅ Improve error handling and retry logic
- ✅ Add progress indicators
- ✅ Optimize cost calculations
- ✅ Add usage analytics
- ✅ Update documentation
- ✅ User testing and feedback
Success Criteria:
- Smooth end-to-end workflow
- Cost-effective for users
- Reliable generation
- Excellent user experience
- All features work seamlessly together
Cost Management & Prevention of Waste
Pre-Flight Validation
Implementation: backend/services/subscription/preflight_validator.py
Checks Before Generation:
- User has sufficient subscription tier
- Estimated cost within monthly budget
- Video generation limit not exceeded
- Audio generation limit not exceeded
- Total story cost reasonable (<$5 for typical story)
Validation Flow:
def validate_story_generation(
pricing_service: PricingService,
user_id: str,
num_scenes: int,
video_resolution: str,
video_duration: int,
use_voice_clone: bool,
) -> Tuple[bool, str, Dict[str, Any]]:
"""
Pre-flight validation before story generation.
Returns: (allowed, message, cost_breakdown)
"""
# Calculate estimated costs
video_cost_per_scene = get_wavespeed_cost(video_resolution, video_duration)
audio_cost_per_scene = get_voice_clone_cost() if use_voice_clone else 0.0
total_estimated_cost = (video_cost_per_scene + audio_cost_per_scene) * num_scenes
# Check limits
limits = pricing_service.get_user_limits(user_id)
current_usage = pricing_service.get_current_usage(user_id)
# Validation logic...
return (allowed, message, cost_breakdown)
Cost Estimation Display
Frontend Implementation:
- Real-time cost calculator in Story Setup
- Per-scene cost breakdown
- Total story cost estimate
- Monthly budget remaining
- Warning if approaching limits
UI Example:
Video Generation Cost Estimate:
├─ Resolution: 720p ($0.10/second)
├─ Duration: 5 seconds per scene
├─ Scenes: 10
└─ Total: $5.00
Audio Generation Cost Estimate:
├─ Provider: Voice Clone ($0.02/minute)
├─ Average: 30 seconds per scene
├─ Scenes: 10
└─ Total: $1.00
Total Estimated Cost: $6.00
Monthly Budget Remaining: $44.00
Usage Tracking
Enhanced Tracking:
- Track video generation per scene
- Track audio generation per scene
- Track total story cost
- Alert users approaching limits
- Provide cost breakdown in analytics
Pricing Integration
WaveSpeed WAN 2.5 Pricing
Add to pricing_service.py:
# WaveSpeed WAN 2.5 Text-to-Video
{
"provider": APIProvider.VIDEO, # Or new WAVESPEED provider
"model_name": "wan-2.5-480p",
"cost_per_second": 0.05,
"description": "WaveSpeed WAN 2.5 Text-to-Video (480p)"
},
{
"provider": APIProvider.VIDEO,
"model_name": "wan-2.5-720p",
"cost_per_second": 0.10,
"description": "WaveSpeed WAN 2.5 Text-to-Video (720p)"
},
{
"provider": APIProvider.VIDEO,
"model_name": "wan-2.5-1080p",
"cost_per_second": 0.15,
"description": "WaveSpeed WAN 2.5 Text-to-Video (1080p)"
}
Minimax Voice Clone Pricing
Add to pricing_service.py:
# Minimax Voice Clone
{
"provider": APIProvider.AUDIO, # New provider type
"model_name": "minimax-voice-clone-train",
"cost_per_request": 0.75, # One-time training cost
"description": "Minimax Voice Clone Training"
},
{
"provider": APIProvider.AUDIO,
"model_name": "minimax-voice-clone-generate",
"cost_per_minute": 0.02, # Per minute of generated audio
"description": "Minimax Voice Clone Generation"
}
Subscription Tier Limits
Update subscription limits:
- Free: 3 stories/month, 480p only, gTTS only
- Basic: 10 stories/month, up to 720p, voice clone available
- Pro: 50 stories/month, up to 1080p, voice clone included
- Enterprise: Unlimited, all features
Technical Architecture
Backend Services
backend/services/
├── wavespeed/
│ ├── __init__.py
│ ├── client.py # WaveSpeed API client
│ ├── wan25_video.py # WAN 2.5 video generation
│ └── models.py # Request/response models
├── minimax/
│ ├── __init__.py
│ ├── client.py # Minimax API client
│ ├── voice_clone.py # Voice cloning service
│ └── models.py
└── story_writer/
├── audio_generation_service.py # Updated with voice clone
└── video_generation_service.py # Updated with WaveSpeed
Frontend Components
frontend/src/components/StoryWriter/
├── Phases/StorySetup/
│ └── GenerationSettingsSection.tsx # Enhanced with new settings
├── components/
│ ├── HdVideoSection.tsx # Updated for WaveSpeed
│ ├── VoiceTrainingSection.tsx # NEW: Voice training UI
│ └── CostEstimationDisplay.tsx # NEW: Cost calculator
└── hooks/
└── useStoryGenerationCost.ts # NEW: Cost calculation hook
Error Handling & User Experience
Error Scenarios
-
WaveSpeed API Failure:
- Retry with exponential backoff (3 attempts)
- Fallback to HuggingFace if available
- Clear error message with cost refund notice
-
Voice Clone Training Failure:
- Provide specific error (audio quality, length, format)
- Suggest improvements
- Allow retry with different audio
-
Cost Limit Exceeded:
- Pre-flight validation prevents this
- Show upgrade prompt
- Suggest reducing scenes/resolution
-
Audio/Video Mismatch:
- Validate audio length matches video duration
- Auto-trim or extend audio
- Warn user before generation
User Feedback
- Progress indicators for all operations
- Clear cost breakdowns
- Quality previews before final generation
- Regeneration options with cost tracking
- Usage analytics dashboard
Testing Plan
Unit Tests
- WaveSpeed API client
- Voice clone service
- Cost calculation
- Pre-flight validation
Integration Tests
- End-to-end story generation
- Audio + video synchronization
- Error handling and fallbacks
- Subscription limit enforcement
User Acceptance Tests
- Story generation workflow
- Voice training process
- Cost estimation accuracy
- Error recovery
Success Metrics
Technical Metrics
- Video generation success rate >95%
- Audio generation success rate >98%
- Average generation time per scene <30s
- API error rate <2%
Business Metrics
- User satisfaction with video quality
- Cost per story (target: <$5 for 10-scene story)
- Voice clone adoption rate
- Story completion rate
User Experience Metrics
- Time to generate story
- Error recovery time
- User understanding of costs
- Feature discovery rate
Provider Management Strategy
Always-Available Options
- gTTS: Always available, always free, works even when credits exhausted
- HuggingFace: Preserved as fallback option, works when WaveSpeed unavailable
Automatic Provider Routing
- Primary: WaveSpeed WAN 2.5 (when credits available)
- Fallback: HuggingFace (when WaveSpeed unavailable or credits exhausted)
- Audio Fallback: gTTS (always available, always free)
User Experience
- Users never see provider names
- System automatically selects best available option
- Seamless fallback when credits exhausted
- Clear notifications when fallback occurs
- No user intervention required
No Deprecation
- HuggingFace: Kept as permanent fallback option
- gTTS: Kept as permanent free option
- All existing functionality preserved
- New features are additions, not replacements
Next Steps
- Week 1: Set up WaveSpeed API access and credentials
- Week 1: Implement provider-agnostic routing system
- Week 2: Integrate into Story Writer with quality-based UI
- Week 3: Implement voice cloning with simple "AI Clone" vs "Default" choice
- Week 4: Add voice training UI (only if AI Clone selected)
- Week 5: Add "Animate Scene" hover option in Outline
- Week 6: Add "Animate Story with VoiceOver" button in Writing
- Week 7-8: Testing, optimization, and polish
Key Design Principles
- Provider Abstraction: Users never see provider names - only quality/voice options
- Preserve Existing: gTTS and HuggingFace remain available as fallbacks
- Cost Transparency: All buttons show costs in tooltips
- Automatic Fallback: System automatically uses free options when credits exhausted
- Per-Scene Only: Outline phase only allows per-scene generation (no bulk)
- User-Friendly: Simple choices like "Standard Quality" not "WaveSpeed 480p"
Risk Mitigation
| Risk | Mitigation |
|---|---|
| WaveSpeed API changes | Version pinning, abstraction layer |
| Cost overruns | Strict pre-flight validation |
| Voice quality issues | Quality checks, fallback options |
| User confusion | Clear UI, tooltips, documentation |
| Integration complexity | Phased rollout, extensive testing |
Document Version: 1.0
Last Updated: January 2025
Priority: HIGH - Immediate Implementation