4.2 KiB
4.2 KiB
Text-to-Video Implementation Plan - Phase 1
Goal
Implement WaveSpeed text-to-video support in the unified ai_video_generate() entry point with modular, maintainable code structure.
Proposed Architecture
Modular Structure (Following Image Generation Pattern)
backend/services/llm_providers/
├── main_video_generation.py # Unified entry point (already exists)
└── video_generation/ # NEW: Modular video generation services
├── __init__.py
├── base.py # Base classes/interfaces
└── wavespeed_provider.py # WaveSpeed text-to-video models
├── HunyuanVideoService # HunyuanVideo-1.5
├── LTX2ProService # LTX-2 Pro
├── LTX2FastService # LTX-2 Fast
└── LTX2RetakeService # LTX-2 Retake
Implementation Strategy
Step 1: Create Base Structure
- Create
video_generation/directory - Create
base.pywith base classes/interfaces - Create
wavespeed_provider.pywith service classes
Step 2: Implement First Model (HunyuanVideo-1.5)
- Create
HunyuanVideoServiceclass - Implement model-specific logic
- Add progress callback support
- Return metadata dict
Step 3: Integrate into Unified Entry Point
- Add
_generate_text_to_video_wavespeed()function - Route to appropriate service based on model
- Handle async/sync properly
Step 4: Test and Validate
- Test with one model
- Verify all features work
- Ensure backward compatibility
Step 5: Add Remaining Models
- Follow same pattern for LTX-2 Pro, Fast, Retake
- Reuse common logic
- Model-specific differences only
Model Selection
Recommended Starting Model: HunyuanVideo-1.5
- Most commonly used
- Good documentation availability
- Standard parameters
Alternative: Any model you prefer - we'll follow the same pattern.
Service Class Structure
class HunyuanVideoService:
"""Service for HunyuanVideo-1.5 text-to-video generation."""
MODEL_PATH = "wavespeed-ai/hunyuan-video-1.5/text-to-video"
MODEL_NAME = "hunyuan-video-1.5"
def __init__(self, client: Optional[WaveSpeedClient] = None):
self.client = client or WaveSpeedClient()
async def generate_video(
self,
prompt: str,
duration: int = 5,
resolution: str = "720p",
negative_prompt: Optional[str] = None,
seed: Optional[int] = None,
audio_base64: Optional[str] = None,
enable_prompt_expansion: bool = True,
progress_callback: Optional[Callable[[float, str], None]] = None,
**kwargs
) -> Dict[str, Any]:
"""
Generate video using HunyuanVideo-1.5.
Returns:
Dict with video_bytes, prompt, duration, model_name, cost, etc.
"""
# 1. Validate inputs
# 2. Build payload
# 3. Submit to WaveSpeed
# 4. Poll with progress callbacks
# 5. Download video
# 6. Return metadata dict
Integration Points
Unified Entry Point
# In main_video_generation.py
async def _generate_text_to_video_wavespeed(
prompt: str,
model: str = "hunyuan-video-1.5",
progress_callback: Optional[Callable[[float, str], None]] = None,
**kwargs
) -> Dict[str, Any]:
"""Route to appropriate WaveSpeed text-to-video service."""
from .video_generation.wavespeed_provider import get_wavespeed_text_to_video_service
service = get_wavespeed_text_to_video_service(model)
return await service.generate_video(
prompt=prompt,
progress_callback=progress_callback,
**kwargs
)
Next Steps
- Wait for Model Documentation - You'll provide documentation for the first model
- Create Base Structure - Set up directory and base classes
- Implement First Model - HunyuanVideo-1.5 (or your chosen model)
- Test - Verify functionality
- Add Remaining Models - Follow same pattern
Questions
- Which model should we start with? (Recommended: HunyuanVideo-1.5)
- Do you have the model documentation ready? (API endpoints, parameters, response format)
- Any specific requirements for the first model? (Parameters, features, etc.)