Text-to-Video Implementation Plan - Phase 1

Goal

Implement WaveSpeed text-to-video support in the unified ai_video_generate() entry point with modular, maintainable code structure.

Proposed Architecture

Modular Structure (Following Image Generation Pattern)

backend/services/llm_providers/
├── main_video_generation.py          # Unified entry point (already exists)
└── video_generation/                 # NEW: Modular video generation services
    ├── __init__.py
    ├── base.py                       # Base classes/interfaces
    └── wavespeed_provider.py         # WaveSpeed text-to-video models
        ├── HunyuanVideoService       # HunyuanVideo-1.5
        ├── LTX2ProService            # LTX-2 Pro
        ├── LTX2FastService           # LTX-2 Fast
        └── LTX2RetakeService         # LTX-2 Retake

Implementation Strategy

Step 1: Create Base Structure

Create video_generation/ directory
Create base.py with base classes/interfaces
Create wavespeed_provider.py with service classes

Step 2: Implement First Model (HunyuanVideo-1.5)

Create HunyuanVideoService class
Implement model-specific logic
Add progress callback support
Return metadata dict

Step 3: Integrate into Unified Entry Point

Add _generate_text_to_video_wavespeed() function
Route to appropriate service based on model
Handle async/sync properly

Step 4: Test and Validate

Test with one model
Verify all features work
Ensure backward compatibility

Step 5: Add Remaining Models

Follow same pattern for LTX-2 Pro, Fast, Retake
Reuse common logic
Model-specific differences only

Model Selection

Recommended Starting Model: HunyuanVideo-1.5

Most commonly used
Good documentation availability
Standard parameters

Alternative: Any model you prefer - we'll follow the same pattern.

Service Class Structure

class HunyuanVideoService:
    """Service for HunyuanVideo-1.5 text-to-video generation."""
    
    MODEL_PATH = "wavespeed-ai/hunyuan-video-1.5/text-to-video"
    MODEL_NAME = "hunyuan-video-1.5"
    
    def __init__(self, client: Optional[WaveSpeedClient] = None):
        self.client = client or WaveSpeedClient()
    
    async def generate_video(
        self,
        prompt: str,
        duration: int = 5,
        resolution: str = "720p",
        negative_prompt: Optional[str] = None,
        seed: Optional[int] = None,
        audio_base64: Optional[str] = None,
        enable_prompt_expansion: bool = True,
        progress_callback: Optional[Callable[[float, str], None]] = None,
        **kwargs
    ) -> Dict[str, Any]:
        """
        Generate video using HunyuanVideo-1.5.
        
        Returns:
            Dict with video_bytes, prompt, duration, model_name, cost, etc.
        """
        # 1. Validate inputs
        # 2. Build payload
        # 3. Submit to WaveSpeed
        # 4. Poll with progress callbacks
        # 5. Download video
        # 6. Return metadata dict

Integration Points

Unified Entry Point

# In main_video_generation.py
async def _generate_text_to_video_wavespeed(
    prompt: str,
    model: str = "hunyuan-video-1.5",
    progress_callback: Optional[Callable[[float, str], None]] = None,
    **kwargs
) -> Dict[str, Any]:
    """Route to appropriate WaveSpeed text-to-video service."""
    from .video_generation.wavespeed_provider import get_wavespeed_text_to_video_service
    
    service = get_wavespeed_text_to_video_service(model)
    return await service.generate_video(
        prompt=prompt,
        progress_callback=progress_callback,
        **kwargs
    )

Next Steps

Wait for Model Documentation - You'll provide documentation for the first model
Create Base Structure - Set up directory and base classes
Implement First Model - HunyuanVideo-1.5 (or your chosen model)
Test - Verify functionality
Add Remaining Models - Follow same pattern

Questions

Which model should we start with? (Recommended: HunyuanVideo-1.5)
Do you have the model documentation ready? (API endpoints, parameters, response format)
Any specific requirements for the first model? (Parameters, features, etc.)

4.2 KiB Raw Blame History