Files
ALwrity/docs/Video Studio/IMAGE_TO_VIDEO_REQUIREMENTS_ANALYSIS.md

13 KiB

Image-to-Video Unified Generation - Requirements Analysis

Overview

This document analyzes all image-to-video operations across Story Writer, Podcast Maker, Video Studio, and Image Studio to ensure the unified ai_video_generate() implementation supports all existing features and requirements.

Current Image-to-Video Operations

1. Standard Image-to-Video (WAN 2.5 / Kandinsky 5 Pro)

Used By:

  • Image Studio Transform Service
  • Video Studio Service

Current Status: Uses unified ai_video_generate() with operation_type="image-to-video"

Features:

  • Input: Image (bytes or base64) + text prompt
  • Optional: Audio file (for synchronization), negative prompt, seed
  • Duration: 5 or 10 seconds
  • Resolution: 480p, 720p, 1080p
  • Models: alibaba/wan-2.5/image-to-video, wavespeed/kandinsky5-pro/image-to-video
  • Prompt expansion: Optional (enabled by default)

Requirements:

  • Pre-flight validation (subscription limits)
  • Usage tracking
  • File saving to disk
  • Asset library integration
  • Progress callbacks (for async operations)
  • Metadata return (cost, duration, resolution, dimensions)

Implementation Status: COMPLETE


2. Kling Animation (Scene Animation) ⚠️

Used By:

  • Story Writer (/api/story/animate-scene-preview)

Current Status: Uses separate animate_scene_image() function (NOT using unified entry point)

Features:

  • Input: Image (bytes) + scene data + story context
  • Special: Uses LLM to generate animation prompt from scene data
  • Duration: 5 or 10 seconds
  • Guidance scale: 0.0-1.0 (default: 0.5)
  • Optional: Negative prompt
  • Model: kwaivgi/kling-v2.5-turbo-std/image-to-video
  • Resume support: Yes (via resume_scene_animation())

Key Differences from Standard:

  1. LLM Prompt Generation: Automatically generates animation prompt using LLM from scene data
  2. Different Model: Uses Kling v2.5 Turbo Std (not WAN 2.5)
  3. Guidance Scale: Has guidance_scale parameter (WAN 2.5 doesn't)
  4. Resume Support: Can resume failed/timeout operations

Requirements:

  • Pre-flight validation (subscription limits)
  • Usage tracking
  • File saving to disk
  • Asset library integration
  • Progress callbacks (currently synchronous)
  • Metadata return (cost, duration, prompt, prediction_id)

Current Implementation:

# backend/services/wavespeed/kling_animation.py
def animate_scene_image(
    image_bytes: bytes,
    scene_data: Dict[str, Any],
    story_context: Dict[str, Any],
    user_id: str,
    duration: int = 5,
    guidance_scale: float = 0.5,
    negative_prompt: Optional[str] = None,
) -> Dict[str, Any]:
    # 1. Generate animation prompt using LLM
    animation_prompt = generate_animation_prompt(scene_data, story_context, user_id)
    
    # 2. Submit to WaveSpeed Kling model
    prediction_id = client.submit_image_to_video(KLING_MODEL_PATH, payload)
    
    # 3. Poll for completion
    result = client.poll_until_complete(prediction_id, timeout_seconds=240)
    
    # 4. Download video and return
    return {video_bytes, prompt, duration, model_name, cost, provider, prediction_id}

Decision Needed:

  • Option A: Keep separate (recommended) - Different model, LLM prompt generation, guidance_scale
  • Option B: Integrate into unified entry point - Add model="kling-v2.5-turbo-std" support

Recommendation: Keep separate for now, but ensure it follows same patterns (pre-flight, usage tracking, file saving).


3. InfiniteTalk (Talking Avatar with Audio) ⚠️

Used By:

  • Story Writer (/api/story/animate-scene-voiceover)
  • Podcast Maker (/api/podcast/render/video)
  • Image Studio Transform Studio (Talking Avatar feature)

Current Status: Uses separate animate_scene_with_voiceover() function (NOT using unified entry point)

Features:

  • Input: Image (bytes) + Audio (bytes) - BOTH REQUIRED
  • Optional: Prompt (for expression/style), mask_image (for animatable regions), seed
  • Resolution: 480p or 720p only
  • Model: wavespeed-ai/infinitetalk
  • Special: Audio-driven lip-sync animation (different from standard image-to-video)

Key Differences from Standard:

  1. Audio Required: Must have audio file (for lip-sync)
  2. Different Model: Uses InfiniteTalk (not WAN 2.5)
  3. Limited Resolution: Only 480p or 720p (no 1080p)
  4. Different Use Case: Talking avatar (person speaking) vs. scene animation
  5. Different Pricing: $0.03/s (480p) or $0.06/s (720p) vs. WAN 2.5 pricing

Requirements:

  • Pre-flight validation (subscription limits)
  • Usage tracking
  • File saving to disk
  • Asset library integration
  • Progress callbacks (for async operations)
  • Metadata return (cost, duration, prompt, prediction_id)

Current Implementation:

# backend/services/wavespeed/infinitetalk.py
def animate_scene_with_voiceover(
    image_bytes: bytes,
    audio_bytes: bytes,  # REQUIRED
    scene_data: Dict[str, Any],
    story_context: Dict[str, Any],
    user_id: str,
    resolution: str = "720p",
    prompt_override: Optional[str] = None,
    mask_image_bytes: Optional[bytes] = None,
    seed: Optional[int] = -1,
) -> Dict[str, Any]:
    # 1. Generate prompt (or use override)
    animation_prompt = prompt_override or _generate_simple_infinitetalk_prompt(...)
    
    # 2. Submit to WaveSpeed InfiniteTalk
    prediction_id = client.submit_image_to_video(INFINITALK_MODEL_PATH, payload)
    
    # 3. Poll for completion (up to 10 minutes)
    result = client.poll_until_complete(prediction_id, timeout_seconds=600)
    
    # 4. Download video and return
    return {video_bytes, prompt, duration, model_name, cost, provider, prediction_id}

Decision Needed:

  • Option A: Keep separate (recommended) - Different model, requires audio, different use case
  • Option B: Integrate into unified entry point - Add operation_type="talking-avatar" or model="infinitetalk" support

Recommendation: Keep separate for now, but ensure it follows same patterns (pre-flight, usage tracking, file saving).


Unified Entry Point Current Support

Supported Operations

Standard Image-to-Video:

  • WAN 2.5 (alibaba/wan-2.5/image-to-video)
  • Kandinsky 5 Pro (wavespeed/kandinsky5-pro/image-to-video)
  • Pre-flight validation
  • Usage tracking
  • Progress callbacks
  • Metadata return
  • File saving (handled by calling services)
  • Asset library integration (handled by calling services)

Not Supported (Keep Separate)

Kling Animation:

  • Different model (kwaivgi/kling-v2.5-turbo-std/image-to-video)
  • LLM prompt generation requirement
  • Guidance scale parameter
  • Resume support

InfiniteTalk:

  • Different model (wavespeed-ai/infinitetalk)
  • Requires audio (not optional)
  • Different use case (talking avatar vs. scene animation)
  • Limited resolution (480p/720p only)

Requirements Checklist

Core Requirements (All Operations)

Requirement Standard (WAN 2.5) Kling Animation InfiniteTalk
Pre-flight validation
Usage tracking
File saving
Asset library
Progress callbacks (sync)
Metadata return
Error handling
Resume support

Feature-Specific Requirements

Feature Standard (WAN 2.5) Kling Animation InfiniteTalk
Image input
Text prompt (LLM-generated) (optional)
Audio input (optional) (required)
Duration control (5/10s) (5/10s) (audio-driven)
Resolution options (480p/720p/1080p) (model default) (480p/720p)
Negative prompt
Seed control
Guidance scale
Mask image
Prompt expansion

Gaps and Recommendations

No Gaps Found for Standard Image-to-Video

The unified ai_video_generate() implementation fully supports all requirements for:

  • Image Studio Transform Service
  • Video Studio Service

Both services are correctly using the unified entry point and all features work as expected.

Reasoning:

  1. Different model with different parameters (guidance_scale)
  2. Requires LLM prompt generation (adds complexity)
  3. Has resume support (not in unified entry point)
  4. Different use case (scene animation vs. general image-to-video)

Action: Ensure it follows same patterns:

  • Pre-flight validation (already done)
  • Usage tracking (already done)
  • File saving (already done)
  • Asset library (already done)
  • ⚠️ Consider adding progress callbacks for async operations

Reasoning:

  1. Different model with different requirements (audio required)
  2. Different use case (talking avatar vs. scene animation)
  3. Different pricing model
  4. Limited resolution options

Action: Ensure it follows same patterns:

  • Pre-flight validation (already done)
  • Usage tracking (already done)
  • File saving (already done)
  • Asset library (already done)
  • Progress callbacks (already done)

Verification Checklist

Image Studio

  • Uses unified ai_video_generate() for image-to-video
  • Pre-flight validation works
  • Usage tracking works
  • File saving works
  • Asset library integration works
  • All parameters supported (prompt, duration, resolution, audio, negative_prompt, seed)

Video Studio

  • Uses unified ai_video_generate() for image-to-video
  • Pre-flight validation works
  • Usage tracking works
  • File saving works
  • Asset library integration works
  • All parameters supported

Story Writer ⚠️

  • Standard image-to-video: Uses unified entry point (via hd_video.py - but that's text-to-video)
  • Kling animation: Uses separate function (keep separate)
  • InfiniteTalk: Uses separate function (keep separate)
  • All operations have pre-flight validation
  • All operations have usage tracking
  • All operations save files
  • All operations save to asset library

Podcast Maker ⚠️

  • InfiniteTalk: Uses separate function (keep separate)
  • Pre-flight validation works
  • Usage tracking works
  • File saving works
  • Asset library integration (via podcast service)
  • Progress callbacks work (async polling)

Conclusion

Standard Image-to-Video is Complete

The unified ai_video_generate() implementation fully supports all requirements for standard image-to-video operations used by:

  • Image Studio
  • Video Studio

⚠️ Specialized Operations Should Stay Separate

Kling Animation and InfiniteTalk are specialized operations with:

  • Different models
  • Different requirements (audio for InfiniteTalk, LLM prompts for Kling)
  • Different use cases (talking avatar vs. scene animation)

Recommendation: Keep these separate but ensure they follow the same patterns:

  • Pre-flight validation
  • Usage tracking
  • File saving
  • Asset library integration
  • Progress callbacks (where applicable)

Next Steps

  1. Confirmed: Standard image-to-video unified generation is complete
  2. Confirmed: All existing features and requirements are supported
  3. ⚠️ Note: Kling and InfiniteTalk are intentionally separate (different models/use cases)
  4. Ready: Proceed with Phase 1 (text-to-video implementation)

Testing Recommendations

Before proceeding with text-to-video, verify:

  1. Image Studio:

    • Image-to-video generation works
    • All parameters work (prompt, duration, resolution, audio, negative_prompt, seed)
    • File saving works
    • Asset library integration works
    • Pre-flight validation blocks exceeded limits
    • Usage tracking works
  2. Video Studio:

    • Image-to-video generation works
    • All parameters work
    • File saving works
    • Asset library integration works
    • Pre-flight validation works
    • Usage tracking works
  3. Story Writer (Kling & InfiniteTalk):

    • Kling animation works (separate function)
    • InfiniteTalk works (separate function)
    • Both have pre-flight validation
    • Both have usage tracking
    • Both save files and assets
  4. Podcast Maker (InfiniteTalk):

    • InfiniteTalk works (separate function)
    • Pre-flight validation works
    • Usage tracking works
    • File saving works
    • Async polling works