Files
ALwrity/docs/IMAGE_TO_VIDEO_REQUIREMENTS_ANALYSIS.md
ajaysi b134e9dc7e Added video studio router and endpoints. Added research router and endpoints. Added youtube router and endpoints. Added onboarding utils router and endpoints. Added onboarding utils service. Added onboarding utils models. Added onboarding utils routes. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils. Added onboarding utils utils.
2026-01-01 17:56:25 +05:30

13 KiB

Image-to-Video Unified Generation - Requirements Analysis

Overview

This document analyzes all image-to-video operations across Story Writer, Podcast Maker, Video Studio, and Image Studio to ensure the unified ai_video_generate() implementation supports all existing features and requirements.

Current Image-to-Video Operations

1. Standard Image-to-Video (WAN 2.5 / Kandinsky 5 Pro)

Used By:

  • Image Studio Transform Service
  • Video Studio Service

Current Status: Uses unified ai_video_generate() with operation_type="image-to-video"

Features:

  • Input: Image (bytes or base64) + text prompt
  • Optional: Audio file (for synchronization), negative prompt, seed
  • Duration: 5 or 10 seconds
  • Resolution: 480p, 720p, 1080p
  • Models: alibaba/wan-2.5/image-to-video, wavespeed/kandinsky5-pro/image-to-video
  • Prompt expansion: Optional (enabled by default)

Requirements:

  • Pre-flight validation (subscription limits)
  • Usage tracking
  • File saving to disk
  • Asset library integration
  • Progress callbacks (for async operations)
  • Metadata return (cost, duration, resolution, dimensions)

Implementation Status: COMPLETE


2. Kling Animation (Scene Animation) ⚠️

Used By:

  • Story Writer (/api/story/animate-scene-preview)

Current Status: Uses separate animate_scene_image() function (NOT using unified entry point)

Features:

  • Input: Image (bytes) + scene data + story context
  • Special: Uses LLM to generate animation prompt from scene data
  • Duration: 5 or 10 seconds
  • Guidance scale: 0.0-1.0 (default: 0.5)
  • Optional: Negative prompt
  • Model: kwaivgi/kling-v2.5-turbo-std/image-to-video
  • Resume support: Yes (via resume_scene_animation())

Key Differences from Standard:

  1. LLM Prompt Generation: Automatically generates animation prompt using LLM from scene data
  2. Different Model: Uses Kling v2.5 Turbo Std (not WAN 2.5)
  3. Guidance Scale: Has guidance_scale parameter (WAN 2.5 doesn't)
  4. Resume Support: Can resume failed/timeout operations

Requirements:

  • Pre-flight validation (subscription limits)
  • Usage tracking
  • File saving to disk
  • Asset library integration
  • Progress callbacks (currently synchronous)
  • Metadata return (cost, duration, prompt, prediction_id)

Current Implementation:

# backend/services/wavespeed/kling_animation.py
def animate_scene_image(
    image_bytes: bytes,
    scene_data: Dict[str, Any],
    story_context: Dict[str, Any],
    user_id: str,
    duration: int = 5,
    guidance_scale: float = 0.5,
    negative_prompt: Optional[str] = None,
) -> Dict[str, Any]:
    # 1. Generate animation prompt using LLM
    animation_prompt = generate_animation_prompt(scene_data, story_context, user_id)
    
    # 2. Submit to WaveSpeed Kling model
    prediction_id = client.submit_image_to_video(KLING_MODEL_PATH, payload)
    
    # 3. Poll for completion
    result = client.poll_until_complete(prediction_id, timeout_seconds=240)
    
    # 4. Download video and return
    return {video_bytes, prompt, duration, model_name, cost, provider, prediction_id}

Decision Needed:

  • Option A: Keep separate (recommended) - Different model, LLM prompt generation, guidance_scale
  • Option B: Integrate into unified entry point - Add model="kling-v2.5-turbo-std" support

Recommendation: Keep separate for now, but ensure it follows same patterns (pre-flight, usage tracking, file saving).


3. InfiniteTalk (Talking Avatar with Audio) ⚠️

Used By:

  • Story Writer (/api/story/animate-scene-voiceover)
  • Podcast Maker (/api/podcast/render/video)
  • Image Studio Transform Studio (Talking Avatar feature)

Current Status: Uses separate animate_scene_with_voiceover() function (NOT using unified entry point)

Features:

  • Input: Image (bytes) + Audio (bytes) - BOTH REQUIRED
  • Optional: Prompt (for expression/style), mask_image (for animatable regions), seed
  • Resolution: 480p or 720p only
  • Model: wavespeed-ai/infinitetalk
  • Special: Audio-driven lip-sync animation (different from standard image-to-video)

Key Differences from Standard:

  1. Audio Required: Must have audio file (for lip-sync)
  2. Different Model: Uses InfiniteTalk (not WAN 2.5)
  3. Limited Resolution: Only 480p or 720p (no 1080p)
  4. Different Use Case: Talking avatar (person speaking) vs. scene animation
  5. Different Pricing: $0.03/s (480p) or $0.06/s (720p) vs. WAN 2.5 pricing

Requirements:

  • Pre-flight validation (subscription limits)
  • Usage tracking
  • File saving to disk
  • Asset library integration
  • Progress callbacks (for async operations)
  • Metadata return (cost, duration, prompt, prediction_id)

Current Implementation:

# backend/services/wavespeed/infinitetalk.py
def animate_scene_with_voiceover(
    image_bytes: bytes,
    audio_bytes: bytes,  # REQUIRED
    scene_data: Dict[str, Any],
    story_context: Dict[str, Any],
    user_id: str,
    resolution: str = "720p",
    prompt_override: Optional[str] = None,
    mask_image_bytes: Optional[bytes] = None,
    seed: Optional[int] = -1,
) -> Dict[str, Any]:
    # 1. Generate prompt (or use override)
    animation_prompt = prompt_override or _generate_simple_infinitetalk_prompt(...)
    
    # 2. Submit to WaveSpeed InfiniteTalk
    prediction_id = client.submit_image_to_video(INFINITALK_MODEL_PATH, payload)
    
    # 3. Poll for completion (up to 10 minutes)
    result = client.poll_until_complete(prediction_id, timeout_seconds=600)
    
    # 4. Download video and return
    return {video_bytes, prompt, duration, model_name, cost, provider, prediction_id}

Decision Needed:

  • Option A: Keep separate (recommended) - Different model, requires audio, different use case
  • Option B: Integrate into unified entry point - Add operation_type="talking-avatar" or model="infinitetalk" support

Recommendation: Keep separate for now, but ensure it follows same patterns (pre-flight, usage tracking, file saving).


Unified Entry Point Current Support

Supported Operations

Standard Image-to-Video:

  • WAN 2.5 (alibaba/wan-2.5/image-to-video)
  • Kandinsky 5 Pro (wavespeed/kandinsky5-pro/image-to-video)
  • Pre-flight validation
  • Usage tracking
  • Progress callbacks
  • Metadata return
  • File saving (handled by calling services)
  • Asset library integration (handled by calling services)

Not Supported (Keep Separate)

Kling Animation:

  • Different model (kwaivgi/kling-v2.5-turbo-std/image-to-video)
  • LLM prompt generation requirement
  • Guidance scale parameter
  • Resume support

InfiniteTalk:

  • Different model (wavespeed-ai/infinitetalk)
  • Requires audio (not optional)
  • Different use case (talking avatar vs. scene animation)
  • Limited resolution (480p/720p only)

Requirements Checklist

Core Requirements (All Operations)

Requirement Standard (WAN 2.5) Kling Animation InfiniteTalk
Pre-flight validation
Usage tracking
File saving
Asset library
Progress callbacks (sync)
Metadata return
Error handling
Resume support

Feature-Specific Requirements

Feature Standard (WAN 2.5) Kling Animation InfiniteTalk
Image input
Text prompt (LLM-generated) (optional)
Audio input (optional) (required)
Duration control (5/10s) (5/10s) (audio-driven)
Resolution options (480p/720p/1080p) (model default) (480p/720p)
Negative prompt
Seed control
Guidance scale
Mask image
Prompt expansion

Gaps and Recommendations

No Gaps Found for Standard Image-to-Video

The unified ai_video_generate() implementation fully supports all requirements for:

  • Image Studio Transform Service
  • Video Studio Service

Both services are correctly using the unified entry point and all features work as expected.

Reasoning:

  1. Different model with different parameters (guidance_scale)
  2. Requires LLM prompt generation (adds complexity)
  3. Has resume support (not in unified entry point)
  4. Different use case (scene animation vs. general image-to-video)

Action: Ensure it follows same patterns:

  • Pre-flight validation (already done)
  • Usage tracking (already done)
  • File saving (already done)
  • Asset library (already done)
  • ⚠️ Consider adding progress callbacks for async operations

Reasoning:

  1. Different model with different requirements (audio required)
  2. Different use case (talking avatar vs. scene animation)
  3. Different pricing model
  4. Limited resolution options

Action: Ensure it follows same patterns:

  • Pre-flight validation (already done)
  • Usage tracking (already done)
  • File saving (already done)
  • Asset library (already done)
  • Progress callbacks (already done)

Verification Checklist

Image Studio

  • Uses unified ai_video_generate() for image-to-video
  • Pre-flight validation works
  • Usage tracking works
  • File saving works
  • Asset library integration works
  • All parameters supported (prompt, duration, resolution, audio, negative_prompt, seed)

Video Studio

  • Uses unified ai_video_generate() for image-to-video
  • Pre-flight validation works
  • Usage tracking works
  • File saving works
  • Asset library integration works
  • All parameters supported

Story Writer ⚠️

  • Standard image-to-video: Uses unified entry point (via hd_video.py - but that's text-to-video)
  • Kling animation: Uses separate function (keep separate)
  • InfiniteTalk: Uses separate function (keep separate)
  • All operations have pre-flight validation
  • All operations have usage tracking
  • All operations save files
  • All operations save to asset library

Podcast Maker ⚠️

  • InfiniteTalk: Uses separate function (keep separate)
  • Pre-flight validation works
  • Usage tracking works
  • File saving works
  • Asset library integration (via podcast service)
  • Progress callbacks work (async polling)

Conclusion

Standard Image-to-Video is Complete

The unified ai_video_generate() implementation fully supports all requirements for standard image-to-video operations used by:

  • Image Studio
  • Video Studio

⚠️ Specialized Operations Should Stay Separate

Kling Animation and InfiniteTalk are specialized operations with:

  • Different models
  • Different requirements (audio for InfiniteTalk, LLM prompts for Kling)
  • Different use cases (talking avatar vs. scene animation)

Recommendation: Keep these separate but ensure they follow the same patterns:

  • Pre-flight validation
  • Usage tracking
  • File saving
  • Asset library integration
  • Progress callbacks (where applicable)

Next Steps

  1. Confirmed: Standard image-to-video unified generation is complete
  2. Confirmed: All existing features and requirements are supported
  3. ⚠️ Note: Kling and InfiniteTalk are intentionally separate (different models/use cases)
  4. Ready: Proceed with Phase 1 (text-to-video implementation)

Testing Recommendations

Before proceeding with text-to-video, verify:

  1. Image Studio:

    • Image-to-video generation works
    • All parameters work (prompt, duration, resolution, audio, negative_prompt, seed)
    • File saving works
    • Asset library integration works
    • Pre-flight validation blocks exceeded limits
    • Usage tracking works
  2. Video Studio:

    • Image-to-video generation works
    • All parameters work
    • File saving works
    • Asset library integration works
    • Pre-flight validation works
    • Usage tracking works
  3. Story Writer (Kling & InfiniteTalk):

    • Kling animation works (separate function)
    • InfiniteTalk works (separate function)
    • Both have pre-flight validation
    • Both have usage tracking
    • Both save files and assets
  4. Podcast Maker (InfiniteTalk):

    • InfiniteTalk works (separate function)
    • Pre-flight validation works
    • Usage tracking works
    • File saving works
    • Async polling works