feat: Import 35+ skills, merge duplicates, add openclaw installer

Major updates:
- Added 35+ new skills from awesome-opencode-skills and antigravity repos
- Merged SEO skills into seo-master
- Merged architecture skills into architecture
- Merged security skills into security-auditor and security-coder
- Merged testing skills into testing-master and testing-patterns
- Merged pentesting skills into pentesting
- Renamed website-creator to thai-frontend-dev
- Replaced skill-creator with github version
- Removed Chutes references (use MiniMax API instead)
- Added install-openclaw-skills.sh for cross-platform installation
- Updated .env.example with MiniMax API credentials
This commit is contained in:
Kunthawat Greethong
2026-03-26 11:37:39 +07:00
parent 48595100a1
commit 7edf5bc4d0
469 changed files with 131580 additions and 417 deletions

View File

@@ -0,0 +1,649 @@
---
name: minimax-multimodal-toolkit
description: MiniMax multimodal model skill — use MiniMax Multi-Modal models for speech, music, video, and image. Create voice, music, video, and images with MiniMax AI: TTS (text-to-speech, voice cloning, voice design, multi-segment), music (songs, instrumentals), video (text-to-video, image-to-video, start-end frame, subject reference, templates, long-form multi-scene), image (text-to-image, image-to-image with character reference), and media processing (convert, concat, trim, extract). Use when the user mentions MiniMax, multimodal generation, or wants speech/music/video/image AI, MiniMax APIs, or FFmpeg workflows alongside MiniMax outputs.
---
# MiniMax Multi-Modal Toolkit
Generate voice, music, video, and image content via MiniMax APIs — the unified entry for **MiniMax multimodal** use cases (audio + music + video + image). Includes voice cloning & voice design for custom voices, image generation with character reference, and FFmpeg-based media tools for audio/video format conversion, concatenation, trimming, and extraction.
## Output Directory
**All generated files MUST be saved to `minimax-output/` under the AGENT'S current working directory (NOT the skill directory).** Every script call MUST include an explicit `--output` / `-o` argument pointing to this location. Never omit the output argument or rely on script defaults.
**Rules:**
1. Before running any script, ensure `minimax-output/` exists in the agent's working directory (create if needed: `mkdir -p minimax-output`)
2. Always use absolute or relative paths from the agent's working directory: `--output minimax-output/video.mp4`
3. **Never** `cd` into the skill directory to run scripts — run from the agent's working directory using the full script path
4. Intermediate/temp files (segment audio, video segments, extracted frames) are automatically placed in `minimax-output/tmp/`. They can be cleaned up when no longer needed: `rm -rf minimax-output/tmp`
## Prerequisites
```bash
brew install ffmpeg jq # macOS (or apt install ffmpeg jq on Linux)
bash scripts/check_environment.sh
```
No Python or pip required — all scripts are pure bash using `curl`, `ffmpeg`, `jq`, and `xxd`.
### API Host Configuration
MiniMax provides two service endpoints for different regions. Set `MINIMAX_API_HOST` before running any script:
| Region | Platform URL | API Host Value |
|--------|-------------|----------------|
| China Mainland中国大陆 | https://platform.minimaxi.com | `https://api.minimaxi.com` |
| Global全球 | https://platform.minimax.io | `https://api.minimax.io` |
```bash
# China Mainland
export MINIMAX_API_HOST="https://api.minimaxi.com"
# or Global
export MINIMAX_API_HOST="https://api.minimax.io"
```
**IMPORTANT — When API Host is missing:**
Before running any script, check if `MINIMAX_API_HOST` is set in the environment. If it is NOT configured:
1. Ask the user which service endpoint their MiniMax account uses:
- **China Mainland** → `https://api.minimaxi.com`
- **Global** → `https://api.minimax.io`
2. Instruct and help user to set it via `export MINIMAX_API_HOST="https://api.minimaxi.com"` (or the global variant) in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
### API Key Configuration
Set the `MINIMAX_API_KEY` environment variable before running any script:
```bash
export MINIMAX_API_KEY="your-api-key-here"
```
The key starts with `sk-api-` or `sk-cp-`, obtainable from https://platform.minimaxi.com (China) or https://platform.minimax.io (Global)
**IMPORTANT — When API Key is missing:**
Before running any script, check if `MINIMAX_API_KEY` is set in the environment. If it is NOT configured:
1. Ask the user to provide their MiniMax API key
2. Instruct and help user to set it via `export MINIMAX_API_KEY="sk-..."` in their terminal or add it to their shell profile (`~/.zshrc` / `~/.bashrc`) for persistence
## Key Capabilities
| Capability | Description | Entry point |
|------------|-------------|-------------|
| TTS | Text-to-speech synthesis with multiple voices and emotions | `scripts/tts/generate_voice.sh` |
| Voice Cloning | Clone a voice from an audio sample (10s5min) | `scripts/tts/generate_voice.sh clone` |
| Voice Design | Create a custom voice from a text description | `scripts/tts/generate_voice.sh design` |
| Music Generation | Generate songs with lyrics or instrumental tracks | `scripts/music/generate_music.sh` |
| Image Generation | Text-to-image, image-to-image with character reference | `scripts/image/generate_image.sh` |
| Video Generation | Text-to-video, image-to-video, subject reference, templates | `scripts/video/generate_video.sh` |
| Long Video | Multi-scene chained video with crossfade transitions | `scripts/video/generate_long_video.sh` |
| Media Tools | Audio/video format conversion, concatenation, trimming, extraction | `scripts/media_tools.sh` |
## TTS (Text-to-Speech)
Entry point: `scripts/tts/generate_voice.sh`
### IMPORTANT: Single voice vs Multi-segment — Choose the right approach
| User intent | Approach |
|-------------|----------|
| Single voice / no multi-character need | `tts` command — generate the entire text in one call |
| Multiple characters / narrator + dialogue | `generate` command with segments.json |
**Default behavior:** When the user simply asks to generate speech/voice and does NOT mention multiple voices or characters, use the `tts` command directly with a single appropriate voice. Do NOT split into segments or use the multi-segment pipeline — just pass the full text to `tts` in one call.
Only use multi-segment `generate` when:
- The user explicitly needs multiple voices/characters
- The text requires narrator + character dialogue separation
- The text exceeds **10,000 characters** (API limit per request) — in this case, split into segments with the same voice
### Single-voice generation (DEFAULT)
```bash
bash scripts/tts/generate_voice.sh tts "Hello world" -o minimax-output/hello.mp3
bash scripts/tts/generate_voice.sh tts "你好世界" -v female-shaonv -o minimax-output/hello_cn.mp3
```
### Multi-segment generation (multi-voice / audiobook / podcast)
**Complete workflow — follow ALL steps in order:**
1. **Write segments.json** — split text into segments with voice assignments (see format and rules below)
2. **Run `generate` command** — this reads segments.json, generates audio for EACH segment via TTS API, then merges them into a single output file with crossfade
```bash
# Step 1: Write segments.json to minimax-output/
# (use the Write tool to create minimax-output/segments.json)
# Step 2: Generate audio from segments.json — this is the CRITICAL step
# It generates each segment individually and merges them into one file
bash scripts/tts/generate_voice.sh generate minimax-output/segments.json \
-o minimax-output/output.mp3 --crossfade 200
```
**Do NOT skip Step 2.** Writing segments.json alone does nothing — you MUST run the `generate` command to actually produce audio.
### Voice management
```bash
# List all available voices
bash scripts/tts/generate_voice.sh list-voices
# Voice cloning (from audio sample, 10s5min)
bash scripts/tts/generate_voice.sh clone sample.mp3 --voice-id my-voice
# Voice design (from text description)
bash scripts/tts/generate_voice.sh design "A warm female narrator voice" --voice-id narrator
```
### Audio processing
```bash
bash scripts/tts/generate_voice.sh merge part1.mp3 part2.mp3 -o minimax-output/combined.mp3
bash scripts/tts/generate_voice.sh convert input.wav -o minimax-output/output.mp3
```
### TTS Models
| Model | Notes |
|-------|-------|
| speech-2.8-hd | Recommended, auto emotion matching |
| speech-2.8-turbo | Faster variant |
| speech-2.6-hd | Previous gen, manual emotion |
| speech-2.6-turbo | Previous gen, faster |
### segments.json Format
Default crossfade between segments: **200ms** (`--crossfade 200`).
```json
[
{ "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
{ "text": "Welcome.", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
```
Leave `emotion` empty for speech-2.8 models (auto-matched from text).
### IMPORTANT: Multi-Segment Script Generation Rules (Audiobooks, Podcasts, etc.)
When generating segments.json for audiobooks, podcasts, or any multi-character narration, you MUST split narration text from character dialogue into separate segments with distinct voices.
**Rule: Narration and dialogue are ALWAYS separate segments.**
A sentence like `"Tom said: The weather is great today!"` must be split into two segments:
- Segment 1 (narrator voice): `"Tom said:"`
- Segment 2 (character voice): `"The weather is great today!"`
**Example — Audiobook with narrator + 2 characters:**
```json
[
{ "text": "Morning sunlight streamed into the classroom as students filed in one by one.", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Tom smiled and turned to Lisa:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "The weather is amazing today! Let's go to the park after school!", "voice_id": "tom-voice", "emotion": "happy" },
{ "text": "Lisa thought for a moment, then replied:", "voice_id": "narrator-voice", "emotion": "" },
{ "text": "Sure, but I need to drop off my backpack at home first.", "voice_id": "lisa-voice", "emotion": "" },
{ "text": "They exchanged a smile and went back to listening to the lecture.", "voice_id": "narrator-voice", "emotion": "" }
]
```
**Key principles:**
1. **Narrator** uses a consistent neutral narrator voice throughout
2. **Each character** has a dedicated voice_id, maintained consistently across all their dialogue
3. **Split at dialogue boundaries**`"He said:"` is narrator, the quoted content is the character
4. **Do NOT merge** narrator text and character speech into a single segment
5. For characters without pre-existing voice_ids, use voice cloning or voice design to create them first, then reference the created voice_id in segments
## Music Generation
Entry point: `scripts/music/generate_music.sh`
### IMPORTANT: Instrumental vs Lyrics — When to use which
| Scenario | Mode | Action |
|----------|------|--------|
| BGM for video / voice / podcast | Instrumental (default) | Use `--instrumental` directly, do NOT ask user |
| User explicitly asks to "create music" / "make a song" | Ask user first | Ask whether they want instrumental or with lyrics |
**When adding background music to video or voice content**, always default to instrumental mode (`--instrumental`). Do not ask the user — BGM should never have vocals competing with the main content.
**When the user explicitly asks to create/generate music as the primary task**, ask them whether they want:
- Instrumental (pure music, no vocals)
- With lyrics (song with vocals — user provides or you help write lyrics)
```bash
# Instrumental (for BGM or when user chooses instrumental)
bash scripts/music/generate_music.sh \
--instrumental \
--prompt "ambient electronic, atmospheric" \
--output minimax-output/ambient.mp3 --download
# Song with lyrics (when user chooses vocal music)
bash scripts/music/generate_music.sh \
--lyrics "[verse]\nHello world\n[chorus]\nLa la la" \
--prompt "indie folk, melancholic" \
--output minimax-output/song.mp3 --download
# With style fields
bash scripts/music/generate_music.sh \
--lyrics "[verse]\nLyrics here" \
--genre "pop" --mood "upbeat" --tempo "fast" \
--output minimax-output/pop_track.mp3 --download
```
### Music Model
Default model: `music-2.5`
`music-2.5` does **not** support `--instrumental` directly. When instrumental music is needed, the script automatically applies a workaround:
- Sets lyrics to `[intro] [outro]` (empty structural tags, no actual vocals), appends `pure music, no lyrics` to the prompt
This produces instrumental-style output without requiring manual intervention. You can always use `--instrumental` and the script handles the rest.
## Image Generation
Entry point: `scripts/image/generate_image.sh`
Model: `image-01` — photorealistic image generation from text prompts, with optional character reference for image-to-image.
### IMPORTANT: Mode Selection — t2i vs i2i
| User intent | Mode |
|-------------|------|
| Generate image from text description (default) | `t2i` — text-to-image |
| Generate image with a character reference photo (keep same person) | `i2i` — image-to-image |
**Default behavior:** When the user asks to generate/create an image without mentioning a reference photo, use `t2i` mode (default). Only use `i2i` mode when the user provides a character reference image or explicitly asks to base the image on an existing person's appearance.
### IMPORTANT: Aspect Ratio — Infer from user context
Do NOT always default to `1:1`. Analyze the user's request and choose the most appropriate aspect ratio:
| User intent / context | Recommended ratio | Resolution |
|-----------------------|-------------------|------------|
| 头像、图标、社交媒体头像、avatar、icon、profile pic | `1:1` | 1024×1024 |
| 风景、横幅、桌面壁纸、landscape、banner、desktop wallpaper | `16:9` | 1280×720 |
| 传统照片、经典比例、classic photo | `4:3` | 1152×864 |
| 摄影作品、杂志封面、photography、magazine | `3:2` | 1248×832 |
| 人像竖图、海报、portrait photo、poster | `2:3` | 832×1248 |
| 竖版海报、书籍封面、tall poster、book cover | `3:4` | 864×1152 |
| 手机壁纸、社交媒体故事、phone wallpaper、story、reel | `9:16` | 720×1280 |
| 超宽全景、电影画幅、panoramic、cinematic ultrawide | `21:9` | 1344×576 |
| 未指定特定需求 / ambiguous | `1:1` | 1024×1024 |
### IMPORTANT: Image Count — When to generate multiple images
| User intent | Count (`-n`) |
|-------------|--------------|
| Default / single image request | `1` (default) |
| 用户说"几张"、"多张"、"一些" / "a few", "several" | `3` |
| 用户说"多种方案"、"备选" / "variations", "options" | `3``4` |
| 用户明确指定数量 | Use the specified number (19) |
### Text-to-Image Examples
```bash
# Basic text-to-image
bash scripts/image/generate_image.sh \
--prompt "A cat sitting on a rooftop at sunset, cinematic lighting, warm tones, photorealistic" \
-o minimax-output/cat.png
# Landscape with inferred aspect ratio
bash scripts/image/generate_image.sh \
--prompt "Mountain landscape with misty valleys, photorealistic, golden hour" \
--aspect-ratio 16:9 \
-o minimax-output/landscape.png
# Phone wallpaper (portrait 9:16)
bash scripts/image/generate_image.sh \
--prompt "Aurora borealis over a snowy forest, vivid colors, magical atmosphere" \
--aspect-ratio 9:16 \
-o minimax-output/wallpaper.png
# Multiple variations
bash scripts/image/generate_image.sh \
--prompt "Abstract geometric art, vibrant colors" \
-n 3 \
-o minimax-output/art.png
# With prompt optimizer
bash scripts/image/generate_image.sh \
--prompt "A man standing on Venice Beach, 90s documentary style" \
--aspect-ratio 16:9 --prompt-optimizer \
-o minimax-output/beach.png
# Custom dimensions (must be multiple of 8)
bash scripts/image/generate_image.sh \
--prompt "Product photo of a luxury watch on marble surface" \
--width 1024 --height 768 \
-o minimax-output/watch.png
```
### Image-to-Image (Character Reference)
Use a reference photo to generate images with the same character in new scenes. Best results with a single front-facing portrait. Supported formats: JPG, JPEG, PNG (max 10MB).
```bash
# Character reference — place same person in a new scene
bash scripts/image/generate_image.sh \
--mode i2i \
--prompt "A girl looking into the distance from a library window, warm afternoon light" \
--ref-image face.jpg \
--aspect-ratio 16:9 \
-o minimax-output/girl_library.png
# Multiple character variations
bash scripts/image/generate_image.sh \
--mode i2i \
--prompt "A woman in a red dress at a gala event, elegant, cinematic" \
--ref-image face.jpg -n 3 \
-o minimax-output/gala.png
```
### Aspect Ratio Reference
| Ratio | Resolution | Best for |
|-------|------------|----------|
| `1:1` | 1024×1024 | Default, avatars, icons, social media |
| `16:9` | 1280×720 | Landscape, banner, desktop wallpaper |
| `4:3` | 1152×864 | Classic photo, presentations |
| `3:2` | 1248×832 | Photography, magazine layout |
| `2:3` | 832×1248 | Portrait photo, poster |
| `3:4` | 864×1152 | Book cover, tall poster |
| `9:16` | 720×1280 | Phone wallpaper, social story/reel |
| `21:9` | 1344×576 | Ultra-wide panoramic, cinematic |
### Key Options
| Option | Description |
|--------|-------------|
| `--prompt TEXT` | Image description, max 1500 chars (required) |
| `--aspect-ratio RATIO` | Aspect ratio (see table above). Infer from user context |
| `--width PX` / `--height PX` | Custom size, 5122048, must be multiple of 8, both required together. Overridden by `--aspect-ratio` if both set |
| `-n N` | Number of images to generate, 19 (default 1) |
| `--seed N` | Random seed for reproducibility. Same seed + same params → similar results |
| `--prompt-optimizer` | Enable automatic prompt optimization by the API |
| `--ref-image FILE` | Character reference image for i2i mode (local file or URL, JPG/JPEG/PNG, max 10MB) |
| `--no-download` | Print image URLs instead of downloading files |
| `--aigc-watermark` | Add AIGC watermark to generated images |
## Video Generation
### IMPORTANT: Single vs Multi-Segment — Choose the right script
| User intent | Script to use |
|-------------|---------------|
| Default / no special request | `scripts/video/generate_video.sh` (single segment, **10s, 768P**) |
| User explicitly asks for "long video", "multi-scene", "story", or duration > 10s | `scripts/video/generate_long_video.sh` (multi-segment) |
**Default behavior:** Always use single-segment `generate_video.sh` with **duration 10s and resolution 768P** unless the user explicitly asks for a long video, multi-scene video, or specifies a total duration exceeding 10 seconds. Do NOT automatically split into multiple segments — a single 10s video is the standard output. Only use `generate_long_video.sh` when the user clearly needs multi-scene or longer content.
Entry point (single video): `scripts/video/generate_video.sh`
Entry point (long/multi-scene): `scripts/video/generate_long_video.sh`
### Video Model Constraints (MUST follow)
**Duration limits by model and resolution:**
| Model | 720P | 768P | 1080P |
|-------|------|------|-------|
| MiniMax-Hailuo-2.3 | - | 6s or **10s** | 6s only |
| MiniMax-Hailuo-2.3-Fast | - | 6s or **10s** | 6s only |
| MiniMax-Hailuo-02 | - | 6s or **10s** | 6s only |
| T2V-01 / T2V-01-Director | 6s only | - | - |
| I2V-01 / I2V-01-Director / I2V-01-live | 6s only | - | - |
| S2V-01 (ref) | 6s only | - | - |
**Resolution options by model and duration:**
| Model | 6s | 10s |
|-------|-----|-----|
| MiniMax-Hailuo-2.3 | 768P (default), 1080P | 768P only |
| MiniMax-Hailuo-2.3-Fast | 768P (default), 1080P | 768P only |
| MiniMax-Hailuo-02 | 512P, 768P (default), 1080P | 512P, 768P (default) |
| Other models | 720P (default) | Not supported |
**Key rules:**
- **Default: 10s + 768P** (best balance of length and quality for MiniMax-Hailuo-2.3)
- 1080P only supports 6s duration — if user requests 1080P, set `--duration 6`
- 10s duration only works with 768P (or 512P on Hailuo-02) — never combine 10s + 1080P
- Older models (T2V-01, I2V-01, S2V-01) only support 6s at 720P
### IMPORTANT: Prompt Optimization (MUST follow before generating any video)
Before calling any video generation script, you MUST optimize the user's prompt by reading and applying `references/video-prompt-guide.md`. Never pass the user's raw description directly as `--prompt`.
**Optimization steps:**
1. **Apply the Professional Formula**: `Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere`
- BAD: `"A puppy in a park"`
- GOOD: `"A golden retriever puppy runs toward the camera on a sun-dappled grass path in a park, [跟随] smooth tracking shot, warm golden hour lighting, shallow depth of field, joyful atmosphere"`
2. **Add camera instructions** using `[指令]` syntax: `[推进]`, `[拉远]`, `[跟随]`, `[固定]`, `[左摇]`, etc.
3. **Include aesthetic details**: lighting (golden hour, dramatic side lighting), color grading (warm tones, cinematic), texture (dust particles, rain droplets), atmosphere (intimate, epic, peaceful)
4. **Keep to 1-2 key actions** for 6-10 second videos — do not overcrowd with events
5. **For i2v mode** (image-to-video): Focus prompt on **movement and change only**, since the image already establishes the visual. Do NOT re-describe what's in the image.
- BAD: `"A lake with mountains"` (just repeating the image)
- GOOD: `"Gentle ripples spread across the water surface, a breeze rustles the distant trees, [固定] fixed camera, soft morning light, peaceful and serene"`
6. **For multi-segment long videos**: Each segment's prompt must be self-contained and optimized individually. The i2v segments (segment 2+) should describe motion/change relative to the previous segment's ending frame.
```bash
# Text-to-video (default: 10s, 768P)
bash scripts/video/generate_video.sh \
--mode t2v \
--prompt "A golden retriever puppy bounds toward the camera on a sunlit grass path, [跟随] tracking shot, warm golden hour, shallow depth of field, joyful" \
--output minimax-output/puppy.mp4
# Text-to-video with 1080P (must use --duration 6)
bash scripts/video/generate_video.sh \
--mode t2v \
--prompt "A golden retriever puppy bounds toward the camera" \
--duration 6 --resolution 1080P \
--output minimax-output/puppy_hd.mp4
# Image-to-video (prompt focuses on MOTION, not image content)
bash scripts/video/generate_video.sh \
--mode i2v \
--prompt "The petals begin to sway gently in the breeze, soft light shifts across the surface, [固定] fixed framing, dreamy pastel tones" \
--first-frame photo.jpg \
--output minimax-output/animated.mp4
# Start-end frame interpolation (sef mode uses MiniMax-Hailuo-02)
bash scripts/video/generate_video.sh \
--mode sef \
--first-frame start.jpg --last-frame end.jpg \
--output minimax-output/transition.mp4
# Subject reference (face consistency, ref mode uses S2V-01, 6s only)
bash scripts/video/generate_video.sh \
--mode ref \
--prompt "A young woman in a white dress walks slowly through a sunlit garden, [跟随] smooth tracking, warm natural lighting, cinematic depth of field" \
--subject-image face.jpg \
--duration 6 \
--output minimax-output/person.mp4
```
### Long-form Video (Multi-scene)
Multi-scene long videos chain segments together: the first segment generates via text-to-video (t2v), then each subsequent segment uses the last frame of the previous segment as its first frame (i2v). Segments are joined with crossfade transitions for smooth continuity. Default is 10 seconds per segment.
**Workflow:**
1. Segment 1: t2v — generated purely from the optimized text prompt
2. Segment 2+: i2v — the previous segment's last frame becomes `first_frame_image`, prompt describes **motion and change from that ending state**
3. All segments are concatenated with 0.5s crossfade transitions to eliminate jump cuts
4. Optional: AI-generated background music is overlaid
**Prompt rules for each segment:**
- Each segment prompt MUST be independently optimized using the Professional Formula
- Segment 1 (t2v): Full scene description with subject, scene, camera, atmosphere
- Segment 2+ (i2v): Focus on **what changes and moves** from the previous ending frame. Do NOT repeat the visual description — the first frame already provides it
- Maintain visual consistency: keep lighting, color grading, and style keywords consistent across segments
- Each segment covers only 10 seconds of action — keep it focused
```bash
# Example: 3-segment story with optimized per-segment prompts (default: 10s/segment, 768P)
bash scripts/video/generate_long_video.sh \
--scenes \
"A lone astronaut stands on a red desert planet surface, wind blowing dust particles, [推进] slow push in toward the visor, dramatic rim lighting, cinematic sci-fi atmosphere" \
"The astronaut turns and begins walking toward a distant glowing structure on the horizon, dust swirling around boots, [跟随] tracking from behind, vast desolate landscape, golden light from the structure" \
"The astronaut reaches the structure entrance, a massive doorway pulses with blue energy, [推进] slow push in toward the doorway, light reflects off the visor, awe-inspiring epic scale" \
--music-prompt "cinematic orchestral ambient, slow build, sci-fi atmosphere" \
--output minimax-output/long_video.mp4
# With custom settings
bash scripts/video/generate_long_video.sh \
--scenes "Scene 1 prompt" "Scene 2 prompt" \
--segment-duration 10 \
--resolution 768P \
--crossfade 0.5 \
--music-prompt "calm ambient background music" \
--output minimax-output/long_video.mp4
```
### Add Background Music
```bash
bash scripts/video/add_bgm.sh \
--video input.mp4 \
--generate-bgm --instrumental \
--music-prompt "soft piano background" \
--bgm-volume 0.3 \
--output minimax-output/output_with_bgm.mp4
```
### Template Video
```bash
bash scripts/video/generate_template_video.sh \
--template-id 392753057216684038 \
--media photo.jpg \
--output minimax-output/template_output.mp4
```
### Video Models
| Mode | Default Model | Default Duration | Default Resolution | Notes |
|------|--------------|-----------------|-------------------|-------|
| t2v | MiniMax-Hailuo-2.3 | 10s | 768P | Latest text-to-video |
| i2v | MiniMax-Hailuo-2.3 | 10s | 768P | Latest image-to-video |
| sef | MiniMax-Hailuo-02 | 6s | 768P | Start-end frame |
| ref | S2V-01 | 6s | 720P | Subject reference, 6s only |
## Media Tools (Audio/Video Processing)
Entry point: `scripts/media_tools.sh`
Standalone FFmpeg-based utilities for format conversion, concatenation, extraction, trimming, and audio overlay. Use these when the user needs to process existing media files without generating new content via MiniMax API.
### Video Format Conversion
```bash
# Convert between formats (mp4, mov, webm, mkv, avi, ts, flv)
bash scripts/media_tools.sh convert-video input.webm -o output.mp4
bash scripts/media_tools.sh convert-video input.mp4 -o output.mov
# With quality / resolution / fps options
bash scripts/media_tools.sh convert-video input.mp4 -o output.mp4 \
--crf 18 --preset medium --resolution 1920x1080 --fps 30
```
### Audio Format Conversion
```bash
# Convert between formats (mp3, wav, flac, ogg, aac, m4a, opus, wma)
bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
bash scripts/media_tools.sh convert-audio input.mp3 -o output.flac \
--bitrate 320k --sample-rate 48000 --channels 2
```
### Video Concatenation
```bash
# Concatenate with crossfade transition (default 0.5s)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 seg3.mp4 -o merged.mp4
# Hard cut (no crossfade)
bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4 --crossfade 0
```
### Audio Concatenation
```bash
# Simple concatenation
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
# With crossfade
bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3 --crossfade 1
```
### Extract Audio from Video
```bash
# Extract as mp3
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.mp3
# Extract as wav with higher bitrate
bash scripts/media_tools.sh extract-audio video.mp4 -o audio.wav --bitrate 320k
```
### Video Trimming
```bash
# Trim by start/end time (seconds)
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 5 --end 15
# Trim by start + duration
bash scripts/media_tools.sh trim-video input.mp4 -o clip.mp4 --start 10 --duration 8
```
### Add Audio to Video (Overlay / Replace)
```bash
# Mix audio with existing video audio
bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4 \
--volume 0.3 --fade-in 2 --fade-out 3
# Replace original audio entirely
bash scripts/media_tools.sh add-audio --video video.mp4 --audio narration.mp3 -o output.mp4 \
--replace
```
### Media File Info
```bash
bash scripts/media_tools.sh probe input.mp4
```
## Script Architecture
```
scripts/
├── check_environment.sh # Env verification (curl, ffmpeg, jq, xxd, API key)
├── media_tools.sh # Audio/video conversion, concat, trim, extract
├── tts/
│ └── generate_voice.sh # Unified TTS CLI (tts, clone, design, list-voices, generate, merge, convert)
├── music/
│ └── generate_music.sh # Music generation CLI
├── image/
│ └── generate_image.sh # Image generation CLI (2 modes: t2i, i2i)
└── video/
├── generate_video.sh # Video generation CLI (4 modes: t2v, i2v, sef, ref)
├── generate_long_video.sh # Multi-scene long video
├── generate_template_video.sh # Template-based video
└── add_bgm.sh # Background music overlay
```
## References
Read these for detailed API parameters, voice catalogs, and prompt engineering:
- [tts-guide.md](references/tts-guide.md) — TTS setup, voice management, audio processing, segment format, troubleshooting
- [tts-voice-catalog.md](references/tts-voice-catalog.md) — Full voice catalog with IDs, descriptions, and parameter reference
- [music-api.md](references/music-api.md) — Music generation API: endpoints, parameters, response format
- [image-api.md](references/image-api.md) — Image generation API: text-to-image, image-to-image, parameters
- [video-api.md](references/video-api.md) — Video API: endpoints, models, parameters, camera instructions, templates
- [video-prompt-guide.md](references/video-prompt-guide.md) — Video prompt engineering: formulas, styles, image-to-video tips

View File

@@ -0,0 +1,115 @@
# MiniMax Image Generation API (image-01)
Source: https://platform.minimaxi.com/docs/api-reference/image-generation-t2i and https://platform.minimaxi.com/docs/api-reference/image-generation-i2i
## Endpoint
`POST https://api.minimaxi.com/v1/image_generation`
## Auth
`Authorization: Bearer <MINIMAX_API_KEY>`
## Request (JSON)
Required:
- `model`: string — `image-01`
- `prompt`: string (max 1500 chars) — text description of the desired image
Optional:
- `aspect_ratio`: string — image aspect ratio, default `1:1`. Options:
- `1:1` (1024×1024)
- `16:9` (1280×720)
- `4:3` (1152×864)
- `3:2` (1248×832)
- `2:3` (832×1248)
- `3:4` (864×1152)
- `9:16` (720×1280)
- `21:9` (1344×576)
- `width`: integer — custom width in pixels. Range [512, 2048], must be multiple of 8. Overridden by `aspect_ratio` if both set.
- `height`: integer — custom height in pixels. Same rules as `width`. Both `width` and `height` must be set together.
- `response_format`: string — `url` (default, valid 24h) or `base64`
- `n`: integer (19, default 1) — number of images to generate
- `seed`: integer — random seed for reproducibility
- `prompt_optimizer`: boolean (default `false`) — enable automatic prompt optimization
- `aigc_watermark`: boolean (default `false`) — add AIGC watermark
### Subject Reference (image-to-image)
- `subject_reference`: array — character reference for image-to-image generation
- `type`: string — currently only `character` (portrait)
- `image_file`: string — reference image as public URL or Base64 Data URL (`data:image/jpeg;base64,...`). For best results, use a single person front-facing photo. Formats: JPG, JPEG, PNG. Max size: 10MB.
## Example — Text-to-Image
```json
{
"model": "image-01",
"prompt": "A man in a white t-shirt, full-body, standing front view, outdoors, with the Venice Beach sign in the background, Los Angeles. Fashion photography in 90s documentary style, film grain, photorealistic.",
"aspect_ratio": "16:9",
"response_format": "url",
"n": 3,
"prompt_optimizer": true
}
```
## Example — Image-to-Image (Character Reference)
```json
{
"model": "image-01",
"prompt": "A girl looking into the distance from a library window",
"aspect_ratio": "16:9",
"subject_reference": [
{
"type": "character",
"image_file": "https://example.com/face.jpg"
}
],
"n": 2
}
```
## Response
```json
{
"id": "03ff3cd0820949eb8a410056b5f21d38",
"data": {
"image_urls": ["https://...", "https://...", "https://..."],
"image_base64": null
},
"metadata": {
"success_count": 3,
"failed_count": 0
},
"base_resp": {
"status_code": 0,
"status_msg": "success"
}
}
```
- `data.image_urls`: array of image URLs (when `response_format` is `url`, valid 24h)
- `data.image_base64`: array of Base64 strings (when `response_format` is `base64`)
- `metadata.success_count`: number of successfully generated images
- `metadata.failed_count`: number of images blocked by content safety
## Status Codes
| Code | Meaning |
|------|---------|
| 0 | Success |
| 1002 | Rate limited, retry later |
| 1004 | Auth failed, check API key |
| 1008 | Insufficient balance |
| 1026 | Prompt contains sensitive content |
| 2013 | Invalid parameters |
| 2049 | Invalid API key |
## Notes
- The API is synchronous — images are returned directly in the response (no polling needed).
- URL format image links expire after 24 hours.
- For image-to-image: upload a single front-facing portrait for best character reference results.
- `width`/`height` are overridden by `aspect_ratio` if both provided.

View File

@@ -0,0 +1,57 @@
# MiniMax Music Generation API (music-2.5)
Source: https://platform.minimaxi.com/docs/api-reference/music-generation
## Endpoint
`POST https://api.minimaxi.com/v1/music_generation`
## Auth
`Authorization: Bearer <MINIMAX_API_KEY>`
## Request (JSON)
Required:
- `model`: string — `music-2.5`
- `lyrics`: string (13500 chars) — required. Use `\n` for line breaks. Structure tags: `[Verse]`, `[Chorus]`, `[Bridge]`, `[Intro]`, `[Outro]`, etc.
Optional:
- `prompt`: string (02000 chars) — style description, optional but recommended.
- `lyrics_optimizer`: boolean — auto-generate lyrics from prompt when lyrics is empty.
- `stream`: boolean (default `false`)
- `output_format`: `hex` (default) or `url`. URL valid for 24 hours.
- `aigc_watermark`: boolean — top-level field, non-streaming only.
- `audio_setting`:
- `sample_rate`: 16000, 24000, 32000, 44100
- `bitrate`: 32000, 64000, 128000, 256000
- `format`: mp3, wav, pcm
## Example
```json
{
"model": "music-2.5",
"prompt": "indie folk, melancholic, introspective",
"lyrics": "[verse]\n...\n[chorus]\n...",
"aigc_watermark": false,
"audio_setting": {
"sample_rate": 44100,
"bitrate": 256000,
"format": "mp3"
}
}
```
## Response
- `data.audio`: hex string or URL depending on `output_format`
- `data.status`: 1 (generating), 2 (complete)
- `extra_info`: duration, sample_rate, channels, bitrate, size
- `base_resp.status_code`: 0 on success
## Notes
- `music-2.5` does not support `is_instrumental`. For instrumental music, use lyrics `[intro] [outro]` and add `pure music, no lyrics` to the prompt.
- `prompt` is optional but recommended for better style control.
- `stream=true` only supports `hex` output.

View File

@@ -0,0 +1,111 @@
# TTS Guide
## Setup
```bash
cd skills/MiniMaxStudio
pip install -r requirements.txt
brew install ffmpeg # macOS (or: sudo apt install ffmpeg)
export MINIMAX_API_KEY="your-api-key" # sk-api-xxx or sk-cp-xxx
python scripts/check_environment.py
```
## Quick Test
```bash
python scripts/tts/generate_voice.py tts "Hello, this is a test." -o test.mp3
```
## Voice Management
List available voices:
```bash
python scripts/tts/generate_voice.py list-voices
```
### Voice Cloning
Create a custom voice from an audio sample:
```bash
python scripts/tts/generate_voice.py clone audio.mp3 --voice-id my-custom-voice
# With preview
python scripts/tts/generate_voice.py clone audio.mp3 --voice-id my-voice --preview "Test text" --preview-output preview.mp3
```
Requirements: 10s5min duration, ≤20MB, mp3/wav/m4a format.
### Voice Design
Design a voice from a text description:
```bash
python scripts/tts/generate_voice.py design "A warm, gentle female voice" --voice-id designed-voice
```
Custom voices expire after 7 days if not used with TTS.
## Audio Processing
### Merge
```bash
python scripts/tts/generate_voice.py merge file1.mp3 file2.mp3 -o combined.mp3
python scripts/tts/generate_voice.py merge a.mp3 b.mp3 -o merged.mp3 --crossfade 300
```
### Convert
```bash
python scripts/tts/generate_voice.py convert input.wav -o output.mp3
python scripts/tts/generate_voice.py convert input.wav -o output.mp3 --format mp3 --bitrate 192k --sample-rate 32000
```
FFmpeg required. Supported formats: mp3, wav, flac, ogg, m4a, aac, wma, opus, pcm.
## Segment-Based TTS
For multi-voice, multi-emotion workflows using a `segments.json` file:
```bash
# Validate
python scripts/tts/generate_voice.py validate segments.json --verbose
# Generate
python scripts/tts/generate_voice.py generate segments.json -o output.mp3 --crossfade 200
```
### segments.json Format
```json
[
{ "text": "Hello!", "voice_id": "female-shaonv", "emotion": "" },
{ "text": "How are you?", "voice_id": "male-qn-qingse", "emotion": "happy" }
]
```
- `text` (required): Text to synthesize
- `voice_id` (required): Voice ID
- `emotion` (optional): For speech-2.8 models, leave empty for auto-matching. Valid values: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
## Troubleshooting
| Error | Solution |
|-------|----------|
| `MINIMAX_API_KEY is required` | `export MINIMAX_API_KEY="key"` |
| `FFmpeg not installed` | `brew install ffmpeg` |
| `Voice not found` | `python scripts/tts/generate_voice.py list-voices` |
| `401 Unauthorized` | Check API key validity |
| `429 Too Many Requests` | Add delays between requests |
## API Details
- **Endpoint**: `POST /v1/t2a_v2`
- **Base URL**: `https://api.minimaxi.com`
- **Auth**: `Authorization: Bearer {MINIMAX_API_KEY}`
- **Models**: speech-2.8-hd (recommended), speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, speech-01-turbo
- **Text limit**: 10,000 characters per request
- **Pause marker**: `<#x#>` where x is seconds (0.0199.99)
- **Interjection tags** (speech-2.8 only): `(laughs)`, `(chuckle)`, `(coughs)`, `(sighs)`, `(breath)`, etc.

View File

@@ -0,0 +1,543 @@
# TTS Voice Catalog
## Contents
- [Voice Selection Guide](#voice-selection-guide)
- [System Voices by Language](#system-voices-by-language)
- [Voice Parameters](#voice-parameters)
- [Custom Voices](#custom-voices)
---
## Voice Selection Guide
### Decision Flow
```
Content type?
├── Narration / Audiobook → audiobook_female_1, audiobook_male_1
├── News / Announcement → Chinese (Mandarin)_News_Anchor, Chinese (Mandarin)_Male_Announcer
├── Documentary → doc_commentary
└── Other → Select by: Gender → Age → Language → Personality
```
### Recommended Professional Voices
| Scenario | Recommended | Characteristics |
|----------|-------------|-----------------|
| Narration / Audiobook | `audiobook_female_1`, `audiobook_male_1` | Clear articulation, good pacing, sustained performance |
| News / Announcement | `Chinese (Mandarin)_News_Anchor`, `Chinese (Mandarin)_Male_Announcer` | Authoritative, professional pacing |
| Documentary | `doc_commentary` | Professional, clear, consistent |
### Selection Priority
1. **Gender** (mandatory match) — male voices for male characters, female for female
2. **Age** — Children / Youth / Adult / Elderly
3. **Language** (must match content language)
4. **Personality/tone** — choose best fit from matching candidates
---
## System Voices by Language
Gender: M = Male, F = Female, N = Neutral/Character
Age: C = Child, Y = Youth, A = Adult, E = Elder
### Chinese Mandarin (普通话)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `male-qn-qingse` | 青涩青年 | M | Y | Youthful, inexperienced | Campus, coming-of-age |
| `male-qn-badao` | 霸道青年 | M | Y | Arrogant, dominant | Drama, romance |
| `male-qn-daxuesheng` | 青年大学生 | M | Y | University student | Campus, educational |
| `male-qn-jingying` | 精英青年 | M | A | Elite, ambitious | Business, professional |
| `female-shaonv` | 少女 | F | Y | Young maiden | Romance, youth |
| `female-yujie` | 御姐 | F | A | Mature, elegant | Romance, professional |
| `female-chengshu` | 成熟女性 | F | A | Mature, reliable | Sophisticated, news |
| `female-tianmei` | 甜美女性 | F | A | Sweet, pleasant | Soft, gentle |
| `clever_boy` | 聪明男童 | M | C | Smart, witty | Children's, educational |
| `cute_boy` | 可爱男童 | M | C | Adorable | Kids, animations |
| `lovely_girl` | 萌萌女童 | F | C | Cute, sweet | Children's stories |
| `cartoon_pig` | 卡通猪小琪 | N | C | Cartoon character | Animations, comedy |
| `bingjiao_didi` | 病娇弟弟 | M | Y | Tsundere brother | Romance, character |
| `junlang_nanyou` | 俊朗男友 | M | Y | Handsome boyfriend | Romance, dating |
| `chunzhen_xuedi` | 纯真学弟 | M | Y | Innocent junior | Campus, youth |
| `lengdan_xiongzhang` | 冷淡学长 | M | Y | Cool senior | Campus, romance |
| `badao_shaoye` | 霸道少爷 | M | A | Arrogant young master | Drama, character |
| `tianxin_xiaoling` | 甜心小玲 | F | Y | Sweet Xiao Ling | Character, animations |
| `qiaopi_mengmei` | 俏皮萌妹 | F | Y | Playful cute girl | Comedy, light-hearted |
| `wumei_yujie` | 妩媚御姐 | F | A | Charming mature woman | Romance, mature |
| `diadia_xuemei` | 嗲嗲学妹 | F | Y | Flirty junior girl | Romance, dating |
| `danya_xuejie` | 淡雅学姐 | F | Y | Elegant senior girl | Campus, romance |
| `Arrogant_Miss` | 嚣张小姐 | F | A | Arrogant young lady | Drama, character |
| `Robot_Armor` | 机械战甲 | N | A | Robotic armor | Sci-fi, games |
| `audiobook_male_1` | 有声书男1 | M | A | Warm, engaging narrator | Audiobooks, stories |
| `audiobook_female_1` | 有声书女1 | F | A | Gentle, expressive narrator | Audiobooks, stories |
| `doc_commentary` | 纪录片解说 | M | A | Professional narrator | Documentary |
| `Chinese (Mandarin)_News_Anchor` | 新闻女声 | F | A | News anchor | News, broadcasts |
| `Chinese (Mandarin)_Male_Announcer` | 播报男声 | M | A | Male announcer | Announcements |
| `Chinese (Mandarin)_Radio_Host` | 电台男主播 | M | A | Radio host | Podcasts, radio |
| `Chinese (Mandarin)_Reliable_Executive` | 沉稳高管 | M | A | Reliable executive | Corporate, business |
| `Chinese (Mandarin)_Gentleman` | 温润男声 | M | A | Gentle, refined | Narration, storytelling |
| `Chinese (Mandarin)_Unrestrained_Young_Man` | 不羁青年 | M | Y | Unrestrained, casual | Entertainment |
| `Chinese (Mandarin)_Southern_Young_Man` | 南方小哥 | M | Y | Southern accent | Regional, casual |
| `Chinese (Mandarin)_Gentle_Youth` | 温润青年 | M | Y | Gentle young man | Narration, calm |
| `Chinese (Mandarin)_Sincere_Adult` | 真诚青年 | M | Y | Sincere, genuine | Honest, genuine |
| `Chinese (Mandarin)_Straightforward_Boy` | 率真弟弟 | M | Y | Frank, direct | Casual, direct |
| `Chinese (Mandarin)_Pure-hearted_Boy` | 清澈邻家弟弟 | M | Y | Pure-hearted neighbor | Innocent, wholesome |
| `Chinese (Mandarin)_Stubborn_Friend` | 嘴硬竹马 | M | Y | Stubborn childhood friend | Drama, character |
| `Chinese (Mandarin)_Lyrical_Voice` | 抒情男声 | M | A | Lyrical, singing | Music, singing |
| `Chinese (Mandarin)_Mature_Woman` | 傲娇御姐 | F | A | Tsundere mature woman | Romance, character |
| `Chinese (Mandarin)_Sweet_Lady` | 甜美女声 | F | A | Sweet lady | Soft, gentle |
| `Chinese (Mandarin)_Warm_Bestie` | 温暖闺蜜 | F | A | Warm bestie | Friendly, supportive |
| `Chinese (Mandarin)_Warm_Girl` | 温暖少女 | F | Y | Warm young girl | Friendly, supportive |
| `Chinese (Mandarin)_Soft_Girl` | 柔和少女 | F | Y | Soft, gentle | Calm, soothing |
| `Chinese (Mandarin)_Crisp_Girl` | 清脆少女 | F | Y | Crisp, clear | Bright, clear |
| `Chinese (Mandarin)_Gentle_Senior` | 温柔学姐 | F | Y | Gentle senior girl | Campus, supportive |
| `Chinese (Mandarin)_Wise_Women` | 阅历姐姐 | F | A | Experienced, wise | Advice, guidance |
| `Chinese (Mandarin)_HK_Flight_Attendant` | 港普空姐 | F | A | HK accent flight attendant | Regional, entertainment |
| `Chinese (Mandarin)_Cute_Spirit` | 憨憨萌兽 | N | C | Cute cartoon spirit | Animations, children's |
| `Chinese (Mandarin)_Humorous_Elder` | 搞笑大爷 | M | E | Humorous old man | Comedy, entertainment |
| `Chinese (Mandarin)_Kind-hearted_Elder` | 花甲奶奶 | F | E | Kind elderly lady | Stories, warm |
| `Chinese (Mandarin)_Kind-hearted_Antie` | 热心大婶 | F | E | Kind-hearted auntie | Warm, friendly |
### Chinese Cantonese (粤语)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Cantonese_ProfessionalHostF)` | 专业女主持 | F | A | Professional host | Broadcasts, hosting |
| `Cantonese_GentleLady` | 温柔女声 | F | A | Gentle female | Soft, warm |
| `Cantonese_ProfessionalHostM)` | 专业男主持 | M | A | Professional host | Broadcasts, hosting |
| `Cantonese_PlayfulMan` | 活泼男声 | M | A | Playful male | Entertainment, casual |
| `Cantonese_CuteGirl` | 可爱女孩 | F | C | Cute girl | Children's, animations |
| `Cantonese_KindWoman` | 善良女声 | F | A | Kind female | Warm, friendly |
### English
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `English_Trustworthy_Man` | Trustworthy Man | M | A | Reliable, sincere | Business, narration |
| `English_Graceful_Lady` | Graceful Lady | F | A | Elegant, refined | Formal, professional |
| `English_Aussie_Bloke` | Aussie Bloke | M | A | Casual Australian | Casual, entertainment |
| `English_Whispering_girl` | Whispering Girl | F | Y | Soft whisper | Romance, intimate |
| `English_Diligent_Man` | Diligent Man | M | A | Earnest, hardworking | Motivational, educational |
| `English_Gentle-voiced_man` | Gentle-voiced Man | M | E | Soft-spoken, kind | Calm, supportive |
| `English_Sweet_Girl` | Sweet Girl | F | C | Sweet, innocent | Children's, friendly |
| `Charming_Lady` | Charming Lady | F | A | Elegant, sophisticated | Professional, romance |
| `Attractive_Girl` | Attractive Girl | F | Y | Engaging female | Entertainment, marketing |
| `Serene_Woman` | Serene Woman | F | A | Calm, peaceful | Meditation, relaxation |
| `Santa_Claus` | Santa Claus | M | E | Festive, jolly | Holiday, children's |
| `Charming_Santa` | Charming Santa | M | E | Smooth, charismatic | Holiday, entertainment |
| `Grinch` | Grinch | M | A | Whiny, mischievous | Comedy, holiday |
| `Rudolph` | Rudolph | N | C | Cute, nasal reindeer | Children's, holiday |
| `Arnold` | Arnold | M | A | Deep, robotic | Sci-fi, action |
| `Cute_Elf` | Cute Elf | N | C | Playful, tiny elf | Fantasy, children's |
### Japanese (日本語)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Japanese_IntellectualSenior` | Intellectual Senior | M | E | Wise, knowledgeable | Narration, educational |
| `Japanese_DecisivePrincess` | Decisive Princess | F | A | Confident, royal | Animation, games |
| `Japanese_LoyalKnight` | Loyal Knight | M | A | Brave, faithful | Fantasy, games |
| `Japanese_DominantMan` | Dominant Man | M | A | Powerful, commanding | Action, leadership |
| `Japanese_SeriousCommander` | Serious Commander | M | A | Stern, authoritative | Military, games |
| `Japanese_ColdQueen` | Cold Queen | F | A | Distant, majestic | Drama, fantasy |
| `Japanese_DependableWoman` | Dependable Woman | F | A | Reliable, supportive | Guidance |
| `Japanese_GentleButler` | Gentle Butler | M | A | Polite, refined | Comedy, animation |
| `Japanese_KindLady` | Kind Lady | F | A | Warm, gentle | Comforting |
| `Japanese_CalmLady` | Calm Lady | F | A | Composed, serene | Meditation, relaxation |
| `Japanese_OptimisticYouth` | Optimistic Youth | M | Y | Cheerful, positive | Youth, motivation |
| `Japanese_GenerousIzakayaOwner` | Generous Izakaya Owner | M | A | Friendly, welcoming | Casual, comedy |
| `Japanese_SportyStudent` | Sporty Student | M | Y | Energetic, athletic | Sports, youth |
| `Japanese_InnocentBoy` | Innocent Boy | M | C | Pure, naive | Children's |
| `Japanese_GracefulMaiden` | Graceful Maiden | F | Y | Elegant, gentle | Romance, drama |
### Korean (한국어)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Korean_SweetGirl` | Sweet Girl | F | C | Sweet, adorable | Children's, romance |
| `Korean_CheerfulBoyfriend` | Cheerful Boyfriend | M | Y | Energetic, loving | Romance, dating |
| `Korean_EnchantingSister` | Enchanting Sister | F | A | Charming, captivating | Family, drama |
| `Korean_ShyGirl` | Shy Girl | F | Y | Timid, reserved | Comedy, romance |
| `Korean_ReliableSister` | Reliable Sister | F | A | Trustworthy, dependable | Guidance |
| `Korean_StrictBoss` | Strict Boss | M | A | Authoritative, demanding | Business, drama |
| `Korean_SassyGirl` | Sassy Girl | F | Y | Bold, witty | Comedy, entertainment |
| `Korean_ChildhoodFriendGirl` | Childhood Friend Girl | F | Y | Familiar, friendly | Romance, nostalgia |
| `Korean_PlayboyCharmer` | Playboy Charmer | M | A | Smooth, flirtatious | Romance, entertainment |
| `Korean_ElegantPrincess` | Elegant Princess | F | A | Graceful, royal | Animation, fantasy |
| `Korean_BraveFemaleWarrior` | Brave Female Warrior | F | A | Courageous | Action, fantasy |
| `Korean_BraveYouth` | Brave Youth | M | Y | Heroic | Action, youth |
| `Korean_CalmLady` | Calm Lady | F | A | Composed, serene | Meditation, relaxation |
| `Korean_EnthusiasticTeen` | Enthusiastic Teen | M | Y | Excited, energetic | Youth |
| `Korean_SoothingLady` | Soothing Lady | F | A | Calming, comforting | Relaxation |
| `Korean_IntellectualSenior` | Intellectual Senior | M | E | Wise, knowledgeable | Educational, narration |
| `Korean_LonelyWarrior` | Lonely Warrior | M | A | Solitary, melancholic | Drama, fantasy |
| `Korean_MatureLady` | Mature Lady | F | A | Sophisticated | Professional, drama |
| `Korean_InnocentBoy` | Innocent Boy | M | C | Pure, naive | Children's |
| `Korean_CharmingSister` | Charming Sister | F | A | Attractive, delightful | Family, romance |
| `Korean_AthleticStudent` | Athletic Student | M | Y | Sporty, energetic | Sports, youth |
| `Korean_BraveAdventurer` | Brave Adventurer | M | A | Courageous explorer | Adventure, fantasy |
| `Korean_CalmGentleman` | Calm Gentleman | M | A | Composed, refined | Formal, professional |
| `Korean_WiseElf` | Wise Elf | M | E | Ancient, mystical | Fantasy, narration |
| `Korean_CheerfulCoolJunior` | Cheerful Cool Junior | M | Y | Popular, friendly | Youth, entertainment |
| `Korean_DecisiveQueen` | Decisive Queen | F | A | Commanding | Drama, fantasy |
| `Korean_ColdYoungMan` | Cold Young Man | M | Y | Distant, aloof | Drama, romance |
| `Korean_MysteriousGirl` | Mysterious Girl | F | Y | Enigmatic, secretive | Mystery, drama |
| `Korean_QuirkyGirl` | Quirky Girl | F | Y | Eccentric, unique | Comedy |
| `Korean_ConsiderateSenior` | Considerate Senior | M | E | Thoughtful, caring | Warm, supportive |
| `Korean_CheerfulLittleSister` | Cheerful Little Sister | F | C | Playful, adorable | Family, comedy |
| `Korean_DominantMan` | Dominant Man | M | A | Powerful, commanding | Leadership, action |
| `Korean_AirheadedGirl` | Airheaded Girl | F | Y | Bubbly, spacey | Comedy |
| `Korean_ReliableYouth` | Reliable Youth | M | Y | Trustworthy, dependable | Supportive |
| `Korean_FriendlyBigSister` | Friendly Big Sister | F | A | Warm, protective | Family, support |
| `Korean_GentleBoss` | Gentle Boss | M | A | Kind, understanding | Business |
| `Korean_ColdGirl` | Cold Girl | F | Y | Aloof, distant | Drama, romance |
| `Korean_HaughtyLady` | Haughty Lady | F | A | Arrogant, proud | Drama, comedy |
| `Korean_CharmingElderSister` | Charming Elder Sister | F | A | Graceful | Romance, family |
| `Korean_IntellectualMan` | Intellectual Man | M | A | Smart, knowledgeable | Educational |
| `Korean_CaringWoman` | Caring Woman | F | A | Nurturing | Supportive, warm |
| `Korean_WiseTeacher` | Wise Teacher | M | E | Experienced | Educational |
| `Korean_ConfidentBoss` | Confident Boss | M | A | Self-assured, capable | Business, leadership |
| `Korean_AthleticGirl` | Athletic Girl | F | Y | Sporty, energetic | Sports, fitness |
| `Korean_PossessiveMan` | Possessive Man | M | A | Intense, protective | Romance, drama |
| `Korean_GentleWoman` | Gentle Woman | F | A | Soft-spoken, kind | Calm |
| `Korean_CockyGuy` | Cocky Guy | M | Y | Confident, arrogant | Comedy |
| `Korean_ThoughtfulWoman` | Thoughtful Woman | F | A | Reflective, caring | Drama |
| `Korean_OptimisticYouth` | Optimistic Youth | M | Y | Positive, hopeful | Motivation |
### Spanish (Español)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Spanish_Narrator` | Narrator | M | A | Professional narrator | Documentaries |
| `Spanish_CaptivatingStoryteller` | Captivating Storyteller | M | A | Engaging narrator | Audiobooks |
| `Spanish_WiseScholar` | Wise Scholar | M | A | Knowledgeable | Educational |
| `Spanish_SereneWoman` | Serene Woman | F | A | Calm, peaceful | Relaxation |
| `Spanish_MaturePartner` | Mature Partner | M | A | Sophisticated | Romance, drama |
| `Spanish_ConfidentWoman` | Confident Woman | F | A | Self-assured | Professional |
| `Spanish_DeterminedManager` | Determined Manager | M | A | Ambitious, driven | Business |
| `Spanish_BossyLeader` | Bossy Leader | M | A | Commanding | Leadership |
| `Spanish_ReservedYoungMan` | Reserved Young Man | M | Y | Quiet, introverted | Drama |
| `Spanish_ThoughtfulMan` | Thoughtful Man | M | A | Reflective | Educational |
| `Spanish_RationalMan` | Rational Man | M | A | Logical, analytical | Business |
| `Spanish_Deep-tonedMan` | Deep-toned Man | M | A | Deep, resonant | Commanding |
| `Spanish_Jovialman` | Jovial Man | M | A | Cheerful, friendly | Entertainment |
| `Spanish_Steadymentor` | Steady Mentor | M | A | Reliable mentor | Guidance |
| `Spanish_ReliableMan` | Reliable Man | M | A | Trustworthy | Professional |
| `Spanish_RomanticHusband` | Romantic Husband | M | A | Loving, romantic | Romance |
| `Spanish_Comedian` | Comedian | M | A | Humorous | Comedy |
| `Spanish_Debator` | Debator | M | A | Persuasive | Debate |
| `Spanish_ToughBoss` | Tough Boss | M | A | Harsh, demanding | Business, drama |
| `Spanish_AngryMan` | Angry Man | M | A | Frustrated | Drama, comedy |
| `Spanish_PowerfulSoldier` | Powerful Soldier | M | A | Strong, brave | Action, military |
| `Spanish_PassionateWarrior` | Passionate Warrior | M | A | Fierce, dedicated | Action, fantasy |
| `Spanish_PowerfulVeteran` | Powerful Veteran | M | A | Experienced | Military |
| `Spanish_SensibleManager` | Sensible Manager | M | A | Practical | Business |
| `Spanish_Kind-heartedGirl` | Kind-hearted Girl | F | C | Warm, compassionate | Children's |
| `Spanish_SophisticatedLady` | Sophisticated Lady | F | A | Elegant, refined | Formal |
| `Spanish_FrankLady` | Frank Lady | F | A | Direct, honest | Comedy |
| `Spanish_Fussyhostess` | Fussy Hostess | F | A | Demanding | Comedy, drama |
| `Spanish_Wiselady` | Wise Lady | F | E | Experienced, wise | Guidance |
| `Spanish_ThoughtfulLady` | Thoughtful Lady | F | A | Considerate | Advice |
| `Spanish_AssertiveQueen` | Assertive Queen | F | A | Commanding | Drama, fantasy |
| `Spanish_CaringGirlfriend` | Caring Girlfriend | F | Y | Nurturing | Romance |
| `Spanish_ChattyGirl` | Chatty Girl | F | Y | Talkative, sociable | Comedy |
| `Spanish_CompellingGirl` | Compelling Girl | F | Y | Persuasive | Marketing |
| `Spanish_WhimsicalGirl` | Whimsical Girl | F | C | Playful, imaginative | Children's |
| `Spanish_Intonategirl` | Intonate Girl | F | Y | Musical, melodic | Singing |
| `Spanish_SincereTeen` | Sincere Teen | M | Y | Honest, genuine | Youth |
| `Spanish_Strong-WilledBoy` | Strong-willed Boy | M | Y | Determined | Youth, motivation |
| `Spanish_EnergeticBoy` | Energetic Boy | M | C | Active, lively | Youth, sports |
| `Spanish_StrictBoss` | Strict Boss | M | A | Strict | Business |
| `Spanish_HumorousElder` | Humorous Elder | M | E | Funny | Comedy |
| `Spanish_SereneElder` | Serene Elder | M | E | Calm, peaceful | Meditation |
| `Spanish_SantaClaus` | Santa Claus | M | E | Festive | Holiday |
| `Spanish_Rudolph` | Rudolph | N | C | Reindeer | Holiday |
| `Spanish_Arnold` | Arnold | M | A | Robotic | Sci-fi |
| `Spanish_Ghost` | Ghost | N | A | Spooky | Horror |
| `Spanish_AnimeCharacter` | Anime Character | N | Y | Anime-style | Animation |
### Portuguese (Português)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Portuguese_Narrator` | Narrator | M | A | Professional narrator | Documentaries |
| `Portuguese_CaptivatingStoryteller` | Captivating Storyteller | M | A | Engaging narrator | Audiobooks |
| `Portuguese_WiseScholar` | Wise Scholar | M | A | Knowledgeable | Educational |
| `Portuguese_Deep-VoicedGentleman` | Deep-voiced Gentleman | M | A | Deep, rich | Commanding |
| `Portuguese_ReservedYoungMan` | Reserved Young Man | M | Y | Quiet, introverted | Drama |
| `Portuguese_ThoughtfulMan` | Thoughtful Man | M | A | Reflective | Educational |
| `Portuguese_RationalMan` | Rational Man | M | A | Logical | Business |
| `Portuguese_Jovialman` | Jovial Man | M | A | Cheerful | Entertainment |
| `Portuguese_Steadymentor` | Steady Mentor | M | A | Reliable mentor | Guidance |
| `Portuguese_ReliableMan` | Reliable Man | M | A | Trustworthy | Professional |
| `Portuguese_RomanticHusband` | Romantic Husband | M | A | Loving | Romance |
| `Portuguese_Comedian` | Comedian | M | A | Humorous | Comedy |
| `Portuguese_Debator` | Debator | M | A | Persuasive | Debate |
| `Portuguese_ToughBoss` | Tough Boss | M | A | Demanding | Business |
| `Portuguese_StrictBoss` | Strict Boss | M | A | Strict | Business |
| `Portuguese_AngryMan` | Angry Man | M | A | Frustrated | Drama |
| `Portuguese_Godfather` | Godfather | M | A | Authoritative | Drama |
| `Portuguese_PowerfulSoldier` | Powerful Soldier | M | A | Strong, brave | Action |
| `Portuguese_PowerfulVeteran` | Powerful Veteran | M | A | Experienced | Military |
| `Portuguese_SensibleManager` | Sensible Manager | M | A | Practical | Business |
| `Portuguese_DeterminedManager` | Determined Manager | M | A | Driven | Business |
| `Portuguese_BossyLeader` | Bossy Leader | M | A | Commanding | Leadership |
| `Portuguese_CalmLeader` | Calm Leader | M | A | Composed, steady | Leadership |
| `Portuguese_FascinatingBoy` | Fascinating Boy | M | Y | Charming | Romance |
| `Portuguese_Strong-WilledBoy` | Strong-willed Boy | M | Y | Determined | Youth |
| `Portuguese_EnergeticBoy` | Energetic Boy | M | C | Active, lively | Youth |
| `Portuguese_FragileBoy` | Fragile Boy | M | Y | Sensitive | Drama |
| `Portuguese_MaturePartner` | Mature Partner | M | A | Sophisticated | Romance |
| `Portuguese_HumorousElder` | Humorous Elder | M | E | Funny | Comedy |
| `Portuguese_SereneElder` | Serene Elder | M | E | Calm | Meditation |
| `Portuguese_ConfidentWoman` | Confident Woman | F | A | Self-assured | Professional |
| `Portuguese_SereneWoman` | Serene Woman | F | A | Calm, peaceful | Relaxation |
| `Portuguese_SentimentalLady` | Sentimental Lady | F | A | Emotional | Drama, romance |
| `Portuguese_Wiselady` | Wise Lady | F | E | Wise | Guidance |
| `Portuguese_GorgeousLady` | Gorgeous Lady | F | A | Beautiful | Romance |
| `Portuguese_LovelyLady` | Lovely Lady | F | A | Sweet, endearing | Warm |
| `Portuguese_Pompouslady` | Pompous Lady | F | A | Self-important | Comedy |
| `Portuguese_CharmingQueen` | Charming Queen | F | A | Elegant | Drama, fantasy |
| `Portuguese_AssertiveQueen` | Assertive Queen | F | A | Commanding | Drama, fantasy |
| `Portuguese_CharmingLady` | Charming Lady | F | A | Sophisticated | Professional |
| `Portuguese_InspiringLady` | Inspiring Lady | F | A | Motivating | Motivation |
| `Portuguese_StressedLady` | Stressed Lady | F | A | Anxious | Comedy |
| `Portuguese_FrankLady` | Frank Lady | F | A | Direct, honest | Comedy |
| `Portuguese_Fussyhostess` | Fussy Hostess | F | A | Demanding | Comedy |
| `Portuguese_ThoughtfulLady` | Thoughtful Lady | F | A | Considerate | Advice |
| `Portuguese_GentleTeacher` | Gentle Teacher | F | A | Kind, patient | Educational |
| `Portuguese_Kind-heartedGirl` | Kind-hearted Girl | F | C | Warm | Children's |
| `Portuguese_SweetGirl` | Sweet Girl | F | Y | Sweet, adorable | Romance |
| `Portuguese_AttractiveGirl` | Attractive Girl | F | Y | Charming | Entertainment |
| `Portuguese_PlayfulGirl` | Playful Girl | F | Y | Fun-loving | Comedy |
| `Portuguese_SmartYoungGirl` | Smart Young Girl | F | Y | Intelligent | Educational |
| `Portuguese_UpsetGirl` | Upset Girl | F | Y | Distressed | Drama |
| `Portuguese_ElegantGirl` | Elegant Girl | F | Y | Graceful | Formal |
| `Portuguese_CompellingGirl` | Compelling Girl | F | Y | Persuasive | Marketing |
| `Portuguese_WhimsicalGirl` | Whimsical Girl | F | C | Playful | Children's |
| `Portuguese_ChattyGirl` | Chatty Girl | F | Y | Talkative | Comedy |
| `Portuguese_NaughtySchoolgirl` | Naughty Schoolgirl | F | Y | Mischievous | Comedy |
| `Portuguese_SadTeen` | Sad Teen | F | Y | Melancholic | Drama |
| `Portuguese_CaringGirlfriend` | Caring Girlfriend | F | Y | Nurturing | Romance |
| `Portuguese_FriendlyNeighbor` | Friendly Neighbor | F | A | Warm, helpful | Community |
| `Portuguese_Dramatist` | Dramatist | M | A | Theatrical | Drama |
| `Portuguese_TheatricalActor` | Theatrical Actor | M | A | Dramatic | Entertainment |
| `Portuguese_Conscientiousinstructor` | Conscientious Instructor | M | A | Diligent | Training |
| `Portuguese_PlayfulSpirit` | Playful Spirit | N | C | Cheerful spirit | Fantasy |
| `Portuguese_SantaClaus` | Santa Claus | M | E | Festive | Holiday |
| `Portuguese_Rudolph` | Rudolph | N | C | Reindeer | Holiday |
| `Portuguese_Arnold` | Arnold | M | A | Robotic | Sci-fi |
| `Portuguese_CharmingSanta` | Charming Santa | M | E | Charismatic | Holiday |
| `Portuguese_Grinch` | Grinch | M | A | Mischievous | Comedy |
| `Portuguese_Ghost` | Ghost | N | A | Spooky | Horror |
| `Portuguese_GrimReaper` | Grim Reaper | N | A | Dark, ominous | Horror |
### French (Français)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `French_Male_Speech_New` | Level-Headed Man | M | A | Calm, reasonable | Professional |
| `French_Female_News Anchor` | Patient Female Presenter | F | A | Clear, patient | News |
| `French_CasualMan` | Casual Man | M | A | Relaxed, informal | Casual |
| `French_MovieLeadFemale` | Movie Lead Female | F | A | Dramatic, expressive | Drama |
| `French_FemaleAnchor` | Female Anchor | F | A | Professional anchor | News |
### Indonesian (Bahasa Indonesia)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Indonesian_SweetGirl` | Sweet Girl | F | C | Sweet, adorable | Children's |
| `Indonesian_ReservedYoungMan` | Reserved Young Man | M | Y | Quiet, introverted | Drama |
| `Indonesian_CharmingGirl` | Charming Girl | F | Y | Attractive | Romance |
| `Indonesian_CalmWoman` | Calm Woman | F | A | Composed, peaceful | Relaxation |
| `Indonesian_ConfidentWoman` | Confident Woman | F | A | Self-assured | Professional |
| `Indonesian_CaringMan` | Caring Man | M | A | Nurturing | Family |
| `Indonesian_BossyLeader` | Bossy Leader | M | A | Commanding | Leadership |
| `Indonesian_DeterminedBoy` | Determined Boy | M | Y | Ambitious | Youth |
| `Indonesian_GentleGirl` | Gentle Girl | F | Y | Soft-spoken | Calm |
### German (Deutsch)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `German_FriendlyMan` | Friendly Man | M | A | Warm, approachable | Casual |
| `German_SweetLady` | Sweet Lady | F | A | Pleasant, kind | Warm |
| `German_PlayfulMan` | Playful Man | M | A | Fun-loving | Comedy |
### Russian (Русский)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Russian_HandsomeChildhoodFriend` | Handsome Childhood Friend | M | Y | Charming | Romance |
| `Russian_BrightHeroine` | Bright Queen | F | A | Lively, strong | Drama |
| `Russian_AmbitiousWoman` | Ambitious Woman | F | A | Driven | Professional |
| `Russian_ReliableMan` | Reliable Man | M | A | Trustworthy | Professional |
| `Russian_CrazyQueen` | Crazy Girl | F | Y | Wild, unpredictable | Comedy |
| `Russian_PessimisticGirl` | Pessimistic Girl | F | Y | Gloomy | Comedy |
| `Russian_AttractiveGuy` | Attractive Guy | M | A | Charming | Romance |
| `Russian_Bad-temperedBoy` | Bad-tempered Boy | M | Y | Irritable, grumpy | Comedy |
### Italian (Italiano)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Italian_BraveHeroine` | Brave Heroine | F | A | Courageous | Action |
| `Italian_Narrator` | Narrator | M | A | Professional narrator | Storytelling |
| `Italian_WanderingSorcerer` | Wandering Sorcerer | M | A | Mysterious | Fantasy |
| `Italian_DiligentLeader` | Diligent Leader | M | A | Hardworking | Leadership |
### Arabic (العربية)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Arabic_CalmWoman` | Calm Woman | F | A | Composed | Relaxation |
| `Arabic_FriendlyGuy` | Friendly Guy | M | A | Warm | Casual |
### Turkish (Türkçe)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Turkish_CalmWoman` | Calm Woman | F | A | Composed | Relaxation |
| `Turkish_Trustworthyman` | Trustworthy Man | M | A | Reliable | Professional |
### Ukrainian (Українська)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Ukrainian_CalmWoman` | Calm Woman | F | A | Composed | Relaxation |
| `Ukrainian_WiseScholar` | Wise Scholar | M | A | Knowledgeable | Educational |
### Dutch (Nederlands)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Dutch_kindhearted_girl` | Kind-hearted Girl | F | C | Warm | Children's |
| `Dutch_bossy_leader` | Bossy Leader | M | A | Commanding | Leadership |
### Vietnamese (Tiếng Việt)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Vietnamese_kindhearted_girl` | Kind-hearted Girl | F | C | Warm | Children's |
### Thai (ภาษาไทย)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Thai_male_1_sample8` | Serene Man | M | A | Calm, peaceful | Relaxation |
| `Thai_male_2_sample2` | Friendly Man | M | A | Warm | Casual |
| `Thai_female_1_sample1` | Confident Woman | F | A | Self-assured | Professional |
| `Thai_female_2_sample2` | Energetic Woman | F | A | Active, lively | Motivation |
### Polish (Polski)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Polish_male_1_sample4` | Male Narrator | M | A | Professional | Narration |
| `Polish_male_2_sample3` | Male Anchor | M | A | Professional | News |
| `Polish_female_1_sample1` | Calm Woman | F | A | Composed | Relaxation |
| `Polish_female_2_sample3` | Casual Woman | F | A | Relaxed | Casual |
### Romanian (Română)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `Romanian_male_1_sample2` | Reliable Man | M | A | Trustworthy | Professional |
| `Romanian_male_2_sample1` | Energetic Youth | M | Y | Active, lively | Youth |
| `Romanian_female_1_sample4` | Optimistic Youth | F | Y | Positive | Motivation |
| `Romanian_female_2_sample1` | Gentle Woman | F | A | Soft-spoken | Calm |
### Greek (Ελληνικά)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `greek_male_1a_v1` | Thoughtful Mentor | M | A | Reflective, wise | Guidance |
| `Greek_female_1_sample1` | Gentle Lady | F | A | Soft-spoken | Calm |
| `Greek_female_2_sample3` | Girl Next Door | F | Y | Friendly | Casual |
### Czech (Čeština)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `czech_male_1_v1` | Assured Presenter | M | A | Confident | Presentations |
| `czech_female_5_v7` | Steadfast Narrator | F | A | Reliable | Storytelling |
| `czech_female_2_v2` | Elegant Lady | F | A | Graceful | Formal |
### Finnish (Suomi)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `finnish_male_3_v1` | Upbeat Man | M | A | Cheerful | Motivation |
| `finnish_male_1_v2` | Friendly Boy | M | Y | Warm | Children's |
| `finnish_female_4_v1` | Assertive Woman | F | A | Confident | Professional |
### Hindi (हिन्दी)
| voice_id | Name | G | Age | Description | Best For |
|----------|------|---|-----|-------------|----------|
| `hindi_male_1_v2` | Trustworthy Advisor | M | A | Reliable, wise | Guidance |
| `hindi_female_2_v1` | Tranquil Woman | F | A | Calm, peaceful | Meditation |
| `hindi_female_1_v2` | News Anchor | F | A | Professional | News |
---
## Voice Parameters
### VoiceSetting
```python
from scripts.tts.utils import VoiceSetting
voice = VoiceSetting(
voice_id="male-qn-qingse",
speed=1.0, # 0.52.0 (default 1.0)
volume=1.0, # 0.110.0 (default 1.0)
pitch=0, # -12 to +12 (default 0)
emotion="", # Leave empty for speech-2.8 auto-matching (recommended)
)
```
### Speed
| Value | Effect |
|-------|--------|
| 0.75 | Slower, deliberate (news, tutorials) |
| 1.0 | Normal pace |
| 1.25 | Slightly faster (energetic) |
| 1.5+ | Fast (time-sensitive) |
### Emotion
| Value | Description | Model Support |
|-------|-------------|---------------|
| *(empty)* | Auto-match from text | speech-2.8 (recommended) |
| `happy` | Cheerful, upbeat | All |
| `sad` | Melancholic, somber | All |
| `angry` | Intense, frustrated | All |
| `fearful` | Anxious, nervous | All |
| `disgusted` | Repulsed | All |
| `surprised` | Astonished | All |
| `calm` | Neutral tone | All |
| `fluent` | Natural, lively | speech-2.6 only |
| `whisper` | Soft, gentle | speech-2.6 only |
---
## Custom Voices
### Voice Cloning
Create custom voices from audio samples:
- Source: 10s5min, mp3/wav/m4a, ≤20MB, clear single speaker
- Best: 3060s of clean speech with varied intonation
### Voice Design
Generate voices from text descriptions:
- Include: gender, age, vocal characteristics, tone, use case
- Example: "A warm, grandmotherly voice with gentle pacing, perfect for bedtime stories"
Custom voices expire after 7 days if not used with TTS. List all voices: `python scripts/tts/generate_voice.py list-voices`

View File

@@ -0,0 +1,130 @@
# MiniMax Video Generation API Documentation
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/video_generation` | POST | Create video generation task (all 4 modes) |
| `/v1/query/video_generation` | GET | Query task status |
| `/v1/files/retrieve` | GET | Get video download URL |
| `/v1/video_template_generation` | POST | Create template-based video task |
| `/v1/query/video_template_generation` | GET | Query template task status |
**Base URL:** `https://api.minimaxi.com`
**Auth:** `Authorization: Bearer {MINIMAX_API_KEY}`
---
## Video Generation Models
### Text-to-Video (T2V) Models
| Model | Resolution | Duration | Notes |
|-------|-----------|----------|-------|
| MiniMax-Hailuo-2.3 | 768P (default), 1080P | 6s (1080P), 6/10s (768P) | Recommended, latest |
| MiniMax-Hailuo-2.3-Fast | 768P (default), 1080P | 6s (1080P), 6/10s (768P) | Fast variant |
| MiniMax-Hailuo-02 | 512P, 768P (default), 1080P | 6s (1080P), 6/10s (512P/768P) | Previous gen |
| T2V-01-Director | 720P | 6s | Director control |
| T2V-01 | 720P | 6s | Base model |
### Image-to-Video (I2V) Models
| Model | Resolution | Duration | Notes |
|-------|-----------|----------|-------|
| MiniMax-Hailuo-2.3 | 768P, 1080P | 6s | Recommended |
| MiniMax-Hailuo-2.3-Fast | 768P, 1080P | 6s | Fast variant |
| MiniMax-Hailuo-02 | 512P, 768P, 1080P | 6/10s | Previous gen |
| I2V-01-Director | 720P | 6s | Director control |
| I2V-01-live | 720P | 6s | Live photo style |
| I2V-01 | 720P | 6s | Base model |
### Start-End Frame Model
| Model | Notes |
|-------|-------|
| MiniMax-Hailuo-02 | Only model supporting start-end frame |
### Subject Reference Model
| Model | Notes |
|-------|-------|
| S2V-01 | Face consistency across video |
---
## Request Parameters
### Common Parameters (All Modes)
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| model | string | Yes | - | Model name |
| prompt | string | Depends | - | Video description, max 2000 chars |
| duration | int | No | 6 | Video length in seconds |
| resolution | string | No | 768P/720P | Video resolution |
| prompt_optimizer | bool | No | true | Auto-optimize prompt |
| fast_pretreatment | bool | No | false | Shorten optimizer duration |
| callback_url | string | No | - | Webhook URL |
| aigc_watermark | bool | No | false | Add watermark |
### Image-to-Video Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| first_frame_image | string | Yes | Starting frame (URL or base64 data URL) |
**Image requirements:** JPG/JPEG/PNG/WebP, < 20MB, short side > 300px, aspect ratio 2:55:2.
### Start-End Frame Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| first_frame_image | string | Yes | Starting frame |
| last_frame_image | string | Yes | Ending frame |
### Subject Reference Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| subject_reference | array | Yes | Array of subject objects |
Each object has `type` and `image` (array of image URLs):
```json
[{ "type": "character", "image": ["<image_url>"] }]
```
---
## Camera Instructions
Supported in `[指令]` syntax for Hailuo-2.3, Hailuo-02, and Director models:
| Category | Instructions |
|----------|-------------|
| Pan | `[左移]`, `[右移]` |
| Rotation | `[左摇]`, `[右摇]` |
| Push/Pull | `[推进]`, `[拉远]` |
| Elevation | `[上升]`, `[下降]` |
| Tilt | `[上摇]`, `[下摇]` |
| Zoom | `[变焦推近]`, `[变焦拉远]` |
| Other | `[晃动]`, `[跟随]`, `[固定]` |
Combine for simultaneous: `[左摇,上升]` (max 3). Sequential: `...[推进], then ...[拉远]`
---
## Response
**Query status:** `Preparing`, `Queueing`, `Processing`, `Success`, `Fail`
**Error codes:** 0 (success), 1002 (rate limited), 1004 (auth failed), 1008 (insufficient balance), 1026 (sensitive content), 2013 (invalid params), 2049 (invalid API key)
---
## Video Templates
| Template | ID | Input | Description |
|----------|-----|-------|-------------|
| Diving | 392753057216684038 | Image | Diving motion |
| Rings | 393881433990066176 | Image | Gymnastics rings |
| Survival | 393769180141805569 | Image + Text | Outdoor survival |
| Labubu | 394246956137422856 | Image | Labubu character |
| McDonald's Delivery | 393879757702918151 | Image | Pet courier |
| Tibetan Portrait | 393766210733957121 | Image | Cultural portrait |
| Female Model Ads | 393866076583718914 | Image | Female fashion |
| Male Model Ads | 393876118804459526 | Image | Male fashion |
| Winter Romance | 393857704283172856 | Image | Snowy portrait |
| Four Seasons | 398574688191234048 | Image | Seasonal portrait |
| Helpless Moments | 394125185182695432 | Text only | Comedic animation |

View File

@@ -0,0 +1,98 @@
# Video Prompt Writing Guide
## Prompt Structure
### Basic Formula
**Main subject + Scene/Space + Movement/Change**
Examples:
- "A puppy runs toward the camera in a sunny park"
- "A woman walks in the rain holding an umbrella on a city street"
- "A stream flows through a green valley with morning mist"
### Professional Formula
**Main subject + Scene + Movement + Camera motion + Aesthetic atmosphere**
Examples:
- "A couple sits on a park bench, warm golden hour lighting, [固定] framing, intimate and romantic atmosphere"
- "A young man in a suit eats noodles at a street stall, [拉远] revealing the busy night market, warm tones, cinematic"
- "A dancer performs contemporary dance in an empty studio, [跟随] smooth tracking, dramatic side lighting"
---
## Key Principles
1. **More precise language → more accurate video**
2. **Richer description → better generation quality**
3. **Keep prompts focused on 5-6 seconds of action** — do not describe too many events
4. **Combine shot types with mood descriptors** for professional output
---
## Camera Instructions Usage
### Simultaneous Camera Movement
Place multiple instructions in one bracket:
- `[左摇,上升]` — pan left while rising
- `[推进,下摇]` — push in while tilting down
### Sequential Camera Movement
Place instructions at different points in the prompt:
- "The camera starts with [推进] toward the face, then [拉远] to reveal the full scene"
---
## Style-Specific Prompt Tips
### Realistic / Cinematic Style
- Mention lighting: "golden hour", "overcast sky", "dramatic side lighting"
- Color grading: "warm tones", "cool desaturated palette", "high contrast"
- Texture: "rain droplets on glass", "dust particles in sunlight"
- Cinematic terms: "shallow depth of field", "anamorphic lens flare"
### Animation Style
- Substyle: "2D anime", "3D Pixar-style", "watercolor animation", "stop-motion"
- Character design: "big expressive eyes", "chibi proportions"
- Effects: "sparkle particles", "speed lines", "dramatic wind effects"
### Product / Commercial Style
- Product details: "smooth surface", "premium materials", "elegant design"
- Studio lighting: "soft box lighting", "rim light", "gradient background"
- Motion: "slow rotation", "smooth reveal", "gentle float"
### Fantasy / Sci-Fi Style
- World elements: "floating islands", "neon cyberpunk city", "enchanted forest"
- VFX: "magic particles", "holographic displays", "energy beams"
- Scale: "vast landscape", "towering structures", "infinite horizon"
### Nature / Documentary Style
- Terminology: "macro shot", "time-lapse", "wildlife behavior"
- Phenomena: "morning dew", "sunset colors", "storm clouds"
- Precision: "slow motion at 240fps", "underwater perspective"
---
## Image-to-Video Prompt Tips
Focus on **movement and change** since the image establishes the visual:
- Image of still lake → "Gentle ripples spread across the water surface, a breeze rustles the trees, [固定] fixed camera, peaceful"
- Image of portrait → "The person slowly smiles and turns their head, natural blinking, [推进] subtle push in, warm lighting"
---
## Prompt Building Checklist
1. **Subject**: Appearance, clothing, color, expression, posture
2. **Action**: 1-2 key temporal actions ("first...then...")
3. **Scene**: Setting with foreground + background + atmosphere
4. **Camera**: `[运镜指令]` for precise control
5. **Aesthetic**: Lighting, color, texture, cinematic quality
## Common Mistakes
1. Too many events for 6-second videos
2. Conflicting camera instructions
3. Vague descriptions
4. Static descriptions without motion
5. Missing aesthetic layer
6. Overlong prompts (keep under 200 words)

View File

@@ -0,0 +1,156 @@
#!/usr/bin/env bash
# MiniMax Multi-Modal Toolkit — Environment Check
#
# Usage:
# bash scripts/check_environment.sh
# bash scripts/check_environment.sh --test-api
set -euo pipefail
PASSED=0
FAILED=0
TOTAL=0
check() {
TOTAL=$((TOTAL + 1))
if "$@"; then
PASSED=$((PASSED + 1))
else
FAILED=$((FAILED + 1))
fi
}
check_curl() {
if command -v curl &>/dev/null; then
echo "[OK] curl installed"
return 0
fi
echo "[FAIL] curl not installed"
return 1
}
check_ffmpeg() {
if command -v ffmpeg &>/dev/null; then
echo "[OK] FFmpeg installed"
return 0
fi
echo "[FAIL] FFmpeg not installed"
return 1
}
check_ffprobe() {
if command -v ffprobe &>/dev/null; then
echo "[OK] ffprobe installed"
return 0
fi
echo "[FAIL] ffprobe not installed"
return 1
}
check_jq() {
if command -v jq &>/dev/null; then
echo "[OK] jq installed"
return 0
fi
echo "[FAIL] jq not installed (brew install jq / apt install jq)"
return 1
}
check_xxd() {
if command -v xxd &>/dev/null; then
echo "[OK] xxd installed"
return 0
fi
echo "[FAIL] xxd not installed"
return 1
}
check_api_host() {
local api_host="${MINIMAX_API_HOST:-}"
if [[ -z "$api_host" ]]; then
echo "[FAIL] MINIMAX_API_HOST not set"
echo " China Mainland: export MINIMAX_API_HOST='https://api.minimaxi.com'"
echo " Global: export MINIMAX_API_HOST='https://api.minimax.io'"
return 1
fi
if [[ "$api_host" != "https://api.minimaxi.com" && "$api_host" != "https://api.minimax.io" ]]; then
echo "[WARN] MINIMAX_API_HOST has non-standard value: $api_host"
echo " Expected: https://api.minimaxi.com (China) or https://api.minimax.io (Global)"
return 0
fi
echo "[OK] MINIMAX_API_HOST set ($api_host)"
return 0
}
check_api_key() {
local api_key="${MINIMAX_API_KEY:-}"
if [[ -z "$api_key" ]]; then
echo "[FAIL] MINIMAX_API_KEY not set"
echo " export MINIMAX_API_KEY='your-key'"
return 1
fi
if [[ "$api_key" != sk-api* && "$api_key" != sk-cp* ]]; then
echo "[FAIL] Invalid API key format"
echo " Expected: sk-api-xxx... or sk-cp-xxx..."
echo " Got: ${api_key:0:20}..."
return 1
fi
echo "[OK] MINIMAX_API_KEY set (${#api_key} chars)"
return 0
}
check_api_connectivity() {
local api_host="${MINIMAX_API_HOST:-}"
local api_key="${MINIMAX_API_KEY:-}"
if [[ -z "$api_key" ]]; then
echo "[FAIL] API connectivity skipped (MINIMAX_API_KEY not set)"
return 1
fi
if [[ -z "$api_host" ]]; then
echo "[FAIL] API connectivity skipped (MINIMAX_API_HOST not set)"
return 1
fi
local http_code
http_code=$(curl -s -o /dev/null -w "%{http_code}" \
-H "Authorization: Bearer $api_key" \
--max-time 10 \
"$api_host" 2>/dev/null) || true
if [[ -n "$http_code" && "$http_code" -lt 500 ]] 2>/dev/null; then
echo "[OK] API host reachable (HTTP $http_code)"
return 0
fi
echo "[FAIL] API host unreachable ($api_host)"
return 1
}
# --- Main ---
TEST_API=false
for arg in "$@"; do
case "$arg" in
--test-api) TEST_API=true ;;
esac
done
echo "MiniMax Multi-Modal Toolkit — Environment Check"
echo "========================================"
check check_curl
check check_ffmpeg
check check_ffprobe
check check_jq
check check_xxd
check check_api_host
check check_api_key
if $TEST_API; then
check check_api_connectivity
fi
echo ""
echo "========================================"
if [[ $FAILED -eq 0 ]]; then
echo "All $TOTAL checks passed!"
exit 0
else
echo "$FAILED check(s) failed out of $TOTAL"
exit 1
fi

View File

@@ -0,0 +1,277 @@
#!/usr/bin/env bash
# MiniMax Image Generation CLI (pure bash)
#
# Usage:
# bash scripts/image/generate_image.sh --prompt "A cat on a rooftop at sunset" -o minimax-output/cat.png
# bash scripts/image/generate_image.sh --mode i2i --prompt "A girl reading in a library" --ref-image face.jpg -o minimax-output/girl.png
# bash scripts/image/generate_image.sh --prompt "Mountain landscape" --aspect-ratio 16:9 -n 3 -o minimax-output/landscape.png
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
# ============================================================================
# Common functions
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}"; line="$(echo "$line" | xargs)"
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}" val="${line#*=}"
key="$(echo "$key" | xargs)"; val="$(echo "$val" | xargs)"
if [[ ${#val} -ge 2 ]]; then
case "$val" in \"*\") val="${val:1:${#val}-2}" ;; \'*\') val="${val:1:${#val}-2}" ;; esac
fi
[[ -z "${!key:-}" ]] && export "$key=$val"
done < "$env_file"
fi
done
}
check_api_key() {
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY environment variable is not set." >&2; exit 1
fi
}
image_to_data_url() {
local path="$1"
[[ -f "$path" ]] || { echo "Error: Image not found: $path" >&2; exit 1; }
local mime
mime="$(file -b --mime-type "$path" 2>/dev/null)" || mime="image/jpeg"
local b64
b64="$(base64 < "$path")"
echo "data:${mime};base64,${b64}"
}
resolve_image() {
local input="$1"
[[ -z "$input" ]] && return
case "$input" in
http://*|https://*|data:*) echo "$input" ;;
*) image_to_data_url "$input" ;;
esac
}
# ============================================================================
# Main
# ============================================================================
main() {
load_env
check_api_key
local mode="t2i" prompt="" model="image-01"
local aspect_ratio="" width="" height=""
local response_format="url" n=1 seed=""
local prompt_optimizer=false aigc_watermark=false
local ref_image=""
local output="" download=true
while [[ $# -gt 0 ]]; do
case "$1" in
--mode) mode="$2"; shift 2 ;;
--prompt) prompt="$2"; shift 2 ;;
--aspect-ratio|--ratio) aspect_ratio="$2"; shift 2 ;;
--width) width="$2"; shift 2 ;;
--height) height="$2"; shift 2 ;;
--response-format) response_format="$2"; shift 2 ;;
-n|--count) n="$2"; shift 2 ;;
--seed) seed="$2"; shift 2 ;;
--prompt-optimizer) prompt_optimizer=true; shift ;;
--aigc-watermark) aigc_watermark=true; shift ;;
--ref-image) ref_image="$2"; shift 2 ;;
--no-download) download=false; shift ;;
-o|--output) output="$2"; shift 2 ;;
-h|--help)
cat <<'USAGE'
MiniMax Image Generation CLI (model: image-01)
Usage:
generate_image.sh [--mode MODE] [options] -o OUTPUT
Modes:
t2i Text-to-image (default) — generate image from text prompt
i2i Image-to-image — generate image using a character reference photo
Options:
--mode MODE Generation mode: t2i (default), i2i
--prompt TEXT Text description of the image (max 1500 chars, required)
--aspect-ratio RATIO Aspect ratio: 1:1, 16:9, 4:3, 3:2, 2:3, 3:4, 9:16, 21:9
--width PX Custom width in pixels (512-2048, multiple of 8)
--height PX Custom height in pixels (512-2048, multiple of 8)
-n, --count N Number of images to generate (1-9, default: 1)
--seed N Random seed for reproducibility
--prompt-optimizer Enable automatic prompt optimization
--aigc-watermark Add AIGC watermark to generated images
--ref-image FILE Character reference image (local file or URL, i2i mode)
--response-format FMT Response format: url (default), base64
--no-download Don't download, just print URL(s)
-o, --output FILE Output file path (required)
Examples:
# Text-to-image (default)
generate_image.sh --prompt "A cat on a rooftop at sunset, cinematic" -o cat.png
# Custom aspect ratio
generate_image.sh --prompt "Mountain landscape" --aspect-ratio 16:9 -o landscape.png
# Multiple images
generate_image.sh --prompt "Abstract art" -n 3 -o art.png
# Image-to-image with character reference
generate_image.sh --mode i2i --prompt "A girl reading in a library" --ref-image face.jpg -o girl.png
USAGE
exit 0
;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
if [[ -z "$prompt" ]]; then
echo "Error: --prompt is required" >&2; exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: --output / -o is required" >&2; exit 1
fi
# Validate n range
if [[ "$n" -lt 1 || "$n" -gt 9 ]] 2>/dev/null; then
echo "Error: -n must be between 1 and 9" >&2; exit 1
fi
# Build payload
local payload
payload=$(jq -n \
--arg model "$model" \
--arg prompt "$prompt" \
--arg rf "$response_format" \
--argjson n "$n" \
--argjson po "$prompt_optimizer" \
--argjson aw "$aigc_watermark" \
'{model: $model, prompt: $prompt, response_format: $rf, n: $n, prompt_optimizer: $po, aigc_watermark: $aw}')
[[ -n "$aspect_ratio" ]] && payload=$(echo "$payload" | jq --arg ar "$aspect_ratio" '. + {aspect_ratio: $ar}')
[[ -n "$width" ]] && payload=$(echo "$payload" | jq --argjson w "$width" '. + {width: $w}')
[[ -n "$height" ]] && payload=$(echo "$payload" | jq --argjson h "$height" '. + {height: $h}')
[[ -n "$seed" ]] && payload=$(echo "$payload" | jq --argjson s "$seed" '. + {seed: $s}')
# Subject reference (i2i mode)
if [[ "$mode" == "i2i" ]]; then
if [[ -z "$ref_image" ]]; then
echo "Error: --ref-image is required for i2i mode" >&2; exit 1
fi
local img_url
img_url="$(resolve_image "$ref_image")"
payload=$(echo "$payload" | jq --arg img "$img_url" '. + {subject_reference: [{type: "character", image_file: $img}]}')
fi
local api_host="${MINIMAX_API_HOST:-https://api.minimaxi.com}"
local api_url="${api_host}/v1/image_generation"
echo "Mode: $mode"
echo "Model: $model"
echo "Generating $n image(s)..."
local raw_output http_code response
raw_output="$(curl -s -w "\n%{http_code}" \
-X POST "$api_url" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time 120 \
-d "$payload" 2>/dev/null)" || {
echo "Error: curl request failed" >&2
exit 1
}
http_code="${raw_output##*$'\n'}"
response="${raw_output%$'\n'*}"
if [[ "$http_code" -ge 400 ]] 2>/dev/null; then
echo "Error: API returned HTTP $http_code" >&2
echo "$response" >&2
exit 1
fi
local status_code
status_code="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
if [[ "$status_code" != "0" && -n "$status_code" ]]; then
local status_msg
status_msg="$(echo "$response" | jq -r '.base_resp.status_msg // "Unknown error"')"
echo "Error: API error (code $status_code): $status_msg" >&2
exit 1
fi
local success_count failed_count
success_count="$(echo "$response" | jq -r '.metadata.success_count // 0')" 2>/dev/null || true
failed_count="$(echo "$response" | jq -r '.metadata.failed_count // 0')" 2>/dev/null || true
echo "Success: $success_count, Failed: $failed_count"
mkdir -p "$(dirname "$output")"
if [[ "$response_format" == "base64" ]]; then
local count
count="$(echo "$response" | jq '.data.image_base64 | length')" 2>/dev/null || count=0
if [[ "$count" -eq 0 ]]; then
echo "Error: No image data in response" >&2; exit 1
fi
if [[ "$count" -eq 1 ]]; then
echo "$response" | jq -r '.data.image_base64[0]' | base64 -d > "$output"
echo "Image saved to: $output"
else
local ext="${output##*.}"
local base="${output%.*}"
for ((i=0; i<count; i++)); do
local out_file="${base}_$((i+1)).${ext}"
echo "$response" | jq -r ".data.image_base64[$i]" | base64 -d > "$out_file"
echo "Image saved to: $out_file"
done
fi
elif [[ "$response_format" == "url" ]]; then
local count
count="$(echo "$response" | jq '.data.image_urls | length')" 2>/dev/null || count=0
if [[ "$count" -eq 0 ]]; then
echo "Error: No image URLs in response" >&2
echo "$response" | jq . >&2
exit 1
fi
if $download; then
if [[ "$count" -eq 1 ]]; then
local img_url
img_url="$(echo "$response" | jq -r '.data.image_urls[0]')"
echo "URL: $img_url"
curl -s -o "$output" --max-time 120 "$img_url"
echo "Image downloaded to: $output"
else
local ext="${output##*.}"
local base="${output%.*}"
for ((i=0; i<count; i++)); do
local img_url out_file
img_url="$(echo "$response" | jq -r ".data.image_urls[$i]")"
out_file="${base}_$((i+1)).${ext}"
echo "URL $((i+1)): $img_url"
curl -s -o "$out_file" --max-time 120 "$img_url"
echo "Image downloaded to: $out_file"
done
fi
else
for ((i=0; i<count; i++)); do
local img_url
img_url="$(echo "$response" | jq -r ".data.image_urls[$i]")"
echo "Image URL $((i+1)): $img_url"
done
echo "Use without --no-download to save files automatically."
fi
fi
echo "Done!"
}
main "$@"

View File

@@ -0,0 +1,543 @@
#!/usr/bin/env bash
# MiniMax Multi-Modal Toolkit Media Tools CLI (pure bash)
#
# FFmpeg-based utilities for audio/video format conversion, concatenation,
# extraction, and trimming.
#
# Usage:
# bash scripts/media_tools.sh convert-video input.webm -o output.mp4
# bash scripts/media_tools.sh convert-audio input.wav -o output.mp3
# bash scripts/media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4
# bash scripts/media_tools.sh concat-audio part1.mp3 part2.mp3 -o combined.mp3
# bash scripts/media_tools.sh extract-audio input.mp4 -o audio.mp3
# bash scripts/media_tools.sh trim-video input.mp4 --start 5 --end 15 -o clip.mp4
# bash scripts/media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
# bash scripts/media_tools.sh probe input.mp4
set -euo pipefail
# ============================================================================
# Probe / info helpers
# ============================================================================
probe_media() {
ffprobe -v error -show_format -show_streams -of json "$1" 2>/dev/null
}
get_duration() {
probe_media "$1" | jq -r '.format.duration // "0"'
}
get_video_fps() {
local fps_str
fps_str="$(ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate -of csv=p=0 "$1" 2>/dev/null)" || { echo 25; return; }
local num="${fps_str%/*}" den="${fps_str#*/}"
echo $(( (num + den/2) / den )) 2>/dev/null || echo 25
}
has_audio_stream() {
local out
out="$(ffprobe -v error -select_streams a -show_entries stream=codec_type -of csv=p=0 "$1" 2>/dev/null)"
[[ "$out" == *audio* ]]
}
has_video_stream() {
local out
out="$(ffprobe -v error -select_streams v -show_entries stream=codec_type -of csv=p=0 "$1" 2>/dev/null)"
[[ "$out" == *video* ]]
}
# ============================================================================
# Video codec maps
# ============================================================================
video_codec_for() {
case "$1" in
mp4|mov|mkv|avi|ts|flv) echo "libx264" ;;
webm) echo "libvpx-vp9" ;;
*) echo "libx264" ;;
esac
}
audio_codec_for_container() {
case "$1" in
mp4|mov|mkv|ts|flv) echo "aac" ;;
webm) echo "libopus" ;;
avi) echo "mp3" ;;
*) echo "aac" ;;
esac
}
audio_codec_for_format() {
case "$1" in
mp3) echo "libmp3lame" ;;
wav) echo "pcm_s16le" ;;
flac) echo "flac" ;;
ogg) echo "libvorbis" ;;
aac|m4a) echo "aac" ;;
opus) echo "libopus" ;;
wma) echo "wmav2" ;;
*) echo "libmp3lame" ;;
esac
}
get_ext() {
local name="$1"
echo "${name##*.}" | tr '[:upper:]' '[:lower:]'
}
# ============================================================================
# Subcommand: convert-video
# ============================================================================
cmd_convert_video() {
local input="" output="" crf=18 preset="medium" resolution="" fps=""
if [[ $# -gt 0 && "$1" != -* ]]; then input="$1"; shift; fi
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--crf) crf="$2"; shift 2 ;;
--preset) preset="$2"; shift 2 ;;
--resolution) resolution="$2"; shift 2 ;;
--fps) fps="$2"; shift 2 ;;
*) [[ -z "$input" ]] && input="$1"; shift ;;
esac
done
[[ -z "$input" || ! -f "$input" ]] && { echo "Error: Input file not found: ${input:-<none>}" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
local ext; ext="$(get_ext "$output")"
local v_codec; v_codec="$(video_codec_for "$ext")"
local a_codec; a_codec="$(audio_codec_for_container "$ext")"
mkdir -p "$(dirname "$output")"
local cmd=(ffmpeg -y -i "$input")
# Video filters
if [[ -n "$resolution" ]]; then
local w="${resolution%%x*}" h="${resolution##*x}"
cmd+=(-vf "scale=${w}:${h}")
fi
cmd+=(-c:v "$v_codec")
case "$v_codec" in
libx264|libx265) cmd+=(-crf "$crf" -preset "$preset" -pix_fmt yuv420p) ;;
libvpx-vp9) cmd+=(-crf "$crf" -b:v 0) ;;
esac
[[ -n "$fps" ]] && cmd+=(-r "$fps")
if has_audio_stream "$input"; then
cmd+=(-c:a "$a_codec" -b:a 192k)
else
cmd+=(-an)
fi
cmd+=("$output")
echo "Converting: $input -> $output ($v_codec/$a_codec)"
"${cmd[@]}" 2>/dev/null
echo " Done: $output"
}
# ============================================================================
# Subcommand: convert-audio
# ============================================================================
cmd_convert_audio() {
local input="" output="" bitrate="192k" sample_rate="" channels=""
if [[ $# -gt 0 && "$1" != -* ]]; then input="$1"; shift; fi
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--bitrate) bitrate="$2"; shift 2 ;;
--sample-rate) sample_rate="$2"; shift 2 ;;
--channels) channels="$2"; shift 2 ;;
*) [[ -z "$input" ]] && input="$1"; shift ;;
esac
done
[[ -z "$input" || ! -f "$input" ]] && { echo "Error: Input file not found: ${input:-<none>}" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
local ext; ext="$(get_ext "$output")"
local codec; codec="$(audio_codec_for_format "$ext")"
mkdir -p "$(dirname "$output")"
local cmd=(ffmpeg -y -i "$input" -c:a "$codec" -b:a "$bitrate")
[[ -n "$sample_rate" ]] && cmd+=(-ar "$sample_rate")
[[ -n "$channels" ]] && cmd+=(-ac "$channels")
cmd+=("$output")
echo "Converting audio: $input -> $output ($codec)"
"${cmd[@]}" 2>/dev/null
echo " Done: $output"
}
# ============================================================================
# Subcommand: concat-video
# ============================================================================
cmd_concat_video() {
local output="" crossfade=0.5
local inputs=()
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--crossfade) crossfade="$2"; shift 2 ;;
*) inputs+=("$1"); shift ;;
esac
done
[[ ${#inputs[@]} -lt 2 ]] && { echo "Error: At least 2 input files required" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
mkdir -p "$(dirname "$output")"
if [[ ${#inputs[@]} -eq 1 ]]; then
cp "${inputs[0]}" "$output"
return 0
fi
local fps; fps="$(get_video_fps "${inputs[0]}")"
local has_audio=true
for vp in "${inputs[@]}"; do
has_audio_stream "$vp" || { has_audio=false; break; }
done
if [[ "$(echo "$crossfade > 0" | bc -l)" == "1" ]]; then
local durations=()
for vp in "${inputs[@]}"; do durations+=("$(get_duration "$vp")"); done
local ff_inputs=()
for vp in "${inputs[@]}"; do ff_inputs+=(-i "$(cd "$(dirname "$vp")" && pwd)/$(basename "$vp")"); done
local n=${#inputs[@]}
local offsets=() cumulative=0
for ((i=0; i<n-1; i++)); do
local offset; offset="$(echo "$cumulative + ${durations[$i]} - $crossfade" | bc -l)"
offsets+=("$offset"); cumulative="$offset"
done
local vf_parts=() af_parts=()
if [[ $n -eq 2 ]]; then
vf_parts+=("[0:v][1:v]xfade=transition=fade:duration=${crossfade}:offset=${offsets[0]}[vout]")
$has_audio && af_parts+=("[0:a][1:a]acrossfade=d=${crossfade}:c1=tri:c2=tri[aout]")
else
vf_parts+=("[0:v][1:v]xfade=transition=fade:duration=${crossfade}:offset=${offsets[0]}[xv1]")
$has_audio && af_parts+=("[0:a][1:a]acrossfade=d=${crossfade}:c1=tri:c2=tri[xa1]")
for ((i=2; i<n; i++)); do
local out_v="[xv${i}]" out_a="[xa${i}]"
[[ $i -eq $((n-1)) ]] && { out_v="[vout]"; out_a="[aout]"; }
vf_parts+=("[xv$((i-1))][${i}:v]xfade=transition=fade:duration=${crossfade}:offset=${offsets[$((i-1))]}${out_v}")
$has_audio && af_parts+=("[xa$((i-1))][${i}:a]acrossfade=d=${crossfade}:c1=tri:c2=tri${out_a}")
done
fi
local fc
fc="$(IFS=';'; echo "${vf_parts[*]}${af_parts[*]:+;${af_parts[*]}}")"
local cmd=(ffmpeg -y "${ff_inputs[@]}" -filter_complex "$fc" -map "[vout]")
$has_audio && cmd+=(-map "[aout]")
cmd+=(-c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p -r "$fps")
$has_audio && cmd+=(-c:a aac -b:a 192k)
cmd+=("$output")
echo "Concatenating $n videos with ${crossfade}s crossfade..."
if "${cmd[@]}" 2>/dev/null; then
echo " Done: $output"
return 0
fi
echo " Crossfade failed, falling back to re-encode..."
fi
# Fallback
local concat_file; concat_file="$(mktemp /tmp/concat_XXXXXX.txt)"
for vp in "${inputs[@]}"; do
echo "file '$(cd "$(dirname "$vp")" && pwd)/$(basename "$vp")'" >> "$concat_file"
done
ffmpeg -y -f concat -safe 0 -i "$concat_file" \
-c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p -r "$fps" \
-c:a aac -b:a 192k "$output" 2>/dev/null
rm -f "$concat_file"
echo " Done: $output"
}
# ============================================================================
# Subcommand: concat-audio
# ============================================================================
cmd_concat_audio() {
local output="" crossfade=0
local inputs=()
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--crossfade) crossfade="$2"; shift 2 ;;
*) inputs+=("$1"); shift ;;
esac
done
[[ ${#inputs[@]} -lt 1 ]] && { echo "Error: At least 1 input file required" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
mkdir -p "$(dirname "$output")"
if [[ ${#inputs[@]} -eq 1 ]]; then
cp "${inputs[0]}" "$output"
echo " Done: $output"
return 0
fi
local ext; ext="$(get_ext "$output")"
local codec; codec="$(audio_codec_for_format "$ext")"
local n=${#inputs[@]}
if [[ "$(echo "$crossfade > 0" | bc -l)" == "1" ]]; then
local ff_inputs=()
for ap in "${inputs[@]}"; do ff_inputs+=(-i "$(cd "$(dirname "$ap")" && pwd)/$(basename "$ap")"); done
local af_parts=()
if [[ $n -eq 2 ]]; then
af_parts+=("[0:a][1:a]acrossfade=d=${crossfade}:c1=tri:c2=tri[aout]")
else
af_parts+=("[0:a][1:a]acrossfade=d=${crossfade}:c1=tri:c2=tri[xa1]")
for ((i=2; i<n; i++)); do
local prev="[xa$((i-1))]" out="[xa${i}]"
[[ $i -eq $((n-1)) ]] && out="[aout]"
af_parts+=("${prev}[${i}:a]acrossfade=d=${crossfade}:c1=tri:c2=tri${out}")
done
fi
local fc; fc="$(IFS=';'; echo "${af_parts[*]}")"
echo "Concatenating $n audio files with ${crossfade}s crossfade..."
if ffmpeg -y "${ff_inputs[@]}" -filter_complex "$fc" -map "[aout]" \
-c:a "$codec" -b:a 192k "$output" 2>/dev/null; then
echo " Done: $output"
return 0
fi
echo " Crossfade failed, falling back..."
fi
# Fallback: concat demuxer
local concat_file; concat_file="$(mktemp /tmp/concat_XXXXXX.txt)"
for ap in "${inputs[@]}"; do
echo "file '$(cd "$(dirname "$ap")" && pwd)/$(basename "$ap")'" >> "$concat_file"
done
ffmpeg -y -f concat -safe 0 -i "$concat_file" -c:a "$codec" -b:a 192k "$output" 2>/dev/null
rm -f "$concat_file"
echo " Done: $output"
}
# ============================================================================
# Subcommand: extract-audio
# ============================================================================
cmd_extract_audio() {
local input="" output="" bitrate="192k"
if [[ $# -gt 0 && "$1" != -* ]]; then input="$1"; shift; fi
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--bitrate) bitrate="$2"; shift 2 ;;
*) [[ -z "$input" ]] && input="$1"; shift ;;
esac
done
[[ -z "$input" || ! -f "$input" ]] && { echo "Error: Input not found: ${input:-<none>}" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
has_audio_stream "$input" || { echo "Error: No audio stream in $input" >&2; exit 1; }
local ext; ext="$(get_ext "$output")"
local codec; codec="$(audio_codec_for_format "$ext")"
mkdir -p "$(dirname "$output")"
echo "Extracting audio: $input -> $output"
ffmpeg -y -i "$input" -vn -c:a "$codec" -b:a "$bitrate" "$output" 2>/dev/null
echo " Done: $output"
}
# ============================================================================
# Subcommand: trim-video
# ============================================================================
cmd_trim_video() {
local input="" output="" start="" end="" duration=""
if [[ $# -gt 0 && "$1" != -* ]]; then input="$1"; shift; fi
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--start) start="$2"; shift 2 ;;
--end) end="$2"; shift 2 ;;
--duration) duration="$2"; shift 2 ;;
*) [[ -z "$input" ]] && input="$1"; shift ;;
esac
done
[[ -z "$input" || ! -f "$input" ]] && { echo "Error: Input not found: ${input:-<none>}" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
mkdir -p "$(dirname "$output")"
local cmd=(ffmpeg -y)
[[ -n "$start" ]] && cmd+=(-ss "$start")
cmd+=(-i "$input")
if [[ -n "$duration" ]]; then
cmd+=(-t "$duration")
elif [[ -n "$end" ]]; then
local actual_start="${start:-0}"
local dur; dur="$(echo "$end - $actual_start" | bc -l)"
cmd+=(-t "$dur")
fi
cmd+=(-c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p)
has_audio_stream "$input" && cmd+=(-c:a aac -b:a 192k)
cmd+=("$output")
local start_str="${start:-0}s"
local end_str="${end:+${end}s}"
[[ -z "$end_str" && -n "$duration" ]] && end_str="+${duration}s"
[[ -z "$end_str" ]] && end_str="end"
echo "Trimming: $input [$start_str - $end_str] -> $output"
"${cmd[@]}" 2>/dev/null
echo " Done: $output"
}
# ============================================================================
# Subcommand: add-audio
# ============================================================================
cmd_add_audio() {
local video="" audio="" output="" volume=1.0 fade_in=0 fade_out=0 replace=false
while [[ $# -gt 0 ]]; do
case "$1" in
--video) video="$2"; shift 2 ;;
--audio) audio="$2"; shift 2 ;;
-o|--output) output="$2"; shift 2 ;;
--volume) volume="$2"; shift 2 ;;
--fade-in) fade_in="$2"; shift 2 ;;
--fade-out) fade_out="$2"; shift 2 ;;
--replace) replace=true; shift ;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
[[ -z "$video" || ! -f "$video" ]] && { echo "Error: Video not found: ${video:-<none>}" >&2; exit 1; }
[[ -z "$audio" || ! -f "$audio" ]] && { echo "Error: Audio not found: ${audio:-<none>}" >&2; exit 1; }
[[ -z "$output" ]] && { echo "Error: -o/--output required" >&2; exit 1; }
mkdir -p "$(dirname "$output")"
local duration; duration="$(get_duration "$video")"
local video_audio=false
has_audio_stream "$video" && video_audio=true
local af="[1:a]volume=${volume}"
[[ "$(echo "$fade_in > 0" | bc -l)" == "1" ]] && af+=",afade=t=in:d=${fade_in}"
if [[ "$(echo "$fade_out > 0" | bc -l)" == "1" ]]; then
local fo_start; fo_start="$(echo "$duration - $fade_out" | bc -l)"
[[ "$(echo "$fo_start < 0" | bc -l)" == "1" ]] && fo_start=0
af+=",afade=t=out:st=${fo_start}:d=${fade_out}"
fi
if $video_audio && ! $replace; then
af+="[newaudio];[0:a][newaudio]amix=inputs=2:duration=first:dropout_transition=2[aout]"
local mode="mixing with"
else
af+="[aout]"
local mode="replacing"
fi
echo "Adding audio ($mode original): $output"
ffmpeg -y -i "$video" -i "$audio" \
-filter_complex "$af" \
-map 0:v -map "[aout]" \
-c:v copy -c:a aac -b:a 192k -shortest "$output" 2>/dev/null
echo " Done: $output"
}
# ============================================================================
# Subcommand: probe
# ============================================================================
cmd_probe() {
local input=""
if [[ $# -gt 0 ]]; then input="$1"; fi
[[ -z "$input" || ! -f "$input" ]] && { echo "Error: File not found: ${input:-<none>}" >&2; exit 1; }
local info; info="$(probe_media "$input")"
local fmt_name dur size br
fmt_name="$(echo "$info" | jq -r '.format.format_long_name // "unknown"')"
dur="$(echo "$info" | jq -r '.format.duration // "0"')"
size="$(echo "$info" | jq -r '.format.size // "0"')"
br="$(echo "$info" | jq -r '.format.bit_rate // "0"')"
echo "File: $input"
echo "Format: $fmt_name"
printf "Duration: %.2fs\n" "$dur"
printf "Size: %.2f MB\n" "$(echo "$size / 1048576" | bc -l)"
printf "Bitrate: %.0f kbps\n" "$(echo "$br / 1000" | bc -l)"
echo "$info" | jq -r '.streams[] | if .codec_type == "video" then "Video: \(.codec_name) \(.width)x\(.height) @ \(.r_frame_rate) fps" elif .codec_type == "audio" then "Audio: \(.codec_name) \(.sample_rate)Hz \(.channels)ch" else empty end'
}
# ============================================================================
# Main dispatcher
# ============================================================================
usage() {
cat <<'EOF'
MiniMax Multi-Modal Toolkit Media Tools
Usage:
media_tools.sh <command> [options]
Commands:
convert-video Convert video format
convert-audio Convert audio format
concat-video Concatenate videos with crossfade
concat-audio Concatenate audio files
extract-audio Extract audio from video
trim-video Trim video by time range
add-audio Add/overlay audio on video
probe Show media file info
Examples:
media_tools.sh convert-video input.webm -o output.mp4
media_tools.sh convert-audio input.wav -o output.mp3
media_tools.sh concat-video seg1.mp4 seg2.mp4 -o merged.mp4
media_tools.sh extract-audio video.mp4 -o audio.mp3
media_tools.sh trim-video input.mp4 --start 5 --end 15 -o clip.mp4
media_tools.sh add-audio --video video.mp4 --audio bgm.mp3 -o output.mp4
media_tools.sh probe input.mp4
EOF
}
main() {
if [[ $# -eq 0 ]]; then
usage; exit 0
fi
local command="$1"; shift
case "$command" in
convert-video) cmd_convert_video "$@" ;;
convert-audio) cmd_convert_audio "$@" ;;
concat-video) cmd_concat_video "$@" ;;
concat-audio) cmd_concat_audio "$@" ;;
extract-audio) cmd_extract_audio "$@" ;;
trim-video) cmd_trim_video "$@" ;;
add-audio) cmd_add_audio "$@" ;;
probe) cmd_probe "$@" ;;
-h|--help|help) usage ;;
*) echo "Unknown command: $command" >&2; usage >&2; exit 1 ;;
esac
}
main "$@"

View File

@@ -0,0 +1,266 @@
#!/usr/bin/env bash
# MiniMax Music Generation CLI (pure bash)
#
# Usage:
# bash scripts/music/generate_music.sh --lyrics "[verse]\nHello world" --output output/song.mp3 --download
# bash scripts/music/generate_music.sh --instrumental --prompt "ambient electronic" -o output/ambient.mp3 --download
# bash scripts/music/generate_music.sh --lyrics "[verse]\nStars" --genre pop --mood happy -o output/happy.mp3 --download
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
# ============================================================================
# Common functions (shared with generate_voice.sh)
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}"
line="$(echo "$line" | xargs)"
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}" val="${line#*=}"
key="$(echo "$key" | xargs)"; val="$(echo "$val" | xargs)"
if [[ ${#val} -ge 2 ]]; then
case "$val" in
\"*\") val="${val:1:${#val}-2}" ;;
\'*\') val="${val:1:${#val}-2}" ;;
esac
fi
[[ -z "${!key:-}" ]] && export "$key=$val"
done < "$env_file"
return 0
fi
done
}
check_api_key() {
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY environment variable is not set." >&2
exit 1
fi
}
# ============================================================================
# Main
# ============================================================================
main() {
load_env
check_api_key
local lyrics="" prompt="" model="music-2.5" instrumental=false
local genre="" mood="" tempo="" bpm="" key="" instruments="" vocals=""
local use_case="" structure="" avoid="" references=""
local output="" output_format="url" stream=false download=false
local sample_rate="" bitrate="" format="" aigc_watermark=""
while [[ $# -gt 0 ]]; do
case "$1" in
--lyrics) lyrics="$2"; shift 2 ;;
--prompt) prompt="$2"; shift 2 ;;
--model) model="$2"; shift 2 ;;
--instrumental) instrumental=true; shift ;;
--genre) genre="$2"; shift 2 ;;
--mood) mood="$2"; shift 2 ;;
--tempo) tempo="$2"; shift 2 ;;
--bpm) bpm="$2"; shift 2 ;;
--key) key="$2"; shift 2 ;;
--instruments) instruments="$2"; shift 2 ;;
--vocals) vocals="$2"; shift 2 ;;
--use-case) use_case="$2"; shift 2 ;;
--structure) structure="$2"; shift 2 ;;
--avoid) avoid="$2"; shift 2 ;;
--references) references="$2"; shift 2 ;;
-o|--output) output="$2"; shift 2 ;;
--output-format) output_format="$2"; shift 2 ;;
--stream) stream=true; shift ;;
--download) download=true; shift ;;
--sample-rate) sample_rate="$2"; shift 2 ;;
--bitrate) bitrate="$2"; shift 2 ;;
--format) format="$2"; shift 2 ;;
--aigc-watermark) aigc_watermark="$2"; shift 2 ;;
-h|--help)
cat <<'USAGE'
MiniMax Music Generation CLI
Usage:
generate_music.sh [options]
Options:
--lyrics TEXT Song lyrics (with [verse]/[chorus] tags)
--prompt TEXT Music style/description prompt
--instrumental Generate instrumental (no vocals)
--model MODEL Model name (default: music-2.5)
--genre TEXT Genre (e.g. pop, rock, jazz)
--mood TEXT Mood (e.g. happy, melancholic)
--tempo TEXT Tempo description (e.g. fast, slow)
--bpm NUMBER Beats per minute
--key TEXT Musical key (e.g. C major, A minor)
--instruments TEXT Instruments to include
--vocals TEXT Vocal style description
--use-case TEXT Use case (e.g. background, theme song)
--structure TEXT Song structure
--avoid TEXT Elements to avoid
--references TEXT Reference tracks/artists
--output-format FMT Output format: url (default) or hex
--download Download audio file (for url format)
--sample-rate N Audio sample rate
--bitrate N Audio bitrate
--format FMT Audio format (mp3, wav, etc.)
-o, --output FILE Output file path (required)
Examples:
generate_music.sh --instrumental --prompt "ambient electronic" -o ambient.mp3 --download
generate_music.sh --lyrics "[verse]\nHello world" -o song.mp3 --download
generate_music.sh --lyrics "[verse]\nStars" --genre pop --mood happy -o happy.mp3 --download
USAGE
exit 0
;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
if [[ -z "$output" ]]; then
echo "Error: --output / -o is required" >&2
exit 1
fi
# Build prompt from structured fields
local field_parts=()
[[ -n "$genre" ]] && field_parts+=("Genre: $genre")
[[ -n "$mood" ]] && field_parts+=("Mood: $mood")
[[ -n "$tempo" ]] && field_parts+=("Tempo: $tempo")
[[ -n "$bpm" ]] && field_parts+=("BPM: $bpm")
[[ -n "$key" ]] && field_parts+=("Key: $key")
[[ -n "$instruments" ]] && field_parts+=("Instruments: $instruments")
[[ -n "$vocals" ]] && field_parts+=("Vocals: $vocals")
[[ -n "$use_case" ]] && field_parts+=("Use case: $use_case")
[[ -n "$structure" ]] && field_parts+=("Structure: $structure")
[[ -n "$avoid" ]] && field_parts+=("Avoid: $avoid")
[[ -n "$references" ]] && field_parts+=("References: $references")
local field_prompt=""
if [[ ${#field_parts[@]} -gt 0 ]]; then
field_prompt="$(IFS='. '; echo "${field_parts[*]}")"
fi
if [[ -n "$field_prompt" ]]; then
if [[ -n "$prompt" ]]; then
prompt="$prompt. $field_prompt"
else
prompt="$field_prompt"
fi
fi
# Build payload
local payload
payload=$(jq -n \
--arg model "$model" \
--arg prompt "$prompt" \
--arg of "$output_format" \
--argjson stream "$stream" \
'{model: $model, prompt: $prompt, output_format: $of, stream: $stream}')
if $instrumental; then
# music-2.5 does not support is_instrumental — use lyrics workaround
payload=$(echo "$payload" | jq '. + {lyrics: "[intro] [outro]"}')
local current_prompt
current_prompt="$(echo "$payload" | jq -r '.prompt // ""')"
if [[ -n "$current_prompt" ]]; then
payload=$(echo "$payload" | jq --arg p "$current_prompt. pure music, no lyrics" '.prompt = $p')
else
payload=$(echo "$payload" | jq '.prompt = "pure music, no lyrics"')
fi
else
payload=$(echo "$payload" | jq --arg l "$lyrics" '. + {lyrics: $l}')
fi
# Audio settings
local audio_setting="{}"
[[ -n "$sample_rate" ]] && audio_setting=$(echo "$audio_setting" | jq --argjson sr "$sample_rate" '. + {sample_rate: $sr}')
[[ -n "$bitrate" ]] && audio_setting=$(echo "$audio_setting" | jq --argjson br "$bitrate" '. + {bitrate: $br}')
[[ -n "$format" ]] && audio_setting=$(echo "$audio_setting" | jq --arg f "$format" '. + {format: $f}')
if [[ "$audio_setting" != "{}" ]]; then
payload=$(echo "$payload" | jq --argjson as "$audio_setting" '. + {audio_setting: $as}')
fi
[[ -n "$aigc_watermark" ]] && payload=$(echo "$payload" | jq --argjson aw "$aigc_watermark" '. + {aigc_watermark: $aw}')
local api_host="${MINIMAX_API_HOST:-https://api.minimaxi.com}"
local api_url="${api_host}/v1/music_generation"
echo "Generating music with model: $model"
echo "Output format: $output_format"
# Send request via curl
local raw_output http_code response
raw_output="$(curl -s -w "\n%{http_code}" \
-X POST "$api_url" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time 300 \
-d "$payload" 2>/dev/null)" || {
echo "Error: curl request failed" >&2
exit 1
}
http_code="${raw_output##*$'\n'}"
response="${raw_output%$'\n'*}"
if [[ "$http_code" -ge 400 ]] 2>/dev/null; then
echo "Error: API returned HTTP $http_code" >&2
echo "$response" >&2
exit 1
fi
local status_code
status_code="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
if [[ "$status_code" != "0" && -n "$status_code" ]]; then
echo "API error: $(echo "$response" | jq '.base_resp')" >&2
exit 1
fi
mkdir -p "$(dirname "$output")"
if [[ "$output_format" == "hex" ]]; then
local audio_hex
audio_hex="$(echo "$response" | jq -r '.data.audio // empty')"
if [[ -z "$audio_hex" ]]; then
echo "Error: No audio hex data in response." >&2
exit 1
fi
echo "$audio_hex" | xxd -r -p > "$output"
echo "Audio saved to: $output"
elif [[ "$output_format" == "url" ]]; then
local audio_url
audio_url="$(echo "$response" | jq -r '.data.audio_url // .data.audio // .data.audio_file.download_url // empty')"
if [[ -z "$audio_url" ]]; then
echo "Error: No audio URL in response." >&2
echo "$response" | jq . >&2
exit 1
fi
echo "Audio URL: $audio_url"
if $download; then
curl -s -o "$output" --max-time 120 "$audio_url"
echo "Audio downloaded to: $output"
else
echo "Use --download to save the file, or download manually from the URL above."
echo "$audio_url" > "$output"
echo "URL written to: $output"
fi
fi
# Print extra info if present
local extra
extra="$(echo "$response" | jq -r '.extra_info // .data.extra_info // empty')" 2>/dev/null || true
if [[ -n "$extra" && "$extra" != "null" ]]; then
echo "Extra info: $extra"
fi
}
main "$@"

View File

@@ -0,0 +1,934 @@
#!/usr/bin/env bash
# MiniMax Voice CLI — Unified TTS command-line interface (pure bash)
#
# Usage:
# bash scripts/tts/generate_voice.sh tts "Hello world" -o hello.mp3
# bash scripts/tts/generate_voice.sh clone my_voice.mp3 --voice-id my-custom-voice
# bash scripts/tts/generate_voice.sh design "A gentle female voice" --voice-id designed-voice-1
# bash scripts/tts/generate_voice.sh list-voices
# bash scripts/tts/generate_voice.sh validate segments.json
# bash scripts/tts/generate_voice.sh generate segments.json -o output.mp3
# bash scripts/tts/generate_voice.sh merge file1.mp3 file2.mp3 -o combined.mp3
# bash scripts/tts/generate_voice.sh convert input.wav -o output.mp3
# bash scripts/tts/generate_voice.sh check-env
set -euo pipefail
# ============================================================================
# Configuration
# ============================================================================
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
# ============================================================================
# Common functions
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}" # strip comments
line="$(echo "$line" | xargs)" # trim whitespace
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}"
local val="${line#*=}"
key="$(echo "$key" | xargs)"
val="$(echo "$val" | xargs)"
# Remove surrounding quotes
if [[ ${#val} -ge 2 ]]; then
case "$val" in
\"*\") val="${val:1:${#val}-2}" ;;
\'*\') val="${val:1:${#val}-2}" ;;
esac
fi
# Only set if not already in environment
if [[ -z "${!key:-}" ]]; then
export "$key=$val"
fi
done < "$env_file"
return 0
fi
done
return 0
}
check_api_key() {
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY environment variable is not set" >&2
echo " export MINIMAX_API_KEY='your-key'" >&2
exit 1
fi
}
ensure_dir() {
local dir="$1"
[[ -n "$dir" ]] && mkdir -p "$dir"
}
API_BASE="${MINIMAX_API_HOST:-https://api.minimaxi.com}/v1"
api_request() {
# api_request METHOD ENDPOINT [JSON_BODY]
# Outputs raw JSON response to stdout.
local method="$1" endpoint="$2" body="${3:-}"
local url="${API_BASE}/${endpoint#/}"
local args=(
-s -w "\n%{http_code}"
-X "$method"
-H "Authorization: Bearer ${MINIMAX_API_KEY}"
-H "Accept-Encoding: gzip, deflate"
--compressed
--max-time 120
)
if [[ -n "$body" ]]; then
args+=(-H "Content-Type: application/json" -d "$body")
fi
args+=("$url")
local output http_code response
output="$(curl "${args[@]}" 2>/dev/null)" || {
echo "Error: curl request failed" >&2
exit 1
}
http_code="${output##*$'\n'}"
response="${output%$'\n'*}"
if [[ "$http_code" -ge 400 ]] 2>/dev/null; then
echo "Error: API returned HTTP $http_code" >&2
echo "$response" >&2
exit 1
fi
# Check API-level error
local status_code
status_code="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
if [[ "$status_code" != "0" && -n "$status_code" ]]; then
local status_msg
status_msg="$(echo "$response" | jq -r '.base_resp.status_msg // "Unknown error"')"
echo "Error: API error [$status_code]: $status_msg" >&2
exit 1
fi
echo "$response"
}
api_upload() {
# api_upload ENDPOINT FILE_PATH PURPOSE
local endpoint="$1" file_path="$2" purpose="$3"
local url="${API_BASE}/${endpoint#/}"
local output http_code response
output="$(curl -s -w "\n%{http_code}" \
-X POST \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Accept-Encoding: gzip, deflate" \
--compressed \
-F "file=@${file_path}" \
-F "purpose=${purpose}" \
--max-time 120 \
"$url" 2>/dev/null)" || {
echo "Error: curl upload failed" >&2
exit 1
}
http_code="${output##*$'\n'}"
response="${output%$'\n'*}"
if [[ "$http_code" -ge 400 ]] 2>/dev/null; then
echo "Error: API returned HTTP $http_code" >&2
echo "$response" >&2
exit 1
fi
local status_code
status_code="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
if [[ "$status_code" != "0" && -n "$status_code" ]]; then
local status_msg
status_msg="$(echo "$response" | jq -r '.base_resp.status_msg // "Unknown error"')"
echo "Error: API error [$status_code]: $status_msg" >&2
exit 1
fi
echo "$response"
}
hex_to_file() {
# hex_to_file HEX_STRING OUTPUT_PATH
local hex="$1" output="$2"
ensure_dir "$(dirname "$output")"
echo "$hex" | xxd -r -p > "$output"
}
# ============================================================================
# Subcommand: tts
# ============================================================================
cmd_tts() {
local text="" voice_id="male-qn-qingse" output="" model="speech-2.8-hd"
local speed=1.0 volume=1.0 pitch=0 emotion="" audio_format="mp3"
local sample_rate=32000 language_boost=""
# First positional arg is text
if [[ $# -gt 0 && "$1" != -* ]]; then
text="$1"; shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
-v|--voice-id) voice_id="$2"; shift 2 ;;
-o|--output) output="$2"; shift 2 ;;
--model) model="$2"; shift 2 ;;
--speed) speed="$2"; shift 2 ;;
--volume) volume="$2"; shift 2 ;;
--pitch) pitch="$2"; shift 2 ;;
--emotion) emotion="$2"; shift 2 ;;
--format) audio_format="$2"; shift 2 ;;
--sample-rate) sample_rate="$2"; shift 2 ;;
--language-boost) language_boost="$2"; shift 2 ;;
*) text="$1"; shift ;;
esac
done
if [[ -z "$text" ]]; then
echo "Error: text is required" >&2
echo "Usage: $(basename "$0") tts \"Text to speak\" -o output.mp3" >&2
exit 1
fi
# Build voice_setting
local voice_setting
voice_setting=$(jq -n \
--arg vid "$voice_id" \
--argjson spd "$speed" \
--argjson vol "$volume" \
--argjson pit "$pitch" \
'{voice_id: $vid, speed: $spd, vol: $vol, pitch: $pit}')
if [[ -n "$emotion" ]]; then
voice_setting=$(echo "$voice_setting" | jq --arg e "$emotion" '. + {emotion: $e}')
fi
# Build payload
local payload
payload=$(jq -n \
--arg model "$model" \
--arg text "$text" \
--argjson vs "$voice_setting" \
--arg fmt "$audio_format" \
--argjson sr "$sample_rate" \
'{
model: $model,
text: $text,
voice_setting: $vs,
audio_setting: {sample_rate: $sr, bitrate: 128000, format: $fmt, channel: 1},
stream: false,
subtitle_enable: false,
output_format: "hex"
}')
if [[ -n "$language_boost" ]]; then
payload=$(echo "$payload" | jq --arg lb "$language_boost" '. + {language_boost: $lb}')
fi
echo "Synthesizing: ${text:0:50}..."
local response
response="$(api_request POST t2a_v2 "$payload")"
# Extract hex audio
local audio_hex
audio_hex="$(echo "$response" | jq -r '.data.audio // .extra_info.audio // empty')"
if [[ -z "$audio_hex" ]]; then
echo "Error: No audio data returned from API" >&2
exit 1
fi
if [[ -n "$output" ]]; then
hex_to_file "$audio_hex" "$output"
echo "Done: $output"
else
echo "Generated ${#audio_hex} hex chars of audio"
fi
}
# ============================================================================
# Subcommand: clone
# ============================================================================
cmd_clone() {
local audio_file="" voice_id="" preview_text="" preview_output=""
# First positional arg is audio file
if [[ $# -gt 0 && "$1" != -* ]]; then
audio_file="$1"; shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
--voice-id) voice_id="$2"; shift 2 ;;
--preview) preview_text="$2"; shift 2 ;;
--preview-output) preview_output="$2"; shift 2 ;;
*) [[ -z "$audio_file" ]] && audio_file="$1"; shift ;;
esac
done
if [[ -z "$audio_file" ]]; then
echo "Error: audio file is required" >&2
echo "Usage: $(basename "$0") clone audio.mp3 --voice-id my-voice" >&2
exit 1
fi
if [[ ! -f "$audio_file" ]]; then
echo "Error: Audio file not found: $audio_file" >&2
exit 1
fi
if [[ -z "$voice_id" ]]; then
echo "Error: --voice-id is required" >&2
exit 1
fi
echo "Cloning voice from: $audio_file"
echo "Voice ID: $voice_id"
# Step 1: Upload audio
local upload_response file_id
upload_response="$(api_upload files/upload "$audio_file" voice_clone)"
file_id="$(echo "$upload_response" | jq -r '.file.file_id // .file_id // empty')"
if [[ -z "$file_id" ]]; then
echo "Error: Upload succeeded but no file_id was returned" >&2
exit 1
fi
# Step 2: Clone voice
local clone_payload
clone_payload=$(jq -n \
--arg vid "$voice_id" \
--argjson fid "$file_id" \
'{voice_id: $vid, file_id: $fid}')
api_request POST voice_clone "$clone_payload" > /dev/null
echo "Voice cloned successfully: $voice_id"
# Step 3: Preview if requested
if [[ -n "$preview_text" ]]; then
echo "Generating preview..."
local pout="${preview_output:-${voice_id}_preview.mp3}"
cmd_tts "$preview_text" -v "$voice_id" -o "$pout"
echo "Preview saved to: $pout"
fi
}
# ============================================================================
# Subcommand: design
# ============================================================================
cmd_design() {
local description="" voice_id="" preview_text="" preview_output=""
if [[ $# -gt 0 && "$1" != -* ]]; then
description="$1"; shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
--voice-id) voice_id="$2"; shift 2 ;;
--preview) preview_text="$2"; shift 2 ;;
--preview-output) preview_output="$2"; shift 2 ;;
*) [[ -z "$description" ]] && description="$1"; shift ;;
esac
done
if [[ -z "$description" ]]; then
echo "Error: description is required" >&2
echo "Usage: $(basename \"$0\") design \"A warm female voice\" --voice-id narrator" >&2
exit 1
fi
local ptext="${preview_text:-This is a preview of the designed voice.}"
echo "Designing voice from: \"$description\""
[[ -n "$voice_id" ]] && echo "Voice ID: $voice_id"
local payload
payload=$(jq -n \
--arg prompt "$description" \
--arg pt "$ptext" \
'{prompt: $prompt, preview_text: $pt}')
if [[ -n "$voice_id" ]]; then
payload=$(echo "$payload" | jq --arg vid "$voice_id" '. + {voice_id: $vid}')
fi
local response
response="$(api_request POST voice_design "$payload")"
local actual_voice_id
actual_voice_id="${voice_id:-$(echo "$response" | jq -r '.voice_id // "unknown"')}"
echo "Voice designed: $actual_voice_id"
local trial_audio
trial_audio="$(echo "$response" | jq -r '.trial_audio // empty')"
if [[ -n "$trial_audio" ]]; then
local pout="${preview_output:-${actual_voice_id}_preview.mp3}"
hex_to_file "$trial_audio" "$pout"
echo "Preview saved to: $pout"
fi
}
# ============================================================================
# Subcommand: list-voices
# ============================================================================
cmd_list_voices() {
echo "=== System Voices ==="
local sys_response
sys_response="$(api_request POST voice/list '{"voice_type":"system"}' 2>/dev/null)" || true
if [[ -n "$sys_response" ]]; then
local count
count="$(echo "$sys_response" | jq '.voice_list | length')" 2>/dev/null || count=0
if [[ "$count" -gt 0 ]]; then
echo "$sys_response" | jq -r '.voice_list[:10][] | " \(.voice_id): \(.name // "N/A")"'
if [[ "$count" -gt 10 ]]; then
echo " ... and $((count - 10)) more"
fi
else
echo " (None found)"
fi
else
echo " (Could not fetch system voices)"
fi
echo ""
echo "=== Custom Voices ==="
local clone_response design_response
clone_response="$(api_request POST voice/list '{"voice_type":"voice_cloning"}' 2>/dev/null)" || true
design_response="$(api_request POST voice/list '{"voice_type":"voice_generation"}' 2>/dev/null)" || true
local has_custom=false
if [[ -n "$clone_response" ]]; then
local cc
cc="$(echo "$clone_response" | jq '.voice_list | length')" 2>/dev/null || cc=0
if [[ "$cc" -gt 0 ]]; then
has_custom=true
echo "Cloned ($cc):"
echo "$clone_response" | jq -r '.voice_list[] | " \(.voice_id)"'
fi
fi
if [[ -n "$design_response" ]]; then
local dc
dc="$(echo "$design_response" | jq '.voice_list | length')" 2>/dev/null || dc=0
if [[ "$dc" -gt 0 ]]; then
has_custom=true
echo "Designed ($dc):"
echo "$design_response" | jq -r '.voice_list[] | " \(.voice_id)"'
fi
fi
if ! $has_custom; then
echo " (None found)"
fi
}
# ============================================================================
# Subcommand: validate
# ============================================================================
cmd_validate() {
local segments_file="" model="speech-2.8-hd" strict=false verbose=false
if [[ $# -gt 0 && "$1" != -* ]]; then
segments_file="$1"; shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
--model) model="$2"; shift 2 ;;
--strict) strict=true; shift ;;
-v|--verbose) verbose=true; shift ;;
--validate-voices) shift ;; # Not implemented in bash version
*) [[ -z "$segments_file" ]] && segments_file="$1"; shift ;;
esac
done
if [[ -z "$segments_file" || ! -f "$segments_file" ]]; then
echo "Error: Segments file not found: ${segments_file:-<none>}" >&2
exit 1
fi
echo "Validating: $segments_file"
echo "Model: $model"
local valid_emotions="happy sad angry fearful disgusted surprised calm fluent whisper"
echo "Valid emotions: $valid_emotions"
echo ""
# Parse JSON
local segments count
segments="$(jq -r 'if type == "array" then . elif type == "object" and has("segments") then .segments else empty end' "$segments_file" 2>/dev/null)" || {
echo "Error: Invalid JSON in $segments_file" >&2
exit 1
}
if [[ -z "$segments" || "$segments" == "null" ]]; then
echo "Error: No segments found in file" >&2
exit 1
fi
count="$(echo "$segments" | jq 'length')"
local errors=0
for ((i=0; i<count; i++)); do
local text voice_id emotion
text="$(echo "$segments" | jq -r ".[$i].text // \"\"")"
voice_id="$(echo "$segments" | jq -r ".[$i].voice_id // \"\"")"
emotion="$(echo "$segments" | jq -r ".[$i].emotion // \"\"")"
if [[ -z "$text" ]]; then
echo " - Segment $i: 'text' is required and must not be empty"
errors=$((errors + 1))
fi
if [[ -z "$voice_id" ]]; then
echo " - Segment $i: 'voice_id' is required"
errors=$((errors + 1))
fi
if [[ -n "$emotion" ]]; then
if ! echo "$valid_emotions" | grep -qw "$emotion"; then
echo " - Segment $i: invalid emotion '$emotion'"
errors=$((errors + 1))
fi
fi
done
if [[ $errors -eq 0 ]]; then
echo "Validation passed: $count segments"
if $verbose; then
echo ""
echo "=== Segment Summary ==="
for ((i=0; i<count; i++)); do
local text voice_id emotion
text="$(echo "$segments" | jq -r ".[$i].text // \"\"")"
voice_id="$(echo "$segments" | jq -r ".[$i].voice_id // \"\"")"
emotion="$(echo "$segments" | jq -r ".[$i].emotion // \"\"")"
local elabel="${emotion:-AUTO}"
printf " %d: [%-10s] voice=%-20s \"%s\"\n" "$i" "${elabel^^}" "${voice_id:0:20}" "${text:0:40}"
done
fi
return 0
else
echo "Validation failed ($errors errors)"
return 1
fi
}
# ============================================================================
# Subcommand: generate (multi-segment pipeline)
# ============================================================================
cmd_generate() {
local segments_file="" output="" model="speech-2.8-hd" crossfade=200
local no_normalize=false temp_dir="" continue_on_error=false
if [[ $# -gt 0 && "$1" != -* ]]; then
segments_file="$1"; shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--model) model="$2"; shift 2 ;;
--crossfade) crossfade="$2"; shift 2 ;;
--no-normalize) no_normalize=true; shift ;;
--temp-dir) temp_dir="$2"; shift 2 ;;
--continue-on-error) continue_on_error=true; shift ;;
*) [[ -z "$segments_file" ]] && segments_file="$1"; shift ;;
esac
done
if [[ -z "$segments_file" || ! -f "$segments_file" ]]; then
echo "Error: Segments file not found: ${segments_file:-<none>}" >&2
exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: -o/--output is required" >&2
exit 1
fi
# Validate first
echo "Validating segments file..."
local segments count
segments="$(jq -r 'if type == "array" then . elif type == "object" and has("segments") then .segments else empty end' "$segments_file")"
count="$(echo "$segments" | jq 'length')"
if [[ "$count" -eq 0 ]]; then
echo "Error: No segments found" >&2
exit 1
fi
echo "Found $count valid segments"
echo ""
# Setup temp dir
if [[ -z "$temp_dir" ]]; then
temp_dir="$(dirname "$(cd "$(dirname "$output")" 2>/dev/null && pwd || echo ".")/$(basename "$output")")/tmp"
fi
mkdir -p "$temp_dir"
echo "Temp directory: $temp_dir"
# Generate each segment
local succeeded=0 failed=0
local segment_files=()
for ((i=0; i<count; i++)); do
local text voice_id emotion speed vol pitch
text="$(echo "$segments" | jq -r ".[$i].text")"
voice_id="$(echo "$segments" | jq -r ".[$i].voice_id")"
emotion="$(echo "$segments" | jq -r ".[$i].emotion // \"\"")"
speed="$(echo "$segments" | jq -r ".[$i].speed // 1.0")"
vol="$(echo "$segments" | jq -r ".[$i].volume // 1.0")"
pitch="$(echo "$segments" | jq -r ".[$i].pitch // 0")"
printf " Generating segment %d/%d: %s...\n" "$((i+1))" "$count" "${text:0:40}"
local seg_output="$temp_dir/segment_$(printf '%04d' "$i").mp3"
# Build voice_setting
local voice_setting
voice_setting=$(jq -n \
--arg vid "$voice_id" \
--argjson spd "$speed" \
--argjson vol "$vol" \
--argjson pit "$pitch" \
'{voice_id: $vid, speed: $spd, vol: $vol, pitch: $pit}')
if [[ -n "$emotion" ]]; then
voice_setting=$(echo "$voice_setting" | jq --arg e "$emotion" '. + {emotion: $e}')
fi
local payload
payload=$(jq -n \
--arg model "$model" \
--arg text "$text" \
--argjson vs "$voice_setting" \
'{
model: $model,
text: $text,
voice_setting: $vs,
audio_setting: {sample_rate: 32000, bitrate: 128000, format: "mp3", channel: 1},
stream: false,
output_format: "hex"
}')
local response audio_hex
if response="$(api_request POST t2a_v2 "$payload" 2>&1)"; then
audio_hex="$(echo "$response" | jq -r '.data.audio // .extra_info.audio // empty')"
if [[ -n "$audio_hex" ]]; then
hex_to_file "$audio_hex" "$seg_output"
segment_files+=("$seg_output")
succeeded=$((succeeded + 1))
echo " ✓ Saved: $seg_output"
else
failed=$((failed + 1))
echo " ✗ Error: No audio data in response"
if ! $continue_on_error; then break; fi
fi
else
failed=$((failed + 1))
echo " ✗ Error: $response"
if ! $continue_on_error; then break; fi
fi
done
if [[ ${#segment_files[@]} -eq 0 ]]; then
echo "Error: No segments were generated successfully" >&2
exit 1
fi
# Merge segments
ensure_dir "$(dirname "$output")"
if [[ ${#segment_files[@]} -eq 1 ]]; then
cp "${segment_files[0]}" "$output"
else
_merge_audio_files "$output" "$crossfade" "$no_normalize" "${segment_files[@]}"
fi
echo ""
echo "Audio saved to: $output"
echo " Generated: $succeeded/$count segments"
echo ""
echo " Intermediate files in: $temp_dir"
echo " Delete with: rm -rf $temp_dir"
}
# ============================================================================
# Subcommand: merge
# ============================================================================
cmd_merge() {
local output="" format="mp3" crossfade=300 normalize=true
local input_files=()
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--format) format="$2"; shift 2 ;;
--crossfade) crossfade="$2"; shift 2 ;;
--no-normalize) normalize=false; shift ;;
*) input_files+=("$1"); shift ;;
esac
done
if [[ ${#input_files[@]} -lt 2 ]]; then
echo "Error: At least 2 input files required" >&2
exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: -o/--output is required" >&2
exit 1
fi
for f in "${input_files[@]}"; do
if [[ ! -f "$f" ]]; then
echo "Error: File not found: $f" >&2
exit 1
fi
done
echo "Merging ${#input_files[@]} files..."
local no_norm="false"
$normalize || no_norm="true"
_merge_audio_files "$output" "$crossfade" "$no_norm" "${input_files[@]}"
echo "Merged audio saved to: $output"
}
_merge_audio_files() {
# _merge_audio_files OUTPUT CROSSFADE_MS NO_NORMALIZE FILE1 FILE2 ...
local output="$1" crossfade_ms="$2" no_normalize="$3"
shift 3
local files=("$@")
local n=${#files[@]}
ensure_dir "$(dirname "$output")"
if [[ "$crossfade_ms" -gt 0 && $n -ge 2 ]]; then
# Use acrossfade filter for crossfade between segments
local crossfade_sec
crossfade_sec=$(echo "scale=3; $crossfade_ms / 1000" | bc)
local inputs=()
local filter_parts=()
for ((i=0; i<n; i++)); do
inputs+=(-i "${files[$i]}")
filter_parts+=("[${i}:a]aresample=32000,aformat=sample_fmts=fltp:channel_layouts=mono[s${i}]")
done
# Build acrossfade chain
if [[ $n -eq 2 ]]; then
filter_parts+=("[s0][s1]acrossfade=d=${crossfade_sec}[merged]")
else
filter_parts+=("[s0][s1]acrossfade=d=${crossfade_sec}[m1]")
for ((i=2; i<n; i++)); do
local prev="[m$((i-1))]"
if [[ $i -eq $((n-1)) ]]; then
filter_parts+=("${prev}[s${i}]acrossfade=d=${crossfade_sec}[merged]")
else
filter_parts+=("${prev}[s${i}]acrossfade=d=${crossfade_sec}[m${i}]")
fi
done
fi
local final_filter="[merged]aformat=sample_fmts=fltp"
if [[ "$no_normalize" != "true" ]]; then
final_filter+=",loudnorm=I=-16:TP=-1.5:LRA=11"
fi
final_filter+="[final]"
filter_parts+=("$final_filter")
local filter_complex
filter_complex="$(IFS=';'; echo "${filter_parts[*]}")"
if ffmpeg -y "${inputs[@]}" \
-filter_complex "$filter_complex" \
-map "[final]" \
-ar 32000 -ac 1 -acodec libmp3lame \
"$output" 2>/dev/null; then
return 0
fi
echo " Crossfade merge failed, falling back to concat demuxer..." >&2
fi
# Fallback: concat demuxer (no crossfade)
local concat_file
concat_file="$(mktemp /tmp/concat_XXXXXX.txt)"
for f in "${files[@]}"; do
echo "file '$(cd "$(dirname "$f")" && pwd)/$(basename "$f")'" >> "$concat_file"
done
if [[ "$no_normalize" != "true" ]]; then
local tmp_concat
tmp_concat="$(mktemp /tmp/concat_out_XXXXXX.mp3)"
ffmpeg -y -f concat -safe 0 -i "$concat_file" -c copy "$tmp_concat" 2>/dev/null
ffmpeg -y -i "$tmp_concat" -af "loudnorm=I=-16:TP=-1.5:LRA=11" -acodec libmp3lame "$output" 2>/dev/null
rm -f "$tmp_concat"
else
ffmpeg -y -f concat -safe 0 -i "$concat_file" -c copy "$output" 2>/dev/null
fi
rm -f "$concat_file"
}
# ============================================================================
# Subcommand: convert
# ============================================================================
cmd_convert() {
local input_file="" output="" format="mp3" sample_rate="" bitrate="" channels=""
if [[ $# -gt 0 && "$1" != -* ]]; then
input_file="$1"; shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
-o|--output) output="$2"; shift 2 ;;
--format) format="$2"; shift 2 ;;
--sample-rate) sample_rate="$2"; shift 2 ;;
--bitrate) bitrate="$2"; shift 2 ;;
--channels) channels="$2"; shift 2 ;;
*) [[ -z "$input_file" ]] && input_file="$1"; shift ;;
esac
done
if [[ -z "$input_file" || ! -f "$input_file" ]]; then
echo "Error: Input file not found: ${input_file:-<none>}" >&2
exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: -o/--output is required" >&2
exit 1
fi
ensure_dir "$(dirname "$output")"
# Determine codec
local codec="copy"
case "$format" in
mp3) codec="libmp3lame" ;;
wav) codec="pcm_s16le" ;;
flac) codec="flac" ;;
ogg) codec="libvorbis" ;;
aac) codec="aac" ;;
m4a) codec="aac" ;;
*) codec="copy" ;;
esac
local args=(-y -i "$input_file" -acodec "$codec")
[[ -n "$sample_rate" ]] && args+=(-ar "$sample_rate")
[[ -n "$channels" ]] && args+=(-ac "$channels")
[[ -n "$bitrate" ]] && args+=(-b:a "$bitrate")
args+=("$output")
echo "Converting $input_file to $format..."
ffmpeg "${args[@]}" 2>/dev/null
echo "Converted audio saved to: $output"
}
# ============================================================================
# Subcommand: check-env
# ============================================================================
cmd_check_env() {
local check_script="$SCRIPT_DIR/../check_environment.sh"
if [[ -f "$check_script" ]]; then
bash "$check_script" "$@"
else
echo "check_environment.sh not found" >&2
exit 1
fi
}
# ============================================================================
# Main dispatcher
# ============================================================================
usage() {
cat <<'EOF'
MiniMax Voice CLI — Unified TTS interface
Usage:
generate_voice.sh <command> [options]
Commands:
tts Basic text-to-speech
clone Clone voice from audio sample
design Design voice from description
list-voices List available voices
validate Validate segments.json file
generate Generate audio from segments.json
merge Merge multiple audio files
convert Convert audio format
check-env Check environment setup
Examples:
generate_voice.sh tts "Hello world" -o hello.mp3
generate_voice.sh tts "你好" -v female-shaonv -o hello_cn.mp3
generate_voice.sh clone my_voice.mp3 --voice-id my-custom-voice
generate_voice.sh design "A warm female voice" --voice-id narrator-1
generate_voice.sh list-voices
generate_voice.sh validate segments.json --verbose
generate_voice.sh generate segments.json -o output.mp3
generate_voice.sh merge part1.mp3 part2.mp3 -o combined.mp3
generate_voice.sh convert input.wav -o output.mp3
generate_voice.sh check-env --test-api
EOF
}
main() {
load_env
if [[ $# -eq 0 ]]; then
usage
exit 0
fi
local command="$1"; shift
case "$command" in
tts)
check_api_key
cmd_tts "$@"
;;
clone)
check_api_key
cmd_clone "$@"
;;
design)
check_api_key
cmd_design "$@"
;;
list-voices)
check_api_key
cmd_list_voices "$@"
;;
validate)
cmd_validate "$@"
;;
generate)
check_api_key
cmd_generate "$@"
;;
merge)
cmd_merge "$@"
;;
convert)
cmd_convert "$@"
;;
check-env)
cmd_check_env "$@"
;;
-h|--help|help)
usage
;;
*)
echo "Unknown command: $command" >&2
usage >&2
exit 1
;;
esac
}
main "$@"

View File

@@ -0,0 +1,221 @@
#!/usr/bin/env bash
# Add background music to a video file (pure bash)
#
# Usage:
# bash scripts/video/add_bgm.sh --video input.mp4 --audio bgm.mp3 -o output.mp4
# bash scripts/video/add_bgm.sh --video input.mp4 --generate-bgm --music-prompt "upbeat pop" -o output.mp4
# bash scripts/video/add_bgm.sh --video input.mp4 --audio bgm.mp3 --replace-audio -o output.mp4
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
MUSIC_API_URL="${MINIMAX_API_HOST:-https://api.minimaxi.com}/v1/music_generation"
# ============================================================================
# Common functions
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}"; line="$(echo "$line" | xargs)"
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}" val="${line#*=}"
key="$(echo "$key" | xargs)"; val="$(echo "$val" | xargs)"
if [[ ${#val} -ge 2 ]]; then
case "$val" in \"*\") val="${val:1:${#val}-2}" ;; \'*\') val="${val:1:${#val}-2}" ;; esac
fi
[[ -z "${!key:-}" ]] && export "$key=$val"
done < "$env_file"
fi
done
}
get_video_duration() {
ffprobe -v error -show_entries format=duration -of json "$1" 2>/dev/null | jq -r '.format.duration'
}
video_has_audio() {
local out
out="$(ffprobe -v error -select_streams a -show_entries stream=codec_type -of csv=p=0 "$1" 2>/dev/null)"
[[ "$out" == *audio* ]]
}
generate_music() {
local prompt="$1" output_path="$2" instrumental="${3:-false}"
local payload
local effective_prompt="${prompt:-background music, cinematic, ambient}"
if [[ "$instrumental" == "true" ]]; then
payload=$(jq -n \
--arg p "$effective_prompt. pure music, no lyrics" \
'{model: "music-2.5", prompt: $p, lyrics: "[intro] [outro]", output_format: "url"}')
else
payload=$(jq -n \
--arg p "$effective_prompt" \
'{model: "music-2.5", prompt: $p, lyrics: "[Intro]\nla da da\nla la la", output_format: "url"}')
fi
echo "Generating ${instrumental:+instrumental }music..."
echo " Prompt: $prompt"
local raw http_code response
raw="$(curl -s -w "\n%{http_code}" -X POST "$MUSIC_API_URL" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time 300 -d "$payload")"
http_code="${raw##*$'\n'}"; response="${raw%$'\n'*}"
[[ "$http_code" -ge 400 ]] 2>/dev/null && { echo "Error: Music API HTTP $http_code" >&2; return 1; }
local sc
sc="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
[[ "$sc" != "0" && -n "$sc" ]] && { echo "Error: Music API error: $(echo "$response" | jq '.base_resp')" >&2; return 1; }
local audio_url
audio_url="$(echo "$response" | jq -r '.data.audio_url // .data.audio // .data.audio_file.download_url // empty')"
[[ -z "$audio_url" ]] && { echo "Error: No audio URL in music response" >&2; return 1; }
mkdir -p "$(dirname "$output_path")"
# Download with retry
local attempt
for attempt in 1 2 3; do
if curl -s -o "$output_path" --max-time 120 "$audio_url" 2>/dev/null; then
local size; size="$(wc -c < "$output_path" | tr -d ' ')"
echo " Downloaded: $output_path ($size bytes)"
return 0
fi
if [[ $attempt -lt 3 ]]; then
local wait=$((2 ** attempt))
echo " Download attempt $attempt failed. Retrying in ${wait}s..."
sleep "$wait"
fi
done
echo "Error: Download failed after 3 attempts" >&2
return 1
}
# ============================================================================
# Main
# ============================================================================
main() {
load_env
local video="" audio="" output=""
local generate_bgm=false instrumental=false music_prompt=""
local bgm_volume=0.3 fade_in=0 fade_out=0 replace_audio=false
while [[ $# -gt 0 ]]; do
case "$1" in
--video) video="$2"; shift 2 ;;
--audio) audio="$2"; shift 2 ;;
--generate-bgm) generate_bgm=true; shift ;;
--instrumental) instrumental=true; shift ;;
--music-prompt) music_prompt="$2"; shift 2 ;;
--bgm-volume) bgm_volume="$2"; shift 2 ;;
--fade-in) fade_in="$2"; shift 2 ;;
--fade-out) fade_out="$2"; shift 2 ;;
--replace-audio) replace_audio=true; shift ;;
-o|--output) output="$2"; shift 2 ;;
-h|--help)
cat <<'USAGE'
Add Background Music to Video
Usage:
add_bgm.sh --video INPUT --audio BGM -o OUTPUT
add_bgm.sh --video INPUT --generate-bgm --music-prompt "style" -o OUTPUT
Options:
--video FILE Input video file (required)
--audio FILE Background music audio file
--generate-bgm Generate BGM via MiniMax API
--instrumental Make generated BGM instrumental
--music-prompt TEXT Prompt for BGM generation
--bgm-volume FLOAT BGM volume level (default: 0.3)
--fade-in SECS BGM fade-in duration
--fade-out SECS BGM fade-out duration
--replace-audio Replace original audio instead of mixing
-o, --output FILE Output video file (required)
Examples:
add_bgm.sh --video input.mp4 --audio bgm.mp3 -o output.mp4
add_bgm.sh --video input.mp4 --generate-bgm --music-prompt "upbeat pop" -o output.mp4
add_bgm.sh --video input.mp4 --audio bgm.mp3 --replace-audio -o output.mp4
USAGE
exit 0
;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
if [[ -z "$video" || ! -f "$video" ]]; then
echo "Error: Video file not found: ${video:-<none>}" >&2; exit 1
fi
if [[ -z "$audio" && "$generate_bgm" != "true" ]]; then
echo "Error: Provide --audio or --generate-bgm" >&2; exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: --output / -o is required" >&2; exit 1
fi
local audio_path="$audio"
if $generate_bgm; then
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY not set." >&2; exit 1
fi
audio_path="${output%.*}_bgm.mp3"
generate_music "$music_prompt" "$audio_path" "$instrumental" || exit 1
fi
if [[ ! -f "$audio_path" ]]; then
echo "Error: Audio file not found: $audio_path" >&2; exit 1
fi
local duration
duration="$(get_video_duration "$video")"
echo "Video duration: $(printf '%.1f' "$duration")s"
mkdir -p "$(dirname "$output")"
local has_audio=false
video_has_audio "$video" && has_audio=true
local bgm_filter="[1:a]volume=${bgm_volume}"
[[ "$(echo "$fade_in > 0" | bc -l)" == "1" ]] && bgm_filter+=",afade=t=in:d=${fade_in}"
if [[ "$(echo "$fade_out > 0" | bc -l)" == "1" ]]; then
local fo_start
fo_start="$(echo "$duration - $fade_out" | bc -l)"
[[ "$(echo "$fo_start < 0" | bc -l)" == "1" ]] && fo_start=0
bgm_filter+=",afade=t=out:st=${fo_start}:d=${fade_out}"
fi
if $has_audio && ! $replace_audio; then
bgm_filter+="[bgm];[0:a][bgm]amix=inputs=2:duration=first:dropout_transition=2[aout]"
echo "Merging video + audio (mixing with original, bgm_volume=${bgm_volume})..."
ffmpeg -y \
-i "$video" -i "$audio_path" \
-filter_complex "$bgm_filter" \
-map 0:v -map "[aout]" \
-c:v copy -c:a aac -shortest "$output" 2>/dev/null
else
bgm_filter+="[bgm]"
echo "Merging video + audio (${replace_audio:+replacing original}${replace_audio:-no original audio}, bgm_volume=${bgm_volume})..."
ffmpeg -y \
-i "$video" -i "$audio_path" \
-filter_complex "$bgm_filter" \
-map 0:v -map "[bgm]" \
-c:v copy -c:a aac -shortest "$output" 2>/dev/null
fi
echo "Output saved: $output"
echo "Done!"
}
main "$@"

View File

@@ -0,0 +1,479 @@
#!/usr/bin/env bash
# MiniMax Long Video Generation CLI (pure bash)
#
# Generates multi-segment videos by chaining scenes together.
# Each segment's last frame becomes the next segment's first frame.
# Optionally adds AI-generated background music.
#
# Usage:
# bash scripts/video/generate_long_video.sh \
# --scenes "A sunrise" "Birds flying" "A calm lake" \
# --output output/long_video.mp4
#
# bash scripts/video/generate_long_video.sh \
# --scenes "A robot waking up" "The robot walks outside" \
# --music-prompt "cinematic orchestral" \
# --output output/robot_story.mp4
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
API_BASE="${MINIMAX_API_HOST:-https://api.minimaxi.com}/v1"
MUSIC_API_URL="${API_BASE}/music_generation"
POLL_INTERVAL=10
MAX_WAIT_TIME=600
REQUEST_TIMEOUT=60
MAX_CONSECUTIVE_FAILURES=5
# ============================================================================
# Common functions
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}"; line="$(echo "$line" | xargs)"
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}" val="${line#*=}"
key="$(echo "$key" | xargs)"; val="$(echo "$val" | xargs)"
if [[ ${#val} -ge 2 ]]; then
case "$val" in \"*\") val="${val:1:${#val}-2}" ;; \'*\') val="${val:1:${#val}-2}" ;; esac
fi
[[ -z "${!key:-}" ]] && export "$key=$val"
done < "$env_file"
fi
done
}
check_api_key() {
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY not set." >&2; exit 1
fi
}
image_to_data_url() {
local path="$1"
[[ -f "$path" ]] || { echo "Error: Image not found: $path" >&2; exit 1; }
local mime; mime="$(file -b --mime-type "$path" 2>/dev/null)" || mime="image/jpeg"
local b64; b64="$(base64 < "$path")"
echo "data:${mime};base64,${b64}"
}
resolve_image() {
local input="$1"
[[ -z "$input" ]] && return
case "$input" in
http://*|https://*|data:*) echo "$input" ;;
*) image_to_data_url "$input" ;;
esac
}
# ============================================================================
# Video API helpers (duplicated from generate_video.sh for standalone use)
# ============================================================================
_create_task() {
local payload="$1"
local raw http_code response
raw="$(curl -s -w "\n%{http_code}" -X POST "${API_BASE}/video_generation" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time "$REQUEST_TIMEOUT" -d "$payload")"
http_code="${raw##*$'\n'}"; response="${raw%$'\n'*}"
[[ "$http_code" -ge 400 ]] 2>/dev/null && { echo "Error: HTTP $http_code" >&2; echo "$response" >&2; exit 1; }
local sc; sc="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
[[ "$sc" != "0" && -n "$sc" ]] && { echo "Error: $(echo "$response" | jq '.base_resp')" >&2; exit 1; }
echo "$response" | jq -r '.task_id // empty'
}
_poll_task() {
local task_id="$1" start_time cf=0
start_time="$(date +%s)"
while true; do
local now=$(($(date +%s) - start_time))
[[ $now -gt $MAX_WAIT_TIME ]] && { echo "Error: Timeout" >&2; exit 1; }
local raw http_code response
if raw="$(curl -s -w "\n%{http_code}" -G "${API_BASE}/query/video_generation" \
-d "task_id=$task_id" -H "Authorization: Bearer ${MINIMAX_API_KEY}" \
--max-time "$REQUEST_TIMEOUT" 2>/dev/null)"; then
http_code="${raw##*$'\n'}"; response="${raw%$'\n'*}"; cf=0
else
cf=$((cf+1)); [[ $cf -ge $MAX_CONSECUTIVE_FAILURES ]] && { echo "Error: Too many failures" >&2; exit 1; }
sleep "$POLL_INTERVAL"; continue
fi
local status; status="$(echo "$response" | jq -r '.status // "Unknown"')"
echo " [${now}s] Status: $status" >&2
[[ "$status" == "Success" ]] && { echo "$response" | jq -r '.file_id // empty'; return 0; }
[[ "$status" == "Fail" || "$status" == "Failed" || "$status" == "Error" ]] && { echo "Error: Task failed" >&2; exit 1; }
sleep "$POLL_INTERVAL"
done
}
_download_video() {
local file_id="$1" output_path="$2"
local raw; raw="$(curl -s -G "${API_BASE}/files/retrieve" -d "file_id=$file_id" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" --max-time "$REQUEST_TIMEOUT")"
local dl_url; dl_url="$(echo "$raw" | jq -r '.file.download_url // empty')"
[[ -z "$dl_url" ]] && { echo "Error: No download_url" >&2; exit 1; }
mkdir -p "$(dirname "$output_path")"
curl -s -o "$output_path" --max-time $((REQUEST_TIMEOUT * 3)) "$dl_url"
echo " Video saved: $output_path ($(wc -c < "$output_path" | tr -d ' ') bytes)" >&2
}
# ============================================================================
# FFmpeg helpers
# ============================================================================
get_video_duration() {
ffprobe -v error -show_entries format=duration -of json "$1" 2>/dev/null | jq -r '.format.duration'
}
get_video_fps() {
local fps_str
fps_str="$(ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate -of csv=p=0 "$1" 2>/dev/null)" || { echo 25; return; }
local num den
num="${fps_str%/*}"; den="${fps_str#*/}"
echo $(( (num + den/2) / den )) 2>/dev/null || echo 25
}
video_has_audio() {
local out
out="$(ffprobe -v error -select_streams a -show_entries stream=codec_type -of csv=p=0 "$1" 2>/dev/null)"
[[ "$out" == *audio* ]]
}
extract_last_frame() {
local video_path="$1" output_image="$2"
# Try frame-accurate method with sseof fallback
if ! ffmpeg -y -sseof -0.04 -i "$video_path" -frames:v 1 -q:v 2 "$output_image" 2>/dev/null; then
echo "Warning: Could not extract last frame" >&2
return 1
fi
[[ -f "$output_image" ]] || return 1
echo " Extracted last frame: $output_image" >&2
}
concatenate_videos() {
local output_path="$1" crossfade="$2"
shift 2
local video_paths=("$@")
local n=${#video_paths[@]}
if [[ $n -eq 1 ]]; then
cp "${video_paths[0]}" "$output_path"
return 0
fi
local fps
fps="$(get_video_fps "${video_paths[0]}")"
local has_audio=true
for vp in "${video_paths[@]}"; do
video_has_audio "$vp" || { has_audio=false; break; }
done
if [[ "$(echo "$crossfade > 0" | bc -l)" == "1" ]]; then
# Get durations
local durations=()
for vp in "${video_paths[@]}"; do
durations+=("$(get_video_duration "$vp")")
done
# Build inputs
local inputs=()
for vp in "${video_paths[@]}"; do
inputs+=(-i "$(cd "$(dirname "$vp")" && pwd)/$(basename "$vp")")
done
# Calculate offsets
local offsets=() cumulative=0
for ((i=0; i<n-1; i++)); do
local offset
offset="$(echo "$cumulative + ${durations[$i]} - $crossfade" | bc -l)"
offsets+=("$offset")
cumulative="$offset"
done
# Build filter
local vf_parts=() af_parts=()
if [[ $n -eq 2 ]]; then
vf_parts+=("[0:v][1:v]xfade=transition=fade:duration=${crossfade}:offset=${offsets[0]}[vout]")
$has_audio && af_parts+=("[0:a][1:a]acrossfade=d=${crossfade}:c1=tri:c2=tri[aout]")
else
vf_parts+=("[0:v][1:v]xfade=transition=fade:duration=${crossfade}:offset=${offsets[0]}[xv1]")
$has_audio && af_parts+=("[0:a][1:a]acrossfade=d=${crossfade}:c1=tri:c2=tri[xa1]")
for ((i=2; i<n; i++)); do
local out_v="[xv${i}]" out_a="[xa${i}]"
[[ $i -eq $((n-1)) ]] && { out_v="[vout]"; out_a="[aout]"; }
vf_parts+=("[xv$((i-1))][${i}:v]xfade=transition=fade:duration=${crossfade}:offset=${offsets[$((i-1))]}${out_v}")
$has_audio && af_parts+=("[xa$((i-1))][${i}:a]acrossfade=d=${crossfade}:c1=tri:c2=tri${out_a}")
done
fi
local filter_complex
filter_complex="$(IFS=';'; echo "${vf_parts[*]}${af_parts[*]:+;${af_parts[*]}}")"
local cmd=(ffmpeg -y "${inputs[@]}" -filter_complex "$filter_complex" -map "[vout]")
$has_audio && cmd+=(-map "[aout]")
cmd+=(-c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p -r "$fps")
$has_audio && cmd+=(-c:a aac -b:a 192k)
cmd+=("$output_path")
if "${cmd[@]}" 2>/dev/null; then
echo "Concatenated $n segments -> $output_path" >&2
return 0
fi
echo " Crossfade failed, falling back to re-encode concat..." >&2
fi
# Fallback: concat demuxer with re-encode
local concat_file
concat_file="$(mktemp /tmp/concat_XXXXXX.txt)"
for vp in "${video_paths[@]}"; do
echo "file '$(cd "$(dirname "$vp")" && pwd)/$(basename "$vp")'" >> "$concat_file"
done
ffmpeg -y -f concat -safe 0 -i "$concat_file" \
-c:v libx264 -preset medium -crf 18 -pix_fmt yuv420p -r "$fps" \
-c:a aac -b:a 192k "$output_path" 2>/dev/null
rm -f "$concat_file"
echo "Concatenated $n segments -> $output_path" >&2
}
merge_video_audio() {
local video_path="$1" audio_path="$2" output_path="$3"
local bgm_volume="${4:-0.3}" fade_in="${5:-0}" fade_out="${6:-0}"
local duration
duration="$(get_video_duration "$video_path")"
local af="[1:a]volume=${bgm_volume}"
[[ "$(echo "$fade_in > 0" | bc -l)" == "1" ]] && af+=",afade=t=in:d=${fade_in}"
if [[ "$(echo "$fade_out > 0" | bc -l)" == "1" ]]; then
local fo_start
fo_start="$(echo "$duration - $fade_out" | bc -l)"
[[ "$(echo "$fo_start < 0" | bc -l)" == "1" ]] && fo_start=0
af+=",afade=t=out:st=${fo_start}:d=${fade_out}"
fi
af+="[bgm]"
mkdir -p "$(dirname "$output_path")"
ffmpeg -y -i "$video_path" -i "$audio_path" \
-filter_complex "$af" \
-map 0:v -map "[bgm]" \
-c:v copy -c:a aac -shortest "$output_path" 2>/dev/null
echo "Merged video+audio -> $output_path" >&2
}
generate_music_instrumental() {
local prompt="$1" output_path="$2"
local payload
payload=$(jq -n \
--arg p "${prompt:-cinematic background music, orchestral, ambient}. pure music, no lyrics" \
'{model: "music-2.5", prompt: $p, lyrics: "[intro] [outro]", output_format: "url"}')
echo "Generating instrumental music: $prompt" >&2
local raw http_code response
raw="$(curl -s -w "\n%{http_code}" -X POST "$MUSIC_API_URL" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time 300 -d "$payload")"
http_code="${raw##*$'\n'}"; response="${raw%$'\n'*}"
[[ "$http_code" -ge 400 ]] 2>/dev/null && { echo "Error: Music API HTTP $http_code" >&2; return 1; }
local audio_url
audio_url="$(echo "$response" | jq -r '.data.audio_url // .data.audio // .data.audio_file.download_url // empty')"
[[ -z "$audio_url" ]] && { echo "Error: No audio URL in music response" >&2; return 1; }
mkdir -p "$(dirname "$output_path")"
curl -s -o "$output_path" --max-time 120 "$audio_url"
echo " Music saved: $output_path" >&2
}
# ============================================================================
# Main
# ============================================================================
main() {
load_env
check_api_key
local scenes=() model="" segment_duration=10 resolution="768P"
local first_frame="" subject_reference="" crossfade=0.5
local music_prompt="" bgm_volume=0.3 fade_in=0 fade_out=0
local output=""
while [[ $# -gt 0 ]]; do
case "$1" in
--scenes)
shift
while [[ $# -gt 0 && "$1" != --* ]]; do
scenes+=("$1"); shift
done
;;
--model) model="$2"; shift 2 ;;
--segment-duration) segment_duration="$2"; shift 2 ;;
--resolution) resolution="$2"; shift 2 ;;
--first-frame) first_frame="$2"; shift 2 ;;
--subject-reference) subject_reference="$2"; shift 2 ;;
--crossfade) crossfade="$2"; shift 2 ;;
--music-prompt) music_prompt="$2"; shift 2 ;;
--bgm-volume) bgm_volume="$2"; shift 2 ;;
--fade-in) fade_in="$2"; shift 2 ;;
--fade-out) fade_out="$2"; shift 2 ;;
-o|--output) output="$2"; shift 2 ;;
-h|--help)
cat <<'USAGE'
MiniMax Long Video Generation CLI
Usage:
generate_long_video.sh --scenes "scene1" "scene2" ... -o OUTPUT
Options:
--scenes TEXT... Scene prompts (2+ required)
--model MODEL Model name (default: auto)
--segment-duration SECS Duration per segment (default: 10)
--resolution RES Resolution: 768P, 1080P (default: 768P)
--first-frame FILE First frame for scene 1 (local file or URL)
--subject-reference FILE Subject reference image
--crossfade SECS Crossfade duration between scenes (default: 0.5)
--music-prompt TEXT Generate BGM with this prompt
--bgm-volume FLOAT BGM volume level (default: 0.3)
--fade-in SECS BGM fade-in duration
--fade-out SECS BGM fade-out duration
-o, --output FILE Output video file (required)
Examples:
generate_long_video.sh --scenes "A sunrise" "Birds flying" "Sunset" -o long.mp4
generate_long_video.sh --scenes "Scene 1" "Scene 2" --crossfade 1 --music-prompt "cinematic" -o movie.mp4
USAGE
exit 0
;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
if [[ ${#scenes[@]} -eq 0 ]]; then
echo "Error: --scenes is required" >&2; exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: --output / -o is required" >&2; exit 1
fi
local output_dir
output_dir="$(dirname "$output")"
mkdir -p "$output_dir"
local tmpdir="$output_dir/tmp"
mkdir -p "$tmpdir"
echo "Temp directory: $tmpdir"
local segment_paths=()
local current_first_frame="$first_frame"
echo "=== Generating ${#scenes[@]} video segments ==="
echo ""
for i in "${!scenes[@]}"; do
local scene="${scenes[$i]}"
echo "--- Segment $((i+1))/${#scenes[@]} ---"
echo " Prompt: $scene"
local seg_output="$tmpdir/segment_$(printf '%03d' "$i").mp4"
# Determine mode
local seg_mode="t2v"
[[ -n "$current_first_frame" ]] && seg_mode="i2v"
[[ -n "$subject_reference" && -z "$current_first_frame" ]] && seg_mode="ref"
# Determine model
local seg_model="$model"
if [[ -z "$seg_model" ]]; then
case "$seg_mode" in
t2v|i2v) seg_model="MiniMax-Hailuo-2.3" ;;
ref) seg_model="S2V-01" ;;
esac
fi
# Build payload
local payload
payload=$(jq -n \
--arg m "$seg_model" \
--arg p "$scene" \
--argjson d "$segment_duration" \
--arg r "$resolution" \
'{model: $m, prompt: $p, duration: $d, resolution: $r}')
if [[ "$seg_mode" == "i2v" ]]; then
local ff_url; ff_url="$(resolve_image "$current_first_frame")"
payload=$(echo "$payload" | jq --arg ff "$ff_url" '. + {first_frame_image: $ff, prompt_optimizer: false}')
elif [[ "$seg_mode" == "ref" ]]; then
local si_url; si_url="$(resolve_image "$subject_reference")"
payload=$(echo "$payload" | jq --arg si "$si_url" '. + {subject_reference: [{type: "character", image: [$si]}]}')
fi
# Generate segment
local task_id file_id
if task_id="$(_create_task "$payload")" && [[ -n "$task_id" ]]; then
echo " Task created: $task_id"
if file_id="$(_poll_task "$task_id")" && [[ -n "$file_id" ]]; then
_download_video "$file_id" "$seg_output"
segment_paths+=("$seg_output")
# Extract last frame for next segment
local last_frame_path="$tmpdir/last_frame_$(printf '%03d' "$i").jpg"
if extract_last_frame "$seg_output" "$last_frame_path"; then
current_first_frame="$last_frame_path"
else
current_first_frame=""
fi
else
echo " Error: Polling failed for segment $((i+1))" >&2
[[ ${#segment_paths[@]} -eq 0 ]] && exit 1
break
fi
else
echo " Error generating segment $((i+1))" >&2
[[ ${#segment_paths[@]} -eq 0 ]] && exit 1
break
fi
done
if [[ ${#segment_paths[@]} -eq 0 ]]; then
echo "Error: No segments were generated." >&2; exit 1
fi
# Concatenate
local final_video="$output"
[[ -n "$music_prompt" ]] && final_video="$tmpdir/concatenated.mp4"
if [[ ${#segment_paths[@]} -eq 1 ]]; then
cp "${segment_paths[0]}" "$final_video"
else
concatenate_videos "$final_video" "$crossfade" "${segment_paths[@]}"
fi
# Add BGM if requested
if [[ -n "$music_prompt" ]]; then
echo ""
echo "--- Generating background music ---"
local music_path="$tmpdir/bgm.mp3"
if generate_music_instrumental "$music_prompt" "$music_path"; then
merge_video_audio "$final_video" "$music_path" "$output" "$bgm_volume" "$fade_in" "$fade_out" || {
echo "Warning: Failed to add BGM, using video without music" >&2
[[ "$final_video" != "$output" ]] && cp "$final_video" "$output"
}
else
echo "Warning: Failed to generate BGM" >&2
[[ "$final_video" != "$output" ]] && cp "$final_video" "$output"
fi
fi
echo ""
echo "=== Done! Output: $output ==="
echo " Intermediate files in: $tmpdir"
echo " Delete with: rm -rf $tmpdir"
}
main "$@"

View File

@@ -0,0 +1,216 @@
#!/usr/bin/env bash
# MiniMax Template Video Generation CLI (pure bash)
#
# Usage:
# bash scripts/video/generate_template_video.sh \
# --template-id T00001 \
# --media image1.jpg image2.jpg \
# --text "Title" "Subtitle" \
# -o output/template_video.mp4
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
API_BASE="${MINIMAX_API_HOST:-https://api.minimaxi.com}/v1"
TEMPLATE_URL="${API_BASE}/video_template_generation"
QUERY_URL="${API_BASE}/query/video_template_generation"
POLL_INTERVAL=10
MAX_WAIT_TIME=600
REQUEST_TIMEOUT=60
MAX_CONSECUTIVE_FAILURES=5
# ============================================================================
# Common functions
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}"; line="$(echo "$line" | xargs)"
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}" val="${line#*=}"
key="$(echo "$key" | xargs)"; val="$(echo "$val" | xargs)"
if [[ ${#val} -ge 2 ]]; then
case "$val" in \"*\") val="${val:1:${#val}-2}" ;; \'*\') val="${val:1:${#val}-2}" ;; esac
fi
[[ -z "${!key:-}" ]] && export "$key=$val"
done < "$env_file"
fi
done
}
check_api_key() {
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY not set." >&2; exit 1
fi
}
resolve_media_input() {
local value="$1"
case "$value" in
http://*|https://*|data:*) echo "$value"; return ;;
esac
[[ -f "$value" ]] || { echo "Error: Media file not found: $value" >&2; exit 1; }
local mime; mime="$(file -b --mime-type "$value" 2>/dev/null)" || mime="application/octet-stream"
local b64; b64="$(base64 < "$value")"
echo "data:${mime};base64,${b64}"
}
# ============================================================================
# Main
# ============================================================================
main() {
load_env
check_api_key
local template_id="" output=""
local media_inputs=() text_inputs=()
while [[ $# -gt 0 ]]; do
case "$1" in
--template-id) template_id="$2"; shift 2 ;;
--media)
shift
while [[ $# -gt 0 && "$1" != --* ]]; do
media_inputs+=("$1"); shift
done
;;
--text)
shift
while [[ $# -gt 0 && "$1" != --* ]]; do
text_inputs+=("$1"); shift
done
;;
-o|--output) output="$2"; shift 2 ;;
-h|--help)
cat <<'USAGE'
MiniMax Template Video Generation CLI
Usage:
generate_template_video.sh --template-id ID [--media FILE...] [--text TEXT...] -o OUTPUT
Options:
--template-id ID Template ID (required)
--media FILE... Media inputs (local files or URLs)
--text TEXT... Text inputs for template slots
-o, --output FILE Output video file (required)
Examples:
generate_template_video.sh --template-id T00001 --media image1.jpg image2.jpg --text "Title" "Subtitle" -o video.mp4
USAGE
exit 0
;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
if [[ -z "$template_id" ]]; then
echo "Error: --template-id is required" >&2; exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: --output / -o is required" >&2; exit 1
fi
# Build payload
local payload
payload=$(jq -n --arg tid "$template_id" '{template_id: $tid}')
# Add media inputs
if [[ ${#media_inputs[@]} -gt 0 ]]; then
local media_json="[]"
for i in "${!media_inputs[@]}"; do
local resolved
resolved="$(resolve_media_input "${media_inputs[$i]}")"
media_json=$(echo "$media_json" | jq --arg url "$resolved" '. + [{value: $url}]')
echo " Media [$i]: ${media_inputs[$i]}"
done
payload=$(echo "$payload" | jq --argjson mi "$media_json" '. + {media_inputs: $mi}')
fi
# Add text inputs
if [[ ${#text_inputs[@]} -gt 0 ]]; then
local text_json="[]"
for i in "${!text_inputs[@]}"; do
text_json=$(echo "$text_json" | jq --arg t "${text_inputs[$i]}" '. + [{value: $t}]')
echo " Text [$i]: ${text_inputs[$i]}"
done
payload=$(echo "$payload" | jq --argjson ti "$text_json" '. + {text_inputs: $ti}')
fi
# Create task
echo "Creating template video task (template: $template_id)..."
local raw http_code response
raw="$(curl -s -w "\n%{http_code}" -X POST "$TEMPLATE_URL" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time "$REQUEST_TIMEOUT" -d "$payload")"
http_code="${raw##*$'\n'}"; response="${raw%$'\n'*}"
[[ "$http_code" -ge 400 ]] 2>/dev/null && { echo "Error: HTTP $http_code" >&2; echo "$response" >&2; exit 1; }
local sc
sc="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
[[ "$sc" != "0" && -n "$sc" ]] && { echo "Error: $(echo "$response" | jq '.base_resp')" >&2; exit 1; }
local task_id
task_id="$(echo "$response" | jq -r '.task_id // empty')"
[[ -z "$task_id" ]] && { echo "Error: No task_id in response" >&2; exit 1; }
echo "Task created: $task_id"
# Poll task
echo "Polling task $task_id..."
local start_time cf=0
start_time="$(date +%s)"
local video_url=""
while true; do
local elapsed=$(( $(date +%s) - start_time ))
[[ $elapsed -gt $MAX_WAIT_TIME ]] && { echo "Error: Timeout" >&2; exit 1; }
local poll_raw poll_code poll_resp
if poll_raw="$(curl -s -w "\n%{http_code}" -G "$QUERY_URL" \
-d "task_id=$task_id" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
--max-time "$REQUEST_TIMEOUT" 2>/dev/null)"; then
poll_code="${poll_raw##*$'\n'}"; poll_resp="${poll_raw%$'\n'*}"; cf=0
else
cf=$((cf+1))
echo " Poll error ($cf/$MAX_CONSECUTIVE_FAILURES)"
[[ $cf -ge $MAX_CONSECUTIVE_FAILURES ]] && { echo "Error: Too many failures" >&2; exit 1; }
sleep "$POLL_INTERVAL"; continue
fi
local status
status="$(echo "$poll_resp" | jq -r '.status // "Unknown"')"
echo " [${elapsed}s] Status: $status"
if [[ "$status" == "Success" ]]; then
local video_url
video_url="$(echo "$poll_resp" | jq -r '.video_url // empty')"
[[ -z "$video_url" ]] && { echo "Error: No video_url in response" >&2; exit 1; }
break
fi
[[ "$status" == "Fail" || "$status" == "Failed" || "$status" == "Error" ]] && {
echo "Error: Task failed: $(echo "$poll_resp" | jq -r '.base_resp.status_msg // "Unknown"')" >&2
exit 1
}
sleep "$POLL_INTERVAL"
done
# Download video directly from video_url
echo "Downloading video..."
mkdir -p "$(dirname "$output")"
curl -s -o "$output" --max-time $((REQUEST_TIMEOUT * 3)) "$video_url"
local size; size="$(wc -c < "$output" | tr -d ' ')"
echo "Video saved to: $output ($size bytes)"
echo "Done!"
}
main "$@"

View File

@@ -0,0 +1,329 @@
#!/usr/bin/env bash
# MiniMax Video Generation CLI (pure bash)
#
# Usage:
# bash scripts/video/generate_video.sh --mode t2v --prompt "A cat playing piano" -o output/cat.mp4
# bash scripts/video/generate_video.sh --mode i2v --prompt "Gentle breeze" --first-frame image.jpg -o output/anim.mp4
# bash scripts/video/generate_video.sh --mode sef --first-frame start.jpg --last-frame end.jpg -o output/sef.mp4
# bash scripts/video/generate_video.sh --mode ref --prompt "Person dancing" --subject-image person.jpg -o output/ref.mp4
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
API_BASE="${MINIMAX_API_HOST:-https://api.minimaxi.com}/v1"
POLL_INTERVAL=10
MAX_WAIT_TIME=600
REQUEST_TIMEOUT=60
MAX_CONSECUTIVE_FAILURES=5
# ============================================================================
# Common functions
# ============================================================================
load_env() {
local env_file
for env_file in "$PROJECT_ROOT/.env" "$(pwd)/.env"; do
if [[ -f "$env_file" ]]; then
while IFS= read -r line || [[ -n "$line" ]]; do
line="${line%%#*}"; line="$(echo "$line" | xargs)"
[[ -z "$line" || "$line" != *=* ]] && continue
local key="${line%%=*}" val="${line#*=}"
key="$(echo "$key" | xargs)"; val="$(echo "$val" | xargs)"
if [[ ${#val} -ge 2 ]]; then
case "$val" in \"*\") val="${val:1:${#val}-2}" ;; \'*\') val="${val:1:${#val}-2}" ;; esac
fi
[[ -z "${!key:-}" ]] && export "$key=$val"
done < "$env_file"
fi
done
}
check_api_key() {
if [[ -z "${MINIMAX_API_KEY:-}" ]]; then
echo "Error: MINIMAX_API_KEY environment variable is not set." >&2; exit 1
fi
}
image_to_data_url() {
local path="$1"
[[ -f "$path" ]] || { echo "Error: Image not found: $path" >&2; exit 1; }
local mime
mime="$(file -b --mime-type "$path" 2>/dev/null)" || mime="image/jpeg"
local b64
b64="$(base64 < "$path")"
echo "data:${mime};base64,${b64}"
}
resolve_image() {
local input="$1"
[[ -z "$input" ]] && return
case "$input" in
http://*|https://*|data:*) echo "$input" ;;
*) image_to_data_url "$input" ;;
esac
}
# ============================================================================
# Video generation functions
# ============================================================================
create_task() {
local payload="$1"
echo "Creating video generation task..." >&2
local raw_output http_code response
raw_output="$(curl -s -w "\n%{http_code}" \
-X POST "${API_BASE}/video_generation" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
-H "Content-Type: application/json" \
--max-time "$REQUEST_TIMEOUT" \
-d "$payload")"
http_code="${raw_output##*$'\n'}"
response="${raw_output%$'\n'*}"
if [[ "$http_code" -ge 400 ]] 2>/dev/null; then
echo "Error: API returned HTTP $http_code" >&2; echo "$response" >&2; exit 1
fi
local sc
sc="$(echo "$response" | jq -r '.base_resp.status_code // 0')" 2>/dev/null || true
if [[ "$sc" != "0" && -n "$sc" ]]; then
echo "Error: API error: $(echo "$response" | jq '.base_resp')" >&2; exit 1
fi
local task_id
task_id="$(echo "$response" | jq -r '.task_id // empty')"
if [[ -z "$task_id" ]]; then
echo "Error: No task_id in response" >&2; echo "$response" >&2; exit 1
fi
echo "Task created: $task_id" >&2
echo "$task_id"
}
poll_task() {
local task_id="$1"
echo "Polling task $task_id..." >&2
local start_time consecutive_failures=0
start_time="$(date +%s)"
while true; do
local now elapsed
now="$(date +%s)"
elapsed=$((now - start_time))
if [[ $elapsed -gt $MAX_WAIT_TIME ]]; then
echo "Error: Task $task_id timed out after ${MAX_WAIT_TIME}s" >&2; exit 1
fi
local raw_output http_code response
if raw_output="$(curl -s -w "\n%{http_code}" \
-G "${API_BASE}/query/video_generation" \
-d "task_id=$task_id" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
--max-time "$REQUEST_TIMEOUT" 2>/dev/null)"; then
http_code="${raw_output##*$'\n'}"
response="${raw_output%$'\n'*}"
consecutive_failures=0
else
consecutive_failures=$((consecutive_failures + 1))
echo " Poll error ($consecutive_failures/$MAX_CONSECUTIVE_FAILURES)" >&2
if [[ $consecutive_failures -ge $MAX_CONSECUTIVE_FAILURES ]]; then
echo "Error: Too many consecutive poll failures" >&2; exit 1
fi
sleep "$POLL_INTERVAL"; continue
fi
local status
status="$(echo "$response" | jq -r '.status // "Unknown"')"
echo " [${elapsed}s] Status: $status" >&2
if [[ "$status" == "Success" ]]; then
local file_id
file_id="$(echo "$response" | jq -r '.file_id // empty')"
if [[ -z "$file_id" ]]; then
echo "Error: Task succeeded but no file_id" >&2; exit 1
fi
echo "$file_id"
return 0
fi
if [[ "$status" == "Fail" || "$status" == "Failed" || "$status" == "Error" ]]; then
local err_msg
err_msg="$(echo "$response" | jq -r '.base_resp.status_msg // "Unknown error"')"
echo "Error: Task failed: $err_msg" >&2; exit 1
fi
sleep "$POLL_INTERVAL"
done
}
download_video() {
local file_id="$1" output_path="$2"
echo "Retrieving file $file_id..." >&2
local raw_output http_code response
raw_output="$(curl -s -w "\n%{http_code}" \
-G "${API_BASE}/files/retrieve" \
-d "file_id=$file_id" \
-H "Authorization: Bearer ${MINIMAX_API_KEY}" \
--max-time "$REQUEST_TIMEOUT")"
http_code="${raw_output##*$'\n'}"
response="${raw_output%$'\n'*}"
local dl_url
dl_url="$(echo "$response" | jq -r '.file.download_url // empty')"
if [[ -z "$dl_url" ]]; then
echo "Error: No download_url in file response" >&2; exit 1
fi
echo "Downloading video..." >&2
mkdir -p "$(dirname "$output_path")"
curl -s -o "$output_path" --max-time $((REQUEST_TIMEOUT * 3)) "$dl_url"
local size
size="$(wc -c < "$output_path" | tr -d ' ')"
echo "Video saved to: $output_path ($size bytes)" >&2
}
# ============================================================================
# Main
# ============================================================================
main() {
load_env
check_api_key
local mode="" prompt="" model="" duration=10 resolution="768P"
local first_frame="" last_frame="" subject_image=""
local prompt_optimizer="" fast_pretreatment="" callback_url="" aigc_watermark=""
local output=""
while [[ $# -gt 0 ]]; do
case "$1" in
--mode) mode="$2"; shift 2 ;;
--prompt) prompt="$2"; shift 2 ;;
--model) model="$2"; shift 2 ;;
--duration) duration="$2"; shift 2 ;;
--resolution) resolution="$2"; shift 2 ;;
--first-frame) first_frame="$2"; shift 2 ;;
--last-frame) last_frame="$2"; shift 2 ;;
--subject-image) subject_image="$2"; shift 2 ;;
--prompt-optimizer) prompt_optimizer="$2"; shift 2 ;;
--fast-pretreatment) fast_pretreatment="$2"; shift 2 ;;
--callback-url) callback_url="$2"; shift 2 ;;
--aigc-watermark) aigc_watermark="$2"; shift 2 ;;
-o|--output) output="$2"; shift 2 ;;
-h|--help)
cat <<'USAGE'
MiniMax Video Generation CLI
Usage:
generate_video.sh --mode MODE [options] -o OUTPUT
Modes:
t2v Text-to-video
i2v Image-to-video (requires --first-frame)
sef Start-end frame (requires --first-frame and --last-frame)
ref Subject reference (requires --subject-image)
Options:
--mode MODE Generation mode: t2v, i2v, sef, ref (required)
--prompt TEXT Text prompt describing the video
--model MODEL Model name (default: T2V-01)
--first-frame FILE First frame image (local file or URL)
--last-frame FILE Last frame image (local file or URL)
--subject-image FILE Subject reference image (local file or URL)
-o, --output FILE Output video file (required)
Examples:
generate_video.sh --mode t2v --prompt "A cat playing piano" -o cat.mp4
generate_video.sh --mode i2v --prompt "Gentle breeze" --first-frame photo.jpg -o anim.mp4
generate_video.sh --mode sef --first-frame start.jpg --last-frame end.jpg -o sef.mp4
generate_video.sh --mode ref --prompt "Person dancing" --subject-image person.jpg -o ref.mp4
USAGE
exit 0
;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
done
if [[ -z "$mode" ]]; then
echo "Error: --mode is required (t2v, i2v, sef, ref)" >&2; exit 1
fi
if [[ -z "$output" ]]; then
echo "Error: --output / -o is required" >&2; exit 1
fi
# Default model per mode
if [[ -z "$model" ]]; then
case "$mode" in
t2v) model="MiniMax-Hailuo-2.3" ;;
i2v) model="MiniMax-Hailuo-2.3" ;;
sef) model="MiniMax-Hailuo-02" ;;
ref) model="S2V-01" ;;
esac
fi
# Build payload
local payload
payload=$(jq -n --arg m "$model" '{model: $m}')
[[ -n "$prompt" ]] && payload=$(echo "$payload" | jq --arg p "$prompt" '. + {prompt: $p}')
payload=$(echo "$payload" | jq --argjson d "$duration" '. + {duration: $d}')
payload=$(echo "$payload" | jq --arg r "$resolution" '. + {resolution: $r}')
[[ -n "$prompt_optimizer" ]] && payload=$(echo "$payload" | jq --argjson po "$(echo "$prompt_optimizer" | tr '[:upper:]' '[:lower:]' | jq -R 'test("true")')" '. + {prompt_optimizer: $po}')
[[ -n "$callback_url" ]] && payload=$(echo "$payload" | jq --arg cu "$callback_url" '. + {callback_url: $cu}')
[[ -n "$aigc_watermark" ]] && payload=$(echo "$payload" | jq --argjson aw "$aigc_watermark" '. + {aigc_watermark: $aw}')
case "$mode" in
t2v) ;;
i2v)
if [[ -z "$first_frame" ]]; then
echo "Error: --first-frame is required for i2v mode" >&2; exit 1
fi
local ff_url
ff_url="$(resolve_image "$first_frame")"
payload=$(echo "$payload" | jq --arg ff "$ff_url" '. + {first_frame_image: $ff}')
[[ -n "$fast_pretreatment" ]] && payload=$(echo "$payload" | jq --argjson fp "$(echo "$fast_pretreatment" | tr '[:upper:]' '[:lower:]' | jq -R 'test("true")')" '. + {fast_pretreatment: $fp}')
;;
sef)
if [[ -z "$first_frame" ]]; then
echo "Error: --first-frame is required for sef mode" >&2; exit 1
fi
local ff_url
ff_url="$(resolve_image "$first_frame")"
payload=$(echo "$payload" | jq --arg ff "$ff_url" '. + {first_frame_image: $ff}')
if [[ -n "$last_frame" ]]; then
local lf_url
lf_url="$(resolve_image "$last_frame")"
payload=$(echo "$payload" | jq --arg lf "$lf_url" '. + {last_frame_image: $lf}')
fi
;;
ref)
if [[ -z "$subject_image" ]]; then
echo "Error: --subject-image is required for ref mode" >&2; exit 1
fi
local si_url
si_url="$(resolve_image "$subject_image")"
payload=$(echo "$payload" | jq --arg si "$si_url" '. + {subject_reference: [{type: "character", image: [$si]}]}')
if [[ -n "$first_frame" ]]; then
local ff_url
ff_url="$(resolve_image "$first_frame")"
payload=$(echo "$payload" | jq --arg ff "$ff_url" '. + {first_frame_image: $ff}')
fi
;;
*)
echo "Error: Unknown mode: $mode" >&2; exit 1 ;;
esac
echo "Mode: $mode"
echo "Model: $model"
local task_id file_id
task_id="$(create_task "$payload")"
file_id="$(poll_task "$task_id")"
download_video "$file_id" "$output"
echo "Done!"
}
main "$@"