Runflow
Back to Gallery

alibaba/happy-horse/reference-to-video

Generate 1080p video with synchronized native audio from a text prompt and references. Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4. Duration: 3–15s.

$0.14/second 2026-04-28

Input

Input 1

Example input image

Your request will cost$1.400

Per second of generated video (720p baseline)

Output

Example

Example output from Happy Horse Reference-to-Video

Pricing

Criteria

Per second of generated video (720p baseline)

$0.14

per second of video

7s of video for $1

Criteria

1080p

$0.28

per second of video

3s of video for $1

Overview

Happy Horse Reference-to-Video composes 1080p multi-character scenes from 1-9 reference images plus a text prompt. Reference images are addressable by index (`character1`, `character2`, …) so you can stage interactions, brand-consistent product placements, or multi-subject scenes with stable identity across the whole clip.

Key capabilities

  • Up to 9 references: each callable as `character1`-`character9` in your prompt — order matches your `image_urls` array
  • Identity preservation: subjects keep their reference appearance across motion and scene changes
  • Native audio + lip-sync: same audio engine as the rest of the family — multilingual and synchronized to motion
  • Five aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4
  • 3-15 second clips at 720p or 1080p
  • Strict reference quality: each image must be ≥400px on the short side, ≤10MB, JPEG/JPG/PNG/WEBP — 720p+ recommended for clean identity

Family

Part of the Happy Horse family — pair with the variants when you need a different starting modality:

VariantInputUse it for
Text-to-Videotext promptone-shot clips from a brief
Image-to-Videoimage + optional promptanimating a still or hero shot
Video Editsource video + edit prompttransforming an existing clip (style, scene swap)
Reference-to-Videotext + 1-9 referencesmulti-character scenes, brand-consistent subjects

Tech specs

  • Resolutions: 720p, 1080p
  • Duration: 3-15s
  • References: 1-9 images, ≥400px short side, ≤10MB each, callable as `character1`-`character9`
  • Audio: native, lip-synced, multilingual
  • Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4
  • Latency: 90-240s typical for a 5s clip with multiple references
  • Pricing: $0.14/s at 720p, $0.28/s at 1080p — references don't change the price

Frequently asked questions

Related models

Start generating with Happy Horse Reference-to-Video

Get API access in minutes. No GPU setup, no infrastructure to manage.