Generate 1080p video with synchronized native audio from a text prompt and references. Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4. Duration: 3–15s.
Per second of generated video (720p baseline)
Output
ExampleExample output from Happy Horse Reference-to-Video
Pricing
Criteria
Per second of generated video (720p baseline)
per second of video
7s of video for $1
Criteria
1080p
per second of video
3s of video for $1
Overview
Happy Horse Reference-to-Video composes 1080p multi-character scenes from 1-9 reference images plus a text prompt. Reference images are addressable by index (`character1`, `character2`, …) so you can stage interactions, brand-consistent product placements, or multi-subject scenes with stable identity across the whole clip.
Key capabilities
- ●Up to 9 references: each callable as `character1`-`character9` in your prompt — order matches your `image_urls` array
- ●Identity preservation: subjects keep their reference appearance across motion and scene changes
- ●Native audio + lip-sync: same audio engine as the rest of the family — multilingual and synchronized to motion
- ●Five aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4
- ●3-15 second clips at 720p or 1080p
- ●Strict reference quality: each image must be ≥400px on the short side, ≤10MB, JPEG/JPG/PNG/WEBP — 720p+ recommended for clean identity
Family
Part of the Happy Horse family — pair with the variants when you need a different starting modality:
| Variant | Input | Use it for |
|---|---|---|
| Text-to-Video | text prompt | one-shot clips from a brief |
| Image-to-Video | image + optional prompt | animating a still or hero shot |
| Video Edit | source video + edit prompt | transforming an existing clip (style, scene swap) |
| Reference-to-Video | text + 1-9 references | multi-character scenes, brand-consistent subjects |
Tech specs
- ●Resolutions: 720p, 1080p
- ●Duration: 3-15s
- ●References: 1-9 images, ≥400px short side, ≤10MB each, callable as `character1`-`character9`
- ●Audio: native, lip-synced, multilingual
- ●Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4
- ●Latency: 90-240s typical for a 5s clip with multiple references
- ●Pricing: $0.14/s at 720p, $0.28/s at 1080p — references don't change the price
Frequently asked questions
Related models
Happy Horse Text-to-Video
alibaba/happy-horse/text-to-video
Generate 1080p video with synchronized native audio from a text prompt. Aspect r...
Happy Horse Image-to-Video
alibaba/happy-horse/image-to-video
Alibaba's #1-ranked Happy Horse 1.0 — generate 1080p video with synchronized nat...
Happy Horse Video Edit
alibaba/happy-horse/video-edit
HappyHorse video editing supports advanced video editing through natural languag...
Wan 2.7 — Image to Video
alibaba/wan/v2.7/image-to-video
Wan 2.7 delivers enhanced motion smoothness, superior scene fidelity, and greate...
Start generating with Happy Horse Reference-to-Video
Get API access in minutes. No GPU setup, no infrastructure to manage.