Happy Horse Reference-to-Video
Generate 1080p video with synchronized native audio from a text prompt and references. Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4. Duration: 3–15s.
By Alibaba
Pricing: $0.14 per second
Overview
Happy Horse Reference-to-Video composes 1080p multi-character scenes from **1-9 reference images plus a text prompt**. Reference images are addressable by index (`character1`, `character2`, …) so you can stage interactions, brand-consistent product placements, or multi-subject scenes with stable identity across the whole clip.
Key capabilities
- **Up to 9 references**: each callable as `character1`-`character9` in your prompt — order matches your `image_urls` array
- **Identity preservation**: subjects keep their reference appearance across motion and scene changes
- **Native audio + lip-sync**: same audio engine as the rest of the family — multilingual and synchronized to motion
- **Five aspect ratios**: 16:9, 9:16, 1:1, 4:3, 3:4
- **3-15 second clips** at 720p or 1080p
- **Strict reference quality**: each image must be ≥400px on the short side, ≤10MB, JPEG/JPG/PNG/WEBP — 720p+ recommended for clean identity
Family
Part of the Happy Horse family — pair with the variants when you need a different starting modality:
| Variant | Input | Use it for |
|--------|-------|-----------|
| Text-to-Video | text prompt | one-shot clips from a brief |
| Image-to-Video | image + optional prompt | animating a still or hero shot |
| Video Edit | source video + edit prompt | transforming an existing clip (style, scene swap) |
| Reference-to-Video | text + 1-9 references | multi-character scenes, brand-consistent subjects |
Tech specs
- **Resolutions**: 720p, 1080p
- **Duration**: 3-15s
- **References**: 1-9 images, ≥400px short side, ≤10MB each, callable as `character1`-`character9`
- **Audio**: native, lip-synced, multilingual
- **Aspect ratios**: 16:9, 9:16, 1:1, 4:3, 3:4
- **Latency**: 90-240s typical for a 5s clip with multiple references
- **Pricing**: $0.14/s at 720p, $0.28/s at 1080p — references don't change the price
Examples
- Neon-lit Tokyo street
- Sunlit meadow
- Mountaintop sunrise
- Cafe by window
Frequently asked questions
- What is Happy Horse Reference-to-Video?
- Happy Horse Reference-to-Video composes multi-subject 1080p video scenes from 1-9 reference images plus a text prompt. Each reference is addressable as `character1`, `character2`, … in the prompt, so you can stage interactions, multi-product shots, or brand-consistent scenes with stable identity across the whole clip.
- How much does Happy Horse Reference-to-Video cost on Runflow?
- $0.14 per second of output at 720p, and $0.28 per second at 1080p. The number of reference images doesn't change the price — same flat per-second rate from 1 to 9 references.
- How do I reference my images in the prompt?
- List your images in `image_urls` (1-9 entries). In the prompt, refer to them as `character1`, `character2`, …, in the same order. Example: with two images, prompt `A dance battle between character1 and character2, neon city background, cinematic camera`.
- What reference image quality is required?
- JPEG, JPG, PNG, or WEBP up to 10MB each. Shortest side must be at least 400 pixels — 720p or higher is recommended for clean identity preservation. Cluttered or low-resolution refs reduce subject consistency.
- Can I use Happy Horse output commercially?
- Yes — all output via Runflow is licensed for commercial use, including ads and products with branded subjects.
- Do I need to manage GPUs or infrastructure?
- No. Runflow handles GPU provisioning, queueing, multi-region routing, and scaling. One HTTP call returns a video URL.
Related models
- Happy Horse Video Edit, HappyHorse video editing supports advanced video editing through natural language instructions. It allows for local or global editing of video elements using up to 5 reference images.
- Happy Horse Image-to-Video, Alibaba's #1-ranked Happy Horse 1.0 — generate 1080p video with synchronized native audio and multilingual lip-sync from text prompts or images.
- Happy Horse Text-to-Video, Generate 1080p video with synchronized native audio from a text prompt. Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4. Duration: 3–15s.
Discoverable surfaces
- Dispatch endpoint:
POST https://api.runflow.io/v1/models/happy-horse/reference-to-video/runs - Per-model spec (markdown): https://app.runflow.io/models/happy-horse/reference-to-video/llms.txt
- Docs page: https://docs.runflow.io/models/happy-horse/reference-to-video
- Public OpenAPI spec: https://docs.runflow.io/api/openapi.public.json
- Agent skill (start here): https://www.runflow.io/.well-known/agent-skills/runflow/SKILL.md