Happy Horse Text-to-Video
Generate 1080p video with synchronized native audio from a text prompt. Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4. Duration: 3–15s.
By Alibaba
Pricing: $0.14 per second
Overview
Happy Horse Text-to-Video is Alibaba's flagship 1080p video generator with **synchronized native audio** built in — no separate audio model, no lip-sync rig, no post-production. Send a single prompt; get a fully-scored clip back.
Key capabilities
- **Native audio**: ambient sound, music, voice, foley — generated in lock-step with the visuals so they actually match (no overlay tricks)
- **Multilingual**: prompts and any embedded dialogue work across major languages
- **Five aspect ratios**: 16:9, 9:16, 1:1, 4:3, 3:4 — covers landscape ads, vertical short-form, square social, and portrait formats from one endpoint
- **3-15 second clips** at 720p or 1080p
- **Cinematic motion**: handles complex camera moves (dolly, push-in, aerial), shallow DOF, golden-hour lighting prompts well
Family
Part of the Happy Horse family — pair with the variants when you need a different starting modality:
| Variant | Input | Use it for |
|--------|-------|-----------|
| Text-to-Video | text prompt | one-shot clips from a brief |
| Image-to-Video | image + optional prompt | animating a still or hero shot |
| Video Edit | source video + edit prompt | transforming an existing clip (style, scene swap) |
| Reference-to-Video | text + 1-9 references | multi-character scenes, brand-consistent subjects |
Tech specs
- **Resolutions**: 720p, 1080p
- **Duration**: 3-15s
- **Audio**: native, in-sync, prompt-controlled
- **Latency**: 60-180s typical for a 5s clip; queue depth varies during peak hours
- **Pricing**: $0.14/s at 720p, $0.28/s at 1080p — simple per-second billing, no minimums
Examples
- Cinematic hummingbird
- Aerial Japanese village
- Macro espresso pour
- Studio dancer
Frequently asked questions
- What is Happy Horse Text-to-Video?
- Happy Horse Text-to-Video is Alibaba's flagship video generator that turns a single text prompt into a 1080p clip with synchronized native audio. Unlike most text-to-video models that generate silent clips, Happy Horse produces ambient sound, music, and dialogue locked to the visuals — no separate audio model or post-production required.
- How much does Happy Horse Text-to-Video cost on Runflow?
- $0.14 per second of generated video at 720p, and $0.28 per second at 1080p. Billing is simple per-second — no minimums, no setup fees, no surprises. A standard 5-second 720p clip costs $0.70.
- How long does a generation take?
- Typical latency is 60-180 seconds for a 5-second clip during off-peak hours. During peak demand the queue can be deeper. Runflow's multi-region routing helps absorb regional spikes; we automatically pick the fastest available endpoint.
- Can I use Happy Horse output commercially?
- Yes. Output generated through Runflow's API is licensed for commercial use, including ads, social content, product launches, and SaaS products. Standard commercial license — no separate model licensing required.
- Do I need to manage GPUs or infrastructure?
- No. Runflow handles GPU provisioning, queueing, multi-region failover, and capacity scaling. Send an HTTP request, get a video back. No DevOps, no Kubernetes, no AI engineering team required.
- How do I get started?
- Sign up at runflow.io, get an API key, and POST to `/v1/models/alibaba/happy-horse/text-to-video/runs` with a `prompt` field. Polling the returned run ID gives you the final video URL. Full SDK and OpenAPI spec available — most teams have first generation working in under 10 minutes.
Related models
- Happy Horse Reference-to-Video, Generate 1080p video with synchronized native audio from a text prompt and references. Aspect ratios: 16:9, 9:16, 1:1, 4:3, 3:4. Duration: 3–15s.
- Happy Horse Video Edit, HappyHorse video editing supports advanced video editing through natural language instructions. It allows for local or global editing of video elements using up to 5 reference images.
- Happy Horse Image-to-Video, Alibaba's #1-ranked Happy Horse 1.0 — generate 1080p video with synchronized native audio and multilingual lip-sync from text prompts or images.
Discoverable surfaces
- Dispatch endpoint:
POST https://api.runflow.io/v1/models/happy-horse/text-to-video/runs - Per-model spec (markdown): https://app.runflow.io/models/happy-horse/text-to-video/llms.txt
- Docs page: https://docs.runflow.io/models/happy-horse/text-to-video
- Public OpenAPI spec: https://docs.runflow.io/api/openapi.public.json
- Agent skill (start here): https://www.runflow.io/.well-known/agent-skills/runflow/SKILL.md