ComfyUI WAN text to video: the 2026 setup and the API math
ComfyUI WAN text to video runs three templates in 2026: the 5B hybrid and two 14B dual-sampler graphs. The honest local setup, plus the cost math of an API.
Two models loaded for one clip. One trained for the first half of the denoise, one for the second. That is what ComfyUI WAN text to video asks of your machine the moment you move past the small model. WAN 2.2 shipped with three templates, and the good one runs 14 billion parameters twice per generation. We learned what that costs the expensive way: a render that should take a minute stretches to several, the queue backs up, and the GPU fan never stops.
This post walks the WAN 2.2 text to video setup in ComfyUI, the same one Koala Nation demoed in their guide. Then it does the part the tutorials skip: the cost and speed math of running this in an app, where one machine and one seed are not enough.
(A note on disclosure: Runflow is our product. The ComfyUI method below is provider-agnostic and works whether or not you touch Runflow. The API numbers are ours, and we flag where each path fits.)
Koala Nation runs clear, no-fluff ComfyUI walkthroughs. If you build video workflows, the channel is worth a subscribe. Credit for the local method in this post goes to them.
What WAN 2.2 actually ships in ComfyUI
WAN 2.2 gives ComfyUI three templates: a 5-billion-parameter hybrid that does both text and image to video, and two 14-billion-parameter graphs split into a dedicated text-to-video and image-to-video pair.
When you open ComfyUI after updating, the WAN 2.2 templates show up in the template browser. The first one, the 5B model, is the friendly entry point. It is a hybrid, so the same weights handle text to video and image to video. You change the prompt, you run, you get a clip. The template itself tells you which models to download and links straight to the WAN 2.2 Hugging Face repository.
The download is not one file. WAN separates the pieces into folders: the VAE, the CLIP text encoders, and the diffusion models. Different workflows pull different VAEs, so read the template before you drag a file into the wrong directory. The official ComfyUI blog post for WAN 2.2 has the install steps and the model links laid out, and it is worth keeping open in a second tab the first time through.
One detail that bites people: updating ComfyUI is sometimes not enough to surface the templates. If they do not appear, you update the template package manually with a pip install. If you run in a virtual environment, activate it first, or the install lands somewhere your ComfyUI never reads.
The 5B model: where to start, and where it stops
The 5B hybrid is the fastest WAN 2.2 path and the most seed-dependent, which makes it a great draft tool and a frustrating final one.
Start here. Load the 5B template, point it at the WAN 2.2 VAE, swap the prompt to describe your scene, and run. Add a video combine node if you want the frames saved as an actual file, though the native save nodes work too.
The output surprises you in both directions. Some seeds produce a clean, watchable clip. Others fall apart, and you have not changed anything except the random number. That is the honest reality of the small model: it is lucky or it is not, and you find out only after the render finishes.
For image to video on the same 5B template, you switch on the load image node (Ctrl+B, or the arrow on the node's popup bar), pick your reference frame, and write a prompt that describes the motion you want. One easy mistake here is orientation. A portrait reference needs the width and height swapped in the image-to-video latent node, or the model fights your aspect ratio the whole way.
The 5B model is the right place to iterate on prompts cheaply. It is the wrong place to ship from, because the quality ceiling is low and the seed roulette is real.
The 14B dual-sampler split, explained without the hand-waving
The 14B WAN 2.2 graph runs two models in sequence: a high-noise model that sets the layout, then a low-noise model that refines the detail. That two-pass design is why it looks better and why it costs more.
This is the part worth understanding, because every WAN 2.2 tutorial shows it and almost none explain it. The 14B text-to-video template loads two diffusion models, not one. There is a high-noise model and a low-noise model, and the FP8 versions are the ones you want for memory reasons.
The first sampler runs the high-noise model. It is trained for the early denoise steps, where the model decides the overall layout and motion of the scene. The second sampler takes the handoff and runs the low-noise model, which is trained for the late steps, where the fine video detail gets resolved. Set both to the default sampler type, run, and you watch the clip build in two passes.
One thing the template hides: the 14B graph uses the WAN 2.1 VAE, not the new 2.2 one the 5B template wanted. Confirm the VAE is loaded correctly or the decode stage throws. (We have lost a render to a mismatched VAE more than once, so check it before you queue a long job.)
The payoff of two models is real. You get more dynamism, more detail, more controllable motion than the 5B model can manage. The cost is also real: two model loads, two sampling passes, double the compute. There is no free version of "better."
The settings that decide whether you wait one minute or ten
Resolution and frame count are the two dials that move WAN 2.2 render time the most, and 480p is the honest default for anything that is not a final hero shot.
Koala Nation's demo makes the tradeoff concrete. At 720p, the 14B render takes too long to be practical for iteration, so they drop it to 480p. That single change is the difference between a coffee-break render and a lunch-break one.
The other dial is frame count. The demo keeps 121 frames, which at the workflow's frame rate is a few seconds of motion. If you are drafting, cut the frame count. Fewer frames means a shorter render, and you can upscale and interpolate the winning clip later instead of paying for full resolution on every failed seed.
A rough hierarchy of what to lower first when WAN is too slow:
| Dial | Effect on time | Effect on quality | When to lower it |
|---|---|---|---|
| Resolution (720p to 480p) | Large reduction | Recoverable via upscale | Always, while drafting |
| Frame count (121 to fewer) | Linear reduction | Shorter clip | Drafting, or known short scene |
| Sampler steps | Linear reduction | Softer detail | Last resort, hurts quality fast |
| Model size (14B to 5B) | Large reduction | Lower ceiling, seed roulette | Prompt iteration only |
The pattern that works: draft on the 5B model or low-res 14B, lock the prompt and seed, then run one final high-res pass on the 14B. Do not pay 720p prices on every experiment.
What changes when WAN text to video runs in an app
On one machine, WAN 2.2 text to video is a creative tool. The moment real users hit it, it becomes a queue, a GPU bill, and an uptime problem, and that is a different job.
Here is the wall. The 14B graph loads two models and runs two sampling passes per clip. On a single 4090 with nobody else in the queue, that is workable. Put it behind a product where ten people request a video in the same minute, and they are not in parallel. They are in line. The eleventh person waits for the first ten renders to finish, and a multi-minute render times ten is the kind of wait that makes people close the tab.
The naive fix is to buy more GPUs. (We have run that math, and it does not end well for a startup: a rack of 4090s sitting idle between traffic spikes is money on fire, and a rack that is always busy means you bought too few.) You are now running a GPU cluster, an autoscaler, a queue, and a retry policy. None of that is the product you set out to build.
The other path is to keep the ComfyUI workflow you already validated and run it as an API. You build and test the graph locally, the same WAN 2.2 dual-sampler setup from this post, then deploy it so a single HTTP call dispatches a render on managed GPUs. We wrote the full version of this in the ComfyUI API developer guide, and the deploy side lives on our ComfyUI deploy page. The honest pitch: it is roughly 70% cheaper than standing up and babysitting your own GPU fleet, and there is no AI infra team to hire. Pricing is simple and fixed per call, so the cost of a clip is a number you can put in a spreadsheet before you ship.
If you would rather not deploy a custom graph at all, the hosted WAN video models in the Runflow model catalog are callable directly. Same model family, no ComfyUI graph to maintain.
Calling WAN as an API instead of a local queue
Once a WAN workflow is deployed, generating a video is a POST to start a run and a GET to poll for the result, which is the same two-step pattern across the whole catalog.
You start the run, get a run id back, then poll until the video is ready. No node graph at request time, no GPU to keep warm.
# Start a WAN text-to-video run
curl -X POST https://api.runflow.io/v1/models/{owner}/{slug}/runs \
-H "Authorization: Bearer $RUNFLOW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": {
"prompt": "a paper boat drifting down a rain-soaked city gutter, cinematic",
"num_frames": 121,
"resolution": "480p"
}
}'
# Response: { "id": "run_abc123", "status": "queued" }
# Poll until the run finishes
curl https://api.runflow.io/v1/runs/run_abc123 \
-H "Authorization: Bearer $RUNFLOW_API_KEY"
# -> { "status": "succeeded", "output": { "video_url": "https://..." } }The exact {owner}/{slug} and input fields come from the model's own page, browsable through the Solutions and model API overview. The shape stays constant: one call to dispatch, one call to poll, the result is a URL you hand to your frontend. The queueing, scaling, and retries that you would otherwise hand-roll around ComfyUI happen on the other side of that call.
Frequently asked questions
What is the difference between WAN 2.2 text to video and image to video in ComfyUI?
Text to video generates a clip from a prompt alone. Image to video animates a reference image you supply, guided by a prompt that describes the motion. The 5B hybrid template does both; the 14B model splits them into two dedicated templates.
Which WAN 2.2 template should I start with?
Start with the 5B hybrid. It is the fastest to run and the easiest to install, which makes it ideal for testing prompts. Move to the 14B text-to-video template when you need higher quality and more controllable motion.
Why does the 14B WAN 2.2 workflow load two models?
It uses a high-noise model for the early denoise steps that set the scene layout, then a low-noise model for the late steps that refine detail. The two-pass design improves quality at the cost of double the compute.
Why is my WAN 2.2 video low quality on the 5B model?
The 5B model is heavily seed-dependent. Some seeds produce clean clips and others fall apart with no other change. Re-run with a different seed, or move to the 14B model for a higher and more consistent quality ceiling.
What resolution should I use for WAN 2.2 in ComfyUI?
Use 480p while drafting. The demo drops from 720p to 480p because 720p renders take too long to iterate on. Upscale the final winning clip rather than paying full resolution on every test seed.
The WAN 2.2 templates do not appear after updating ComfyUI. What now?
Updating ComfyUI is sometimes not enough. Install the templates manually with the pip install command from the ComfyUI WAN 2.2 docs. If you use a virtual environment, activate it first so the install lands where ComfyUI reads.
Which VAE does the 14B WAN 2.2 graph use?
The 14B graph uses the WAN 2.1 VAE, not the newer 2.2 VAE that the 5B hybrid template loads. Confirm the correct VAE is wired in, or the decode stage will fail.
Can I run WAN 2.2 text to video without a powerful local GPU?
Yes. You can deploy the ComfyUI workflow as an API and dispatch renders on managed GPUs, or call a hosted WAN video model directly. Both move the heavy compute off your machine and into a single HTTP call.
How many frames should I generate?
The demo keeps 121 frames for a few seconds of motion. Lower the count while drafting to shorten render time, then interpolate or extend the final clip. Fewer frames is the cheapest way to speed up a test pass after resolution.
Is running WAN as an API cheaper than buying GPUs?
For most products, yes. A self-managed GPU fleet is idle between traffic spikes and undersized during them. Managed dispatch with simple fixed per-call pricing is roughly 70% cheaper than running your own cluster, and there is no infra team required.
Where to go next
WAN 2.2 text to video is a genuinely good model wrapped in a workflow that punishes a single machine. Here is the honest path from demo to production.
- Update ComfyUI and load the 5B hybrid template first. Get one clip out before anything else.
- Iterate prompts on the 5B model or low-res 14B, where renders are cheap and fast.
- Move to the 14B text-to-video template and confirm the WAN 2.1 VAE plus both high-noise and low-noise FP8 models are loaded.
- Draft at 480p with a reduced frame count, then run one final high-res pass on the seed that won.
- When real users arrive, stop scaling GPUs by hand. Read the ComfyUI API developer guide and deploy the graph through ComfyUI deploy.
- If you want a hosted model with no graph to maintain, browse the model catalog and the API overview.
- Compare your options honestly first in our ComfyUI cloud providers breakdown before you commit.
So which is it for you: one more GPU in the closet, or one HTTP call? Start free at runflow.io.
Want custom benchmarks for your workload?
We'll run our evaluation pipeline against your production data, for free.
Talk to Founders