Guides May 12, 2026 read

Building an AI Image Generator API: 14 Things That Broke When We Built Our Own AI Studio

Honest field notes from building Runflow's AI Studio: every contract drift, every silent 422, every 'wait, why is the mask all zeros'. What broke, what got fixed, what you'd actually have to build yourself.

Ricardo Ghekiere

Co-Founder and CEO of Runflow

Honest field notes from building Runflow Studio. Every model contract drift, every silent 422, every "wait, why is the mask all zeros". Most of it is fixed now. Some of it isn't. This is the article we wished existed before we started.

Most articles on "how to build an AI image generator API" are written by people who haven't.

They walk you through pip install diffusers, throw in a FastAPI wrapper, slap "production-ready" on the title, and call it done.

The actual work starts the morning your second customer logs in.

This piece is different. We built our own AI image generation studio at Runflow. Then we built it again. Then we built it a third time after the model providers changed their API contracts overnight without a changelog. We took bumps coordinating with Miguel, our model person, and learned that "the model is broken" and "the model is fine, your payload is wrong" sound identical until you spend four hours debugging.

Below are the fourteen specific things that broke, what we tried, what worked, and what we still wouldn't recommend. By the end you'll have a clear-eyed picture of what an AI Image Generator API actually has to do to survive contact with users. You'll also have a fair shot at deciding whether you want to build one yourself, or just plug into something like Runflow + Sentinel and ship next week.

No affiliate pages dressed as guides. No "best practices" lists abstracted from anything real. Just what broke.

What an AI Image Generator API actually has to do

Before we get to the bumps, the boring foundation. An AI Image Generator API is not a model wrapper.

A model wrapper is one HTTP call, one model, one response. You can build a model wrapper in an afternoon and demo it on your laptop.

An AI Image Generator API has to do four hard things at the same time:

Route between many models because no single model wins at every job. Stable Diffusion 3 is great at one thing, Flux is great at another, nano-banana-pro is great at a third, and tomorrow there will be a fourth that's better at all of them. The routing layer is your moat or your bottleneck.
Manage inputs that the model didn't ship with. Brush masks, reference images, prompt steering, aspect ratios, output resolutions. The model wants its specific tensor shape; the user wants to drag a file onto a canvas. The studio is everything in between.
Evaluate the output. Generative models hallucinate. They put six fingers on hands. They draw the logo upside down. If your studio doesn't catch that before the user does, your studio is just an expensive way to ship broken images.
Stay up. Vercel cold starts, R2 expirations, presigned URL TTLs, cron double-fires, model providers going dark for "scheduled maintenance" they didn't schedule. The infrastructure under your studio will betray you in inventive ways.

Hold those four in your head. Most of what's about to break in this article is one of those four pretending to be something else.

The fourteen things that broke

We're going to walk through these in roughly the order we hit them. Some are model-level. Some are payload-level. Some are infrastructure. A few are pure UX traps that no architecture diagram tells you about.

All of them are now resolved. Some of them were resolved by Miguel pushing a fix upstream. Some by us catching it on our side. A handful by deciding the feature wasn't worth shipping until the upstream model could actually deliver.

1. The zip payload that returned 422 overnight

The original API contract for our editing models took a zip file. You'd upload bundle.zip containing photo.png, mask.png, and optionally reference.png, then send the model a zip_url plus three filename keys (file_input, file_mask, file_reference). Clean on paper. One upload, three files inside.

Three days ago, every dispatch started returning 422.

Nothing on our side had changed. The fibbl demo, the runflow-studio demo, the retaillabs reference-inpaint flow, all dead at the same moment. The Runflow gateway's underlying models had silently shifted their input contract: drop the zip, send three direct URLs (image_url, mask_url, reference_url).

We ripped out JSZip entirely. The dispatcher now uploads each file separately to a presigned R2 endpoint and passes three URLs to the model. Three round-trips per generation instead of one. Slightly more network, far less complexity. The 422s went away.

Lesson: AI model APIs do not come with a changelog. The contract you wrote against last quarter may not be the contract today. Build a thin normalization layer between your client code and the model API, and assume you'll be rewriting that layer every six weeks.

2. The allowlist that silently rejected new models

Every model the demos can dispatch lives in a single file: lib/allowlist.mjs. The proxy reads it on every request and rejects any model not in the set.

This is the right design. It stops a compromised demo from using your Runflow account to mine other people's quota.

It is also the wrong design when you ship a new workflow at 4pm and forget to add the model to the allowlist.

The first time it happened, the UI showed "Workflow failed" with no detail. The server logged "model not allowed". The client console was empty. Twenty minutes of staring at network tabs.

Now the allowlist is also injected into the client-side studioHandle, and the dispatcher pre-validates before the network round-trip. If you forget to allowlist a new model, the error fires in DevTools the moment you click Apply, with the missing model ID in plain text.

Lesson: A silent 403 is worse than a loud 500. Move your validation as close to the user as possible, even if you also keep the server-side check.

3. The output shape that drifted between models

Runflow's edit models return their output in a { output: { outputs: [{ url, width, height }] } } shape. Mostly.

A few older models return { output: { image_urls: [...] } }. A few generative models return { output: { image: { url } } }. We didn't know this until the studio showed a blank canvas after a successful 200 response.

The dispatcher now has three fallback layers:

const outputUrl =
  out.outputs?.[0]?.url ??
  out.image_urls?.[0] ??
  out.image?.url ??
  null;

It's a band-aid. It works. The comment above it says "fall back to legacy shapes so a slight schema drift doesn't kill the panel."

Lesson: When you're gluing twenty model providers together behind one API, the response shape will drift. Build three fallbacks for any field you care about. Cheap insurance.

4. Sentinel evaluations were timing out at the wrong threshold

Sentinel is our quality-evaluation layer. Every generated image gets shipped to a multi-judge AI eval pipeline that returns pass/fail per judge plus detailed reasoning. Green if zero fail. Amber if one. Red if two or more.

We set the poll timeout at 120 seconds because "evals are fast".

Evals are not fast. Multi-step reasoning takes 2 to 4 minutes. Most evals were finishing right after we'd already shown the user a red toast that said "Quality check error".

We bumped DEFAULT_TIMEOUT_MS to 300 seconds. Five minutes feels like forever in a browser. It is also the actual SLA of the underlying judge pipeline.

Lesson: AI evaluation is genuinely slow. Don't guess the SLA, measure it. Then add 50%.

5. The Sentinel content-type handshake

Sentinel evals use Vertex AI under the hood to fetch the generated image. Vertex's image fetcher rejects anything served with application/octet-stream, which is the default Content-Type our temp storage bucket returns.

So every Sentinel eval crashed with INVALID_CONTENT_TYPE before any judge ran.

The fix is one line of routing. Our Sentinel proxy now wraps any image URL that isn't on a known-good origin (public.runflow.io) through /demos/api/image?url=..., an endpoint that re-fetches the upstream and serves it back with the right image/webp or image/png Content-Type.

Lesson: When you glue services together, the middle service is the Content-Type fixer. Don't assume your downstream consumer handles the same MIME variations as a browser does.

6. Presigned R2 URLs expired in the dashboard before the user noticed

We mirror generation outputs into R2 for the reliability dashboard's recent-batches gallery. Vercel functions sign the URLs at read time with a 10-minute TTL.

Open the dashboard, walk away for an hour, come back, every image is broken.

The "fix" is to not cache. The metrics endpoint regenerates fresh presigned URLs on every read, and the dashboard polls every 30 seconds, so the URLs stay in the 10-minute window. If you leave the page open for an hour without polling, you get broken thumbnails. We accepted that.

Lesson: Presigned URLs are a time bomb the moment you cache them. Own the signing lifecycle and regenerate on the hot path.

7. The Vercel cron that double-fired

Our nano-banana reliability benchmark runs every 5 minutes via Vercel cron, hitting six different providers in parallel. Vercel occasionally double-fires a cron in the same window. Without protection, we'd insert two duplicate runs, poisoning the reliability metrics.

We didn't catch this in code review. We caught it in the dashboard, three weeks in, when we noticed the success-rate-per-provider numbers were off by exactly half for a one-hour window.

The fix is one line of SQL: a unique constraint on (cron_window, provider) in the runs table. The second fire hits the constraint, the function catches the 23505 error, returns { skipped: true, reason: 'cron_window_taken' }, and moves on. The database is the idempotency gate.

Lesson: Idempotency must live in the database, not in the function. A unique constraint is cheaper than logic and survives restarts.

8. The in-memory rate limiter that reset on every cold start

The proxies enforce 20 runs/hour per IP and 500/day globally. The counters live in plain JavaScript Maps in the function process.

Vercel functions cold-start. Cold starts wipe the Maps. So the first IP to hit the function after a deploy got 20 fresh runs for that hour, no matter how many they'd burned in the previous hour.

For a demo with low traffic this is fine. For anything close to production, it is not.

The current state is honest: the comment in the code says "acceptable for demo traffic". When the demos start getting real traffic, the counter moves to a row in Postgres with a Redis-backed lease pattern. Today they don't, so it doesn't.

Lesson: In-memory rate limits are good enough for demos, terrible for production. Know which one you're shipping.

We have workflow cards that are "disabled by applicability". Upload a landscape with no people, the "Restyle with a reference" card grays out because the workflow needs a human subject.

The card had a tooltip that explained why: "Different photo would unlock this." The user could never see it.

A disabled <button> in HTML blocks pointer events on itself and its children. The mouse never reaches the tooltip's parent element. The tooltip never fires.

The fix is to wrap the disabled card in a <span> with the tooltip on the wrapper. The button stays disabled (no click), but the wrapper is pointer-aware. Hover, see the explanation, swap the photo.

Lesson: Disabled buttons are a UX trap. If a disabled state needs to communicate something, decorate the disabled element and put the interaction on a wrapper.

10. Model names leaking into customer copy

Early demos had descriptions like "Flux-fill rebuilds the background". Flux is the underlying model. The customer doesn't care; the customer's mental model is broken the moment a brand name they don't recognize shows up next to "AI" in your UI.

Then the model gets swapped six weeks later because the new one is cheaper and produces fewer six-fingered hands. Now your copy is wrong everywhere.

We swept every workflow description and renamed the model leaks. "Flux-fill rebuilds the background" became "Runflow rebuilds the background." The metadata still stores model names for ops; the UI doesn't.

Lesson: Model names are implementation detail. The moment you ship a demo with a model name in the UI, you own keeping that label fresh for the lifetime of the demo. Ship opaque labels.

11. The canvas resolution chip that thrashed

The canvas had a small chip showing the current image dimensions: "2048 x 1536". When the user changed the aspect ratio, the dimensions recalculated, the chip text changed length, the chip reflowed, and adjacent elements shifted by a few pixels.

It looked broken even though it was technically working.

Now the chip shows only the requested resolution bucket ("2K", "4K"), which is what the user controls. The derived dimensions live on hover for ops who need them.

Lesson: UI stability beats information density. Every dynamic number on screen is a reflow risk. Pin the visible state to what the user actually controls.

12. The model tier table that was the wrong abstraction

We had three resolution tiers: 1K, 2K, 4K. We assumed three models, one per tier, with different pricing and quality profiles.

The marketplace doesn't have tier-specific models yet. All three resolutions point to the same google/nano-banana-pro, which natively accepts a resolution enum.

The first version of the dispatcher hard-coded model IDs per tier. The second version used a TIERS lookup table:

const TIERS = {
  '1K': { model: 'google/nano-banana-pro', label: '1K (1024 x 1024)', cost: 1 },
  '2K': { model: 'google/nano-banana-pro', label: '2K (2048 x 2048)', cost: 4 },
  '4K': { model: 'google/nano-banana-pro', label: '4K (4096 x 4096)', cost: 16 },
};

When Miguel ships tier-specific models, one line changes per tier. No refactor. No UI churn.

Lesson: Separate the UI shape (user's choice) from the API shape (the model). A lookup table is cheap and turns a model swap into a one-line change.

13. The model catalog we found four weeks late

We hardcoded google/nano-banana-2 as the text-to-image model because it was the only one we knew about. When nano-banana-pro shipped, the dispatcher was still wired to the old model.

We found docs.runflow.io/llms.txt four weeks too late. It is a flat-text catalog of every Runflow-routed model with input and output schemas. The dispatcher now reads from it, and the demos point at the canonical model IDs the catalog lists.

Lesson: Every AI gateway should publish a public model catalog with input and output schemas. If you can't find one, that's a tooling problem the gateway should fix.

14. The "Coming Soon" workflow that sat broken in production

The fibbl demo has an Object Removal workflow. The user paints a brush mask over the area they want erased; the model rebuilds the background. We shipped it. It returned all-zero masks at node 80 of the underlying ComfyUI workflow. Every run failed silently.

We could have fixed it on our side: rebuild the mask payload, try a different encoding, monkey-patch the request shape. We tried for two hours.

We could not have fixed it. The model itself was in a bad state upstream. Miguel knew about the issue but it wasn't on the priority list.

We marked the workflow as "Coming Soon" in the UI, kept the brush canvas built and the dispatcher wired, and surfaced a friendly disabled state: "Brush is built. The eraser model is taking a nap upstream, back when it wakes."

Lesson: Don't ship broken workflows to production. The operational cost of "something failed silently" is higher than "this is not available yet." Park it. Wait for the upstream fix. Ship when it's actually ready.

The architecture that came out the other side

After fourteen rounds of this, the studio settled into a shape that survives.

A picture, in words:

[ User ]
   |
   v
[ Studio UI ]  <-- mode switch (browse / configure), workflow cards,
   |              brush canvas, reference uploader, version history
   v
[ Dispatcher ]  <-- one function per workflow kind:
   |              simple, prompt, prompt-zip, pin, mask-only, mask-ref, package
   v
[ Upload + R2 ]  <-- per-file presigned URLs, 30-minute TTL
   |
   v
[ Runflow proxy ]  <-- model allowlist, IP rate limit, contract normalization
   |
   v
[ Runflow gateway ]  <-- routes to underlying models (nano-banana-pro,
   |                     reference-inpaint, background-removal, etc.)
   v
[ Output URL ]  <-- mirrored to our R2 for permanence
   |
   v
[ Sentinel proxy ]  <-- async eval, 5-min poll budget, content-type fixer
   |
   v
[ Sentinel verdict ]  <-- green / amber / red + judge reasoning
   |
   v
[ Studio UI badge ]  <-- "1 quality issue detected" + click for details

There is nothing magic about this shape. It is the boring, obvious architecture that everyone draws on a whiteboard before they actually start building. The fourteen lessons above are what's between the boxes when you do.

Sentinel: why output evaluation is not optional

Most "build an AI image generator API" articles stop at the model call. The model returns a URL, you return the URL, the user sees the image, done.

That is the article you write before your second customer logs in.

Generative models lie. They invent text on packaging. They put a sixth finger on a hand. They draw the logo at 30% opacity in the wrong color. They look great on the demo image and ship a defect on the actual customer asset.

Sentinel is our evaluation layer. Every output goes through a panel of AI judges that score the image against a task-specific rubric. The judges know whether the task was "remove the price tag", "change the background to white", or "restyle the model's outfit". They evaluate whether the change happened, whether the rest of the image was preserved, and whether the result is a usable product image.

The verdict comes back as green, amber, or red:

Green: all judges passed. Ship it.
Amber: one judge flagged something. Worth a human glance.
Red: two or more judges flagged. Auto-reject, surface the reasoning, regenerate.

The reasoning is the killer feature. When a Sentinel run flags a result, the user sees not just "fail" but "the model added a watermark in the lower-right that wasn't in the source." They can fix the prompt, swap the workflow, or escalate. The black box becomes auditable.

If you build your own AI image studio and skip the eval layer, you are shipping every model failure straight to the customer. The customer will notice. The customer will leave.

Sentinel is the layer that turns a generative model into a production tool.

What you'd actually have to build yourself

If after all this you still want to build your own:

Routing & orchestration layer. Per-workflow dispatcher functions that handle each input shape (simple, prompt, mask, reference). A lookup table mapping resolution tiers to model IDs. A normalization layer between client code and the model API contract. Add 6 weeks for the first refactor, 6 more for the second.

Upload + storage. R2 or S3 with presigned URLs. A single-file upload endpoint that fixes Content-Types and rate-limits per IP. A mirroring step that copies model outputs into your bucket so they outlive the model's signed URL.

Allowlist + auth. A central allowlist of permitted model IDs, enforced both server-side (security) and client-side (UX). Some kind of caller verification on the proxy. Rate limits in a real database, not in-memory.

Eval layer. A multi-judge AI evaluation pipeline. Per-task rubrics. Async polling with a 5-minute budget. Verdict aggregation logic. A UX surface for the verdict (badge, detail drawer, re-run affordance).

Observability. Live reliability dashboard per provider. Latency histograms (P50, P95, P99). Failure categorization by error class. Cron-based continuous benchmarking so you notice provider degradation before your users do.

UX. A workflow grid that doesn't get busy when you have 14 workflows. A focused configure mode so the user attends to one thing at a time. Brush canvases. Reference uploaders. Version history. Custom recipe chains. A way to mark workflows as "coming soon" without breaking the layout.

That's a team of three engineers for six months, conservatively. We know because we did it. Twice.

Or just plug into Runflow + Sentinel

The honest pitch:

Runflow is the multi-model image generation gateway under everything we just described. It handles routing, contract normalization, the model catalog, and the proxy patterns. You write against one stable API and we deal with the model providers shifting their contracts overnight.

Sentinel is the evaluation layer. Same per-task rubrics, same multi-judge architecture, same green/amber/red output, same reasoning surface.

Together they're what you'd build yourself, minus the six months of bumps. You bring the workflow logic, your customer-specific UX, and your business model. We handle the model providers and the eval layer.

Where we lose: if you have a hyper-specific model architecture we don't route to yet. If you need an eval rubric outside our current judges. If your latency sensitivity is below 800ms per generation (we sit at 1.2s P50 today). If you want full on-prem with no external calls.

Where we win: every other case where you'd rather ship a studio in a sprint than build the routing-and-eval substrate yourself.

If you're curious what the routing actually looks like in production, we run a continuous reliability benchmark across the top six Nano Banana API providers, updated every 10 minutes. Real numbers, including where Runflow loses.

What this article doesn't tell you

This is what we know. It is not the same as what you need to know.

We have not stress-tested the studio at 10,000 concurrent users. Demo traffic is small. Real production traffic will surface different bumps; some of the in-memory rate limiters will fall over before the model itself does.

We have not built every workflow our customers have asked for. The "Coming Soon" workflow from lesson 14 is one of several we know we want to ship and haven't. Some of those depend on Miguel's roadmap upstream.

We have not solved cost at scale. The dispatcher batches well, but a high-tier 4K generation still runs $$$ per image at the underlying model. If your business model can't absorb that, the architecture changes.

We have not figured out how to evaluate hyperreal photography. Sentinel's judges are good at "is the price tag still in the image" and bad at "is this the most flattering crop". Aesthetic evaluation is an open problem and we are not pretending we've solved it.

If any of these are blockers for you, you should build it yourself. Or wait six months and ask us again.

FAQ

Q: Do I really need an evaluation layer? Can't I just ship the model output?
You can. Your customers will catch the failures for you. They will email about the six-finger hands and the upside-down logos. You will manually re-run. Sentinel exists because we got tired of being the manual re-run.

Q: How long did the studio take to build?
The first version was three weeks. The version we'd actually ship to a customer was three months. The version we now run in production is twelve months in and still iterating.

Q: Why not use ComfyUI as the studio backend?
ComfyUI is a great workflow tool. It is also a frontend, an execution engine, and an opinionated UX rolled together. The moment you want to expose it to customers, you are wrapping it. The moment you wrap it, you are building what we built. ComfyUI is fantastic for prototyping; it is not a customer-facing API.

Q: What's the single biggest lesson?
The model contract drifts and you will not get a changelog. Build for that. Everything else is a corollary.

Q: Where do I see Runflow Studio in action?
The internal demos hub at protos.runflow.io ships fibbl, plytix, retaillabs, runflow-studio, and a few more. Each one is a different vertical's take on the same architecture. If you want a guided tour, book time with us.

Where to go next

If you want to look at one provider's behavior in depth, the Nano Banana API reliability benchmark is a continuous live test across six providers, including ours. We win on some metrics and lose on others. The data updates every 10 minutes.

If you want to skip the build and start shipping, Runflow is the gateway. Sentinel is the eval layer. Both have free tiers. Both will catch the bumps in this article before you have to feel them.

If you want to build it yourself, we wish you well. Bookmark this page. You'll come back to it around lesson 7.

Want custom benchmarks for your workload?

We'll run our evaluation pipeline against your production data, for free.

Talk to Founders

benchmarks