Build the infrastructure that powers the next generation of AI media. GPU fleet, warm workers, queues, FastAPI services, in production tomorrow.
Runflow runs millions of AI image and video jobs a month across a multi-cloud GPU fleet. Every workflow our customers ship hits our infrastructure: containers spin up with the right models loaded, queues drain into warm workers on the right GPU class, jobs run to completion within a latency budget, and Sentinel scores them before they go out.
We're hiring a DevOps Engineer to help build that infrastructure. Workers, GPU pool, queues, the database, the FastAPI services that orchestrate everything, all of it is moving fast and could move faster with the right person.
If you've run Kubernetes in anger, sized container fleets to demand without over-paying for idle GPUs, debugged CrashLoopBackOff at 3am, and have strong opinions on when to use a queue vs. a stream, this is for you.
Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.
RunPod is the primary today, with fallback providers across A100 / H100 / 4090 classes. Tighten the auto-scaling against queue depth, kill the cold-start tax, route the job to the cheapest GPU class that meets the latency budget.
ComfyUI workers stay alive between runs with models in GPU memory. Tune the worker pool, reclaim idle GPUs without losing the warm-cache win, ship dashboards so on-call doesn't have to grep logs.
Per-node GPU time, per-run cost, queue wait, top cost drivers. Customers + ops both need this. The hard part isn't the dashboard, it's the accurate per-node attribution at fleet scale.
You'll own the full lifecycle, from prototype to production at scale.
Operate the multi-cloud GPU fleet (RunPod primary, fallback providers, A100 / H100 / 4090)
Run + scale Kubernetes workloads, kill CrashLoopBackOff before it pages anyone
Tune queue + worker architecture so jobs land on warm GPUs at the right tier
Manage the CUDA / driver / model-cache layer so containers spin up fast
Ship + maintain Python FastAPI services that orchestrate the fleet
Own the production Postgres + Redis (Neon, schema migrations, connection pooling, hot paths)
Maintain the CI/CD pipeline + merge queue (GitHub Actions today, room to rebuild)
Build observability so you can answer 'what just changed' in 30 seconds, not 30 minutes
A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.
We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.
We read every word of every application. No silent rejections, ever.
A small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.
Quick conversation about your take-home, the team, and how you work day-to-day.
A bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.
Casual chat with the two founders.
We value proof over promises. When you apply, include examples. Things that stand out:
Five minutes. We read every word. Yes / no / not-now within 5 business days, always.
We hire builders, not roles. If none fit exactly, the open application is at the bottom.