DevOps Engineer.
Build the infrastructure that powers the next generation of AI media. GPU fleet, warm workers, queues, FastAPI services, in production tomorrow.
Runflow runs millions of AI image and video jobs a month across a multi-cloud GPU fleet. Every workflow our customers ship hits our infrastructure: containers spin up with the right models loaded, queues drain into warm workers on the right GPU class, jobs run to completion within a latency budget, and Sentinel scores them before they go out.
We're hiring a DevOps Engineer to help build that infrastructure. Workers, GPU pool, queues, the database, the FastAPI services that orchestrate everything, all of it is moving fast and could move faster with the right person.
If you've run Kubernetes in anger, sized container fleets to demand without over-paying for idle GPUs, debugged CrashLoopBackOff at 3am, and have strong opinions on when to use a queue vs. a stream, this is for you.
Quick facts
What you'd ship in your first 6 months.
Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.
Push the multi-cloud GPU fleet to the next level
RunPod is the primary today, with fallback providers across A100 / H100 / 4090 classes. Tighten the auto-scaling against queue depth, kill the cold-start tax, route the job to the cheapest GPU class that meets the latency budget.
Make warm-worker fleets boring to operate
ComfyUI workers stay alive between runs with models in GPU memory. Tune the worker pool, reclaim idle GPUs without losing the warm-cache win, ship dashboards so on-call doesn't have to grep logs.
Build the per-run cost analytics surface
Per-node GPU time, per-run cost, queue wait, top cost drivers. Customers + ops both need this. The hard part isn't the dashboard, it's the accurate per-node attribution at fleet scale.
What you'll be doing.
You'll own the full lifecycle, from prototype to production at scale.
Operate the multi-cloud GPU fleet (RunPod primary, fallback providers, A100 / H100 / 4090)
Run + scale Kubernetes workloads, kill CrashLoopBackOff before it pages anyone
Tune queue + worker architecture so jobs land on warm GPUs at the right tier
Manage the CUDA / driver / model-cache layer so containers spin up fast
Ship + maintain Python FastAPI services that orchestrate the fleet
Own the production Postgres + Redis (Neon, schema migrations, connection pooling, hot paths)
Maintain the CI/CD pipeline + merge queue (GitHub Actions today, room to rebuild)
Build observability so you can answer 'what just changed' in 30 seconds, not 30 minutes
About you.
You may be a good fit if…
- You've operated Kubernetes in production, not just used managed clusters as a black box
- You've debugged a real GPU-fleet incident, autoscaling gone wrong, container OOM, driver mismatch, the works
- You're comfortable in Python (FastAPI services, scripts, glue) and shell, even if neither is your home language
- You think in queue depth, p95 latency, cost per request, not just CPU + memory graphs
- You can read a Postgres query plan and know when an index is the answer (and when it isn't)
- You have strong opinions on observability — what to log, what to trace, what to alert on, what to ignore
Strong candidates also have…
- Have run a multi-cloud GPU fleet (RunPod, Modal, AWS, Lambda Labs, etc.) at meaningful scale
- Have built or maintained CI/CD systems with merge queues, parallel test sharding, etc.
- Have rebuilt a system that was 'on-call hell' into one that runs itself
- Have shipped infrastructure-as-code (Pulumi, Terraform, k8s manifests) that survived 6+ months without rewriting
- Have contributed to OSS in the infra space (k8s controllers, GPU schedulers, runtime tools)
What we're not looking for
A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.
What we work with.
Backend
Infra
GPU layer
Observability
Five steps. Decision speed is part of the offer.
We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.
Application submitted
Form, ~5 minWe read every word of every application. No silent rejections, ever.
Take-home challenge
2 hours max, your timeA small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.
30 minutes with the CTO
30 min, liveQuick conversation about your take-home, the team, and how you work day-to-day.
Paid challenge, 2 days
2 days, we pay for your timeA bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.
Closing interview
90 min, liveCasual chat with the two founders.
Show, don't tell.
We value proof over promises. When you apply, include examples. Things that stand out:
Ready to ship?
Five minutes. We read every word. Yes / no / not-now within 5 business days, always.
Maybe one of these instead.
We hire builders, not roles. If none fit exactly, the open application is at the bottom.