We're hiring · Infrastructure

DevOps Engineer.

Build the infrastructure that powers the next generation of AI media. GPU fleet, warm workers, queues, FastAPI services, in production tomorrow.

Remote · global
Full-time / contract
Senior IC + equity
The role

Runflow runs millions of AI image and video jobs a month across a multi-cloud GPU fleet. Every workflow our customers ship hits our infrastructure: containers spin up with the right models loaded, queues drain into warm workers on the right GPU class, jobs run to completion within a latency budget, and Sentinel scores them before they go out.

We're hiring a DevOps Engineer to help build that infrastructure. Workers, GPU pool, queues, the database, the FastAPI services that orchestrate everything, all of it is moving fast and could move faster with the right person.

If you've run Kubernetes in anger, sized container fleets to demand without over-paying for idle GPUs, debugged CrashLoopBackOff at 3am, and have strong opinions on when to use a queue vs. a stream, this is for you.

Quick facts

TeamInfrastructure
Reports toCTO
LocationRemote · global
TypeFull-time or contract
CompensationSenior IC + equity
StartASAP
Representative projects

What you'd ship in your first 6 months.

Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.

Project 01

Push the multi-cloud GPU fleet to the next level

RunPod is the primary today, with fallback providers across A100 / H100 / 4090 classes. Tighten the auto-scaling against queue depth, kill the cold-start tax, route the job to the cheapest GPU class that meets the latency budget.

Project 02

Make warm-worker fleets boring to operate

ComfyUI workers stay alive between runs with models in GPU memory. Tune the worker pool, reclaim idle GPUs without losing the warm-cache win, ship dashboards so on-call doesn't have to grep logs.

Project 03

Build the per-run cost analytics surface

Per-node GPU time, per-run cost, queue wait, top cost drivers. Customers + ops both need this. The hard part isn't the dashboard, it's the accurate per-node attribution at fleet scale.

Responsibilities

What you'll be doing.

You'll own the full lifecycle, from prototype to production at scale.

Operate the multi-cloud GPU fleet (RunPod primary, fallback providers, A100 / H100 / 4090)

Run + scale Kubernetes workloads, kill CrashLoopBackOff before it pages anyone

Tune queue + worker architecture so jobs land on warm GPUs at the right tier

Manage the CUDA / driver / model-cache layer so containers spin up fast

Ship + maintain Python FastAPI services that orchestrate the fleet

Own the production Postgres + Redis (Neon, schema migrations, connection pooling, hot paths)

Maintain the CI/CD pipeline + merge queue (GitHub Actions today, room to rebuild)

Build observability so you can answer 'what just changed' in 30 seconds, not 30 minutes

Requirements

About you.

You may be a good fit if…

  • You've operated Kubernetes in production, not just used managed clusters as a black box
  • You've debugged a real GPU-fleet incident, autoscaling gone wrong, container OOM, driver mismatch, the works
  • You're comfortable in Python (FastAPI services, scripts, glue) and shell, even if neither is your home language
  • You think in queue depth, p95 latency, cost per request, not just CPU + memory graphs
  • You can read a Postgres query plan and know when an index is the answer (and when it isn't)
  • You have strong opinions on observability — what to log, what to trace, what to alert on, what to ignore

Strong candidates also have…

  • Have run a multi-cloud GPU fleet (RunPod, Modal, AWS, Lambda Labs, etc.) at meaningful scale
  • Have built or maintained CI/CD systems with merge queues, parallel test sharding, etc.
  • Have rebuilt a system that was 'on-call hell' into one that runs itself
  • Have shipped infrastructure-as-code (Pulumi, Terraform, k8s manifests) that survived 6+ months without rewriting
  • Have contributed to OSS in the infra space (k8s controllers, GPU schedulers, runtime tools)

What we're not looking for

A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.

Tech stack

What we work with.

Backend

PythonFastAPINode.jsPostgreSQLRedis

Infra

KubernetesDockerRunPodModalGitHub Actions

GPU layer

CUDAA100 / H1004090 fleetComfyUI workers

Observability

GrafanaOpenTelemetryLokiSentry
How we hire

Five steps. Decision speed is part of the offer.

We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.

01/5

Application submitted

Form, ~5 min

We read every word of every application. No silent rejections, ever.

Triaged within 5 business days
Apply now
02/5

Take-home challenge

2 hours max, your time

A small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.

Reviewed within 3 business days
03/5

30 minutes with the CTO

30 min, live

Quick conversation about your take-home, the team, and how you work day-to-day.

Decision in 48 hours
04/5

Paid challenge, 2 days

2 days, we pay for your time

A bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.

Decision in 48 hours
05/5

Closing interview

90 min, live

Casual chat with the two founders.

Decision in 24 hours

Show, don't tell.

We value proof over promises. When you apply, include examples. Things that stand out:

A k8s setup you built or rescued, repo, write-up, anything we can read
GPU-fleet work, autoscaling logic, cost / latency tuning, even an internal post-mortem
A Postgres query you tuned that mattered, before / after numbers
A CI/CD pipeline you rebuilt that made the team faster
Apply

Ready to ship?

Five minutes. We read every word. Yes / no / not-now within 5 business days, always.

You're applying for DevOps Engineer · Infrastructure

We commit to a yes / no / not-now within 5 business days. Always. No silent rejections.

Apply now