We're hiring · Infrastructure

DevOps Engineer.

Build the infrastructure that powers the next generation of AI media. GPU fleet, warm workers, queues, FastAPI services, in production tomorrow.

Remote · global

Full-time / contract

Senior IC + equity

Apply now How we hire

The role

Runflow runs millions of AI image and video jobs a month across a multi-cloud GPU fleet. Every workflow our customers ship hits our infrastructure: containers spin up with the right models loaded, queues drain into warm workers on the right GPU class, jobs run to completion within a latency budget, and Sentinel scores them before they go out.

We're hiring a DevOps Engineer to help build that infrastructure. Workers, GPU pool, queues, the database, the FastAPI services that orchestrate everything, all of it is moving fast and could move faster with the right person.

If you've run Kubernetes in anger, sized container fleets to demand without over-paying for idle GPUs, debugged CrashLoopBackOff at 3am, and have strong opinions on when to use a queue vs. a stream, this is for you.

Quick facts

TeamInfrastructure

Reports toCTO

LocationRemote · global

TypeFull-time or contract

CompensationSenior IC + equity

StartASAP

Representative projects

What you'd ship in your first 6 months.

Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.

Project 01

Push the multi-cloud GPU fleet to the next level

RunPod is the primary today, with fallback providers across A100 / H100 / 4090 classes. Tighten the auto-scaling against queue depth, kill the cold-start tax, route the job to the cheapest GPU class that meets the latency budget.

Project 02

Make warm-worker fleets boring to operate

ComfyUI workers stay alive between runs with models in GPU memory. Tune the worker pool, reclaim idle GPUs without losing the warm-cache win, ship dashboards so on-call doesn't have to grep logs.

Project 03

Build the per-run cost analytics surface

Per-node GPU time, per-run cost, queue wait, top cost drivers. Customers + ops both need this. The hard part isn't the dashboard, it's the accurate per-node attribution at fleet scale.

Responsibilities

What you'll be doing.

You'll own the full lifecycle, from prototype to production at scale.

Operate the multi-cloud GPU fleet (RunPod primary, fallback providers, A100 / H100 / 4090)

Run + scale Kubernetes workloads, kill CrashLoopBackOff before it pages anyone

Tune queue + worker architecture so jobs land on warm GPUs at the right tier

Manage the CUDA / driver / model-cache layer so containers spin up fast

Ship + maintain Python FastAPI services that orchestrate the fleet

Own the production Postgres + Redis (Neon, schema migrations, connection pooling, hot paths)

Maintain the CI/CD pipeline + merge queue (GitHub Actions today, room to rebuild)

Build observability so you can answer 'what just changed' in 30 seconds, not 30 minutes

Requirements

About you.

You may be a good fit if…

You've operated Kubernetes in production, not just used managed clusters as a black box
You've debugged a real GPU-fleet incident, autoscaling gone wrong, container OOM, driver mismatch, the works
You're comfortable in Python (FastAPI services, scripts, glue) and shell, even if neither is your home language
You think in queue depth, p95 latency, cost per request, not just CPU + memory graphs
You can read a Postgres query plan and know when an index is the answer (and when it isn't)
You have strong opinions on observability — what to log, what to trace, what to alert on, what to ignore

Strong candidates also have…

Have run a multi-cloud GPU fleet (RunPod, Modal, AWS, Lambda Labs, etc.) at meaningful scale
Have built or maintained CI/CD systems with merge queues, parallel test sharding, etc.
Have rebuilt a system that was 'on-call hell' into one that runs itself
Have shipped infrastructure-as-code (Pulumi, Terraform, k8s manifests) that survived 6+ months without rewriting
Have contributed to OSS in the infra space (k8s controllers, GPU schedulers, runtime tools)

What we're not looking for

A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.

Tech stack

What we work with.

Backend

PythonFastAPINode.jsPostgreSQLRedis

Infra

KubernetesDockerRunPodModalGitHub Actions

GPU layer

CUDAA100 / H1004090 fleetComfyUI workers

Observability

GrafanaOpenTelemetryLokiSentry

How we hire

Five steps. Decision speed is part of the offer.

We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.

01/5

Application submitted

Form, ~5 min

We read every word of every application. No silent rejections, ever.

Triaged within 5 business days

Apply now

02/5

Take-home challenge

2 hours max, your time

A small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.

Reviewed within 3 business days

03/5

30 minutes with the CTO

30 min, live

Quick conversation about your take-home, the team, and how you work day-to-day.

Decision in 48 hours

04/5

Paid challenge, 2 days

2 days, we pay for your time

A bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.

Decision in 48 hours

05/5

Closing interview

90 min, live

Casual chat with the two founders.

Decision in 24 hours

Show, don't tell.

We value proof over promises. When you apply, include examples. Things that stand out:

A k8s setup you built or rescued, repo, write-up, anything we can read

GPU-fleet work, autoscaling logic, cost / latency tuning, even an internal post-mortem

A Postgres query you tuned that mattered, before / after numbers

A CI/CD pipeline you rebuilt that made the team faster

Apply

Ready to ship?

Five minutes. We read every word. Yes / no / not-now within 5 business days, always.

You're applying for DevOps Engineer · Infrastructure

Name

Location

optional, city, country, time zone

Links

At least one, LinkedIn, GitHub, portfolio, Loom, anything.

Show us 1–3 things you've built that prove you can ship end-to-end.

Repos, posts, demos, anything. Specifics beat resume bullets.

1500 left

What's a problem you solved with AI as the execution layer, not just chat?

We care that you reach for AI as a default tool, not occasionally.

1000 left

Show us a product you find beautiful. Why?

Taste is non-negotiable. We want your real opinions.

800 left

Which AI tools are in your daily workflow?

Pick all you actively use, not just tried.

How much is AI helping you these days?

A sentence or two. Skeptical, force multiplier, in between, where it shines, where it doesn't.

600 left

Share a Claude Code session that shipped a feature end-to-end.

Optional but high-signal. We want to see how you actually work with AI, the prompts, the back-and-forth, the corrections. Whole session, snippet, Loom, gist, anything.

3000 left

When have you most successfully hacked a non-computer system to your advantage?

Optional. Bureaucracy, supply chain, a conference badge, social dynamics, anything. We collect signal here.

1000 left

Most recent role + company

Just the title + company name. We don't need a CV.

200 left

When could you start?

Salary expectation

Confidential. A range is fine.

200 left

Show off the production system you've built that you're most proud of.

How many users / requests / GBs / images it served. When it broke, how you found out, what you actually did to fix it. Bonus if you've still got the post-mortem or the dashboard screenshot from the day. We want a real story.

2500 left

And the gnarliest production issue you've personally debugged.

What was the symptom, what tools did you reach for, how did you find the actual cause, what changed afterwards. The kind of incident that taught you something you still use.

2000 left

Anything else?

Optional. A short note, a question, a relevant detail we missed.

500 left

We commit to a yes / no / not-now within 5 business days. Always. No silent rejections.

Other open roles

Maybe one of these instead.

We hire builders, not roles. If none fit exactly, the open application is at the bottom.

Workflows·Remote · global

ComfyUI Engineer

Own the workflows that power millions of AI images every month. You think in nodes, you debug in graphs, you care obsessively about output quality.

View role

Sentinel·Remote · global

Machine Learning Engineer

Help build the AI quality scoring layer Runflow runs in front of every image and video we ship. Hands-on ML, in production tomorrow, at scale.

View role

DevRel·Remote · global

DevRel Engineer

Ship cool things on top of Runflow, bring creators in, drive adoption of the next generation of AI image + video infrastructure.

View role

Open·You define it

Open application

Don't see your role? Tell us what you'd build at Runflow and how it moves the mission forward.

Apply openly

Apply now