We're hiring · Sentinel

Machine Learning Engineer.

Help build the AI quality scoring layer Runflow runs in front of every image and video we ship. Hands-on ML, in production tomorrow, at scale.

Remote · global
Full-time / contract
Senior IC + equity
sentinel · evaluatelive
auto-rotates · hover/focus to pause
01 · router
prompt:
professional headshot, navy suit, neutral background
decision:
use_case: headshot
analyzers: 3
criteria: 3
02 · analyzers
Subjectrunning
Layoutrunning
Surfacerunning
03 · criteria → score
Identity preserved
96
Composition
92
Surface quality
88
sentinel score
94
pass · ship
The role

Sentinel is the AI quality scoring layer Runflow runs in front of every image and video we ship. It decides what's good enough to deliver, what needs to be fixed, and what to throw away. It's how we keep output quality predictable at the volumes our customers run.

We're hiring a Machine Learning Engineer to push that scoring layer forward and extend it across new content types. The work is hands-on ML applied to a real product. The model you train today is in production tomorrow, scoring real customer outputs.

If you've trained, fine-tuned, and quantised small specialised models, built eval pipelines that someone actually relied on, and have strong opinions on when an LLM is the right tool vs. when a sub-100ms specialised model is, this is for you.

Quick facts

TeamSentinel
Reports toHead of AI
LocationRemote · global
TypeFull-time or contract
CompensationSenior IC + equity
StartASAP
Representative projects

What you'd ship in your first 6 months.

Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.

Project 01

Build the eval harness that catches regressions before they ship

Labelled samples per use case (good + bad), CI integration, regression alerts on every prompt change or model swap. Make Sentinel itself testable.

Project 02

Find Sentinel's blind spots, then add the analyzers that close them

Audit where Sentinel passes outputs customers later flag, where it rejects outputs that were actually fine. Pick the highest-impact gap, design the new analyzer (researching the best embeddings model for human-identity comparisons, for example), train, quantise, ship into the pipeline.

Project 03

Extend Sentinel from images to video, scoring temporal consistency on Seedance and Kling

Define the criteria that matter for video (identity drift, motion artifacts, scene continuity, lip sync where it applies). Build the analyzers. Ship it. Sentinel scores video by quarter-end.

Responsibilities

What you'll be doing.

You'll own the full lifecycle, from prototype to production at scale.

Architect the evaluation pipeline, router, preprocessors, judges, for image and video

Train, fine-tune, and quantise specialised neural networks (face, pose, segmentation, OCR, embeddings, depth)

Design dynamic rubrics that adapt to use case, headshot vs. product vs. fashion vs. video

Spawn LLM judges per criterion. Decide model, prompt, complexity tier

Extend Sentinel from images to video, temporal consistency, motion, identity drift, lip sync

Build the eval harness, labelled samples per use case, CI regression alerts

Push the LLM router toward the cost / quality / latency frontier (Lite / Flash / Pro tiers)

Co-design the auto-fix loop with workflow engineers, generate → evaluate → fix → re-evaluate

Requirements

About you.

You may be a good fit if…

  • You've trained or fine-tuned a small neural network end-to-end, datasets, loss curves, the works
  • You've shipped quantised inference (INT4/INT8) and know what trade-offs are acceptable in production
  • You've designed an evaluation pipeline for any generative task and made it run reliably
  • You have strong opinions on when to use an LLM vs. a specialised model, and can defend them
  • You're comfortable shipping in TypeScript even if Python is your home, the system spans both
  • You read recent ML papers (LLM-as-judge, eval design, video gen) and can summarise them in 3 sentences

Strong candidates also have…

  • Have published an evaluation framework or LLM-as-judge benchmark
  • Have worked on AI video generation or video evaluation specifically
  • Have hands-on experience with Qwen3-VL, SAM 3.1, OpenPose, Gemini Embedding 2, ArcFace, or similar specialised CV / multimodal models
  • Have built a model router or orchestrator for multi-model production systems
  • Have shipped open-source ML infrastructure or contributed to one of the libraries above

What we're not looking for

A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.

Tech stack

What we work with.

Backend

Node.jsPythonPostgreSQLRedis

ML

ComfyUIPyTorchdiffuserstransformersONNXTensorRT

Models

Qwen3-VLSAM 3.1OpenPoseGemini Embedding 2ArcFace

LLMs

Gemini 3.1Opus 4.7 ThinkingMuse SparkQwen 3.5Kimi k2.6
How we hire

Five steps. Decision speed is part of the offer.

We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.

01/5

Application submitted

Form, ~5 min

We read every word of every application. No silent rejections, ever.

Triaged within 5 business days
Apply now
02/5

Take-home challenge

2 hours max, your time

A small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.

Reviewed within 3 business days
03/5

30 minutes with the CTO

30 min, live

Quick conversation about your take-home, the team, and how you work day-to-day.

Decision in 48 hours
04/5

Paid challenge, 2 days

2 days, we pay for your time

A bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.

Decision in 48 hours
05/5

Closing interview

90 min, live

Casual chat with the two founders.

Decision in 24 hours

Show, don't tell.

We value proof over promises. When you apply, include examples. Things that stand out:

A small model you trained, fine-tuned, or quantised, repo + the eval that proves it works
An evaluation rubric you've designed for a real problem, with the failure modes you caught
Any work on video generation, video eval, or temporal modelling
A multi-model system or router you've built, when does each model fire, why
Apply

Ready to ship?

Five minutes. We read every word. Yes / no / not-now within 5 business days, always.

You're applying for Machine Learning Engineer · Sentinel

We commit to a yes / no / not-now within 5 business days. Always. No silent rejections.

Apply now