Help build the AI quality scoring layer Runflow runs in front of every image and video we ship. Hands-on ML, in production tomorrow, at scale.
Sentinel is the AI quality scoring layer Runflow runs in front of every image and video we ship. It decides what's good enough to deliver, what needs to be fixed, and what to throw away. It's how we keep output quality predictable at the volumes our customers run.
We're hiring a Machine Learning Engineer to push that scoring layer forward and extend it across new content types. The work is hands-on ML applied to a real product. The model you train today is in production tomorrow, scoring real customer outputs.
If you've trained, fine-tuned, and quantised small specialised models, built eval pipelines that someone actually relied on, and have strong opinions on when an LLM is the right tool vs. when a sub-100ms specialised model is, this is for you.
Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.
Labelled samples per use case (good + bad), CI integration, regression alerts on every prompt change or model swap. Make Sentinel itself testable.
Audit where Sentinel passes outputs customers later flag, where it rejects outputs that were actually fine. Pick the highest-impact gap, design the new analyzer (researching the best embeddings model for human-identity comparisons, for example), train, quantise, ship into the pipeline.
Define the criteria that matter for video (identity drift, motion artifacts, scene continuity, lip sync where it applies). Build the analyzers. Ship it. Sentinel scores video by quarter-end.
You'll own the full lifecycle, from prototype to production at scale.
Architect the evaluation pipeline, router, preprocessors, judges, for image and video
Train, fine-tune, and quantise specialised neural networks (face, pose, segmentation, OCR, embeddings, depth)
Design dynamic rubrics that adapt to use case, headshot vs. product vs. fashion vs. video
Spawn LLM judges per criterion. Decide model, prompt, complexity tier
Extend Sentinel from images to video, temporal consistency, motion, identity drift, lip sync
Build the eval harness, labelled samples per use case, CI regression alerts
Push the LLM router toward the cost / quality / latency frontier (Lite / Flash / Pro tiers)
Co-design the auto-fix loop with workflow engineers, generate → evaluate → fix → re-evaluate
A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.
We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.
We read every word of every application. No silent rejections, ever.
A small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.
Quick conversation about your take-home, the team, and how you work day-to-day.
A bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.
Casual chat with the two founders.
We value proof over promises. When you apply, include examples. Things that stand out:
Five minutes. We read every word. Yes / no / not-now within 5 business days, always.
We hire builders, not roles. If none fit exactly, the open application is at the bottom.