Machine Learning Engineer.
Help build the AI quality scoring layer Runflow runs in front of every image and video we ship. Hands-on ML, in production tomorrow, at scale.
analyzers: 3
criteria: 3
Sentinel is the AI quality scoring layer Runflow runs in front of every image and video we ship. It decides what's good enough to deliver, what needs to be fixed, and what to throw away. It's how we keep output quality predictable at the volumes our customers run.
We're hiring a Machine Learning Engineer to push that scoring layer forward and extend it across new content types. The work is hands-on ML applied to a real product. The model you train today is in production tomorrow, scoring real customer outputs.
If you've trained, fine-tuned, and quantised small specialised models, built eval pipelines that someone actually relied on, and have strong opinions on when an LLM is the right tool vs. when a sub-100ms specialised model is, this is for you.
Quick facts
What you'd ship in your first 6 months.
Concrete, shippable, on the live roadmap. Not hypothetical. We expect every one of these to land.
Build the eval harness that catches regressions before they ship
Labelled samples per use case (good + bad), CI integration, regression alerts on every prompt change or model swap. Make Sentinel itself testable.
Find Sentinel's blind spots, then add the analyzers that close them
Audit where Sentinel passes outputs customers later flag, where it rejects outputs that were actually fine. Pick the highest-impact gap, design the new analyzer (researching the best embeddings model for human-identity comparisons, for example), train, quantise, ship into the pipeline.
Extend Sentinel from images to video, scoring temporal consistency on Seedance and Kling
Define the criteria that matter for video (identity drift, motion artifacts, scene continuity, lip sync where it applies). Build the analyzers. Ship it. Sentinel scores video by quarter-end.
What you'll be doing.
You'll own the full lifecycle, from prototype to production at scale.
Architect the evaluation pipeline, router, preprocessors, judges, for image and video
Train, fine-tune, and quantise specialised neural networks (face, pose, segmentation, OCR, embeddings, depth)
Design dynamic rubrics that adapt to use case, headshot vs. product vs. fashion vs. video
Spawn LLM judges per criterion. Decide model, prompt, complexity tier
Extend Sentinel from images to video, temporal consistency, motion, identity drift, lip sync
Build the eval harness, labelled samples per use case, CI regression alerts
Push the LLM router toward the cost / quality / latency frontier (Lite / Flash / Pro tiers)
Co-design the auto-fix loop with workflow engineers, generate → evaluate → fix → re-evaluate
About you.
You may be a good fit if…
- You've trained or fine-tuned a small neural network end-to-end, datasets, loss curves, the works
- You've shipped quantised inference (INT4/INT8) and know what trade-offs are acceptable in production
- You've designed an evaluation pipeline for any generative task and made it run reliably
- You have strong opinions on when to use an LLM vs. a specialised model, and can defend them
- You're comfortable shipping in TypeScript even if Python is your home, the system spans both
- You read recent ML papers (LLM-as-judge, eval design, video gen) and can summarise them in 3 sentences
Strong candidates also have…
- Have published an evaluation framework or LLM-as-judge benchmark
- Have worked on AI video generation or video evaluation specifically
- Have hands-on experience with Qwen3-VL, SAM 3.1, OpenPose, Gemini Embedding 2, ArcFace, or similar specialised CV / multimodal models
- Have built a model router or orchestrator for multi-model production systems
- Have shipped open-source ML infrastructure or contributed to one of the libraries above
What we're not looking for
A specific number of years. A specific degree. A specific stack. We hire on whether you can ship end-to-end and whether you have taste. Everything else is noise.
What we work with.
Backend
ML
Models
LLMs
Five steps. Decision speed is part of the offer.
We move fast because senior candidates with multiple offers reward the team that respects their time. Whole loop fits in two weeks.
Application submitted
Form, ~5 minWe read every word of every application. No silent rejections, ever.
Take-home challenge
2 hours max, your timeA small, real Runflow problem. We score the prompts and decisions you made as much as the output itself.
30 minutes with the CTO
30 min, liveQuick conversation about your take-home, the team, and how you work day-to-day.
Paid challenge, 2 days
2 days, we pay for your timeA bigger problem we pay you to work through. We care about outcomes, not process. AI tools more than welcome.
Closing interview
90 min, liveCasual chat with the two founders.
Show, don't tell.
We value proof over promises. When you apply, include examples. Things that stand out:
Ready to ship?
Five minutes. We read every word. Yes / no / not-now within 5 business days, always.
Maybe one of these instead.
We hire builders, not roles. If none fit exactly, the open application is at the bottom.