I Built a Visual Fallback for Failed Conforms

Conform is the moment in post-production where an edit decision list reaches the colour pipeline. The colourist’s session gets populated with the right source clips, in the right order, at the right in and out points. When it works, it’s invisible. When the metadata is corrupted, stripped, or just wrong, it stops working, and the colourist has to find the right source by hand.

A teammate had been working on a full visual replacement for metadata-based conform, using DINOv2 embeddings and a FAISS index. Strong system. A different colleague suggested a different shape: rather than replace the whole process, use visual matching as a fallback that kicks in only when metadata conform fails on a shot. That framing turned out to suit the way conform actually breaks. Most shots match fine. A handful don’t, and those are the ones worth catching automatically.

I built that fallback.

The technique: BLAST, but for frames

The problem is shaped like a search. A small piece of footage (the offline cut) is the needle. A much larger corpus (every camera-original source clip in the project) is the haystack. For a single frame, the search is hopeless — too many visually similar frames across takes. For two consecutive frames, the search becomes specific, because the exact visual change between them — a head turn, a light shift, a flag of motion — is unlikely to repeat anywhere else.

That’s the same insight bioinformatics uses to find DNA in genomes. BLAST scans for short matching substrings (k-mers) as “seeds,” then extends each seed by looking at the surrounding context. Different domain, same mechanic.

The pipeline runs in three stages:

BLAST seed. Approximate-nearest-neighbour lookup on consecutive-pair frame embeddings. The candidate (clip, offset) bucket with the most seed hits wins.
Smith-Waterman extend. Align the offline shot against the winning source clip’s frame range to lock the in and out points.
Offset refine. A final ±N-frame sweep around the alignment, picking the offset that maximises mean per-frame cosine similarity.

Each stage tightens the result. Seed gets the right clip. Extend gets the right region. Refine gets the right frame.

Why SSCD, not DINOv2

The teammate’s primary line uses DINOv2 — a strong general-purpose vision foundation model. For this fallback I switched to SSCD, Meta’s copy-detection embeddings. SSCD was trained for content moderation: matching images even after heavy crops, resolution changes, colour shifts, overlays. That happens to be exactly what conform fallback needs to handle. The offline cut has been re-encoded, transcoded, colour-graded, sometimes letterboxed. The source is the camera-original. Anything robust to that range of transforms wins.

I ran a head-to-head comparing the two on a clip pushed through seven degradation profiles (letterbox, crop, double LUT, heavy compression, low-res upscale, burn-in, display transform). DINOv2 had a larger discrimination gap on the cleaner cases. SSCD held up better on the messier ones, and SSCD is smaller, faster, and MIT licensed. For the fallback role, SSCD’s profile fit.

Stress testing: ground truth by construction

The hardest part of building something like this isn’t writing the matcher. It’s knowing whether the matcher actually works. Real EDLs come with retimes, transcodes, and human error baked in. Ground truth is muddy.

So I built a synthetic test framework. Pick a random source clip and frame range, push it through one of the seven degradation profiles, then try to match the result back. Ground truth is provable by construction — the source clip and frame range are what generated the test. No EDL trust issues, no metadata drift.

Run 100 trials per profile, 700 total. Score by whether the matcher found the right clip and how close the offset was.

Results

On a real EDL — 207 events, 354 source clips, around eleven hours of dailies at a 30:1 shoot ratio — the matcher landed on the correct clip 100% of the time. Frame-accurate on 88% of events. Within one frame on 92%. The four retimes in the EDL were out of scope for this version.

On the synthetic stress test — 700 trials across seven degradation profiles — 99% correct clip, 91% frame-accurate.

The 8% that aren’t frame-accurate on the real EDL are nearly all locked-off shots: a static landscape, leaves moving in the wind, a framed photograph on a wall. Consecutive frames in these shots barely differ. The temporal fingerprint falls apart, and the matcher can land 50 to 190 frames off the true position. The algorithm scores these matches with low confidence, so they’re easy to flag from the output report and review manually.

Production architecture

The prototype turned into two pieces of software, both meant to be lightweight to deploy.

fl-enrich, a small terminal app for vectorising source and offline footage. It calls SSCD through a Swift binary that runs in parallel CoreML across the Neural Engine and GPU on an Apple Silicon Mac. Six worker processes feed an FFmpeg fallback for codecs that AVAssetReader can’t decode. End-to-end throughput on an M3 Ultra: around 420 frames per second cumulative across 25 parallel workers, with decode pipelined alongside inference.

Embeddings land in a LanceDB directory. Per-clip keys are basenames. The vector store is portable — a directory on disk, no service to run, no OS-level dependency, just pip install. The whole DB can ship to wherever the colourist is grading.

A FLAPI package inside Baselight runs the fallback. The colourist runs metadata conform the usual way and aims for a clean 100% match. Anything that fails gets marked. Scene → Visual Conform Fix → Match walks each missing shot, runs the three-stage matcher against the LanceDB, and inserts the correct source clip below the missing position. Each insertion is colour-coded by verdict: green for a frame-accurate match, yellow for partial overlap, orange for the right clip in the wrong region, red for a wrong clip.

A separate Visualize Match step opens an HTML report in the browser. Cards show the inserted strip’s centre frame, a verdict pill, the percent overlap with ground truth, and the seed-hit positions as ticks. Faster to eyeball than stepping through marks on the timeline.

Three consecutive offline frames anchoring via seed matches into a single highlighted segment of a long source-corpus reel

What I learned

A handful of things landed harder than the rest.

Single-frame matching is hopeless. Consecutive-frame matching is specific. The whole BLAST analogy hinges on this. One frame, on its own, has too many visual lookalikes across takes for the search to converge. Two or three frames as a unit are a fingerprint.

Vectorising upfront pays for itself many times over. Once the source corpus is vectorised, matching is sub-second per missing shot, even against roughly a million pair embeddings on a 30:1 shoot ratio. The expensive work is done once, asynchronously, off the colourist’s session.

The right embedding for the job isn’t always the strongest embedding. SSCD beat DINOv2 in this role specifically because the deformations it was trained on (compression, recolour, overlay) match the deformations the offline-versus-source comparison demands.

Synthetic stress tests beat hand-curated test sets when you can build them. Ground truth is provable by construction. Every degradation profile becomes a fresh test set on demand, without any of the “is the label correct?” doubt that comes with hand-curated data.

The prototype is currently a demo on a private machine. Not production code, not even close. But the matcher itself, end to end, works.