A computer-vision paper, told as a journey

The Persistence
of Vision

How two flat photographs become depth — and how to solve the NP‑hard problem hiding inside, ten times faster.

Igor Gridchyn & Vladimir Kolmogorov  ·  IST Austria  ·  ICCV 2013
scroll

You have two eyes, about six centimetres apart. Each sends your brain a slightly different flat picture, and from the tiny disagreements between them your visual system reconstructs depth — effortlessly, continuously, before you are even aware of it.

Teaching a computer to do the same is one of the cleaner ways to walk straight into an NP-hard problem. This is the story of a paper I wrote with Vladimir Kolmogorov at IST Austria about doing it anyway — and doing it roughly ten times faster than the state of the art at the time. No background assumed: we will build every idea up from a picture you can poke at.

What this demonstrates

Six short chapters, each with something to play with:

  1. 01 Two eyes, one question Stereo vision as a labeling problem — every pixel picks a depth.
  2. 02 The tug of war Data versus smoothness: the Potts energy, and why it is NP-hard.
  3. 03 A cut is an answer Graph cuts: the one thing we can solve exactly, and fast.
  4. 04 Answers for free Proving part of the optimum without solving the whole problem.
  5. 05 From k cuts to log k The contribution: divide-and-conquer over the label tree.
  6. 06 The payoff Ten times faster — and the cheap answer is often the better one.
The paper

Igor Gridchyn & Vladimir Kolmogorov, “Potts model, parametric maxflow and k-submodular functions” (IST Austria, ICCV 2013). The full PDF is available here. The result images later in this piece are the paper’s own; the interactive toys are illustrative — built to convey each idea rather than reproduce the exact published numbers.

§
Chapter 01

Two eyes, one question

Hold a finger up at arm’s length and look at it while you blink one eye, then the other. The finger jumps sideways against the background. Bring it closer and the jump gets bigger. That jump — the disagreement between your two eyes about where something is — is the entire raw material of depth perception.

A computer sees the same way, given two cameras a few centimetres apart. The cameras are lined up so that any real-world point lands on the same horizontal row in both images; it only ever shifts left or right. That horizontal shift is called the disparity. Near things shift a lot, far things barely move, and the relationship is simple: depth is proportional to one over disparity. Recover the disparity at every pixel and you have recovered the shape of the scene.

So the whole task collapses to a single, repeated question: for this pixel in the left image, how far do I have to slide to find its twin in the right image? The candidate shifts — 0, 1, 2, … up to some maximum — are the labels. Choosing a label for every pixel is the labeling problem at the heart of this paper.

How does a pixel recognise its twin?

By looking at its neighbourhood. We take a small patch around the left pixel and slide it across the candidate positions in the right image, and at each shift we measure how different the two patches are — sum up the squared differences in brightness. The shift where the patches match best (lowest cost) is our answer. That per-pixel, per-label number is the data cost: how much it “hurts” to assign a given disparity to a given pixel.

Interactive · drag the marker on the left image
Left view — pick a pixel
Right view — the match slides along the row
Matching cost vs. disparity — the lowest point wins
Chosen disparity
Implied depth
Cost curve

Drag around and you’ll notice two regimes. On the textured surfaces the cost curve has a sharp, unambiguous minimum — the patch fits in exactly one place. But park the marker on the smooth, near-uniform region at the top and the curve goes flat: dozens of shifts match almost equally well. The data has no opinion. Pick the lowest point anyway and you’re basically guessing.

This is the crack that runs through the whole problem. Choosing every pixel’s lowest-cost label independently gives a crisp answer where the image is rich and pure noise where it isn’t. To do better, the pixels have to stop deciding alone — which is exactly where the next chapter starts.

For the curious

Lining the cameras up so matches stay on one row is called rectification; it follows from epipolar geometry and turns a 2-D search into a cheap 1-D one. The patch-difference score here is the sum of squared differences (SSD); the paper uses SSD aggregated over a 9×9 window, which sharpens those flat curves considerably. Writing fi(a) for the data cost of giving pixel i the disparity label a, everything below is built on top of these numbers. The standard benchmark for this task is the Middlebury stereo dataset (Scharstein & Szeliski, 2002–2007), which the paper’s experiments use.

§
Chapter 02

The tug of war

Letting each pixel pick its own favourite label gives you a depth map that looks like television static — right on average, wrong everywhere it matters. The fix comes from a fact about the world: real surfaces are mostly smooth. Your desk does not change depth every millimetre. Neighbouring pixels almost always belong together.

So we add a second cost that pulls in the opposite direction. Every time two neighbouring pixels are given different labels, we charge a fixed penalty — call it λ. It doesn’t matter whether the labels differ by one or by twenty; any disagreement costs the same flat amount. (That “all disagreements are equal” rule is what makes this the Potts model, borrowed from statistical physics.) The total cost of a labeling is then a sum of two competing forces:

E(x) = Σi fi(xi)  +  λ Σ(i,j) [ xi ≠ xj ] data cost (trust the pixels)  +  smoothness cost (trust the neighbours)

Now no pixel decides alone. Turning λ up makes the labeling pay to disagree, so it cleans up the static by letting confident neighbours overrule noisy ones. Turn it up too far and it starts erasing real detail, flattening small objects into their surroundings because keeping them costs too many border penalties. Somewhere in between is the sweet spot.

Interactive · drag the smoothness slider
Labeling at this λ — crimson = a smoothness penalty paid
Ground truth
Error vs. λ
trust the datatrust smoothness
Error vs. truth
Disagreeing edges
Data cost
Smoothness cost

Watch the crimson lines — they mark every neighbour pair that disagrees, which is precisely where the smoothness cost is being paid. At λ = 0 the map is shot through with them and the error is high. Nudge λ up and the static dissolves, the crimson thins, and the error drops to its lowest. Push further and watch the error climb back up as the small bright region gets swallowed whole. The error-vs-λ curve has a clear valley — a preview of a plot we’ll meet again at the end, drawn from the real experiments.

Here is the catch that makes this a research problem rather than a homework exercise. Finding the labeling that genuinely minimises this energy — the true bottom of the landscape — is NP-hard the moment you have three or more labels. The toy above only fakes it with a quick local heuristic. To do it properly and fast, we need a sharper tool.

For the curious

Formally the energy is f(x) = Σi fi(xi) + Σ{i,j}∈E λij[xi≠xj], where [·] is the Iverson bracket (1 if true, 0 if false). Minimising it is equivalent to a multiway-cut problem and is NP-hard for ≥ 3 labels (Boykov, Veksler & Zabih, 2001), whose α-expansion algorithm is the standard approximation. The interactive uses Iterated Conditional Modes — a simple greedy optimiser that is easy to run live but provably weak; the algorithms in the next three chapters are the real machinery.

§
Chapter 03

A cut is an answer

We left the last chapter with a wall: minimising the energy is NP-hard once there are three or more labels. But hidden inside that hard problem is a much easier one — and it turns out to be the single most useful tool in the whole field. Drop down to just two labels, and the impossible becomes not only solvable but fast.

Picture the simplest version of the task: every pixel is either object or background. Each pixel has a data cost for each choice (dark pixels would rather be background, bright ones object), and neighbours still pay the Potts penalty when they disagree. We want the lowest-cost split. There are 2(number of pixels) possible splits — astronomically many — yet we can find the very best one in a blink. Here is the trick that makes it possible.

Turn the picture into a plumbing network

Build a graph. Add two special nodes: a source standing for “background” and a sink standing for “object.” Wire every pixel to both, with pipe widths equal to its two data costs. Then connect neighbouring pixels to each other with pipes of width λ. Now ask a question that sounds unrelated: what is the cheapest set of pipes you could cut to completely sever the source from the sink?

That minimum cut is the answer. Every pixel ends up on the source side or the sink side — that partition is a labeling. The cost of the cut is exactly the energy of that labeling: severed data pipes are the data costs you pay, and every neighbour-pipe you cut is one disagreeing border. So the cheapest cut is the lowest-energy labeling, and a century-old result (max-flow equals min-cut) lets us find it in polynomial time.

Interactive · drag the smoothness slider
Noisy input — brighter = more “object”
Min-cut segmentation — crimson = the cut
The graph behind it — a cut is a labeling
trust the datatrust smoothness
Total cost (min cut)
Boundary length
Guarantee
global optimum ✓

Slide λ from zero. At zero, the cut just follows each pixel’s own preference and the boundary is ragged, chasing every speck of noise. Raise it and the boundary pays for length, so it tightens around the real object and the noise is overruled — the same tug of war as before, but now solved exactly. The number it reports isn’t a heuristic’s best guess; it is provably the global minimum. There is no better split.

So two labels are a solved problem. The catch, of course, is that stereo has sixty. The cut only knows how to draw one border between two regions, and real depth maps have many. The obvious move — and the one the field took — is to use this two-label superpower as a repeated move: again and again, ask “should any of these pixels switch to label α?” and let a min-cut decide. That is α-expansion. The next two chapters take a different, sharper route with the very same tool.

For the curious

The construction works whenever the pairwise term is submodular — for two labels the Potts penalty always is, so a single min-cut returns the exact global optimum (Greig, Porteous & Seheult, 1989; Kolmogorov & Zabih, 2004). The interactive runs Dinic’s max-flow on the real graph; the cost it prints is the max-flow value, which equals the min cut by the max-flow/min-cut theorem. For more than two labels the problem returns to NP-hard, and α-expansion (Boykov, Veksler & Zabih, 2001) applies this binary cut iteratively — each iteration optimal, the whole only approximate.

§
Chapter 04

Answers for free

Here is a thought that sounds too good to be true. The full problem is NP-hard — but what if, before paying for the expensive solver, you could prove the correct label for most of the pixels using nothing but the cheap two-label cuts from the last chapter? Not guess. Prove. And what if “most” meant eighty or ninety percent of the image?

That is exactly what persistency — also called partial optimality — delivers. The idea, due to Ivan Kovtun, is disarmingly simple. Pick one label, say “disparity 7,” and ask a single binary question of the whole image: label 7, or anything-but-7? That is a two-label problem, so one min-cut answers it exactly. The pixels the cut hands to “7” come with a guarantee: there exists a global optimum of the full problem in which those pixels really are 7.

They are persistent. You can lock them in and never look at them again. Run that binary question once per label, and you peel off a certified piece of the optimal solution each time — for the price of one max-flow apiece.

Interactive · run the auxiliary cuts
Certified so far — grey = not yet proven
Full solution — free + solved
0%
certified for free
Max-flows run
0 / 6
Left for the hard solver
100%
Guarantee
provably optimal ✓
In the real paper — fraction certified for free on Middlebury stereo

Step through it. Each click runs one auxiliary cut and freezes the pixels it can prove, painting them in. After all six, the grey has nearly vanished: the great majority of the image is solved and certified optimal, while only a stubborn residue — the genuinely ambiguous pixels, mostly along object edges — is handed to the expensive solver. That is the whole game: shrink the NP-hard problem before you ever pay for it.

And this isn’t a toy-scale effect. On the standard Middlebury stereo benchmark, the fraction the method certifies for free ranged from about 50% to 93% of the image (the bars above are the paper’s actual numbers). The expensive second phase then only has to solve what little is left — which is why the whole pipeline ends up roughly ten times faster than the previous state of the art. We’ll see those timings in the final chapter.

But there is still something wasteful here. We ran one cut per label — sixty cuts for a sixty-disparity stereo problem. The central contribution of the paper is realising you don’t need sixty. You need about six. That is the next chapter.

For the curious

Persistency for the “one-against-all” auxiliary problem is Kovtun’s result (Kovtun, 2003); the auxiliary energy is constructed so that its min-cut solution is guaranteed to extend to a global optimum of the original, which is what makes the labeled pixels safe to fix. The interactive runs the real auxiliary max-flows; the “full solution” panel is the two-phase pipeline — persistent labels fixed, then α-expansion on the non-persistent remainder, exactly as in the paper — so the certified pixels are always a subset of it. The paper also studies ways to increase the persistent fraction (the “MP” procedure), with the biggest gains at λ = 0.

§
Chapter 05

From k cuts to log k

The previous chapter spent one min-cut per label. That is the wasteful part, and fixing it is the heart of the paper. The fix is an idea you already use every day without naming it: to find a number between 1 and 60, you don’t check all sixty — you ask “higher or lower?” and halve the range each time. Binary search. The trick is doing that search for every pixel at once, with a single min-cut deciding the whole image at each step.

Lay the candidate labels out as the leaves of a balanced binary tree. The root holds the whole range; its two children split it in half; their children split again, and so on down to single labels. Now walk down the tree one level at a time. At each level a single binary question is posed to the entire image — is your label in the lower half of your current range, or the upper? — and, because that is a two-label problem, one min-cut answers it exactly for every pixel simultaneously. Each answer halves every pixel’s range.

Halve sixteen labels four times and you are down to one. The depth map resolves from a flat blank into coarse bands, then finer ones, then full detail — in four cuts rather than sixteen.

Interactive · run the cuts, watch it resolve
Label tree — each cut descends one level (16 labels → depth 4)
Depth map resolving — candidate range halves each cut
SPLIT (this paper)
0
Naive Kovtun
16
max-flows to label the image
Cuts run
0 / 4
Candidates / pixel
16
For a 60-disparity stereo problem the saving is starker still: 7 cuts instead of 60 — that is the ⌈1 + log₂ k⌉ bound.

The accounting is the punchline. Kovtun’s original method needed k max-flows; this needs the depth of the tree, which is about log₂ k. The cuts at any one level act on disjoint sets of pixels — each pixel is in exactly one sub-problem — so a whole level costs no more than a single max-flow on the original image. Sixteen labels: four cuts. Sixty labels: seven. The bigger the label set, the bigger the win, which is exactly why it matters for stereo, where k is large.

That is the contribution in one sentence: the same certified answer Kovtun gives in k max-flows, obtained in ⌈1 + log₂ k⌉ of them. Combine it with the persistency of the last chapter — label most of the image for free, in logarithmically few cuts, then hand the small remainder to a fast solver — and you have the whole pipeline. The final chapter shows what that buys in wall-clock time, and a result that genuinely surprised us.

For the curious

The construction reduces minimising the Potts energy to minimising a function with a tree metric on the label set — a star graph rooted at an auxiliary label — whose unary terms are T-convex. The paper generalises the Tree-Metrics algorithm of Felzenszwalb et al. to these more general unaries; balancing the star into a binary tree (an edge-insertion step) is what makes each split even, giving the ⌈1 + log₂ k⌉ depth. By the coarea formula, the binary cuts across all levels are nested and consistent, so they can equivalently be obtained from a single parametric max-flow (Gallo, Grigoriadis & Tarjan, 1989). The paper also frames the whole thing through k-submodular functions, generalising bisubmodularity and roof duality (QPBO) to more than two labels.

§
Chapter 06

The payoff

Theory is only worth so much; the question a practitioner asks is “how fast, and how good?” So we ran it on the standard testbed for this problem — the Middlebury stereo benchmark — against the fastest published method for the same task. Here is what came back.

Wall-clock runtime on the Middlebury stereo benchmark (milliseconds; shorter is better). Phase 1 of the new method — k-sub Kovtun — is roughly ten times faster than the previous state of the art (“Reduce,” Alahari et al. 2010); the full pipeline with FastPD stays well ahead. Numbers are Table 1 of the paper.

The first phase — labeling most of the image for free, in logarithmically few cuts — runs about ten times faster than the previous state of the art (“Reduce”). Even the complete pipeline, which then cleans up the non-persistent remainder with a fast solver, stays comfortably ahead. On the easy images, where almost everything is certified, the whole thing finishes in a couple of hundred milliseconds.

But speed was expected. The result that genuinely surprised us was about quality. Look at the actual disparity maps — flip between the cheap certified labeling and the full optimisation:

Interactive · flip between the cheap and the full result
Cones disparity map
Cones
Teddy disparity map
Teddy
Tsukuba disparity map
Tsukuba
Venus disparity map
Venus
Showing: Kovtun’s certified labeling

They are nearly indistinguishable. And when we measured error against ground truth, the cheap Kovtun labeling was, in the majority of cases, more accurate than the full α-expansion solution — even though α-expansion reaches a lower energy. That sounds paradoxical until you remember what energy is: a model of the world, not the world itself. Driving the energy to its absolute minimum optimises the model harder, and the model’s small biases harder along with it. The certified pixels are the ones the data was sure about in the first place — so the part we can prove is also the part that is most often right.

The part of the answer we can prove for free turns out to be the part most likely to be correct.

The practical reading is liberating: for a time-critical system — a robot, a camera, anything with a deadline — you can stop after the cheap phase and ship a labeling that is both fast and, on these benchmarks, often better. The expensive global optimum is optional.

Why a stereo paper sits in an AI portfolio

This work is older than the deep-learning era, and stereo depth is largely a learned problem now. I keep it here for what it shows about how I think, which has not dated. Three habits run through it and through everything I have built since. First, find the structure that makes a hard problem easy — the whole paper turns on noticing that a k-way labeling hides a binary search. Second, earn guarantees rather than hope for them: persistency is a proof, not a heuristic, and knowing which of your outputs are certified is exactly the kind of calibrated confidence modern ML systems are still scrambling to provide. Third, respect the gap between the objective and the goal — the surprise in this paper is a small, friendly version of the reward-hacking and over-optimisation problems that define a lot of frontier AI work today.

Faster algorithms, provable partial answers, and a healthy suspicion of the objective you are minimising: that is a toolkit, and it transfers. If your team is working on problems where those instincts would help, I would love to talk.

Get in touch → See the full CV Read the PhD story
For the curious

The full pipeline is “k-sub Kovtun” for phase 1 (the log k persistency) followed by FastPD (Komodakis et al.) on the non-persistent pixels for phase 2; initialising FastPD from Kovtun’s labeling also sped phase 2 up by ~14% over the standard initialisation. The error-rate comparison is measured against the Middlebury ground truth (Scharstein & Szeliski); error over the persistent subset is lower still than over the whole image. Images and timings are from the paper — Gridchyn & Kolmogorov, ICCV 2013.

§