CSE5519 Advances in Computer Vision (Lecture 3)

Reminders

First Example notebook due Sep 18

Project proposal due Sep 23

Continued: A brief history (time) of computer vision

Theme changes

1980

“Definitive” detectors
- Edges: Canny (1986); corners: Harris & Stephens (1988)
Multiscale image representations
- Witkin (1983), Burt & Adelson (1984), Koenderink (1984, 1987), etc.
- Markov Random Field models: Geman & Geman (1984)
Segmentation by energy minimization
- Kass, Witkin & Terzopoulos (1987), Mumford & Shah (1989)

Conferences, journals, books

Conferences: ICPR (1973), CVPR (1983), ICCV (1987), ECCV (1990)
Journals: TPAMI (1979), IJCV (1987)
Books: Duda & Hart (1972), Marr (1982), Ballard & Brown (1982), Horn (1986)

1980s: The dead ends

Alignment-based recognition
- Faugeras & Hebert (1983), Grimson & Lozano-Perez (1984), Lowe (1985), Huttenlocher & Ullman (1987), etc.
Aspect graphs
- Koenderink & Van Doorn (1979), Plantinga & Dyer (1986), Hebert & Kanade (1985), Ikeuchi & Kanade (1988), Gigus & Malik (1990)
Invariants: Mundy & Zisserman (1992)

1980s: Meanwhile…

Neocognitron: Fukushima (1980)
Back-propagation: Rumelhart, Hinton & Williams (1986)
- Origins in control theory and optimization: Kelley (1960), Dreyfus (1962), Bryson & Ho (1969), Linnainmaa (1970)
- Application to neural networks: Werbos (1974)
- Interesting blog post: Backpropagating through time Or, How come BP hasn’t been invented earlier?
Parallel Distributed Processing: Rumelhart et al. (1987)
Neural networks for digit recognition: LeCun et al. (1989)

1990s

Multi-view geometry, statistical and appearance-based models for recognition, first approaches for (class-specific) object detection

Geometry (mostly) solved

Fundamental matrix: Faugeras (1992)
Normalized 8-point algorithm: Hartley (1997)
RANSAC for robust fundamental matrix estimation: Torr & Murray (1997)
Bundle adjustment: Triggs et al. (1999)
Hartley & Zisserman book (2000)
Projective structure from motion: Faugeras and Luong (2001)

Data enters the scene

Appearance-based models: Turk & Pentland (1991), Murase & Nayar (1995)

PCA for face recognition: Turk & Pentland (1991) Image manifolds

Keypoint-based image indexing

Schmid & Mohr (1996), Lowe (1999)

Constellation models for object categories

Burl, Weber & Perona (1998), Weber, Welling & Perona (2000)

First sustained use of classifiers and negative data

Face detectors: Rowley, Baluja & Kanade (1996), Osuna, Freund & Girosi (1997), Schneiderman & Kanade (1998), Viola & Jones (2001)
Convolutional nets: LeCun et al. (1998)

Graph cut image inference

Boykov, Veksler & Zabih (1998)

Segmentation

Normalized cuts: Shi & Malik (2000)
Berkeley segmentation dataset: Martin et al. (2001)

Video processing

Layered motion models: Adelson & Wang (1993)
Robust optical flow: Black & Anandan (1993)
Probabilistic curve tracking: Isard & Blake (1998)

2000s: Keypoints and reconstruction

Keypoints craze

Kadir & Brady (2001), Mikolajczyk & Schmid (2002), Matas et al. (2004), Lowe (2004), Bay et al. (2006), etc.

3D reconstruction “in the wild”

SFM in the wild
Multi-view stereo, stereo on GPU’s

Generic object recognition

Constellation models
Bags of features
Datasets: Caltech-101 -> ImageNet

Generic object detection

PASCAL dataset
HOG, Deformable part models

Action and activity recognition:

“misc. early efforts”

1990s-2000s: Dead ends (?)

Probabilistic graphical models

Perceptual organization

2010s: Deep learning, big data

They can be more accurate (often much more accurate).

They are faster (often much faster).

They are adaptable to new problems.

Deep Convolutional Neural Networks

Many layers, some of which are convolutional (usually near the input)
Early layers “extract features”
Trained using stochastic gradient descent on very large datasets
Many possible loss functions (depending on task)

Additional benefits:

High-quality software frameworks
“New” network layers
- Dropout (enables simultaneously training many models)
- ReLU activation (enables faster training because gradients don’t become zero)
Bigger datasets
- reduces overfitting
- improves robustness
- enable larger, deeper networks
Deeper networks eliminate the need for hand-engineered features

Where did we go wrong?

In retrospect, computer vision has had several periods of “spinning its wheels”

We’ve always prioritized methods that could already do interesting things over potentially more promising methods that could not yet deliver
We’ve undervalued simple methods, data, and learning
When nothing worked, we distracted ourselves with fancy math
On a few occasions, we unaccountably ignored methods that later proved to be “game changers” (RANSAC, SIFT)
We’ve had some problems with bandwagon jumping and intellectual snobbery

But it’s not clear whether any of it mattered in the end.