CSE5519 Advances in Computer Vision (Lecture 3)
Reminders
First Example notebook due Sep 18
Project proposal due Sep 23
Continued: A brief history (time) of computer vision
Theme changes
1980
- “Definitive” detectors
- Edges: Canny (1986); corners: Harris & Stephens (1988)
- Multiscale image representations
- Witkin (1983), Burt & Adelson (1984), Koenderink (1984, 1987), etc.
- Markov Random Field models: Geman & Geman (1984)
- Segmentation by energy minimization
- Kass, Witkin & Terzopoulos (1987), Mumford & Shah (1989)
Conferences, journals, books
- Conferences: ICPR (1973), CVPR (1983), ICCV (1987), ECCV (1990)
- Journals: TPAMI (1979), IJCV (1987)
- Books: Duda & Hart (1972), Marr (1982), Ballard & Brown (1982), Horn (1986)
1980s: The dead ends
- Alignment-based recognition
- Faugeras & Hebert (1983), Grimson & Lozano-Perez (1984), Lowe (1985), Huttenlocher & Ullman (1987), etc.
- Aspect graphs
- Koenderink & Van Doorn (1979), Plantinga & Dyer (1986), Hebert & Kanade (1985), Ikeuchi & Kanade (1988), Gigus & Malik (1990)
- Invariants: Mundy & Zisserman (1992)
1980s: Meanwhile…
- Neocognitron: Fukushima (1980)
- Back-propagation: Rumelhart, Hinton & Williams (1986)
- Origins in control theory and optimization: Kelley (1960), Dreyfus (1962), Bryson & Ho (1969), Linnainmaa (1970)
- Application to neural networks: Werbos (1974)
- Interesting blog post: Backpropagating through time Or, How come BP hasn’t been invented earlier?
- Parallel Distributed Processing: Rumelhart et al. (1987)
- Neural networks for digit recognition: LeCun et al. (1989)
1990s
Multi-view geometry, statistical and appearance-based models for recognition, first approaches for (class-specific) object detection
Geometry (mostly) solved
- Fundamental matrix: Faugeras (1992)
- Normalized 8-point algorithm: Hartley (1997)
- RANSAC for robust fundamental matrix estimation: Torr & Murray (1997)
- Bundle adjustment: Triggs et al. (1999)
- Hartley & Zisserman book (2000)
- Projective structure from motion: Faugeras and Luong (2001)
Data enters the scene
- Appearance-based models: Turk & Pentland (1991), Murase & Nayar (1995)
PCA for face recognition: Turk & Pentland (1991) Image manifolds
Keypoint-based image indexing
- Schmid & Mohr (1996), Lowe (1999)
Constellation models for object categories
- Burl, Weber & Perona (1998), Weber, Welling & Perona (2000)
First sustained use of classifiers and negative data
- Face detectors: Rowley, Baluja & Kanade (1996), Osuna, Freund & Girosi (1997), Schneiderman & Kanade (1998), Viola & Jones (2001)
- Convolutional nets: LeCun et al. (1998)
Graph cut image inference
- Boykov, Veksler & Zabih (1998)
Segmentation
- Normalized cuts: Shi & Malik (2000)
- Berkeley segmentation dataset: Martin et al. (2001)
Video processing
- Layered motion models: Adelson & Wang (1993)
- Robust optical flow: Black & Anandan (1993)
- Probabilistic curve tracking: Isard & Blake (1998)
2000s: Keypoints and reconstruction
Keypoints craze
- Kadir & Brady (2001), Mikolajczyk & Schmid (2002), Matas et al. (2004), Lowe (2004), Bay et al. (2006), etc.
3D reconstruction “in the wild”
- SFM in the wild
- Multi-view stereo, stereo on GPU’s
Generic object recognition
- Constellation models
- Bags of features
- Datasets: Caltech-101 -> ImageNet
Generic object detection
- PASCAL dataset
- HOG, Deformable part models
Action and activity recognition:
“misc. early efforts”
1990s-2000s: Dead ends (?)
Probabilistic graphical models
Perceptual organization
2010s: Deep learning, big data
They can be more accurate (often much more accurate).
They are faster (often much faster).
They are adaptable to new problems.
Deep Convolutional Neural Networks
- Many layers, some of which are convolutional (usually near the input)
- Early layers “extract features”
- Trained using stochastic gradient descent on very large datasets
- Many possible loss functions (depending on task)
Additional benefits:
- High-quality software frameworks
- “New” network layers
- Dropout (enables simultaneously training many models)
- ReLU activation (enables faster training because gradients don’t become zero)
- Bigger datasets
- reduces overfitting
- improves robustness
- enable larger, deeper networks
- Deeper networks eliminate the need for hand-engineered features
Where did we go wrong?
In retrospect, computer vision has had several periods of “spinning its wheels”
- We’ve always prioritized methods that could already do interesting things over potentially more promising methods that could not yet deliver
- We’ve undervalued simple methods, data, and learning
- When nothing worked, we distracted ourselves with fancy math
- On a few occasions, we unaccountably ignored methods that later proved to be “game changers” (RANSAC, SIFT)
- We’ve had some problems with bandwagon jumping and intellectual snobbery
But it’s not clear whether any of it mattered in the end.