CSE559A Lecture 11

Continue on Architecture of CNNs

Successor of LeNet-5, but with a few significant changes

Most floating point operations occur in the convolutional layers.

Most of the memory usage is in the early convolutional layers.

Nearly all parameters are in the fully-connected layers.

Best paper award at ILSVRC 2013.

Nicely visualizes the feature maps.

All the cov layers are 3x3 filters with stride 1 and padding 1. Take advantage of pooling to reduce the spatial dimensionality.

Sequence of deeper networks trained progressively
Large receptive fields replaced by successive layer of 3x3 convs with relu in between
- 7x7 takes $49K^2$ parameters, 3x3 takes $27K^2$ parameters

Use pretrained-network as feature extractor (removing the last layer and training a new linear layer) (transfer learning)
- Add RNN layers to generate captions
Fine-tune the model for the new task (finetuning)
- Keep the earlier layers fixed and only train the new prediction layer

Stem network at the start aggressively downsamples input.

Parallel paths with different receptive field size and operations are means to capture space patterns of correlations in the stack of feature maps
Use 1x1 convs to reduce dimensionality
Use Global Average Pooling (GAP) to replace the fully connected layer
Auxiliary classifiers to improve training
- Training using loss at the end of the network didn’t work well: network is too deep, gradient don’t provide useful model updates
- As a hack, attach “auxiliary classifiers” at several intermediate points in the network that also try to classify the image and receive loss
- GooLeNet was before batch normalization, with batch normalization, the auxiliary classifiers were removed.

152 layers

The residual module
- Introduce skip or shortcut connections to avoid the degradation problem
- Make it easy for network layers to represent the identity mapping
Directly performing 3×3 convolutions with 256 feature maps at input and output:
- $256 \times 256 \times 3 \times 3 \approx 600K$ operations
- Using 1×1 convolutions to reduce 256 to 64 feature maps, followed by 3×3 convolutions, followed by 1×1 convolutions to expand back to 256 maps:
  - $256 \times 64 \times 1 \times 1 \approx 16K$
  - $64 \times 64 \times 3 \times 3 \approx 36K$
  - $64 \times 256 \times 1 \times 1 \approx 16K$
  - Total $\approx 70K$

Possibly the first model with top-5 error rate better than human performance.

Reduce number of residual blocks, but increase number of feature maps in each block

More parallelizable, better feature reuse
16-layer WRN outperforms 1000-layer ResNets, though with much larger # of parameters

Propose “cardinality” as a new factor in network design, apart from depth and width
Claim that increasing cardinality is a better way to increase capacity than increasing depth or width

Next class:

Transformer architectures