CSE559A Lecture 15

Continue on object detection

Two strategies for object detection

R-CNN: Region proposals + CNN features

R-CNN

Fast R-CNN: CNN features + RoI pooling

Fast R-CNN

Use bilinear interpolation to get the features of the proposal.

Region of interest pooling

RoI pooling

Use backpropagation to get the gradient of the proposal.

New materials

Faster R-CNN

Use one CNN to generate region proposals. And use another CNN to classify the proposals.

Region proposal network

Idea: put an “anchor box” of fixed size over each position in the feature map and try to predict whether this box is likely to contain an object.

Introduce anchor boxes at multiple scales and aspect ratios to handle a wider range of object sizes and shapes.

Anchor boxes

Single-stage and multi-resolution detection

YOLO

You only look once (YOLO) is a state-of-the-art, real-time object detection system.

Take conv feature maps at 7x7 resolution
Add two FC layers to predict, at each location, a score for each class and 2 bboxes with confidences

For PASCAL, output is 7×7×30 (30=20 + 2∗(4+1))

YOLO

YOLO Network Head


model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
model.add(Conv2D(1024, (3, 3), activation='lrelu', kernel_regularizer=l2(0.0005)))
# use flatten layer for global reasoning
model.add(Flatten())
model.add(Dense(512))
model.add(Dense(1024))
model.add(Dropout(0.5))
model.add(Dense(7 * 7 * 30, activation='sigmoid'))
model.add(YOLO_Reshape(target_shape=(7, 7, 30)))
model.summary()

YOLO results

Each grid cell predicts only two boxes and can only have one class – this limits the number of nearby objects that can be predicted
Localization accuracy suffers compared to Fast(er) R-CNN due to coarser features, errors on small boxes
7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)

YOLOv2

Remove FC layer, do convolutional prediction with anchor boxes instead
Increase resolution of input images and conv feature maps
Improve accuracy using batch normalization and other tricks

SSD

SSD is a multi-resolution object detection

SSD

Predict boxes of different size from different conv maps
Each level of resolution has its own predictor

Feature Pyramid Network

Improve predictive power of lower-level feature maps by adding contextual information from higher-level feature maps
Predict different sizes of bounding boxes from different levels of the pyramid (but share parameters of predictors)

RetinaNet

RetinaNet combine feature pyramid network with focal loss to reduce the standard cross-entropy loss for well-classified examples.

RetinaNet

Cross-entropy loss: $CE(p_t) = - \log(p_t)$

The focal loss is defined as:

FL(p_t) = - (1 - p_t)^{\gamma} \log(p_t)

We can increase $\gamma$ to reduce the loss for well-classified examples.

YOLOv3

Minor refinements

Alternative approaches

CornerNet

Use a pair of corners to represent the bounding box.

Use hourglass network to accumulate the information of the corners.

CenterNet

Use a center point to represent the bounding box.

Detection Transformer

Use transformer architecture to detect the object.

DETR

DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object” class.