Skip to Content
CSE559AComputer Vision (Lecture 8)

CSE559A Lecture 8

Paper review sharing.

Recap: Three ways to think about linear classifiers

Geometric view: Hyperplanes in the feature space

Algebraic view: Linear functions of the features

Visual view: One template per class

Continue on linear classification models

Two layer networks as combination of templates.

Interpretability is lost during the depth increase.

A two layer network is a universal approximator (we can approximate any continuous function to arbitrary accuracy). But the hidden layer may need to be huge.

Multi-layer networks demo 

Supervised learning outline

  1. Collect training data
  2. Specify model (select hyper-parameters)
  3. Train model

Hyper-parameters selection

  • Number of layers, number of units per layer, learning rate, etc.
  • Type of non-linearity, regularization, etc.
  • Type of loss function, etc.
  • SGD settings: batch size, number of epochs, etc.

Hyper-parameter searching

Use validation set to evaluate the performance of the model.

Never peek the test set.

Use the training set to do K-fold cross validation.

Backpropagation

Computation graphs

SGD update for each parameter

wkwkηewkw_k\gets w_k-\eta\frac{\partial e}{\partial w_k}

ee is the error function.

Using the chain rule

Suppose k=1k=1, e=l(f1(x,w1),y)e=l(f_1(x,w_1),y)

Example: e=(f1(x,w1)y)2e=(f_1(x,w_1)-y)^2

So h1=f1(x,w1)=w1xh_1=f_1(x,w_1)=w^\top_1x, e=l(h1,y)=(yh1)2e=l(h_1,y)=(y-h_1)^2

ew1=eh1h1w1\frac{\partial e}{\partial w_1}=\frac{\partial e}{\partial h_1}\frac{\partial h_1}{\partial w_1} eh1=2(h1y)\frac{\partial e}{\partial h_1}=2(h_1-y) h1w1=x\frac{\partial h_1}{\partial w_1}=x ew1=2(h1y)x\frac{\partial e}{\partial w_1}=2(h_1-y)x

General backpropagation algorithm

Last updated on