Skip to Content
CSE559AComputer Vision (Lecture 7)

CSE559A Lecture 7

Computer Vision (In Artificial Neural Networks for Image Understanding)

Early example of image understanding using Neural Networks: [Back propagation for zip code recognition]

Central idea; representation change, on each layer of feature.

Plan for next few weeks:

  1. How do we train such models?
  2. What are those building blocks
  3. How should we combine those building blocks?

How do we train such models?

CV is finally useful…

  1. Image classification
  2. Image segmentation
  3. Object detection

ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

  • 1000 classes
  • 1.2 million images
  • 10000 test images

Deep Learning (Just neural networks)

Bigger datasets, larger models, faster computers, lots of incremental improvements.

import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 6, 5) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2)) x = F.max_pool2d(F.relu(self.conv2(x)), 2) x = x.view(-1, self.num_flat_features(x)) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x def num_flat_features(self, x): size = x.size()[1:] num_features = 1 for s in size: num_features *= s return num_features # create pytorch dataset and dataloader dataset = torch.utils.data.TensorDataset(torch.randn(1000, 1, 28, 28), torch.randint(10, (1000,))) dataloader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2) # training process net = Net() optimizer = optim.Adam(net.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss() # loop over the dataset multiple times for epoch in range(2): for i, data in enumerate(dataloader, 0): inputs, labels = data optimizer.zero_grad() outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() print(f"Finished Training")

Some generated code above.

Supervised Learning

Training: given a dataset, learn a mapping from input to output.

Testing: given a new input, predict the output.

Example: Linear classification models

Find a linear function that separates the data.

f(x)=wx+bf(x) = w^\top x + b

Linear classification models 

Simple representation of a linear classifier.

Empirical loss minimization framework

Given a training set, find a model that minimizes the loss function.

Assume iid samples.

Example of loss function:

l1 loss:

(f(x;w),y)=f(x;w)y\ell(f(x; w), y) = |f(x; w) - y|

l2 loss:

(f(x;w),y)=(f(x;w)y)2\ell(f(x; w), y) = (f(x; w) - y)^2

Linear classification models

L^(w)=1ni=1n(f(xi;w),yi)\hat{L}(w) = \frac{1}{n} \sum_{i=1}^n \ell(f(x_i; w), y_i)

hard to find the global minimum.

Linear regression

However, if we use l2 loss, we can find the global minimum.

L^(w)=1ni=1n(f(xi;w)yi)2\hat{L}(w) = \frac{1}{n} \sum_{i=1}^n (f(x_i; w) - y_i)^2

This is a convex function, so we can find the global minimum.

The gradient is:

wXwY2=2X(XwY)\nabla_w||Xw-Y||^2 = 2X^\top(Xw-Y)

Set the gradient to 0, we get:

w=(XX)1XYw = (X^\top X)^{-1} X^\top Y

From the maximum likelihood perspective, we can also derive the same result.

Logistic regression

Sigmoid function:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The loss of logistic regression is not convex, so we cannot find the global minimum using normal equations.

Gradient Descent

Full batch gradient descent:

wwηwL^(w)w \leftarrow w - \eta \nabla_w \hat{L}(w)

Stochastic gradient descent:

wwηwL^(w;xi,yi)w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i)

Mini-batch gradient descent:

wwηwL^(w;xi,yi)w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i)

Mini-batch Gradient Descent:

wwηwL^(w;xi,yi)w \leftarrow w - \eta \nabla_w \hat{L}(w; x_i, y_i)

at each step, we update the weights using the average gradient of the mini-batch.

the mini-batch is selected randomly from the training set.

Multi-class classification

Use softmax function to convert the output to a probability distribution.

Neural Networks

From linear to non-linear.

  • Shadow approach:
    • Use feature transformation to make the data linearly separable.
  • Deep approach:
    • Stack multiple layers of linear models.

Common non-linear functions:

  • ReLU:
    • ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
  • Sigmoid:
    • Sigmoid(x)=11+ex\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
  • Tanh:
    • Tanh(x)=exexex+ex\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Backpropagation

Last updated on