Object Detection Rough Intuition

Don’t yet understand how this works in practice, but wanted to get some thoughts down about the theory of how this all works.

Overview

So for a given image, our model’s prediction will have the following scheme:

  • Generate a box around an object, defined by
    • bx, by: coordinates of the center of the box
    • bw, bh: width and height of the boxes
  • Note: these values are scaled as percentages between (0, 0) and (1, 1), as we assume unit length of the image
from IPython.display import Image

Image('images/bounding_box.png')

Then (in this example, a 3-class detection problem) we provide the following vector for our prediction:

  • Probability that there’s any object
  • The 4 parameters listed above
  • Probability that we’re looking at each of the objects
Image('images/bounding_box_labels.png')

png

Loss Function

Our loss function is dependent on whether or not we predict that an object is there

(Intuitively, though, I figured we’d implement some sort of log-loss cross/entropy function here…)

If y = 1

$((\hat{y_1} - y_1)^2 + … + (\hat{y_8} - y_8)^2)$

If y = 0

$(\hat{y_1} - y_1)^2$

Generating Boxes

So assuming that we’ve got labelled data (by class, location, and bounding box size– which the course didn’t cover generating), we know how to score our prediction. But how do we generate our boxes.

Via Convolution

We could slide over our dataset and get a bunch of boxes via convolution.

Image('images/sliding_windows.png')

png

YOLO

This is the first the course uses when it dives into the YOLO algorithm, with out any suggestion of how those red boxes are generated. Idk.

They also suggest using something closer to a 19x19 grid, but leave it as 3s for illustration.

Image('images/yolo_basic.png')

png

For a 3-class problem apply the same 1x8 vector as above, giving you a final matrix representation of 3x3x8. So if your image starts off 100x100x3, make sure whatever Convoluation/Pooling steps along the way get you to an output of this size.

Image('images/yolo_bounding_boxes.png')

png

Then, once you’ve got all of your boxes, you want to trim down the redundant ones.

Non-Maximum Suppression

So when you’ve got two boxes that overlap, like the ones below, you can define a metric Intersection Over Union

Image('images/iou1.png')

png

Here we represent intersection with the yellow, union with the green.

Image('images/iou2.png')

png

We can apply it to images that returned more boxes

Image('images/non_max_1.png')

png

To simply our representation of where we think the objects are.

Image('images/non_max_2.png')

png

More generally, the idea is that we use the following algorithm

For each individual class

    Discard all boxes with p_c < .6

    While remaining boxes:
        Pick box w largest p_c

        Discard any remaining box w IoU >= 0.5 w this box

To go from busy pictures like this

Image('images/yolo_multiclass_1.png')

png

to incrementally-less busy pictures

Image('images/yolo_multiclass_2.png')

png

Until we have a box per object

Image('images/yolo_multiclass_3.png')

png

(Omitted, the blue, yellow, and green reductions, for some reason.)