In the previous section, we presented basic edge detection and two successful feature extractors based on it. These two approaches both use the carefully chosen, manually-tuned parameters and have shown steady improvements until around 2010. By the end of 2010, these approaches achieved only marginal improvements and it looked like we had started to reach the limits of computer vision (figure 9).

Figure 9: Progress in object detection on the PASCAL VOC got a significant boost through the introduction of learned features after plateauing. Figure adapted from [18].

In 2012, the results of the ImageNet competition shocked the computer vision community. AlexNet [17], a deep convolutional neural network, won the ImageNet classification challenge of 2012, beating the runner-up by a large margin. Instead of manually tuning the filters, their kernel weights were learned from the training data. It also successfully entered the object detection challenge showing that feature extractors can be learned. This showed that ConvNets can be successfully used for classification and detection in a supervised manner and set off the rapid increase in object detection accuracy over the following years (see the example of the PASCAL VOC figure 9).

In the following section, we explain the basics of convolutional neural networks first. Afterwards, we present two popular architectures for object detection in detail.


[17]   Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90.

[18]   Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv:1409.1556 [Cs].