For object detection, different model architectures have been tried. Already in the 1980s, CNNs have been used for detection tasks [12], [13] before they fell out of favour for support vector machines (SVMs) in the mid 1990s [14].

Two main reasons led to the ascent of SVMs. First, SVMs are rooted in Statistical Learning Theory (VC-Theory) which makes the involved optimization problem convex. Convex optimization problems have one global minimum which is easy to find with well-tested methods. Second, by using the ‘Kernel trick’ they can handle non-linearly separable data well [14]. Linearly separable data can be separated by a line in the 2-dimensional space (and a hyper-plane in n-dimensional space) whereas non-linearly separable data, for example a circle, cannot. For these cases an appropriate function called a kernel can be used which transforms the data into a higher dimensional space in which the data is linearly separable. This made their application widespread for classification tasks.

For object detection, feature extractors followed by SVMs were often used. That way, only relevant features were given to the SVM classifier. This, however, made the architecture rather complex. A second downside of these multi-stage architectures is that they cannot be trained end-to-end. Starting in the mid 2010s, object detection competitions were won by deep neural net architectures which have since largely replaced SVMs in object detection.

In the following sections, we will explain this evolution with a focus on the feature extractors.


[12]   Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.

[13]   LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., & Jackel, L. D. (1990). Handwritten Digit Recognition with a Back-Propagation Network. 9.

[14]   Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.