When designing a dataset, there are a few considerations to make. For example, how many object classes or categories should the dataset cover? How many instances per category should it have? Figure 2 shows how selected datasets compare. While ImageNet has a large number of classes, COCO has more instances per class.

Figure 2: Instances per category over number of categories for selected datasets. While ImageNet has a lot of classes, COCO has more instances per class. Figure adapted from [7].

Does this mean ImageNet is better? The answer is it depends on the problem that you want to solve. Generally, ImageNet is a popular choice for pretraining classifiers while COCO is popular for general detection and segmentation models.

Further considerations for datasets include: Are the pictures taken on a smartphone or by a professional photographer? Does an image contain only one instance of an object? Are objects shown from different angles? Is there clutter in the image? What about the sizes of objects in images?

Frequently used datasets took these aspects into consideration. In the following we briefly introduce important aspects of three popular datasets. For more detailed information like annotation pipelines etc. see the respective papers.

Pascal VOC

To facilitate progress in object detection the Pattern Analysis, Statistical modelling and Computational Learning Visual Objects Classes (PASCAL VOC) dataset and challenge was launched in 2005. It served as a benchmark to train and test models by providing challenging images with high quality annotations and as a standard evaluation methodology. Progress was measured in an annual competitions that ran between 2005 to 2012 [5].

PASCAL VOC focuses on objects in natural scenes to enable object detectors for the real world. Thus, the images were taken from the image hosting website Flickr. The images typically contain one or more objects per image.

Object categories have been carefully chosen to represent typical scenes. There are a total of 20 classes which can be divided into the four main categories:

  • vehicles
  • household
  • animals
  • person

To improve the training and validation data it has been updated annually. For more details like the annotation process or the number of examples per class see the original paper [5].

During the seven years in which competitions were held, PASCAL enabled important contributions to the field of vision. There was a constant increase in model performance. This was also enabled by transferring models between tasks such as detection and segmentation. And finally, it helped to seed other datasets and competitions like the COCO dataset [6].


Knowing that the best object detection algorithm would be useless without a large and accurate dataset, Fei-Fei Li endeavoured to map the world of objects. The original dataset comprised over \(10\) million photographs from Flickr, Google searches from \(10,000\) different categories ranging from red beech to odometer to geyser [7], hosting one of the largest publicly available labelled image datasets to date. ImageNet also hosted object recognition annual competitions running between 2010 and 2017 using a small fraction of their dataset.

In the seven years that it hosted the competition, the winning accuracies in object classification rose from around \(71.8\%\) to \(97.3\%\), the latter surpassing human performance. This dataset became a benchmark in measuring the success of the newest object classification algorithms. In its later year, the competition introduced object localization, object segmentation, and even object localization in video tasks.

Though the competitions are over, the waves of ImageNet can still be felt today with many object recognition algorithms still being trained and tested on the dataset and many of its winners moving on to senior roles in companies such as Google, Baidu, or Facebook.


One issue with the previous datasets was that images of certain objects depict those from a similar angle in the majority of cases. Training models on these iconic views of objects can lead to poor generalization when the objects are shown from other angles [8].

The Common Objects in Context (COCO) dataset addresses this by focusing on non-iconic images. Additionally, it aims to enable more complex scene understanding by showing several objects in typical scenes. Lastly, by including more instances per category than for example PASCAL (see figure 2) it enables more precise 2D localization [9].

In addition to the greater amount of categories, the dataset also includes smaller objects. This adds difficulty to the competitions compared to PASCAL VOC.

All of this makes COCO a popular choice for training general object detector and segmentation models.


[5]   Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4

[6]   Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111(1), 98–136. https://doi.org/10.1007/s11263-014-0733-5

[7]   Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

[8]   Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. CVPR 2011, 1521–1528. https://doi.org/10.1109/CVPR.2011.5995347

[9]   Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollár, P. (2015). Microsoft COCO: Common Objects in Context. ArXiv:1405.0312 [Cs]. http://arxiv.org/abs/1405.0312