Abstract
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In this paper, we will be using the term object recognition broadly to encompass both image classification (a task requiring an algorithm to determine what object classes are present in the image) as well as object detection (a task requiring an algorithm to localize all objects present in the image).
In 2010, the test annotations were later released publicly; since then the test annotation have been kept hidden.
In addition, ILSVRC in 2012 also included a taster fine-grained classification task, where algorithms would classify dog photographs into one of 120 dog breeds (Khosla et al. 2011). Fine-grained classification has evolved into its own Fine-Grained classification challenge in 2013 (Berg et al. 2013), which is outside the scope of this paper.
Some datasets such as PASCAL VOC (Everingham et al. 2010) and LabelMe (Russell et al. 2007) are able to provide more detailed annotations: for example, marking individual object instances as being truncated. We chose not to provide this level of detail in favor of annotating more images and more object instances.
Some of the training objects are actually annotated with more detailed classes: for example, one of the 200 object classes is the category “dog,” and some training instances are annotated with the specific dog breed.
The validation/test split is consistent with ILSVRC2012: validation images of ILSVRC2012 remained in the validation set of ILSVRC2013, and ILSVRC2012 test images remained in ILSVRC2013 test set.
In this paper we focus on the mean average precision across all categories as the measure of a team’s performance. This is done for simplicity and is justified since the ordering of teams by mean average precision was always the same as the ordering by object categories won.
Table 8 omits 4 teams which submitted results but chose not to officially participate in the challenge.
Personal communication with members of the UvA team.
For rigid versus deformable objects, the average scale in each bin is 34.1–34.2 % for classification and localization, and 13.5–13.7 % for detection. For texture, the average scale in each of the four bins is 31.1–31.3 % for classification and localization, and 12.7–12.8 % for detection.
Natural object detection classes are removed from this analysis because there are only 3 and 13 natural untextured and low-textured classes respectively, and none remain after scale normalization. All other bins contain at least 9 object classes after scale normalization.
References
Ahonen, T., Hadid, A., & Pietikinen, M. (2006). Face description with local binary patterns: Application to face recognition. Pattern Analysis and Machine Intelligence, 28(14), 2037–2041.
Alexe, B., Deselares, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.
Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In CVPR.
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In Computer vision and pattern recognition.
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33, 898–916.
Batra, D., Agrawal, H., Banik, P., Chavali, N., Mathialagan, C. S., & Alfadda, A. (2013). Cloudcv: Large-scale distributed computer vision as a cloud service.
Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). OpenSurfaces: A richly annotated catalog of surface appearance. In ACM transactions on graphics (SIGGRAPH).
Berg, A., Farrell, R., Khosla, A., Krause, J., Fei-Fei, L., Li, J., & Maji, S. (2013). Fine-grained competition. https://sites.google.com/site/fgcomp2013/.
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. CoRR, abs/1405.3531.
Chen, Q., Song, Z., Huang, Z., Hua, Y., & Yan, S. (2014). Contextualizing object detection and classification. In CVPR.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585.
Criminisi, A. (2004). Microsoft Research Cambridge (MSRC) object recognition image database (version 2.0). http://research.microsoft.com/vision/cambridge/recognition.
Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Deng, J., Russakovsky, O., Krause, J., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Scalable multi-label annotation. In CHI.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531.
Dubout, C., & Fleuret, F. (2012). Exact acceleration of linear object detectors. In Proceedings of the European conference on computer vision (ECCV).
Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. (2005–2012). PASCAL Visual Object Classes Challenge (VOC). http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2014). The Pascal Visual Object Classes (VOC) challenge—A retrospective. International Journal of Computer Vision, 111, 98–136.
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In CVPR.
Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few examples: An incremental bayesian approach tested on 101 object categories. In CVPR.
Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, NIPS.
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 32, 1231–1237.
Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation (v4). CoRR.
Girshick, R., Donahue, J., Darrell, T., & Malik., J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV.
Graham, B. (2013). Sparse arrays of signatures for online character recognition. CoRR.
Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report 7694, Caltech.
Harada, T., & Kuniyoshi, Y. (2012). Graphical Gaussian vector for image categorization. In NIPS.
Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In NIPS.
He, K., Zhang, X., Ren, S., & Su, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV.
Howard, A. (2014). Some improvements on deep convolutional neural network based image classification. In ICLR.
Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst.
Iandola, F. N., Moskewicz, M. W., Karayev, S., Girshick, R. B., Darrell, T., & Keutzer, K. (2014). Densenet: Implementing efficient convnet descriptor pyramids. CoRR.
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
Jojic, N., Frey, B. J., & Kannan, A. (2003). Epitomic analysis of appearance and shape. In ICCV.
Kanezaki, A., Inaba, S., Ushiku, Y., Yamashita, Y., Muraoka, H., Kuniyoshi, Y., & Harada, T. (2014). Hard negative classes for multiple object detection. In ICRA.
Khosla, A., Jayadevaprakash, N., Yao, B., & Fei-Fei, L. (2011). Novel dataset for fine-grained image categorization. In First workshop on fine-grained visual categorization, CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.
Kuettel, D., Guillaumin, M., & Ferrari, V. (2012). Segmentation propagation in ImageNet. In ECCV.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Lin, M., Chen, Q., & Yan, S. (2014a). Network in network. In ICLR.
Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., & Huang, T. (2011). Large-scale image classification: Fast feature extraction and SVM training. In CVPR.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., & Zitnick, C. L. (2014b). Microsoft COCO: Common objects in context. In ECCV.
Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 2368–2382.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In CVPR.
Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV.
Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR.
Miller, G. A. (1995). Wordnet: A lexical database for English. Commun. ACM, 38(11), 39–41.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In IJCV.
Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In IEEE international conference on computer vision (ICCV).
Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.
Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang, S., Wang, Z., Xiong, Y., Qian, C., Zhu, Z., Wang, R., Loy, C. C., Wang, X., & Tang, X. (2014). Deepid-net: multi-stage and deformable deep convolutional neural networks for object detection. CoRR, abs/1409.3505.
Papandreou, G. (2014). Deep epitomic convolutional neural networks. CoRR.
Papandreou, G., Chen, L.-C., & Yuille, A. L. (2014). Modeling image patches with a generic dictionary of mini-epitomes.
Perronnin, F., & Dance, C. R. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.
Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.
Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV (4).
Russakovsky, O., Deng, J., Huang, Z., Berg, A., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, & where are we going? In ICCV.
Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2007). LabelMe: A database and web-based tool for image annotation. In IJCV.
Sanchez, J., & Perronnin, F. (2011). High-dim. signature compression for large-scale image classification. In CVPR.
Sanchez, J., Perronnin, F., & de Campos, T. (2012). Modeling spatial layout of images beyond spatial pyramids. In PRL.
Scheirer, W., Kumar, N., Belhumeur, P. N., & Boult, T. E. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR.
Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229.
Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In SIGKDD.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for large-scale image classification. In NIPS.
Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. In InterNet08.
Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI human computation workshop.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., & Rabinovich, A. (2014). Going deeper with convolutions. Technical report.
Tang, Y. (2013). Deep learning using support vector machines. CoRR, abs/1306.0239.
Thorpe, S., Fize, D., Marlot, C., et al. (1996). Speed of processing in the human visual system. Nature, 381(6582), 520–522.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR’11.
Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.
Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154–171.
Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Geiger, A., Lenz, P., Silberman, N., Xiao, J., & Fidler, S. (2013–2014). Reconstruction meets recognition challenge. http://ttic.uchicago.edu/rurtasun/rmrc/.
van de Sande, K. E. A., Snoek, C. G. M., & Smeulders, A. W. M. (2014). Fisher and vlad with flair. In Proceedings of the IEEE conference on computer vision and pattern recognition.
van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T., & Smeulders, A. W. M. (2011b). Segmentation as selective search for object recognition. In ICCV.
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2011a). Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1), 60–70.
Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In BMVC.
von Ahn, L., & Dabbish, L. (2005). Esp: Labeling images with a computer game. In AAAI spring symposium: Knowledge collection from volunteer contributors.
Vondrick, C., Patterson, D., & Ramanan, D. (2012). Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 1010, 184–204.
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., & Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the international conference on machine learning (ICML’13).
Wang, M., Xiao, T., Li, J., Hong, C., Zhang, J., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning. In APSys.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.
Wang, X., Yang, M., Zhu, S., & Lin, Y. (2013). Regionlets for generic object detection. In ICCV.
Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In NIPS.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba., A. (2010). SUN database: Large-scale scene recognition from Abbey to Zoo. In CVPR.
Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
Yao, B., Yang, X., & Zhu, S.-C. (2007). Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks. Berlin: Springer.
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR, abs/1311.2901.
Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In ICCV.
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.
Zhou, X., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.
Acknowledgments
We thank Stanford University, UNC Chapel Hill, Google and Facebook for sponsoring the challenges, and NVIDIA for providing computational resources to participants of ILSVRC2014. We thank our advisors over the years: Lubomir Bourdev, Alexei Efros, Derek Hoiem, Jitendra Malik, Chuck Rosenberg and Andrew Zisserman. We thank the PASCAL VOC organizers for partnering with us in running ILSVRC2010-2012. We thank all members of the Stanford vision lab for supporting the challenges and putting up with us along the way. Finally, and most importantly, we thank all researchers that have made the ILSVRC effort a success by competing in the challenges and by using the datasets to advance computer vision.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by M. Hebert.
Olga Russakovsky and Jia Deng authors contributed equally.
Appendices
Appendix 1: ILSVRC2012-2014 Image Classification and Single-Object Localization Object Categories
Appendix 2: Additional Single-Object Localization Dataset Statistics
We consider two additional metrics of object localization difficulty: chance performance of localization and the level of clutter. We use these metrics to compare ILSVRC2012-2014 single-object localization dataset to the PASCAL VOC 2012 object detection benchmark. The measures of localization difficulty are computed on the validation set of both datasets. According to both of these measures of difficulty there is a subset of ILSVRC which is as challenging as PASCAL but more than an order of magnitude greater in size. Figure 16 shows the distributions of different properties (object scale, chance performance of localization and level of clutter) across the different classes in the two datasets.
Chance Performance of Localization (CPL) Chance performance on a dataset is a common metric to consider. We define the CPL measure as the expected accuracy of a detector which first randomly samples an object instance of that class and then uses its bounding box directly as the proposed localization window on all other images (after rescaling the images to the same size). Concretely, let \(B_1,B_2,\dots ,B_N\) be all the bounding boxes of the object instances within a class, then
Some of the most difficult ILSVRC categories to localize according to this metric are basketball, swimming trunks, ping pong ball and rubber eraser, all with less than \(0.2\,\%\) CPL. This measure correlates strongly (\(\rho = 0.9\)) with the average scale of the object (fraction of image occupied by object). The average CPL across the \(1000\) ILSVRC categories is \(20.8\,\%\). The 20 PASCAL categories have an average CPL of \(8.7\,\%\), which is the same as the CPL of the \(562\) most difficult categories of ILSVRC.
Clutter Intuitively, even small objects are easy to localize on a plain background. To quantify clutter we employ the objectness measure of (Alexe et al. 2012), which is a class-generic object detector evaluating how likely a window in the image contains a coherent object (of any class) as opposed to background (sky, water, grass). For every image \(m\) containing target object instances at positions \(B_1^m,B_2^m,\dots \), we use the publicly available objectness software to sample 1000 windows \(W_1^m,W_2^m,\dots W_{1000}^m\), in order of decreasing probability of the window containing any generic object. Let obj(m) be the number of generic object-looking windows sampled before localizing an instance of the target category, i.e., \(\text{ obj }(m) = \min \{k: \max _i \text{ iou }(W_k^m,B_i^m) \ge 0.5\}\). For a category containing M images, we compute the average number of such windows per image and define
The higher the clutter of a category, the harder the objects are to localize according to generic cues. If an object can’t be localized with the first 1000 windows (as is the case for \(1\,\%\) of images on average per category in ILSVRC and \(5\,\%\) in PASCAL), we set obj \((m)=1001\). The fact that more than \(95\,\%\) of objects can be localized with these windows imply that the objectness cue is already quite strong, so objects that require many windows on average will be extremely difficult to detect: e.g., ping pong ball (clutter of 9.57, or 758 windows on average), basketball (clutter of 9.21), puck (clutter of 9.17) in ILSVRC. The most difficult object in PASCAL is bottle with clutter score of \(8.47\). On average, ILSVRC has clutter score of \(3.59\). The most difficult subset of ILSVRC with 250 object categories has an order of magnitude more categories and the same average amount of clutter (of \(5.90\)) as the PASCAL dataset.
Appendix 3: Manually Curated Queries for Obtaining Object Detection Scene Images
In Sect. 3.3.2 we discussed three types of queries we used for collecting the object detection images: (1) single object category name or a synonym; (2) a pair of object category names; (3) a manual query, typically targetting one or more object categories with insufficient data. Here we provide a list of the 129 manually curated queries:
Appendix 4: Hierarchy of Questions for Full Image Annotation
The following is a hierarchy of questions manually constructed for crowdsourcing full annotation of images with the presence or absence of 200 object detection categories in ILSVRC2013 and ILSVRC2014. All questions are of the form “is there a ... in the image?” Questions marked with \(\bullet \) are asked on every image. If the answer to a question is determined to be “no” then the answer to all descendant questions is assumed to be “no”. The 200 numbered leaf nodes correspond to the 200 object detection categories.
The goal in the hierarchy construction is to save cost (by asking as few questions as possible on every image) while avoiding any ambiguity in questions which would lead to false negatives during annotation. This hierarchy is not tree-structured; some questions have multiple parents.
Appendix 5: Modification to Bounding Box System for Object Detection
The bounding box annotation system described in Sect. 3.2.1 is used for annotating images for both the single-object localization dataset and the object detection dataset. However, two additional manual post-processing are needed to ensure accuracy in the object detection scenario:
Ambiguous Objects The first common source of error was that workers were not able to accurately differentiate some object classes during annotation. Some commonly confused labels were seal and sea otter, backpack and purse, banjo and guitar, violin and cello, brass instruments (trumpet, trombone, french horn and brass), flute and oboe, ladle and spatula. Despite our best efforts (providing positive and negative example images in the annotation task, adding text explanations to alert the user to the distinction between these categories) these errors persisted.
In the single-object localization setting, this problem was not as prominent for two reasons. First, the way the data was collected imposed a strong prior on the object class which was present. Second, since only one object category needed to be annotated per image, ambiguous images could be discarded: for example, if workers couldn’t agree on whether or not a trumpet was in fact present, this image could simply be removed. In contrast, for the object detection setting consensus had to be reached for all target categories on all images.
To fix this problem, once bounding box annotations were collected we manually looked through all cases where the bounding boxes for two different object classes had significant overlap with each other (about \(3\,\%\) of the collected boxes). About a quarter of these boxes were found to correspond to incorrect objects and were removed. Crowdsourcing this post-processing step (with very stringent accuracy constraints) would be possible but it occurred in few enough cases that it was faster (and more accurate) to do this in-house.
Duplicate Annotations The second common source of error were duplicate bounding boxes drawn on the same object instance. Despite instructions not to draw more than one bounding box around the same object instance and constraints in the annotation UI enforcing at least a 5 pixel difference between different bounding boxes, these errors persisted. One reason was that sometimes the initial bounding box was not perfect and subsequent labelers drew a slightly improved alternative.
This type of error was also present in the single-object localization scenario but was not a major cause for concern. A duplicate bounding box is a slightly perturbed but still correct positive example, and single-object localization is only concerned with correctly localizing one object instance. For the detection task algorithms are evaluated on the ability to localize every object instance, and penalized for duplicate detections, so it is imperative that these labeling errors are corrected (even if they only appear in about \(0.6\,\%\) of cases).
Approximately \(1\,\%\) of bounding boxes were found to have significant overlap of more than \(50\,\%\) with another bounding box of the same object class.We again manually verified all of these cases in-house. In approximately \(40\,\%\) of the cases the two bounding boxes correctly corresponded to different people in a crowd, to stacked plates, or to musical instruments nearby in an orchestra. In the other \(60\,\%\) of cases one of the boxes was randomly removed.
These verification steps complete the annotation procedure of bounding boxes around every instance of every object class in validation, test and a subset of training images for the detection task.
Training Set Annotation With the optimized algorithm of Sect. 3.3.3 we fully annotated the validation and test sets. However, annotating all training images with all target object classes was still a budget challenge. Positive training images taken from the single-object localization dataset already had bounding box annotations of all instances of one object class on each image. We extended the existing annotations to the detection dataset by making two modification. First, we corrected any bounding box omissions resulting from merging fine-grained categories: i.e., if an image belonged to the “dalmatian” category and all instances of “dalmatian” were annotated with bounding boxes for single-object localization, we ensured that all remaining “dog” instances are also annotated for the object detection task. Second, we collected significantly more training data for the person class because the existing annotation set was not diverse enough to be representative (the only people categories in the single-object localization task are scuba diver, groom, and ballplayer). To compensate, we additionally annotated people in a large fraction of the existing training set images.
Appendix 6: Competition Protocol
Competition Format At the beginning of the competition period each year we release the new training/validation/test images, training/validation annotations, and competition specification for the year. We then specify a deadline for submission, usually approximately 4 months after the release of data. Teams are asked to upload a text file of their predicted annotations on test images by this deadline to a provided server. We then evaluate all submissions and release the results.
For every task we released code that takes a text file of automatically generated image annotations and compares it with the ground truth annotations to return a quantitative measure of algorithm accuracy. Teams can use this code to evaluate their performance on the validation data.
As described in Everingham et al. (2014), there are three options for measuring performance on test data: (i) Release test images and annotations, and allow participants to assess performance themselves; (ii) Release test images but not test annotations—participants submit results and organizers assess performance; (iii) Neither test images nor annotations are released—participants submit software and organizers run it on new data and assess performance. In line with the PASCAL VOC choice, we opted for option (ii). Option (i) allows too much leeway in overfitting to the test data; option (iii) is infeasible, especially given the scale of our test set (40K–100K images).
We released ILSVRC2010 test annotations for the image classification task, but all other test annotations have remained hidden to discourage fine-tuning results on the test data.
Evaluation Protocol After the Challenge After the challenge period we set up an automatic evaluation server that researchers can use throughout the year to continue evaluating their algorithms against the ground truth test annotations. We limit teams to 2 submissions per week to discourage parameter tuning on the test data, and in practice we have never had a problem with researchers abusing the system.
Rights and permissions
About this article
Cite this article
Russakovsky, O., Deng, J., Su, H. et al. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0816-y