Thèse Année : 2012

Learning Hierarchical Feature Extractors For Image Recognition

(1)

1 (France) 98463

DI-ENS - Département d'informatique - ENS-PSL (École normale supérieure 45 rue d'Ulm F-75230 Paris Cedex 05 - France) 25027
- ENS-PSL - École normale supérieure - Paris (45, Rue d'Ulm - 75230 Paris cedex 05 - France) 59704
  - PSL - Université Paris Sciences et Lettres (60 rue Mazarine 75006 Paris - France) 564132
- Inria - Institut National de Recherche en Informatique et en Automatique (Domaine de Voluceau Rocquencourt - BP 105 78153 Le Chesnay Cedex - France) 300009
- CNRS - Centre National de la Recherche Scientifique : UMR8548 (France) 441569
Inria Paris-Rocquencourt (INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex - France) 86790
- Inria - Institut National de Recherche en Informatique et en Automatique (Domaine de Voluceau Rocquencourt - BP 105 78153 Le Chesnay Cedex - France) 300009
CNRS - Centre National de la Recherche Scientifique : UMR8548 (France) 441569

"> WILLOW - Models of visual object recognition and scene understanding

Y-Lan Boureau

Fonction : Auteur

Models of visual object recognition and scene understanding

Résumé

Telling cow from sheep is effortless for most animals, but requires much engineering for computers. In this thesis, we seek to tease out basic principles that underlie many recent advances in image recognition. First, we recast many methods into a common unsu- pervised feature extraction framework based on an alternation of coding steps, which encode the input by comparing it with a collection of reference patterns, and pooling steps, which compute an aggregation statistic summarizing the codes within some re- gion of interest of the image. Within that framework, we conduct extensive comparative evaluations of many coding or pooling operators proposed in the literature. Our results demonstrate a robust superiority of sparse coding (which decomposes an input as a linear combination of a few visual words) and max pooling (which summarizes a set of inputs by their maximum value). We also propose macrofeatures, which import into the popu- lar spatial pyramid framework the joint encoding of nearby features commonly practiced in neural networks, and obtain significantly improved image recognition performance. Next, we analyze the statistical properties of max pooling that underlie its better perfor- mance, through a simple theoretical model of feature activation. We then present results of experiments that confirm many predictions of the model. Beyond the pooling oper- ator itself, an important parameter is the set of pools over which the summary statistic is computed. We propose locality in feature configuration space as a natural criterion for devising better pools. Finally, we propose ways to make coding faster and more powerful through fast convolutional feedforward architectures, and examine how to incorporate supervision into feature extraction schemes. Overall, our experiments offer insights into what makes current systems work so well, and state-of-the-art results on several image recognition benchmarks.