Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach

Universitat Politècnica de Catalunya
Programa de Doctorat: Automàtica, Robòtica i Visió
Ph.D. Thesis
Fashion Discovery: A Computer Vision Approach

Antonio Rubio Romano
Directors:
Company advisor:
Dr. Francesc Moreno Noguer
LongLong Yu
Dr. Edgar Simo Serra
Barcelona, 2020
Abstract
Performing semantic interpretation of fashion images is undeniably one of the most challeng-
ing domains for computer vision. Subtle variations in color and shape might confer different
meanings or interpretations to an image. Not only is it a domain tightly coupled with human
understanding, but also with scene interpretation and context. Being able to extract fashion-
specific information from images and interpret that information in a proper manner can be useful
in many situations and help understanding the underlying information in an image.
Fashion is also one of the most important businesses around the world, with an estimated
value of 3 trillion dollars [2] and a constantly growing online market, which increases the utility
of image-based algorithms to search, classify or recommend garments.
This doctoral thesis aims to solve specific problems related with the treatment of fashion
e-commerce data, from low-level pure pixel information to high-level abstract conclusions of the
garments appearing in an image, taking advantage of the multi-modality of the available data
for developing some of the solutions.
The contributions include:
• A new superpixel extraction method focused on improving the annotation process for cloth-
ing images.
• The construction of an image and text embedding for fashion data.
• The application of this embedding space to the task of retrieving the main product in an
image showing a complete outfit.
In summary, fashion is a complex computer vision and machine learning problem at many
levels, and developing specific algorithms that are able to capture essential information from
pictures and text is not trivial. In order to solve some of the challenges it proposes, and taking
into account that this is an Industrial Ph.D., we contribute with a variety of solutions that can
boost the performance of many tasks useful for the fashion e-commerce industry.
Keywords image retrieval, multimodal embedding, superpixels, siamese neural networks,

fashion.
Acknowledgements
I’d like to express my gratitude to all the people who helped me during these years.
First, I must thank my three advisors. Francesc Moreno, the best supervisor one can think
of since day one, and who has paved my professional career with great support and magnificent
opportunities. I wouldn’t be in such a good place today if I hadn’t had started working with him
six years ago during my Master’s degree. Edgar, for his brilliant ideas, the countless videocalls
across time zones, the abura soba and the months in Tokyo. And Long, for the time spent coding
side by side and for teaching me how to be a better programmer and a better gardener.
To my parents, always supportive in spite of the distance, putting up with my not-so-frequent
phone calls and my summers in other continents. To my little brother, who is now taller than
me but still gets mad when loses at FIFA.
To everyone in Wide Eyes for giving me the opportunity of working in a startup environment.
Time spent there had its ups and downs, but allowed me to meet extraordinary people. Special
thanks to Arnau for all the conversations, advice and invaluable help, and Oğuz for the support,
the funny moments and the short but intense ping pong tournament.
To the Driblinhos team, whom with I’ve shared some of the most amazing experiences in
my life and many kilometers in planes, cars, vans and boats. To all my flatmates in Barcelona,
especially vegetarian Argentinian drummers and Catalan smoke sellers and DoPs.
Of course, if you’re reading this thesis, it is undoubtedly in great part because of the great
people at IRI. Downstairs, in office 6 and finally in office 19, it has been a pleasure to share space,
knowledge and jokes with all of you. Special thanks to all the subgroups with regular meetings
I proudly belong to: the tupper crew, meeting daily, because eating reheated leftovers outdoors
without even having a table never felt so good; IRI football, meeting weekly, for making me enjoy
sport again after I broke my leg for a second time; and the Agustins Academy, meeting yearly, for
investing a huge amount of time and effort in something with the only objective of laughing for
half an hour. Major shout out to Rick (office mate for a few months, the best footballer in IRI
for years), Juanan (half briliant / half competitive devil), Carlogarro (pure light) and Fherrero
(who always makes the best jokes, which allows him to make the worst ones sometimes) for
surprisingly revitalizing my IRI days when I thought they were almost over.
Last, but not least, exceptional regards to Est for the countless laughters, for being kind
enough to present her Ph.D. before me to show me how it’s done, and for being a part of my life
since eπ in the 005. Thank you for all the years of music.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Overview 7
2.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Computer Vision and Machine Learning for Fashion . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Superpixel Segmentation 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Graph-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Seed-growing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Coarse-to-fine methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Boundary-aware Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Boundary detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Seeds initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.4 Optimization process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Comparison against State of the Art . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Fashion Multi-modal Embedding 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Multi-modal embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Image Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Text Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.1 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Main Product Detection 59

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Common parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 First Approach: Full Image . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.3 Second Approach: Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . 66
5.3.4 Third Approach: Region of Interest Pooling . . . . . . . . . . . . . . . . . 67
5.3.5 Overlap Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.1 Quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Contributions to Wide Eyes 75

6.1 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Model Evaluations and Results Analysis . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Code Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.5 Prototyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.6 Demos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.7 Recommendation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7 Conclusions 81
7.1 Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Fashion Multi-modal Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 Main Product Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
List of Figures
1.1 Quick overview of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Artificial Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Filters learned after training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 ImageNet dataset examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 YOLO object detection results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Examples of retrieved items for the street-to-shop task . . . . . . . . . . . . . . . 16
2.8 Example of runway to realway solution . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Annotated samples from Fashionista dataset . . . . . . . . . . . . . . . . . . . . 18
2.10 Pearson correlation between clothing items . . . . . . . . . . . . . . . . . . . . . 19
2.11 Artificially generated fashion images . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Examples of oversegmentation and undersegmentation problems . . . . . . . . . . 25

3.2 Overview of the proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Results of graph-based superpixel algorithms . . . . . . . . . . . . . . . . . . . . 27
3.4 Results of seed-growing superpixel algorithms . . . . . . . . . . . . . . . . . . . . 28
3.5 Results of coarse-to-fine superpixel algorithms . . . . . . . . . . . . . . . . . . . . 29
3.6 Summary of the main steps of the method . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Examples of images and their extracted boundaries . . . . . . . . . . . . . . . . . 31
3.8 Examples of seeds locations after initialization . . . . . . . . . . . . . . . . . . . 32
3.9 Compactness effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.10 Geodesic distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.11 Different energy values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.12 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.13 UE vs. VOI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.14 Qualitative results (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.15 Qualitative results (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Example of a text and nearest images from the test set . . . . . . . . . . . . . . . 44
4.2 Examples of text preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 CBIR - Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Architecture of the proposed neural network . . . . . . . . . . . . . . . . . . . . . 50
4.5 Example of training samples with a context window size of 2 words . . . . . . . . 51
4.6 Scheme of the word2vec method . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Word2vec’s arithmetical properties . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Contrastive loss behaviour with positive and negative pairs . . . . . . . . . . . . 54
4.9 Examples of products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Overview of our proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Multiple-Instance visual-semantic Embedding . . . . . . . . . . . . . . . . . . . . 63
5.3 Parts of an image related with parts of a text . . . . . . . . . . . . . . . . . . . . 64
5.4 Relevance of image regions for answering different questions . . . . . . . . . . . . 65
5.5 The three different network architectures used . . . . . . . . . . . . . . . . . . . . 66
5.6 RoI pooling explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 Effects of bounding box resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.8 Some results of our method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9 More results of our method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.10 Two-dimensional t-SNE visualization . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1 Bi-LSTMs for fashion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

List of Tables
2.1 Fashion datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Retrieval results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Results using the information in different text fields . . . . . . . . . . . . . . . . 57
4.3 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Results of the architectures detailed in Section 5.3 . . . . . . . . . . . . . . . . . 71

Introduction
And I knew exactly what to do. But in a much more real sense,
I had no idea what to do.
Michael Scott
CHAPTER 1. INTRODUCTION
1.1 Motivation
There is a huge difference between hearing music and listening to music. While the first action
tends to be unsubstantial and in some cases even annoying, the second one can be rewarding at
so many levels. The same discrepancy in meaning can be observed when having a conversation:
listening to your interlocutors implies paying attention to what they say, while hearing them is
basically acknowledging that some sound is stimulating your auditory system while you think
about finishing your thesis dissertation or what to have for dinner.
A direct analogy can be drawn for images. It is completely different to see a picture and
to look at a picture. The former means just receiving light impulses through your eyes, while
the latter includes discerning complex relationships between what you see, like grouping regions
into objects, spatial reconstruction of the scene or higher level tasks such as understanding a
scene: what or who is there? What is the purpose of the things or people you see in a particular
place or disposition? Is there some contextual information you can extract from just looking at
a picture?
This complicated understanding process is done almost automatically by our brains, but
achieving the same level and speed is far away from trivial for machines. The difference between
being able to extract basic features from what a computer “sees” and being able to decipher the
true meaning of an image is known as the semantic gap [158], and is one of the key problems in
computer vision.
When talking about pictures with people on them, clothing is probably one of the most
important sources of information. It can provide lots of context about the event where the
picture was taken, like a sports venue or a wedding, or about the people appearing in it, like
their occupation or whether they belong to some sort of social tribe.
Not only is fashion a very interesting field in terms of computer vision, it is also a buoyant
market whose e-commerce revenues are expected to exceed $638 billion only in the U.S. by
2022 [1]. Apart from having had a skyrocketing growth in the last decade, the majority of the
expected new consumers are within the 16 to 34 age group. Interest of the young generations for
fashion might be partly provoked by the raise of social networks. As an example, in 2019, it is
estimated that Instagram users uploaded every day 95 million pictures and 500 million stories —
short publications deleted after 24 hours. Many of these publications are human-centric pictures
where the user wants to look as good as possible, and clothing style is key. Social media is
the new word-of-mouth, and all these pictures (some of them tagged with the clothing product
information) generate a large percentage of the income for fashion e-commerce sites. Hence
the importance of having a useful and efficient website, where products are easy to find and
recommendations are meaningful. In both aspects, and in many more, computer vision plays
a critical role. This confluence of factors makes the last decade a perfect breeding ground for
companies applying computer vision and machine learning techniques to fashion, like Wide Eyes
Technologies, where part of this research has been carried out.
This is an industrial Ph.D., i.e., the research work has been split between the university and
a company. Concretely, between the Universitat Politècnica de Catalunya (UPC) and Wide Eyes
Technologies, a startup company devoted to computer vision for the fashion industry. Wide Eyes
(WE) is a business-to-business (B2B) technology company providing computer vision solutions
to companies. WE focuses on image similarity, offering services to retrieve clothes using pictures
as queries and to find similar products in a customers’ catalog. They also work on attribute
3
1.2. GOALS
Figure 1.1: Quick overview of the thesis. From the low level task of superpixel segmentation
to a high level text-based product detection on images, this thesis aims to contribute to the
fashion industry from a computer vision point of view.
detection, which allows them to offer an auto tagging system to their clients.
Wide Eyes business model is based on the machine learning models created and trained by
the research team, in which I was working during the three years of development of this thesis.
Therefore, the research work presented in this dissertation is strongly influenced by the needs of
Wide Eyes, and has been developed using their data and infrastructure along with the university
resources (both from UPC and Waseda University in Tokyo, where a three-months research stay
was carried out).
1.2 Goals
This dissertation aims to extract and relate the semantic information contained in fashion e-
commerce data, with a special focus in developing systems that solve some of the current problems
in Wide Eyes. More precisely, it starts by facilitating the task of fashion image annotation
developing a new oversegmentation algorithm that drastically reduces the number of segments.
As will be explained in Chapter 3, this algorithm can be used in other domains, since it reduces
the complexity of the oversegmentation results without losing information. Furthermore, in
Chapter 4 we improve the garment retrieval system by leveraging joint information coming from
images and textual descriptions in Wide Eyes datasets. Finally, Chapter 5 is devoted to the
usage of a similar embedding space to detect the product being sold in a fashion e-commerce
image that contains other products based on its textual description. Please, see Fig. 1.1, for a
quick general view of the main topics of the thesis.
4
CHAPTER 1. INTRODUCTION
1.3 Contributions
During the development of this thesis, we started from a low-level viewpoint (grouping the
pixels of an image based on their color and distribution), to tackle later mid-level tasks like the
creation of a multi-modal embedding and its usage in image retrieval, ending with a higher level
application that lies in between object detection and phrase localization. The main contributions
produced along this path can be divided into four main parts:
1. BASS Superpixel algorithm: we present a new algorithm for the task of image over-
segmentation. Our algorithm, called Boundary-Aware Superpixel Segmentation (BASS),
produces superpixels with a bigger size in regions of the image with less changes in texture,
and smaller superpixels where they are needed, in regions with more details. This allows
to generate more compact representations of images, avoiding the addition of extra super-
pixels that don’t provide new information. Therefore, these superpixels can be used to
drastically reduce the number of superpixels to annotate in the task of manual annotation
of images for semantic segmentation. The algorithm is further explained in Chapter 3.
2. Multi-modal embedding: the majority of this thesis has been devoted to the construc-
tion and utilization of a fashion-specific multi-modal embedding space. In this space,
images and textual metadata representing fashion items can be projected, and distances
between both types of information can be easily computed. This space, generated using
a neural network, can be applied to tasks such as image and text retrieval and text-based
object detection, as explained in the next two contributions.
3. Product retrieval: the first application for which we use the multi-modal embedding is
product retrieval. We extract the embedded descriptor of a query (that can be an image or a
text) and find the closest images and texts in a big (half a million products) dataset created
from fashion e-commerce data. Chapter 4 explains the construction of the embedding and
its application to image retrieval.
4. Main product detection: finally, in Chapter 5 we apply the idea of the multi-modal
embedding to a very specific task that we call main product detection. As some kind of
specific phrase localization, our method aims to find the main product in an image depicting
a model with several garments given its textual metadata, understood as the one the text
refers to. This last contribution obtained the best paper award in 2017’s ICCV Workshop
on Computer Vision for Fashion.
5. Along with these contributions, there are many different technological tasks (less research-
oriented) that were carried out during these years working in Wide Eyes. The most im-
portant ones are briefly described in Chapter 6.
5
1.4. PUBLICATIONS
1.4 Publications
The following papers have been published during the development of this thesis:
• BASS: Boundary-Aware Superpixel Segmentation

A. Rubio, L. Yu, E. Simo-Serra, F. Moreno-Noguer
23rd International Conference on Pattern Recognition, 2016
Cancún, Mexico, pp. 2824-2829
• Multi-modal Fashion Product Retrieval

6th Workshop on Vision and Language, 2017
Valencia, Spain, pp. 43-45
• Multi-modal Joint Embedding for Fashion Product Retrieval

24th IEEE International Conference on Image Processing, 2017
Beijing, China, pp. 400-404
• Multi-modal Embedding for Main Product Detection in Fashion

ICCV Workshop on Computer Vision for Fashion, 2017
Venice, Italy, pp. 2236-2242
(Best paper award)
1.5 Thesis overview

This dissertation is divided in the following chapters:
• Chapter 2 reviews the current state-of-the-art in computer vision, machine learning and
fashion, focusing on the common areas between the three fields and relating them with our
work.
• Chapter 3 describes our approach on image oversegmentation and compares its results
with different state-of-the-art techniques in multiple and variate datasets.
• Chapter 4 focuses on the creation of a multi-modal embedding space for images and
texts, and its utility as a retrieval system, using classification as a tool and a byproduct.
The retrieval and classification performances are evaluated in a dataset with nearly half a
million products and compared against a baseline method.
• Chapter 5 is devoted to the creation of a new multi-modal embedding specialized in the

task of finding a specific product in an image using metadata related to the product itself.
Three incremental approaches are proposed and evaluated using a large-scale dataset.
• Chapter 7 summarizes the work explained in the previous chapters and proposes future
research lines to take it further.
6
Overview
The two principal ingredients of this thesis are computer vision and fashion. When talking about
computer vision in this decade, it is inevitable to mention deep learning as well. Therefore, we
organize this overview chapter in three main sections: computer vision, deep learning and fashion,
trying to emphasize how the three are not only separate trending fields, but interconnected
domains full of challenging problems for research. Further specific related work will be provided
in each of the subsequent chapters.
CHAPTER 2. OVERVIEW
2.1 Computer Vision

When it comes to computer vision, one of the first thoughts that comes to mind in the recent years
is expansion. The ubiquity of mobile devices with embedded cameras and internet connection,
leading to the exponential increment of image databases (both public and private, owned by e-
commerce sites or other big companies), the hardware improvements allowing to fastly compute
expensive operations, and cutting-edge technologies such as autonomous vehicles are only some
examples of the relevance of computer vision nowadays. It is now present in myriad industrial
products, but back in the 60s, it was regarded as a summer project consisting of attaching a
camera to a computer for it to “describe what it saw”. Luckily for us, lots of research lines are
still open in the field, and many of the problems tackled by computer vision in the last decades
remain (at least partially) unsolved.
Some of the earliest works in the field were devoted to the problem of edge extraction, and were
based on simple differential operators, template matching, least square edge fitting or statistical
and heuristic methods [4]. Part of these methods employed convolution operations between small
2 × 2 or 3 × 3 masks and image regions, which are one of the keystones of modern deep learning
algorithms. New techniques for edge detection keep appearing, and in fact, a modern version of
an edge detection system is used in Chapter 3 of this thesis.
Of course, anthropocentric problems have also raised the interest of computer vision re-
searchers, and human modeling is an important part of the field. One of the main topics is 3D
human pose estimation from a 2D image. This specific problem has been tackled in the past
decade in several ways. Works like [153] or [113], for instance, treat the problem as a pipeline
that initially detects the parts of the body in 2D and then disambiguates the three-dimensional
pose. The former uses kinematic constraints for the disambiguation with noisy input data, while
the latter uses deep learning to regress a distance matrix between the joints. Others try to do
everything at once, jointly solving the 2D detection and the 3D disambiguation, like [152], or
use Geodesic Mixture Models, like [155]. Once having the pose of a human body, its shape can
be estimated (the research of [163], [171] or more recently [76] focuses on this shape estimation
problem). And, furthermore, the geometry of the clothing over a human body can also be esti-
mated [130, 143, 30], in one of many examples of computer vision research that can be useful in
the fashion domain.
Letting aside techniques related with physical and geometrical properties like movement,
camera positions, lenses and sensors, or 3D coordinates (like optical flow, camera calibration and
3D reconstruction), out of the scope of our work, other problems tackled by computer vision
researchers along the early decades of the field were shape inference or contour models, which
led to one of the most important topics inside the field: image segmentation.
Image segmentation refers to the process of dividing an image into multiple sets of pixels (seg-
ments) according to some shared computed properties (like the aforementioned edge detection)
or characteristics (such as color, texture or intensity). At a higher level, image segmentation
can lead to object detection when, after combining properties, it can be deduced that a set of
pixels belongs to the same physical object. Chapter 3 will present a new algorithm to generate
superpixels, i.e. meaningful low-level (not necessarily physical objects) segments of images.
In the past two decades, the number and size of available datasets for computer vision re-
searchers started increasing, and as a consequence, so did the usefulness of machine learning
techniques. Nowadays, machine learning (and more specifically, deep learning) practically has
9
2.2. DEEP LEARNING
Figure 2.1: Artificial Neural Network. Graphical depiction of an Artificial Neural Network
(ANN) with two fully connected layers.
the monopoly on many of the computer vision solutions proposed to solve problems based on
feature extraction from images, like classification, object detection or retrieval. Next, we provide
a background on deep learning and its contributions on some recent advances in computer vision.
2.2 Deep Learning

Deep learning is the name given to a subclass of machine learning techniques based on Ar-
tificial Neural Networks (ANN), where multiple non-linear transformations, called layers, are
connected and trained (generally for representation learning) in a supervised, semi-supervised or
unsupervised manner. Each layer of the network transforms its input data into a more abstract
representation.
An ANN is nothing else than a directed set of connected layers of nodes called neurons (refer
to Fig. 2.1 for a basic scheme of an ANN). Information coming out of one layer goes into the next
layer after passing through a non-linear activation function σ(·) (see Fig. 2.2 for some examples
of activation functions). The activation (or output) of a neuron j in layer l, called xlj , can be
written as:  
xlj = σ  xl−1 + blj  , (2.1)

X
l
wjk k
k∈l−1
i.e. the sum of weights connecting every neuron k in layer l − 1 with neuron j in layer l (wjk
l )
multiplied by the activations of each neuron k (xl−1

k ) and the biases of every neuron in l (bj ),
l
passed through σ(·).

Simplifying the previous equation for a whole layer, the output of a layer l can be written
following Eq. (2.2):
xl = σ (W xl−1 + b) , (2.2)
where xl and xl−1 are the outputs of the layers l and l − 1, respectively, σ(·) is the non-linear
activation function, and W and b are the learnable parameters of layer l: the matrix containing
10
CHAPTER 2. OVERVIEW
(a) ReLU (b) Sigmoid and tanh (c) PReLU
Figure 2.2: Activation functions. Graphics for different activation functions (ReLU, sigmoid,
hyperbolic tangent and Parametric ReLU).
the weights connecting the neurons in layer l − 1 with the neurons in l, and the vector containing
the bias of the neurons in the layer l.
These parameters are learned using the well-known backpropagation algorithm, that fine-
tunes the weights of the network based on an error value obtained with a loss function in the
previous iteration. This technique was introduced in the 1970s, but wasn’t fully appreciated
until Rumelhart et al. [139] proved that it worked faster than previous approaches for learning.
Backpropagation consists of computing the partial derivatives of an error or loss function L with
respect to the network weights. These derivatives are computed layer by layer applying the chain
rule of calculus.
When training a network, an iterative process is followed with the following steps until some
stopping condition is met:
1. Forward pass: each sample is fed into the network and propagated through all the layers
using Eq. (2.2).
2. Loss computation: the value of L is obtained (using the corresponding ground truth
information of the sample in the case of supervised training).
3. Backward pass: propagation of the error backwards through the network from the last
layer to the first.
4. Weights update: the variation in the values of the weights after iteration i, ∆W i , is
computed as the derivative of the loss multiplied by a learning rate factor η, so weight
values are updated as in Eq. (2.3), and the same for bias values (Eq. (2.4)).
∂L
W i+1 = W i − ∆W i = W i − η (2.3)
∂W i
∂L
bi+1 = bi − ∆bi = bi − η (2.4)
∂bi
11
2.2. DEEP LEARNING
Figure 2.3: Convolution operation. The filter is multiplied element-wise with the correspond-
ing region of the source image, and the results are summed, obtaining a single value that goes
into the next layer. The convolution of the filter with the image in all the possible locations
produces all the values of the next layer.
In computer vision, where the inputs to the network are images, it is quite usual to use
convolutional layers (instead of the fully connected layers just explained). Here, the input is not
a vector, but a multi-channel image. A convolutional layer normally comprises a set of filters
that are independently convolved with the image, each one of them providing a feature map. The
convolution operation of one filter is the result of taking the dot product between the filter and
a small part of the image, sliding over all posible locations for the filter. Therefore, it produces a
scalar value for each location. In this way, the size of the resulting feature map is (if no padding is
added) smaller than the image size. See Fig. 2.3 for a graphical explanation. Using convolutional
layers, the 2D spatial information is preserved, and the number of parameters decreases at the
cost of increasing the number of operations. The behaviour of this layers is similar to fully
connected ones, but the weights are shared for all the neurons in the layer. The filters become
our learnable parameters. Figure 2.4 shows how depending on the depth on the network, filters
learn to activate when certain edges, patterns or colors are detected on the image.
One limitation of these layers is that they learn the precise position of features in the input
image, so small variations in the image will produce different feature maps. To overcome this
problem, pooling layers are normally added in between convolutional layers. Without extending
too much, these layers basically reduce the feature map size (what is called downsampling) by
taking for instance the maximum (max pooling) or the average (average pooling) over a set of
contiguous pixels.
An architecture with tens of millions of parameters is prone to overfitting if the training is
long enough. In order to avoid this problem and get networks with a certain ability to generalize
their results, apart from using a dataset as big as possible, some regularization techniques are
beneficial. Some of the most common are L2 or L1 regularization (adding a term to the loss
12
CHAPTER 2. OVERVIEW
Figure 2.4: Filters learned after training. Note how the ones in the first layers are more
abstract (vertical or horizontal patterns, or even simpler edges with different orientations),
while those in deeper layers will have a high response to very specific shapes. Figure extracted
from [208].
penalizing big values of the weights [86]), dropout [160] (randomly removing some nodes and
their connections for one iteration, so each iteration has a different set of nodes) and data
augmentation (applying some transformation to the input data, like rotating, flipping, shifting,
scaling, cropping, changing brightness or color, etc. in the case of images to artificially increase
the size of the training dataset).
Although ANNs have been around for quite a long time [189], and despite the accumulation
of many minor contributions improving their performance, it wasn’t until the computational
power was enough to train systems with such high numbers of parameters that they exploded
as one of the most used techniques in computer vision. Now, they provide the state-of-the-art
performance in many tasks, some of which are briefly reviewed below.
Image Classification
Image classification has lived a revolution since Krizhevsky et al. [85] reduced the error percentage
of the ImageNet Large Scale Visualization Challenge [140] by more than 10 points thanks to the
utilization of GPUs (graphics processing units) enabling to train their expensive model (Fig. 2.5
shows some examples of the images in the challenge). The network they defined, AlexNet, was
formed by five convolutional layers with some max pooling layers in between, followed by three
fully connected layers. They used the non-saturating Rectified Linear Unit (ReLU) activation
function (f (x) = max(0, x)), that significantly increased the training speed with respect to the
13
2.2. DEEP LEARNING
Figure 2.5: ImageNet dataset examples. Some images extracted from the ImageNet dataset
and their corresponding category labels. Figure extracted from [34].
usual hyperbolic tangent (f (x) = tanh(x)) or sigmoid (f (x) = (1 + e−x )−1 ). See Fig. 2.2 for a
graphic explanation.
This achievement cleared the path for all the deep learning techniques used for this and other
tasks ever since. For instance, Simonyan and Zisserman [156] and Szegedy et al. [161] studied
how the depth of a Convolutional Neural Network (CNN) affected its classification accuracy.
Kaiming et al. [62] claimed to achieve superhuman performance using Parametric ReLU units
(see Fig. 2.2), and so did Ioffe and Szegedy [73] using Batch Normalization (BatchNorm). Batch
Normalization is a technique that normalizes the output of the previous activation layer using
the current batch’s mean and standard deviation during training. By doing this, the covariance
shift (amount by what hidden unit values vary) is reduced, and the network will perform better
when receiving a test set with a distribution different from the training set. It also increases the
stability of the neural network, reducing overfitting and allowing to use higher learning rates.
Later, Kaiming et al. [63] aimed to ease the training of very deep networks presenting a residual
framework that inserted shortcut connections in the networks.
Object Detection
A more challenging problem in computer vision is object detection, i.e. detecting and classifying
a specific region of an image, for which PASCAL Visual Object Classes (PASCAL VOC [41])
is one of the most used datasets. Some of the most successful works in object detection are
YOLO [134](Fig. 2.6), that treats the problem as a regression problem, Fast-RCNN [49] (more
detailed in Section 5.3.4, where one of their ideas is applied to our method) and its evolution
Faster-RCNN [135], that introduces a Region Proposal Network (RPN) for both predicting object
bounds and objectness scores, enabling almost cost-free region proposals.
Image Segmentation
Deep learning has also been applied to the image segmentation problem, already explained in
Section 2.1. The goal of this task is assigning a label to each pixel in the image. For this, the
output of the network has to be a feature map with the same height and width of the original
14
CHAPTER 2. OVERVIEW
Figure 2.6: YOLO object detection results. Examples of object detection using YOLO
algorithm. Figure extracted from [134].
input image. Although Chapter 3 of this thesis is devoted to oversegmentation, classical pre-deep
learning techniques are used, and this is just mentioned to draw attention to the fact that in a
future these techniques can be considered as an alternative to the proposed method.
Recapping the overview so far and relating it with the present work, in this thesis we start
focusing on low level problems tackled with more classical techniques (Chapter 3), then evolve to
use deep learning techniques relating image and textual information (Chapter 4) and finish with
a sort of object detection method mixed with phrase localization (Chapter 5). All these methods
are designed with special attention to their application in the fashion domain, and that is why
the next Section presents a brief summary of computer vision and machine learning techniques
applied to fashion.
2.3 Computer Vision and Machine Learning for Fashion

Fashion is a predominantly visual world and presents many challenging problems, which has led
many researchers in the past years to apply computer vision techniques to solve specific fashion
tasks. It is also, as stated in Chapter 1, one of the most important e-commerce businesses,
and therefore the amount of relevant data for machine learning is huge (not only images, but
associated metadata). Most of the research in this field is very application-oriented, with a
strong focus on the user interfaces for e-commerce sites and social media services. In both cases,
clothing categorization and retrieval play an extremely important role. In fact, Chapter 4 of
this thesis is devoted to retrieval. Below, we briefly review some research works from the last
years in computer vision and machine learning applied to a number of fashion-related problems,
including: product retrieval, clothing categorization, clothing parsing, attribute recognition, style
understanding, recommendation, conditional image generation, and geometric representations of
clothing.
15
2.3. COMPUTER VISION AND MACHINE LEARNING FOR FASHION
Figure 2.7: Examples of retrieved items for the street-to-shop task. Query on the left,
results on the right, ordered by increasing distance to the query. Ground truth is marked in
green among the results. Figure extracted from [58].
Retrieval
The retrieval task consists of finding relevant items in a dataset that match a given query. More
precisely, talking about fashion product retrieval, the task consists of finding a product inside
a catalog based on —usually— a short textual description or an image. Some retrieval systems
rely upon an attribute-based search, like [35]. Others focus on the so called street-to-shop task,
consisting of matching a street product (a product appearing in a real-world picture, taken by
a non-expert user in an uncontrolled environment) with a shop product (a product appearing
in a studio picture, normally isolated and with a plain background). For this problem, Liu
et al. [95] describe an approach based on features for detected human parts, while Kiapour et
al. [58] propose to solve an exact street-to-shop problem, not only giving as a result a similar
item, but the identical one. They treat the problem as a similarity learning task and formulate
it as a binary classification where example pairs are classified into positive or negative. Some
results are shown in Fig. 2.7. In [179], the authors deal with a slightly different kind of domain
adaptation problem, aiming to find similarities between outfits from fashion runways and street
outfits (see Fig. 2.8). In this thesis, we tackle the problem of multi-domain retrieval, being able
to search for images or text descriptions using both images or texts as queries indistinctly via
the creation of a common embedding space. For a more detailed overview on retrieval methods
relevant to our research, go to Section 4.2.
Clothing Categorization
Another common task is garment classification or clothing categorization. That is, given an
image depicting a garment, assigning it a category label (e.g., shirt or shoes). To solve this
problem, Bossard et al. [22] use noisy data to train a multi-class system based on Random
Forests. Chen et al. [26] extract low-level features using human pose information and then learn
attribute classifiers. In our case, classification just serves as an additional help to give semantic
16
CHAPTER 2. OVERVIEW
Figure 2.8: Example of runway to realway solution. The query comes from a fashion
runway session, while the results are found among street pictures. Figure extracted from [179].
meaning to the embedding spaces designed in Chapters 4 and 5, and good classification results
are a byproduct of the quality of our embeddings.
Clothing Parsing
Adapting semantic segmentation (one of the classical computer vision problems) to the fashion
domain receives the name of clothing parsing, i.e. assigning a garment label to each pixel on
the image. Solving this task has proven itself as a fruitful source of new algorithms, many of
them trained using the Fashionista dataset released by Yamaguchi et al. in [196], consisting of
more than 150,000 fashion photos with associated text annotations. This is one of the datasets
used for evaluation in Chapter 3. Also in [196], they use Conditional Random Fields (CRF) with
knowledge about pose estimation to solve the clothing parsing problem. Other work that uses
pose-aware CRFs applied to this problem is [149], that overcomes the problem of not having class
annotations in test time taking into account that the domain of the task is very specific: clothing
a person. Yamaguchi et al. in [195] transfer mask predictions from the retrieved clothing items
to the unannotated query image. In [201], Yang et al. propose a new framework for clothing
parsing by exploiting relationships between contiguous parts of an image as well as parts of
different images with similar appearance.
Some other datasets, like Paperdoll [195] or Fashion144k [150] are useful for semisupervised
algorithms, since their labels are weak and noisy. More recently, DeepFashion dataset [97] has
been released, containing over 800,000 diverse fashion images from different domains, annotated
with multiple categories and attributes per image, and also with bounding boxes of the gar-
ments and clothing landmarks. Although we do not specifically treat this problem along this
dissertation, some of these datasets were used to train or evaluate different algorithms during
the development of the thesis, and the superpixels generated in Chapter 3 are developed with
the specific aim of efficiently annotating clothing parsing data.
Attribute Recognition
Given the large amount of subtle details that can make one garment different from another one,
many research works focus on attribute recognition. For this task, the assumption is that every
17
Figure 2.9: Annotated samples from Fashionista dataset. Figure extracted from [149].
fashion item in a picture has some lower-level properties that can boost recognition or retrieval
tasks. These properties, called attributes, can be generic (like color or fabric) or garment-specific
(like neck type for upper-body objects, legs length for pants or heels height for shoes). Normally,
each e-commerce tags its products with a series of attributes that are unique and (probably)
slightly different from the attributes of its competitors.
In [197], the authors aim to recognize fashion attributes (and, at the same time, to detect
clothing items) establishing some inter-object and inter-attribute compatibility information using
CRFs. For instance, dresses are incompatible with pants, skirts or shorts. This compatibility
information comes from a real-world dataset, and can be seen in Fig. 2.10. Berg et al. [19] use
images and text mined from the Internet to automatically discover attributes of different types
(global, local, color, texture or shape). This work, as Chapter 5 of this thesis, aims to relate
noisy text annotations with regions of images. While they focus on discovering attributes of a
particular object, we focus on finding out an entire object in images with more than one item.
Next, there is a brief overview of some trends in computer vision for fashion research that were
explored at some point along the development of this thesis, but without reaching a satisfactory
result in the form of a publication or a final product for the company.
Style Understanding
Moving to higher level tasks, we find techniques that seek to extract the underlying style in-
formation given by the combination of different garments that form an outfit. Here, we find
works like [151], where the authors proposed compact 128-dimensional features that encapsulate
the properties of an outfit; [69], that uses unsupervised probabilistic polylingual topic models;
or [150], that proposes a fashionability score to evaluate outfit compatibility. [194] focuses on
18
CHAPTER 2. OVERVIEW
Figure 2.10: Pearson correlation between clothing items. Notice exclusive blocks (for
instance, boots and shoes are incompatible, and dress is not very compatible with shirt, top,
pants, skirt or shorts). Figure extracted from [197].
detecting potential popularity of a picture in different scenarios (online and offline). More social-
oriented papers also exploit the relationships between the different garments on images, like [78],
that classifies people in different social tribes according to what they wear, or [159], predicting
people’s occupation based on their clothing. Some authors try to accumulate style information
over time, analyzing trend evolutions, like [64, 10], or even clustering the fashion styles according
to world regions based on millions of Instagram photos to analyze spatio-temporal trends [105].
Recommendation
When the information relating several garments is used not only to predict a compatibility score
or extract style information, but to actively recommend the inclusion of some other garment in
the outfit (or in a specific user wardrobe), we talk about recommendation.
Recommendation techniques are widely used in online shopping, music and video streaming
services or social networks, where one of the main problems is the data sparsity. One very popu-
lar approach for recommendation is based on user ratings, but this requires active participation
by the users, who need to provide some feedback. Recent works have focused on overcoming this
problem. Some of them use the attributes for calculating similarity between outfits to recom-
mend [45], others combine visual appearance and metadata, like [91] or [61], where they treat
outfits as sequences of garments to exploit the benefits of Long Short-Term Memory networks
(LSTMs). Hsiao and Grauman [70] create what they call capsule wardrobes, sets of garments
that can be combined to form several different stylish outfits. Given the subjective nature of the
19
Figure 2.11: Artificially generated fashion images. Figure extracted from [129].
domain, where small changes determine if two or more garments are fashionable or not, tasks re-
lated to style and recommendation are difficult to evaluate and are often based on co-occurrence
or manual assessment, both of which are normally biased.
Conditional Image Generation
One of the biggest trends in computer vision in the past years are the Generative Adversarial
Networks (GANs), and fashion researchers have of course been aware. Some very recent works
in the area are suitable to be used as data augmentation in many fashion-related tasks, since
they synthetically generate images of people wearing clothes in different positions [129, 100] or
garments with different attributes [204], while some other researchers focus on the virtual fitting
room problem, where the goal is generating an image of the user wearing different clothes [217].
Geometric representations of clothing
There has been a wide range of literature in representing the geometry of clothing. Early ap-
proaches described non-rigid surfaces using models inspired by physics, such as thin-plates [108]
and elastic models [82]. More complex deformations can be captured by Shape-from-Template
approaches [114, 116], which aim at recovering the surface geometry given a reference configura-
tion in which the template shape is known, and a set of 3D-to-2D correspondences between this
shape and the input image. On top of this, additional constraints enforcing isometry [145] and
photometric consistency [114, 116] are considered. Temporal information is another typically
exploited prior. Non-rigid Shape-from-Motion techniques recover deformable shape and camera
motion from a sequence of 2D tracks, exploiting physical properties [7, 8, 6, 115, 146]. In all
these works, it is paramount to achieve good keypoint descriptors robust to non-linear image
deformations [154]. Recent works, have shown that such representations can be well captured
by deep architectures, even without requiring local descriptors [128]. Depth from RGB-D cam-
eras has been also used to model clothing, especially for robotics applications related to cloth
manipulation [131, 132].
20
CHAPTER 2. OVERVIEW
Table 2.1: Fashion datasets. List of different datasets with fashion images, with their image
and text type, the number of garment categories and attributes, and whether they include any
localization data or annotated pairs of images representing the same item.
Dataset # images Image type Text type # cats. # attrs. Localization Pairs
Clothing Attrs. [26] 1856 Shop / street Tags 7 26 - -
Fashionista [196] 685 Social network Tags, comments 56 - 14 body parts -
Paper Doll [195] 339,797 Social network Tags, comments 56 - Weak pose estim. -
ACWS [22] 145,718 Mixed Tags 15 77 keys Bounding boxes -
Colorful-Fashion [94] 2,682 Social network Tags 23 13 colours Pixel-level color -
Daily-Photos [39] 2,500 Social network - 18 - Pixel-level -
Cl. Co-Parsing [201] 2,098 Social network Tags 59 - Pixel-level -
Fashion-136K [74] 135,893 Social network Tags, description - - Bounding boxes -
Fashion-350K [74] 350,000 Tops & blouses Tags, description - - Bounding boxes -
Fashion-Q1K [74] 1,000 Skirts Tags, description - 8 patterns Bounding boxes -
Online-data [27] 341,021 Street Tags, description 15 67 ∼ 6K b. boxes -
Exact S2S [58] 425,040 Street / shop Tags 11 - Bounding boxes 39,479
Clothes-50K [192] ∼ 70K Shop Description 14 - - -
Online-offline [71] ∼ 500,000 Shop / street Tags 9 179 - 90,000
DeepFashion [97] >800,000 Shop / street Tags, description 50 1,000 Landmarks 300,000
2.4 Summary
In this chapter, we presented some of the existing challenges in the fashion domain that can be
tackled with the help of computer vision or deep learning. Problems like segmentation, classi-
fication or retrieval are perfect examples of the combination between scientific and commercial
interest that this topic offers. Moreover, the large quantity of available data (see Table 2.1) en-
courages the adoption of these techniques and makes it a very interesting subject for researchers.
21
Superpixel Segmentation
In this chapter we describe the first contribution of the thesis: a new method for generating
superpixel segmentations of images. This new method produces smaller superpixels in regions of
interest and large superpixels in more homogeneous regions with less information, and is moti-
vated by the flaws of other superpixels method when facing the specific task of image annotation
for clothing parsing.
CHAPTER 3. SUPERPIXEL SEGMENTATION
(a) (b)
Figure 3.1: Examples of oversegmentation (a) and undersegmentation (b) problems.

In (a), we observe that the superpixels of the sky are redundant, they don’t offer new information
and complicate further algorithms unnecessarily. In (b), the contours of the building, the tree
and the mountains are not well adjusted.
3.1 Introduction
If we stated in the introduction that this thesis would start by solving low-level problems, in
computer vision there is no lower level than pixels. This chapter is devoted to the development of
a new method to group pixels into perceptually meaningful atomic clusters (known as superpixels)
that share some properties (e.g. color or texture), normally belonging to the same object on the
image.
Assuming that a superpixel is formed only by pixels belonging to the same physical world
object, this oversegmentation can drastically reduce the computational cost of properties that
remain approximately constant for an object [174]. In addition, the information provided by
superpixels is much more discriminative than that of single pixels, because it includes color
histograms and shape, and therefore can be used for instance in applications that require spatial
information [68]. As expected, representing images as a non-overlapping set of superpixels is
a standard practice as a preprocessing step for many computer vision applications, including
depth estimation [93, 68], object detection [168], localization [9], tracking [186], appearance
descriptors [167, 169], gesture recognition [182], human pose estimation [79, 118, 117], place or
object recognition [120, 122] and semantic segmentation [149, 65, 102, 47].
By the time this thesis started, one of the main concerns in Wide Eyes was obtaining more
data. More precisely, images annotated with segmented garments. In WE’s annotation system,
similar to the one in [164], the user would click on a superpixel and assign it to a certain category.
The main problem of this system is a classic problem of superpixels: it was hard to find the
balance between the number of superpixels and the adjustment to the boundaries present on the
image. A large number of superpixels would guarantee that even the smallest parts of the objects
in the image can be correctly segmented, but the cost of annotating that many superpixels would
be huge. On the other hand, having a lower number of superpixels might lead to the loss of some
details in the segmentation, and smaller garments like glasses, shoes or jewelry might remain
unsegmented. Please see Fig. 3.1 for examples of over and undersegmentation.
25
3.1. INTRODUCTION
(a) (b)
Figure 3.2: Overview of the proposed method. From left to right: (a): input image, with
overlaid boundaries and initial seeds positions; geodesic distance with respect to a specific seed;
and result of our Boundary-Aware Superpixel Segmentation (BASS) with 26 superpixels. (b):
results of state-of-the-art superpixel segmentations SEEDS [173] (36 superpixels), SLIC [5] (36
superpixels), and Yao et al. [202] (48 superpixels). Even with a smaller number of superpixels,
our algorithm is able to achieve better results for the Variation of Information (VOI ) metric
while maintaining the Undersegmentation Error (UE)value when compared with state-of-the-art
methods.
In other words, superpixels are expected to reduce image complexity while respecting the
boundaries, and at the same time they should avoid loss of information due to under-segmentation.
The trade-off between these two requirements has been tackled via Normalized Cuts [148], mean
shift [28], local variation [42], geometric flows [185, 89] and watershed [177]. Another requirement
when computing the superpixels consists of homogeneously distributing them over the image and
keeping their sizes within limited bounds.
In contrast, and taking these considerations into account, we argue that in many situations,
the superpixels can be safely merged and their number highly reduced, simplifying thus subse-
quent tasks. Therefore, for the first part of this thesis we focused on generating a superpixel
segmentation algorithm able to reduce the number of superpixels without losing the details of
the image. The main idea behind the algorithm is adapting the size of the superpixels on each
region of the image depending on the density of boundaries in that region (as objects in images
can generally be described by their boundaries), so that large homogeneous regions are divided
into larger superpixels, while regions with more texture and details are divided into smaller su-
perpixels to maintain that information. For this purpose, we introduce two main ingredients:
1) we first propose a new approach that spreads the initial superpixels seeds non-uniformly, de-
pending on the image content, and 2) we leverage on image intensity boundaries and a geodesic
distance metric to produce smaller superpixels where there is potentially more information in
the image (i.e., regions with more intensity boundaries), and bigger superpixels in regions with
less presence of boundaries. By doing this, we simultaneously prevent extreme oversegmentation
without information gain, and avoid undersegmentation in regions where more precise superpix-
els are needed, hence we are able to maintain the coherence of the image structure with fewer
superpixels than other approaches.
Even though the origin of the idea was to improve fashion product annotation, the algorithm
we develop is generic and agnostic to the type of images it receives. In fact, we first evaluated
26
(a) (b) (c)
Figure 3.3: Results of graph-based superpixel algorithms. Examples from [42] (a), [31]
(b) and [40] (c). The approach in (a) produces different-sized superpixels, but not very accurate
to the semantic segmentation. Superpixels in (b) are bigger, more similar to a semantic segmen-
tation result than a superpixel preprocessing step. In (c), manual constraints can be introduced
to improve the results.
it in a fashion specific dataset and then, to demonstrate generality, in other datasets. As shown
in Fig. 3.2 and expanded in the results section, our approach brings numerous advantages and
improves segmentation metrics compared to the most recent methods. Concretely, we show
the resulting algorithm to yield smaller Variation of Information values in these datasets while
maintaining Undersegmentation Error values similar to the state-of-the-art methods1 .
In summary, the main contributions of this chapter are:
• A new boundary-aware initialization method for superpixel centers.
• Use of an energy function that takes into account color information and both Euclidean
and geodesic distance between pixels.
• Exhaustive evaluation of the resulting algorithm in seven different datasets (both multiclass
and foreground/background) with two different metrics.
• Better Variation of Information metric than state-of-the-art methods and similar values
for Undersegmentation Error and Boundary Recall for a smaller number of superpixels.
3.2 Related Work

Ren and Malik [136] introduced the concept of superpixel for the first time in 2003, defining it
as an over-segmentation using classical Gestalt theory for grouping the image pixels. Later on,
many authors developed different techniques for computing superpixels, which can be roughly
split into three main categories: methods based on graph cuts, techniques that grow superpixels
from an initial set of seeds, and techniques that start from an initial regular grid and iteratively
move the boundaries. We next review each of these families.
3.2.1 Graph-based algorithms

Standard methods use graphs to represent similarities between neighboring pixels, with the
pixels being the nodes of the graph, and the edges their similarities. In the first approaches,
the Normalized Cuts (NC ) algorithm [148] was then used to estimate superpixels by globally
1
The code is publicly available at https://github.com/arubior/bass-superpixels
27
3.2. RELATED WORK
(a) (b) (c) (d)
Figure 3.4: Results of seed-growing superpixel algorithms. Examples from [175] (a), [89]
(b), [5] (c) and [185] (d). In (a), a tree is formed linking pixels to the nearest neighbor instead of
shifting each point to a local mean. In (b) and (c), superpixels grow from evenly distributed pixels
called seeds, producing dense and regular grids of superpixels. In (c), seeds are also regularly
placed, but superpixels can be split according to geodesic distances.
minimizing a graph-based objective function. However, the computational cost of NC was quite
expensive, taking several minutes for segmenting a 480 × 320 pixel image. Based on the same
idea, subsequent works proposed alternatives to speed up the graph-based minimization process
by using agglomerative clustering of the nodes [42] (example result in Fig. 3.3a), by decomposing
the graph in multiple scales [31] (see Fig. 3.3b) or by adding grouping constraints [40] (Fig. 3.3c).
One of the most well known approaches is Graphcut [176], in which the constraints for the label of
a pixel come from a dense set of overlapped patches, enforcing the regularity of the superpixels.
Finally, [212] uses pseudo-boolean optimization to speed-up the graph cut algorithm to 0.5
seconds per image. Although a lot of work has been devoted to the optimization of this kind of
algorithms, especially regarding memory load, they still present some disadvantages with respect
to our proposed approach, such as an excessive uniformity between the resulting superpixels
caused by their tendency to produce small contours.
3.2.2 Seed-growing methods

The Watershed method [177] is one of the first non-graph based techniques. It computes su-
perpixels by flooding the gradient image, interpreted as a topological surface. Quick-Shift [175]
(Fig. 3.4a) builds upon the mean-shift algorithm to develop a non-iterative mode-seeking al-
gorithm for clustering. While these algorithms are considerably fast, they produce irregular
superpixels which tend to span across different objects. This is improved by the turbopixels
algorithm [89] (Fig. 3.4b), that grows boundary curves from seeds uniformly distributed over
the image following geometric flows. The SLIC algorithm [5] (Fig. 3.4c) is based on the same
principle, and substantially improves the efficiency of previous methods. SLIC’s main idea is
to cluster pixels around regularly distributed seeds based on an energy function that uses both
color and Euclidean distance in the image plane. Wang et al. [185] (Fig. 3.4d) also grow super-
pixels around regularly distributed seeds, but allows them to split based on the geodesic distance
28
(a)
(b)
Figure 3.5: Results of coarse-to-fine superpixel algorithms. Examples from [173] (a) and
[202] (b). As depicted in the images, the algorithms iteratively divide the image in blocks, that
are assigned to superpixels based on the minimization of an energy function.
between the pixels and the seeds. All the methods within this category are more efficient than
graph-based algorithms, being SLIC the fastest among them. Nonetheless, their performance is
not always better. Our method follows this line of work, but we primarily favor reducing the
number of superpixels trying to maintain the quality of the segmentation.
3.2.3 Coarse-to-fine methods

Another usual strategy for performing superpixel segmentation is to start from a regular grid
of superpixels, whose boundaries will iteratively be warped until reaching the termination con-
dition by moving blocks between adjacent superpixels. The size of the blocks that move from
one superpixel to another is reduced after each iteration until reaching the size of one pixel.
The SEEDS [173] (Fig. 3.5a) algorithm exploits this technique with a simple hill-climbing op-
timization, using an energy function that enforces color similarity between the boundaries and
the superpixel color histogram. Yao et al. [202] (Fig. 3.5b) use a similar approach, adding a new
topology preserving term to the energy function and focusing on obtaining real-time performance.
While most of the methods in these families focus on producing regular superpixels with
similar sizes, we argue that it is convenient to vary the superpixel size in different regions of
the image depending on the amount of information present in each region. Our goal is to avoid
extreme over-segmentation of the image in order to simplify the representation obtained for
subsequent applications without deteriorating the quality of the segmentation.
3.3 Boundary-aware Superpixel Segmentation

In this section we provide a detailed description of our proposed approach, covering step by
step the process followed to obtain the resulting superpixel segmentation, and emphasizing the
29
3.3. BOUNDARY-AWARE SUPERPIXEL SEGMENTATION
Figure 3.6: Summary of the main steps of the method. First, the boundary image is
obtained. Seeds are regularly distributed over the image, and based on the density of edges,
some of them are deleted and some intermediate seeds are added. After that, more seeds are
placed in the center of large empty spaces. Once the seeds positions are determined, the method
iterates computing the energy function for each seed, and assigning labels to pixels trying to
minimize the total energy. Upon the termination condition is reached, the connectivity of the
labeled pixels is enforced, achieving the final superpixel segmentation.
differences and similarities with previous methods from the state of the art.
Commonly, superpixel algorithms group pixels based on L2 distance computed in a 5-dimensional
space of color and pixel coordinates2 . In this way, if two pixels are close and have a similar color,
they tend to be grouped into the same superpixel.
While this is an standard practice, it ignores the information along the path joining pairs of
pixels, which can produce undesirable effects such as undersegmentations. Furthermore, many
state-of-the-art algorithms force superpixels to be regular-sized and homogeneously distributed
over the image. Again, this seems to be a reasonable heuristic to apply, however, it is prone to
produce excessive over-segmentations in regions where small superpixels are unnecessary, such
as backgrounds or large regions with homogeneous color.
These methods produce satisfactory results when the desired number of superpixels is prop-
erly set, i.e., with a value that balances the trade-off between preserving the image details and
producing an excessively large number of superpixels. Nonetheless, in many cases an extreme
over-segmentation is needed in order for superpixels to adapt to the ground-truth boundaries.
This fact implies a higher cost in the computation of the segmentation. Furthermore, since super-
pixels are mainly used as a compressed representation for images in higher-level tasks, increasing
the number of superpixels also increases the complexity of these applications.
For the first part of the thesis, we address the problem with the goal of producing more
“useful” superpixels, preventing extreme over-segmentation while still producing an accurate
representation of the image for subsequent tasks. In order to do that, we compute the boundaries
2
Three dimensions for color space (e.g., RGB or CIELAB), and two for pixel coordinates (horizontal and
vertical)
30
(a) (b)
Figure 3.7: Examples of images and their extracted boundaries. Observe how the method
for boundary detection that we use [38] extracts high level boundaries at object level, discarding
internal boundaries not useful for the next steps.
of the image and increase the concentration of superpixels in regions with more edges, where more
detail is necessary. Consequently, superpixels in these regions are smaller than those located in
more homogeneous ones (with few edges). Moreover, drawing inspiration in [185], we modify
the energy function to be minimized by adding a new term that takes into account the geodesic
distance between two points, which helps to retain the structure. Yet, note that [185] does
still produce quite homogeneous superpixels, not content aware sized superpixels as we do (see
Fig. 3.4d).
We next describe the steps of the proposed algorithm. Refer to Fig. 3.6 for a visual explana-
tion.
3.3.1 Boundary detection

For each input image we compute its boundary image using an off-the-shelf structured forest-
based approach [38], which has been proven to run in real-time while providing state-of-the-
art results in the BSDS500 dataset [14]. To simplify the computation of geodesic distances we
binarize the edge detection result, using only the top 70% most intense boundaries. Two examples
of the boundaries extracted are shown in Fig. 3.7. Note that a good boundary detection is crucial
for the next steps of the method, since it is used both for seeds placement and for geodesic distance
computation, as explained in the following sections.
3.3.2 Seeds initialization

Unlike other seed-based state-of-the-art algorithms (such as SLIC [5]), that regularly distribute
the seeds over the image, we place more seeds for superpixels in regions with large boundary
concentration. This is done in three steps as outlined
p in Fig. 3.6. Initially, we place seeds
following a regular grid spaced S pixels apart (S = N/K , with N the number of pixels of the
image and K the desired number of initial seeds). After that, based on the ratio of boundary
31
3.3. BOUNDARY-AWARE SUPERPIXEL SEGMENTATION
(a) (b) (c) (d)
(e) (f) (g)
Figure 3.8: Examples of seeds locations after initialization. Observe how seeds are gen-
erally centered around regions of interest (with many boundaries) while in more homogeneous
spaces the number of seeds is drastically reduced.
pixels found inside a square region sized S × S around each seed, we decide whether or not to
add or delete any seed by comparison against a certain threshold Tad = ( ei )/N , being ei a pixel
P
in the boundary image (with value 0 or 1), and N the total number of pixels in the image. More
formally, the seed addition/deletion operation can be written as:
if (PS ei )/N > 3 · Tad
(
Add,
P
(3.1)
Delete, if ( S ei )/N < Tad
where S ei represents the sum of all the pixels in the mentioned square region centered in a
P
seed. If the condition for adding seeds is satisfied, four new seeds are created in the corners of
such region. The integral image of the boundaries is used to obtain these values in order to speed
up the computation.
Note that the condition for adding is harder than that for deleting, as our objective is min-
imizing the final number of superpixels while maintaining a good quality in the segmentation.
Finally, we place a seed in the centroid of empty regions with areas larger than S × S pixels (top
right image in Fig. 3.6). Some examples of seeds placement are shown in Fig. 3.8.
32
(a) Input (b) C = 1 (c) C = 5 (d) C = 10 (e) C = 20 (f) C = 50
Figure 3.9: Compactness effect. Input image (a) and results of BASS varying only the
compactness term C. Observe how the superpixels tend to be more regular as the value of C
increases. In images (b) and (c) superpixels are sharp and tend to have irregular frontiers. In
(e), the oversegmentation starts losing some meaningful parts of the image. (f) shows the effect
of using an extreme value.
3.3.3 Energy function

The label assignment consists of an iterative clustering based on an energy function E com-
posed of three terms associated to color information and Euclidean and geodesic distances (see
Eq. (3.3)). Before starting the optimization process, a label is associated to each one of the
initialized seeds, that will act like cluster centers in a 5-dimensional k-means problem:
Sk = [lk , ak , bk , xk , yk ]T (3.2)
where (xk , yk ) are the pixel coordinates of seed Sk on the image and (lk , ak , bk ) are its color
values in CIELAB color space. This color space represents the color as three real values: L for
lightness from 0 (black) to 100 (white), a from green to red and b from blue to yellow (values
of a and b are implementation-specific, but are normally in the range of -100 to +100 or -128
to +127). It was designed to approximate the human perception, which makes it perfect for the
task of superpixel segmentation, since we want our superpixels to be visually meaningful.
3.3.4 Optimization process

The optimization process consists of several iterations over all seeds, computing an energy value
for their surrounding pixels and assigning them the label of the seed that minimizes the energy.
At the end of every iteration, the seeds are updated as the mean of the positions and colors of
all the pixels that belong to them.
More specifically, at each iteration we compute the total energy for every pixel in a region
around each seed as the sum of Elab , Exy and Egeo , weighting the two last terms with parameters
α and β.
E = Elab + α · Exy + β · Egeo (3.3)
where α = C/S , being C a compactness term and S the already defined step (see Fig. 3.9 for
examples of results with different C values).
33
3.4. EXPERIMENTAL EVALUATION
Figure 3.10: Geodesic distance. Two examples of geodesic distance in a region around a
specific seed (marked with a red dot). For each case, from left to right: region of the original
image, region of the edges image and geodesic distance, where black is lower and white is higher.
Observe how edges act as a sort of barriers, increasing distance of pixels on the other side.
The two first energy terms in Eq. (3.3), corresponding to color and Euclidean distance, are
computed as in [5]:
q
Elab = (lk − li )2 + (ak − ai )2 + (bk − bi )2 (3.4)
q
Exy = (xk − xi )2 + (yk − yi )2 (3.5)
The last energy term depends on the gray-weighted geodesic distance computed over the binary
boundary image. This distance is defined as the smallest weighted sum of gray levels along
the discrete path between two given pixels. Concretely, we implement the Distance Transform
on Curved Space from [165]. This operation yields an image where every pixel i has a value
corresponding to the distance of that pixel to the nearest seed Sk . The region in which we
compute this energy for each seed has a 2S ×2S size. Examples of this distance for different seeds
in a given image are shown in Fig. 3.10, whereas Fig. 3.11 shows examples of color, Euclidean
and geodesic energy values per pixel.
We initialize the energy of all pixels to E0 . A reasonable choice would be to set E0 = ∞, but
that would force all pixels to get a label in the first iteration, even when they are not specially
close to any seed. For that reason, we set E0 as a finite value that we linearly increase with the
number of deleted seeds. Thus, if the energy of a pixel is not lower than E0 it will have label
l = 0, and all pixels with such label will form a superpixel (as seen in Fig. 3.9f). We then iterate
until the maximum allowed number of iterations (itmax ) is reached. Finally, those superpixels
whose area is extremely small are removed by merging them with adjacent bigger superpixels.
All these steps are described in Algorithm 1.
3.4 Experimental evaluation

In this section we show the results obtained applying our superpixel algorithm to seven dif-
ferent datasets: Fashionista [196], Berkeley Segmentation Dataset (BSD) [103], HorseSeg [83],
DogSeg [83], MSRA Salient Object Database [96], Complex Scene Saliency Dataset (CSSD) and
Extended CSSD (ECSSD) [199]. Fashionista is a multi-class fashion dataset where the model
is centered on the image, while BSD is also multi-class, but contains all types of images. The
rest of datasets have binary segmentations (foreground/background): DogSeg and HorseSeg are
composed of images of dogs and horses collected from ImageNet and PASCAL VOC12. MSRA
34
(a) (b) (c) (d)
Figure 3.11: Different energy values. Input image (a) and examples of per-pixel energy values
(color (b), Euclidean (c) and geodesic (d) distance to different seeds marked with a red circle),
where blue means lower distance and red means higher distance.
has very different images, but most are both smooth and simple. On the other hand, images
from CSSD and ECSSD present more natural situations.
We compare our approach, dubbed BASS, against three state-of-the-art algorithms: SEEDS [173],
SLIC [5], and Yao et al. [202]. All algorithms were evaluated with the code from the authors’
websites. For BASS, the maximum number of iterations has been experimentally determined as
10 to produce fast segmentations without excessively affecting their quality.
A brief description of the metrics used to evaluate the segmentations is given below, followed
by a discussion of the results obtained.
3.4.1 Evaluation Metrics

Variation of Information (VOI). It measures the distance between two different clusterings.
Given two segmentations of the same image: X = {X1 , X2 , . . . , Xk } and Y = {Y1 , Y2 , . . . , Yl },
where Xi and Yj are the superpixels for each segmentation, and n is the total number of image
pixels (n = i |Xi | = j |Yj | = |A|), V OI is computed as
P P
" !#
rij rij

V OI(X; Y ) = − rij · log + log (3.6)
X
i,j
pi qj
where pi = |Xi |/n, qj = |Yj |/n and rij = |Xi |∩|Yj |/n. Lower values correspond to smaller distances
and hence to more similar segmentations.
Undersegmentation Error (UE). It is computed as
1 X P :P ∩S6=0 min (|Pin | , |Pout |)

P !
UE = (3.7)
|GT | S∈GT |S|
35
3.4. EXPERIMENTAL EVALUATION
Algorithm 1 Algorithm for Boundary-Aware Superpixel Segmentation.

1: Set K, itmax and areamin .
2: Compute edges of the input image.
T
3: Place seeds Sk = [lk , ak , bk , xk , yk ] depending on the edges density.
T
4: Set label Li = 0 for each pixel pi = [li , ai , bi , xi , yi ] .
5: Set energy Ei = E0 for each pixel pi .
6: while #iter ≤ itmax do
7: for each seed Sk do
8: for each pixel pi in a 2S × 2S region around Sk do
9: Compute the energy Eik of pixel pi w.r.t. seed Sk using Eq. (3.3).
10: if Eik < Ei then
11: Ei = Eik
12: Li = k
13: end if
14: end for
15: end for
16: Update seeds values as Sk = pi with pi |Li == k
17: end while
18: Merge superpixels with area < areamin with bigger adjacent ones
where GT is the set of ground truth segments, P are the superpixel segments, S the ground
truth segments, and |Pin | and |Pout | represent the area of P inside and outside S, respectively.
A low value is desirable.
Boundary Recall. Represents the amount of ground truth boundaries that are covered by the
over-segmentation boundaries. Considering a binary image of boundaries (1: boundary, 0: not
boundary) with a dilation of 3 pixels, we compute the Boundary Recall for all the pixels as the
number of true positives (T P ) divided by the sum of T P plus the number of false negatives
(F N ) and higher values indicate more accurate segmentations.
TP
BR = (3.8)
(T P + F N )
3.4.2 Comparison against State of the Art

Since we consider a large number the datasets, the results we next present are computed on 10%
of randomly chosen images for each dataset (about 300 images per dataset). This already gives
a good intuition of the performance of all algorithms.
Note that the number of initial seeds or desired superpixels does not normally coincide with
the exact final number of superpixels, so in order to perform a fair evaluation, we processed all
images with a wide range of initial seeds. In this way, we obtain values for a sufficient variety of
actual superpixels for all images to compare.
Figure 3.12 reports the previous metrics for different number of superpixels, averaged over
all seven datasets (the results were quite similar for every dataset). A unique set of parameter
values (empirically determined) was used for all the datasets in order to perform fair comparisons
36
(a) (b) (c)
Figure 3.12: Evaluation metrics. Values obtained for different number of superpixels. The
results show that our approach outperforms state-of-the-art methods in Variation of Information
metric. It obtains the second best result in Undersegmentation Error and performs in pair with
state-of-the-art methods like SEEDS and SLIC in Boundary Recall. In (a) and (b), lower values
correspond to better segmentations. For (c), bigger is better.
and emphasize the generalization of the method, even though specific parameter sets per dataset
would give better individual results. These results show how our algorithm consistently decreases
the V OI for all number of superpixels and, at the same time, maintains U E and BR values similar
to state-of-the-art methods. Indeed, we argue that lower V OI is much more representative for
our primary goal of retaining the image information with a minimal number of superpixels. This
is clearly illustrated in Figure 3.13.
3.4.3 Qualitative Results

Several images segmented with different numbers of initial seeds for all the methods are shown in
Figs. 3.14 and 3.15. Note how small superpixels are concentrated in more informative areas, while
the superixels in more homogeneous regions of the image (normally belonging to the background)
are much bigger. Note also how our method is able to capture at least the same information
than the rest while drastically reducing the number of “useless” superpixels, yielding simpler
representations of the images.
3.5 Summary
This chapter has presented the first contribution of the thesis: an over-segmentation algorithm
to compute superpixels that are aware of the boundary information of the input image in order
to simplify the final result. The problem has been formulated as an iterative clustering problem
using color, Euclidean distance and geodesic distance over an edge image. Our method has been
evaluated against the state-of-the-art using seven different datasets.
Our algorithm outperforms state-of-the-art methods in the most significant metric according
to our goal while maintaining the quality of the segmentation.
37
3.5. SUMMARY
Figure 3.13: UE vs. VOI. For two different images, two segmentations with similar U E are
presented. The segmentation with BASS has a lower V OI value in both cases, hence it is more
similar to the ground truth, containing more useful information.
The algorithm is implemented in C++ and runs on CPU in about 0.5 seconds per image.
The code is publicly available at https://github.com/arubior/bass-superpixels.
38
Figure 3.14: Qualitative results (I). Some results of our superpixel segmentation algorithm
compared to state-of-the-art methods.
39
3.5. SUMMARY
Figure 3.15: Qualitative results (II). Some examples of our superpixel segmentation algorithm
compared to state-of-the-art methods.
40
Fashion Multi-modal Embedding
In this chapter, we leverage the textual metadata that normally accompanies the fashion images
in our datasets and propose a joint multi-modal embedding that maps both the text and images
into a common latent space. Distances in this space correspond to similarity between products,
allowing to effectively perform retrieval tasks which are efficient and accurate.
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
4.1 Introduction
Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites
are updating with thousands of images and their associated metadata (textual information),
deepening the problem, akin to finding a needle in a haystack. Not only the size of the databases,
but the level of traffic of modern e-commerce is also growing fast. U.S. retail e-commerce, for
instance, was expected to grow 16.6% on 2016 Christmas holidays (after a 15.3% increase in
2014), with 92% of the holiday shoppers going online to search or buy gifts [3]. In order to adapt
to this trend, modern retail sellers have to provide an easy-to-use experience to their customers,
where products are easy to find and well classified.
In a scenario where huge amounts of new products are presumably arriving on a daily basis
and must be searchable and classified, Machine Learning techniques stand out like a good choice
due to their good results classifying and clustering vast amounts of multidimensional data. More
and more sellers are including Machine Learning technologies in their online sites, specially for
advertising, recommendation and search. Nevertheless, currently most of the searches are text-
only, not taking into account the eventuality of a product not totally tagged or described. Images
without a rich text description are virtually unfindable by classical search, based on text. The
ability to compute distances between text and image in the same space allows to retrieve images
that are similar to a text, and not necessarily images whose associated texts are similar.
Fashion e-commerce products usually consist of pictures and associated metadata, generally
in the form of textual information such as brief descriptions, titles, series of tags, colors or sizes.
Existing approaches for retrieval focus only on images and require hard to obtain datasets for
training [58]. Instead, we opt to leverage easily obtained metadata for training our model, and
learning a mapping from text and images (like the ones in Fig. 4.1) to a common latent space,
in which distances correspond to similarity. We train the system using large-scale real world e-
commerce data by both minimizing the similarity between related products and using auxiliary
classification networks that encourage the embedding to have semantic meaning. Results are
compared against existing approaches and show significant improvements in retrieval tasks on
a large-scale e-commerce dataset. We also provide an analysis of the influence of the different
metadata.
Prior to a detailed analysis of the model itself, it is interesting to note why such an approach
is needed. In our case, observations come from different input channels with different options
for representations: images are normally represented as dense real-valued vectors that take into
account structural relationships between pixel intensities, whereas texts are usually processed
as descriptors based on sparse word count vectors. For this reason, without the intervention
of a specific system devoted to narrow the gap between the two types of descriptors, it is no-
tably harder to find the highly non-linear relationships across modalities than among the same
modality.
More concretely, our approach consists of exploiting a Convolutional Neural Network (CNN)
for processing images, as well as word2vec-based embedding with a Neural Network for processing
the textual information. Both networks are trained such that the distance between the output
of related image-text pairs is minimized, while the distance between unrelated image-text pairs
is maximized. Additionally, two auxiliary classification networks are used in combination with
classification losses to retain semantic information in the common embedding.
We evaluate the retrieval task, where our proposed approach outperforms KCCA [16] and
43
4.2. BACKGROUND AND RELATED WORK
Figure 4.1: Example of a text and nearest images from the test set. Example of a typical
text description found in many of the products in our dataset, and nearest images from the test
set found using our embedding, that yields low distances between texts and images referring to
similar objects.
Bag-of-word features on a large e-commerce dataset, and also classification, where the embedded
descriptors obtain values over 90% for both text and image. Additionally, an analysis of the
different textual metadata is provided.
Summarizing: in this chapter, we learn a representation space (embedding) f (x) for features
coming from visual and textual data. The learned features are smooth (small changes in the
input data lead to small changes in the embedding space, i.e. x ≈ y → f (x) ≈ f (y)), come
from multiple explanatory features (since they are computed using distributed representations,
not sparse), and are valid for retrieval and classification tasks. Our embedding provides textual
representations with the continuity of the visual space, overcoming the problem of artificially
partitioning it into disjoint parts [190].
4.2 Background and Related Work

We split the related work of this chapter in two sections: retrieval and classification. The focus
will be on the former, which will be our main task, while the classification will help improving
the retrieval performance (see Section 4.3).
4.2.1 Retrieval
As detailed in Chapter 1, interest of computer vision researchers in fashion has increased in the
past years. Many works focus on clothing parsing, or higher level tasks such as inferring a person
44
occupation or social tribe, or recommendation. Nevertheless, some of the more practical tasks
still might be clothing retrieval and classification, which we tackle in this chapter.
Retrieval task consists of finding similar items in a dataset given a query. The usual pipeline
for image retrieval is formed by three steps: extracting local image descriptors (such as Fisher
Vectors [124, 125, 126]), reducing the dimensionality (with techniques like principal component
analysis – PCA [123] – or linear discriminant analysis – LDA) and indexing. For text retrieval,
classical approaches [32, 67] looked for repetitions of the query words in a document, while newer
latent semantic models [21, 110] use more powerful distributed text representations capable of
learning the context of words and meaning of documents. There is recently a great effort focused
on word embeddings and their applications [36, 48, 56, 111].
According to [37], current retrieval techniques for large datasets with images and metadata
can be distributed into: text-based, content-based, composite and interactive approaches.
Text-based retrieval techniques

These approaches are the traditional simple keyword based search. The caption, tags, filename,
web site and any other extra information related to the image are used to index the images.
Normally, there are some preprocessing steps involved. Find a brief overview of the most common
ones below, with examples in Fig. 4.2.
• Noise removal: getting rid of unwanted characters normally coming from the text format.
If the dataset has been collected through web or document scrapping, it is normal to
find non-alphabetical characters. Some of the most typical examples are HTML tags and
escaping characters.
• Lowercasing: switching all characters to lowercase. Note that, while this technique makes
the text more homogeneous and avoids word duplication due to different letter case, de-
pending on the task this might not be appropriate.
• Tokenizing: converting a sequence of characters into a sequence of tokens (words or
strings with a certain meaning). The simplest approach would be just dividing by spaces,
but punctuation symbols must be taken into account and splitted separately, so there are
different tokenization methods that can provide slightly different results depending on the
rules used to split punctuation symbols or special characters.
• Stop word removal: getting rid of the most common words, which do not add much
information to the text. These words are language-specific and carry no information (ex-
amples of these words are conjunctions, prepositions or pronouns, such as ’the’, ’to’, ’a’,
’and’ or ’of’).
• Stemming or lemmatization: reducing inflected or derived words to their base or root
form. The difference between stemming or lemmatization lies in that stemming reduces
words to their root, which is not always necessarely a word, while lemmatization reduces
words to lemmas, which are always words from which the other ones are derived (see Fig. 4.2
for examples).
As will be further developed in the following sections, our data is formed by almost half a
million items with very structured text fields (some of them composed of a list of individual
45
Figure 4.2: Examples of text preprocessing.
words, and some others including grammatically richer descriptions) and is very domain specific.
For this reason, we do not need a heavy preprocessing step and we focus on noise removal
and homogenization, switching everything to lowercase and removing punctuation symbols and
numbers that don’t help in our embedding task.
Of course, these purely text-based methods are reliable when images have good annotations.
Otherwise, they are unable to find the similar items. And even with good annotations, the
descriptions of images tend to be highly subjective and incapable of capturing all the relevant
details, so the description of one image can be extremely different from one person to another.
Content-based image retrieval techniques

Content-Based Image Retrieval (CBIR) techniques use images or sketches as queries, and involve
the generation of visual features for the query (see Fig. 4.3). These low-level features describe the
color, shape and texture of the image, ignoring the metadata, and allow to search within massive
databases. In this way, search results are totally independent on the quality or completeness of
the annotations, and avoid the costly process of manual annotation.
It is clear that the performance of these systems necessarily depends on the features used to
represent the image and the choice of a similarity measure. As stated in [216]: ”two pioneering
works paved the way to the significant advance in content-based visual retrieval on large-scale
multimedia: SIFT and BoW”. SIFT [98], scale-invariant feature transform, proved a remark-
able performance both descriptively and discriminatively when compared to the state-of-the-art.
BoW [157], bag of visual words, quantizes the image creating a compact representation based on
histograms of local features.
In order to rank retrieval results in a large-scale dataset, a number of methods have been
proposed, such as LSH (Locality Sensitive Hashing) [12], that typically uses random projec-
tions. These projections normally require high memory footprint, since they are purely data-
independent and their performance tends to improve with the size of the vectors. To overcome
limitations related to these random projections, approaches like SH (Spectral Hashing) [188]
design binary codes that are compact and suitable for searching using Approximate Nearest
46
Figure 4.3: CBIR - Content-Based Image Retrieval.
Neighbor (ANN) techniques. Without getting into details, many techniques that also use binary
hashing are based on the Hamming distance, like [210, 52, 166, 144, 92]. In [213], the results of
Bag of Words were improved encoding more spatial information through what they call geometry-
preserving visual phrases. This approach captures local and long-range spatial distribution of the
words, not only co-occurrences. A popular approach to deal with large scale image databases is
to index the items with inverted files [157], that store mapping from content (e.g., visual words)
to documents.
Still, bridging the so-called ”semantic gap” between low-level pixel features readable by
machines and high-level semantic information understood by humans remains one of the main
problems of CBIR.
Of course, talking about CBIR in the last decade, one of the biggest breakthroughs is the
generalization of the usage of Convolutional Neural Networks (CNN), as detailed in Chapter 1.
Has deep learning really pushed research in the right direction towards reducing the semantic
gap? In [181], they conduct an exhaustive study with hopeful results for feature extraction and
retrieval, concluding that properly designed deep learning systems have the potential to outper-
form conventional hand-crafted feature extraction methods, and that training with similarity or
classification losses can as well improve the retrieval performance of classic systems.
There are also works trying to fuse the retrieval results coming from different methods,
like [209].
Composite retrieval techniques

The approach we describe in this chapter allows to retrieve texts or images with any kind of
query via a common embedding for image and text.
The idea of combining models within different domains of a dataset has already been treated.
For instance, in [17] the authors compute an embedding space for ideal images (clean well-lit
photographs with usually a white background) and realistic photographs (taken by users in
environments outside studios), what is also known as the street-to-shop problem. Similar to our
solution, they learn a distance metric between the two modalities, but using siamese networks
47
since both inputs are RGB images. Basura et al. [43] tackle the unsupervised domain adaptation
problem (adapting different source and target domains) making use of subspaces described by
eigenvectors induced by a PCA. This work performs what they call subspace alignment using
a transformation matrix to map one domain to the other. Their data is formed by images
from different domains. [54] tackles the same problem, but viewing the subspaces as points in
Grassmann manifolds, while [50] integrates an inifinite number of subspaces describing changes
in properties into a geodesic flow kernel that models the shift between domains. In contrast to
all these previous works, our multi-domain framework, instead of images coming from different
domains, uses image and text data. Nevertheless, most ideas coming from these techniques are
still interesting independently of the origin of the data, since for all of them the final goal is
having some sort of common descriptor valid for items coming from both domains.
Most of the approaches for multi-domain classification train with one source domain and then
fine-tune their classifiers to work with the target domain [20, 142]. In our case, we simultaneously
train with data from both domains, producing a common space specifically learned for the
retrieval task that offers also a good performance in the classification task.
Interactive retrieval techniques
In some cases, retrieval systems can learn from the users’ feedback indicating the relevance of
the results obtained. A complete review of interactive retrieval techniques is out of the scope
of this chapter. Many relevance-feedback techniques are described in [24], concluding that this
feedback is not always effective because the users are able to specify irrelevant search results that
will have a negative impact in the system’s performance.
4.2.2 Classification
Since most of the e-commerce metadata consists of textual features, product classification is
normally addressed as text-only classification [57]. However, recent works have started trying to
boost the classification performance using multi-modal architectures. For instance, the authors
of [46] combine the image network from [85] with a skip-gram language model, without obtaining
a significant improvement (probably due to poor text labels). In [77], they improve text-based
classification with a multi-modal architecture, but in a small dataset. Some other works created
good embeddings without focusing on classification [99, 81, 53]. A joint space for image and text
is created in [198] using a deep learning version of the classical canonical correlation analysis
method (DCCA).
Text classification
Text classification is typically treated as a two-step process, where in the first place some features
are extracted (e.g. Bag-of-Words, n-grams, etc.) and then they are used for classification. These
domain-specific priors can be replaced by generic ones in the case of deep learning [18] without
degrading the quality of the results. More specifically, Convolutional and Recurrent Neural
Networks are used to capture the sequentiality of the text [207]. Both types of networks are
suitable to be applied to distributed embeddings [80, 87, 57] or characters [211, 29, 193].
48
Image classification
Regarding image classification, it is nowadays practically monopolized by Convolutional Neu-
ral Networks (CNNs). These architectures represent the state-of-the-art performance on the
ImageNet Large-Scale Visual Recognition Challenge [85, 156, 141, 63]. This has already been
addressed in Section 4.2.1 and Chapter 2, so we will not enter into more detail.
4.2.3 Multi-modal embeddings

There have been some recent advances in the creation of text and image embedding spaces. One
of the most significant jobs in the topic is DeViSE [46], where the authors propose a two-branch
neural network architecture for image and text, similar to what we do. In their case, they focus
on injecting the semantic meaning coming from the text into the image descriptors adapting
these to be similar to the textual descritpors. In our case, we adapt both types of descriptors to
lie in a same space that will (ideally) share properties from both. Also, their dataset is formed
only by ImageNet text labels, not by rich textual descriptions like ours, and they focus on the
classification and zero-shot learning tasks instead of applying it for retrieval systems.
An evolution of DeViSE, ConSE [121], also uses a similar architecture and dataset, but
freezes all AlexNet layers, including the SoftMax classifier, and computes a convex combination
of semantic embeddings that defines a deterministic transformation from the outputs of AlexNet
into the embedding space. Our method is different because it trains all layers to better adapt
the descriptors produced to the embedding, and because we use a contrastive loss to map both
descriptors (coming from images and texts) to a common space. [133] proposes several image-
to-text embeddings and applications focused on the domain of News articles, where the relations
between the texts and the images are usually ambiguous.
4.3 Method
An overview of our method can be seen in Fig. 4.4. The main goal of the work detailed in
this chapter is the creation of an embedding such that, given an image or a text as a query,
similar images and/or texts can be retrieved from a dataset. Our joint multi-modal embedding
approach consists of a neural network with two branches: one for image and one for text. The
image branch is based on a Convolutional Neural Network (CNN) which converts a 227 × 227
pixel image into a fixed-size 128-dimensional vector. The text branch is based on a multi-layer
neural network and uses as inputs features extracted by a pre-trained word2vec network which
are converted into a fixed-size 128-dimensional vector. Both branches, whose specifities will be
given later in the chapter, are trained jointly such that the 128-dimensional output space becomes
a joint embedding by minimizing the distance between related image-text pairs and maximizing
the distance between unrelated image-text pairs. Two auxiliary classification networks are also
used during training that encourage the joint embedding to also encode semantic concepts.
4.3.1 Image Network

The image network branch is based on the well-known AlexNet [85] architecture pre-trained
on a fashion subset of ImageNet. The last layer is removed and replaced with a smaller Fully-
Connected layer that has 128-dimensional outputs (F C8). This is further split into two branches:
49
4.3. METHOD
Figure 4.4: Architecture of the proposed neural network. When sizes of two dimensions
are equal, some of them are omitted for clarity. Fully connected layers are uni-dimensional. Text
descriptor and Image descriptor are the embedded vectors describing the input text and image,
respectively.
one for classification and one for the embedding. The classification branch has two fully connected
layers (F C9 and F C10) and outputs the SoftMax score of the different classes. The embedding
branch has a single layer which outputs the 128-dimensional feature vector for the embedding
(F C11). All fully connected layers F C8 − F C11 consist of the fully connected layer itself,
followed by a Batch Normalization (BatchNorm layer and a Rectified Linear Unit (ReLU) layer.
See Fig. 4.4 for a visual representation of the network. Images are preprocessed before entering
the network: they are resized to a specific size (227 × 227) and normalized using precomputed
values for the mean and standard deviation.
4.3.2 Text Network

Since our network needs to accept two types of inputs (images and texts), we need a way to
process textual information so that we can project it to a common space with the information
coming from the image branch. The text branch requires a heavier preprocessing step, that
includes training a word2vec neural network.
We first tokenize the texts and delete numbers, punctuation marks and stop words, and then
switch all characters to lower-case. With these clean texts, we train from scratch a word2vec [110]
model using our training set.
This model is a particularly efficient Vector Space Model (VSM) based on a neural network.
VSMs project words in multidimensional spaces in a way such that semantic information is
captured, so similar or related words are represented as points that are close in that space.
There are two variants of word2vec: CBoW and Skip-Gram [109]. We use the latter because it
works better for most applications. The goal is to find word representations useful for predicting
surrounding words in a document. In order to learn such representations, the method uses a
50
Figure 4.5: Example of training samples with a context window size of 2 words. Figure
extracted from [107].
context window of n adjacent words for training (see Fig. 4.5).

The input of the word2vec network is a one-hot encoding vector. These vectors are commonly
used in machine learning algorithms to represent categorical data. With a length equal to the
number of possible categories, the ones indicate to which category (or categories) an element
belongs. In this case, the input one-hot encoding vector has K elements (being K the size of
the vocabulary) and represents a word, while the output is a K-element vector coming from a
SoftMax layer that represents probabilities for each word in the vocabulary of appearing in the
set of n closest words to the query. It is the hidden layer which learns the representations in
a M × K matrix (being M the desired size for the representations). A visual explanation is
provided in Fig. 4.6.
Features extracted with word2vec present some interesting properties. Apart from allowing
to compute distances between similar words using the cosine similarity, we can perform basic
arithmetic operations between the descriptors that will make sense semantically-wise, like the
example in Fig. 4.7.
The vocabulary is constructed using bi-grams, i.e. considering repetitive appearances of two
consecutive words as a new independent word (for instance, “t shirt” is treated as “t shirt” or
“new york” as “new york”). For the training, we use a context window of 3 words (see Fig. 4.5
for an explanation on context windows), ignore stop words (“the”, “and”, “a”, etc.) and words
appearing less than 5 times in the dataset and set the size of the descriptor (M ) to 500. Once the
word2vec system is trained, we can use it to extract meaningful dense representations of words.
51
4.3. METHOD
Figure 4.6: Scheme of the word2vec method. Input and output have the same size (the
vocabulary size, K, and each element represents a word. The hidden layer has the size of the
desired output descriptor, that will be extracted from the matrix transform between the hidden
and last layer, with size M × K.
The input for the text branch of our network are M -sized descriptors computed averaging the
previously learned word2vec distributed representations for all the words in each text. Averaging
these descriptors has been proven successful in [180].
Our text network consists of 3 common fully connected layers that output 1024-dimensional
features (F C12-F C14). Afterwards, like in Section 4.3.1, the network splits into two branches:
the classification branch and the embedding branch. The classification branch consists once
again of two additional layers (F C15 and F C16) and the output is the score of the different
classes. The embedding branch outputs 128-dimensional vectors for the joint embedding. All
fully connected layers in the text network are formed by the fully connected layer itself followed
by a BatchNorm layer and a ReLU layer.
4.3.3 Training
For training we use the previously collected large dataset of corresponding text-image pairs with
class labels. Each sample consists of an image, its associated textual metadata and a category
label. The category labels are used for the classification losses and for randomly sampling
negative examples for training the embedding, so that the picked negative example belongs not
only to a different product, but to a different category.
Cross entropy losses are used for classification. This loss function (see Eq. (4.1)) is used in
classification tasks where the outputs are probability values between 0 and 1 for each class. In
52
Figure 4.7: Word2vec’s arithmetical properties. This is a classical example of the arithmetic
properties of the space generated by word2vec. Operating with the descriptors of the words give
results like king − man + woman = queen.
our case, 32-dimensional vectors of probabilities of belonging to a clothing category listed in 4.4.
Given a sample as input (an image or a text), its cross entropy loss is computed as follows:
N
LX (C, L) = − Li log (Ci ) (4.1)
X
i=1
where N is the number of categories (32 in our case, see Section 4.4 for details), Li is the label
for category i and Ci is the output of our classification algorithm for category i.
Simultaneously training the text and image networks is done jointly by encouraging similar
text-images pairs to have a small distance between the embedded vectors, while dissimilar text-
image pairs have a large distance. Images and their associated text are used as positive pairs,
while unrelated image-text pairs are obtained by randomly sampling images and texts from
unrelated categories. This is done by using the contrastive loss described by Hadsell et al. [59]:
1
LC (vI , vT , y) = (1 − y) (kvI − vT k2 )2
2
1
+ (y) {max (0, m − kvI − vT k2 )}2 (4.2)
2
where vI and vT are two embedded vectors corresponding to the image and the text respectively,
y is a label that indicates whether or not the two vectors are compatible (y = 0) or dissimilar (y =
1), and m is a margin for the negatives. After trying different values for m ({0.1, 1, 10, 50, 100})
we select the one that performs best, m = 1. For a visual explanation of contrastive loss, see
Fig. 4.8.
The full training loss consists of both the contrastive loss and the weighted sum of the cross
entropy classification losses:
LC (vI , vT , y) + αLX (CI (vI ), LI ) + βLX (CT (vT ), LT ) (4.3)
53
4.4. DATASET
Figure 4.8: Contrastive loss behaviour with positive and negative pairs. The mini-
mization of this function reduces the distance between two vectors if they are labeled as similar
(positive), and increases it if they are considered dissimilar. The margin value defines a radius
outside which negative samples do not contribute to the loss value (such as y2 ).
where LX is the cross entropy loss, CI (vI ) is the output of the image classification network, LI
is the image label, CT (vT ) is the output of the text classification network, LT is the text label,
and α and β are two weighting hyperparameters.
4.3.4 Implementation Details
We train the network for 100,000 iterations with batches of 64 samples (forming in each iteration
64 correlated pairs image-text and 64 non-correlated pairs) with α = β = 1. Training is done
using stochastic gradient descent with backpropagation. We use an initial learning rate of 10−3
and decrease it by 5 · 10−4 every 10, 000 iterations with momentum 0.95.
4.4 Dataset
The dataset we use consists of 431,841 images of fashion products with associated texts. We
have collected this data in WE from dozens of different fashion websites through web scrapping.
Images depict individual fashion items with constant background (generally white) as well
as images with models wearing several clothes at the same time. Regarding textual metadata,
each website contains its own tags, but we generalize all of them so the textual information for
each product comes separated in the following fields: title, description, gender, type, color and
category. See Fig. 4.9 for examples of products in the dataset.
54
Description: MAURO GRIFONI. denim, solid color, mid rise,

dark wash, front closure, button, zip, multipockets, logo, slim fit.
84% Cotton, 14% Elastomultiester, 2% Elastane.
Title: MAURO GRIFONI Denim Pants
Gender: female
Color: Blue
Type: denim
Category: WOMAN / Denim / Denim Pants
Description: Mountain Dew or crab juice?? One of the best

episodes of The Simpsons! Make sure you can get to New York
and hit up the Khlav Kalash vendor. This makes the perfect gift
for any Simpsons fans out there.
Title: Simpsons Tee - Khlav Kalash Crab Juice Vendor Shirt
Gender: male
Color: White
Type: Cotton tee, ink
Category: Clothing and shoes / Men’s / shirts
Description: knitted, jersey, logo, solid color, round collar,

short sleeves, no pockets, large sized. 100% Cotton
Title: MARNI T-shirt
Gender: male
Color: Green
Category: MAN / T-shirts and tops / T-shirts
Description: Sonic Custom High Top, Sonic Custom Shoes,

Movie Custom High Top, Sonic The Hedgehog Custom High Top,
Birthday Gift, Gift For Her.
Title: Sonic Custom High Top, Sonic Custom Shoes, Movie
Custom High Top, Sonic The Hedgehog Custom High Top, Birth-
day Gift, Gift For Her
Gender: female
Category: Clothing and shoes / Women’s / women’s shoes
Figure 4.9: Examples of products. Images and their associated textual metadata. Note how
for some of the products, the textual metadata is not extremely descriptive, limited to a bunch
of words selected for search engine optimization.
55
4.5. RESULTS
Table 4.1: Retrieval results. Median rank and recall at x (f@x%) of our method (using
word2vec) compared against KCCA and against our method using Bag of Words for text repre-
sentation.
Median rank Image Text

Model
Img v. txt Txt v. img f@5% f@10% f@5% f@10%
KCCA 52.42% 46.65% 3.70 7.59 3.90 9.59
Bag of Words 4.50% 4.54% 53.18 75.02 53.14 74.20
word2vec 1.61% 1.63% 77.90 89.24 77.47 89.78
The products are classified in 32 categories:
vest swimwear blazer

hat suits top handles
boots travel bags belts
polo glasses/sunglasses jacket
jewelry pants/leggings other accessories
skirt flats jumpsuits
clutch/wallet shorts sweater
cardigan coat/cape joggers
shirt tops sandals
dress pump/wedge crossbody/messenger bag
backpack sweatshirt/hoodie
We use 60% of the dataset for training, 30% for test and 10% for validation, and train the
model using different combinations of textual information associated to the images to check the
influence of the different types of text, as will be explained in Section 4.5.
4.5 Results
Next, we describe the results obtained by applying our method to a fashion e-commerce dataset,
in which we train a common embedding where distances between text and images referring to
products of the same category are considerably smaller than distances between those of different
categories. We compare against existing approaches, analyze the different text features, and
look at classification results with the auxiliary networks. We compare the results using a simple
Bag-of-Words approach, and the most complex distributed representation given by word2vec.
We check how the amount and type of text influences the clustering in the embedding and the
classification accuracy. Finally, we use kernel canonical correlation analysis (KCCA) [16] to
obtain another common embedding for comparison.
We use the following metrics to check the results:
• Median rank: median position of the first correct result in the ranked list of results (in
percentage of the total number of samples evaluated).
56
Table 4.2: Results using the information in different text fields. We see how Title
and Category are extremeley discriminative and saturate the text classification accuracy when
appear. We compare against a model trained without the classification losses, seeing how the
difference between positive and negative distance increases at the expense of losing more than
10% classification accuracy. Diff. column corresponds to the difference between mean distance
of positive pairs and negative pairs (bigger is better).
Text fields used Accuracy

Model Diff.
Description Gender Title Category Color Type Text Image
Contrastive only 82.56% 79.23% 0.50
93.39% 90.38% 0.42
94.16% 89.53% 0.43
99.89% 89.97% 0.47
99.61% 91.06% 0.42
93.63% 90.02% 0.42
Ours
94.50% 89.88% 0.43
98.27% 89.94% 0.43
92.56% 89.47% 0.41
99.97% 90.06% 0.44
• Recall at x: percentage of test queries for which the recall at x% is positive. Recall at
x% (f@x%) is positive for a query if the correct result can be found in the first x% of the
results ranked by distance to the query.
• Classification accuracy: percentage of items in the test set that are correctly classified
by the network.
4.5.1 Retrieval
In order to evaluate our method, we compute the 128-dimensional descriptors of all images and
texts in the testing set. Then, we use the text as queries to obtain the most related images,
and vice-versa. Looking at the position at which the exact match is obtained, we compute the
median rank for each case. The resulting values are below 2%, meaning that the exact match is
usually closer than the 98% of the dataset, beating the result obtained by KCCA1 and by our
same architecture substituting the word2vec by a classical Bag of Words.
These results, the recall@K (which shows that around 80% of the times the exact match is
among the top 5% of nearest items) and the classsification accuracy can be seen in Table 4.1.
We also tested the performance of our model with respect to the different data fields available
in the dataset, concluding that, even if the Description field by itself gives good results, using
highly discriminating fields such as Title or Category slightly improves the metrics (see Table 4.2).
4.5.2 Classification
In parallel to the ranking task, we are training a classification task. This one, intended to
help clustering in a certain way the products of the same category in the common embedding,
1
The KCCA model has been trained with less descriptors (only 10, 000) due to memory errors when trying to
use the whole training set.
57
4.6. SUMMARY
Table 4.3: Classification results. Classification accuracy of our approach using word2vec and
a simplified version using Bag of Words for text representation. Diff. column corresponds to the
difference between mean distance of positive pairs and negative pairs (bigger is better). Best
result shown in bold.
Classif. Acc.
Model Diff.
Text Image
Bag of Words 99.78% 71.73% 0.327576
word2vec 99.97% 90.06% 0.44
maintains high accuracy values (> 95% in some cases, as seen in Table 4.2) for the 32 clothing
categories defined in the dataset. In Table 4.3, we show the global classification accuracy results
for two versions of our method, as well as the difference in the mean distance between negative
and positive pairs in our embedding. A higher number for this last value implies that the
descriptors in our embedding are closer in items from a same category and further away in the
embedding.
4.6 Summary
Creating a common embedding where texts and images can be easily compared represents an
opportunity for e-commerces to classify and tag images without associated text, to fill in missing
information looking for similar texts. We presented results on a challenging real-world dataset,
obtaining low median rank values and showing at the same time very good classification accuracy
for both image and text classification.
In this chapter, we have presented an approach for joint multi-modal embedding with neural
networks with a focus on the fashion domain. Our approach is easily amenable to large existing
e-commerce datasets by exploiting readily available images and their associated metadata. By
training the embedding such that distances correspond to similarities, our approach can be
easily used for retrieval tasks. Furthermore, our auxiliary classification networks encourage the
embedding to have semantic meaning, making it suitable as features for classification tasks.
58
Main Product Detection
In this last chapter, we present an approach to detect the main product in fashion images by
exploiting the textual metadata associated with each image. Our approach is based on a joint
embedding of object proposals and textual metadata with an architecture based on the one in
Chapter 4 to predict the main product in the image. We additionally use several complementary
classification and overlap losses in order to improve training stability and performance. Tests on a
large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong
baselines and is able to accurately detect the main product in a wide diversity of challenging
fashion images.
CHAPTER 5. MAIN PRODUCT DETECTION
5.1 Introduction
Most of current commercial transactions occur online. Every modern shop with growing expec-
tations presents to their potential customers the option of buying some or part of the products in
their online catalogs. For instance, 92% of the U.S. Christmas shoppers went online on holidays
in 2016, a 16.6% increase over the same period in 2015 [3].
The way the products are presented to the customer is a key factor to increase online sales.
In the case of fashion e-commerce, a specific item being sold is normally depicted worn by a
model and tastefully combined with other garments to make it look more attractive. Existing
approaches for recommendation or retrieval focus on images only, and normally require hard-to-
obtain datasets for training [58], omitting the metadata associated with the e-commerce products
such as titles, colors, semantic tags or descriptions that can be used to improve the information
obtained from the images.
In this work, we propose to leverage this metadata information to select the most relevant
region in an image, or more specifically, to detect the main product in a fashion image that might
contain several garments. This allows us to subsequently train specific product classifiers, which
do not need to be fed with the whole image. Additionally, this process can also be used as a first
step in tasks like visual question answering or, together with customer behavior data, to extract
useful information relating the type of images in an e-commerce and its sales.
Our approach consists of a first step to extract descriptors of object proposals, that are then
used to train a joint textual and image embedding. The distances between descriptors in this
common latent space are then used to retrieve the main product of each specific image as the
closest object proposal to the textual information, as seen in Fig. 5.1.
We train our method with images of individual garments and evaluate its performance in
images coming from a different e-commerce of images of models wearing the clothes, and we
show it to be able to detect a region with an exigent 70% overlap with the ground truth in more
than 80% of the cases among the top-3 bounding box proposals.
5.2 Related Work

Many recent works focus on generic (not fashion-specific) multi-modal embeddings for images
and text, most of them oriented to automatic multi-labelling of images. Unlike techniques already
mentioned in the previous chapter, like DeViSE [46] or ConSE [121], which use complete images,
in this chapter we aim to relate only some parts of the images to textual metadata. Our work
focuses on the combination of textual and visual information applied to the task of specific object
detection (fashion items for our concrete case), therefore it lies in between the fields of multi-
modal embedding, object detection and phrase localization. We focused on the first one in the
previous chapter, so below we will review object detection with multi-modal information (and,
more specifically, tagging applications) and phrase localization.
In MIE (Multi-Instance visual-semantic Embedding) [137] the authors use geodesic object
proposals [84] to automatically generate multiple labels for images based on meaningful subre-
gions (see Fig. 5.2). This strategy is similar to our approach in the fact that it finds the minimum
distance between proposed image regions and texts. Nevertheless, the text data in [84] consists
of simple ImageNet labels, and the authors retrieve those labels that are closest to a specific
61
5.2. RELATED WORK
Figure 5.1: Overview of our proposed method. From a fashion e-commerce image and its
associated textual metadata, we extract several bounding box proposals and select the one that
represents the main product being described in the text.
image region, while we find the image region closest to a rich textual description. Furthermore,
we base our approach on a much faster algorithm to generate the proposals, extracted from [218].
Other works try to acquire a deeper understanding of the available textual information. The
embedding in [72] is created to explicitly enforce class-analogy preservation. In [183], the task of
image-sentence retrieval is carried out using entire images, with application to phrase localization
on the Flickr30K Entities dataset. See Fig. 5.3 for an example of these results.
The basic idea of two network branches (one for images, one for texts) connected with a
margin loss is similar to ours, but our approach incorporates the classification information to the
gradients of the network, while [183] only enforces the ranking task with combined hinge loss
functions. [90] proposes a two-step process where a network is trained in the first place with
multi-labeled images and then used to mine top candidate image regions for the labels. The work
of [184] is devoted to the structured matching problem, studying semantic relationships between
phrases and relating them to regions of images.
Some other works focus on Visual Question Answering [215], taking a step further the goal of
image region importance according to text, using it to generate a proper answer to a question, as
shown in Fig. 5.4. While these works focus on many-to-many correspondences, i.e. relating parts
62
Figure 5.2: Multiple-Instance visual-semantic Embedding. The aim of this paper is tag-
ging different parts of the image with different labels. Figure extracted from [137].
of sentences to regions of images, our work tries to associate all the available textual metadata
to only one region of the image, simulating the potential problem we are dealing with: receiving
images and text from fashion e-commerces and detecting the product being sold among all the
products on the images.
5.3 Method
Our goal is to detect the main product corresponding to the product being sold in a fashion
image. We consider the case where the image contains several other garments and has additional
metadata associated to it. We will solve this problem creating a common embedding for images
and texts, and then finding the bounding box whose embedded representation is closest to the
representation of the text. In order to do so, we explore different architectures and combinations
of artificial neural networks. Next, we describe the different approaches that we incrementally
propose, stating the pros and cons of each one of them.
5.3.1 Common parts

Three elements of our method remain constant in all the following architectures: the contrastive
loss, the classification losses, and the branch that transforms the textual information into an
embedded vector.
Contrastive loss: we use the same loss function described in Chapter 4 in order to embed
the image and text descriptors in the same space so we can compute distances between them.
This is the keystone of our method, which makes both types of input data comparable. This loss
is again expressed as in Eq. (4.2):
1
LC (vI , vT , y) = (1 − y) (kvI − vT k2 )2
2
1
+ (y) {max (0, m − kvI − vT k2 )}2 (5.1)
2
63
5.3. METHOD
Figure 5.3: Parts of an image related with parts of a text. Figure extracted from [183].
where y is the label indicating whether the two vectors vI and vT , corresponding to image and
text descriptors respectively, are similar (y = 0) or dissimilar (y = 1). The value m is the margin
value for negative samples. Therefore, both positive and negative image-text pairs must be used
in order for the network to learn a good embedding.
Classification loss: the classification loss (for both text and image branches) is a cross en-
tropy loss (see Eq. (4.1)) comparing the predicted vector (composed of 19 category probabilities)
and the ground truth category label (a binary vector of the same size with only one activation)
that can be:
vest pants backpack

skirt tops bags
swimwear hats outerwear
suits accessories dress
shorts belts sweatshirt/sweaters
jumpsuits glasses/sunglasses background
shoes
Global loss: in all our trainings, the global loss function LG is the weighted sum of the
contrastive loss (LC ) and the cross entropy loss (LX ) for image and text:
LG = LC + αLX (CI , LI ) + βLX (CT , LT ) (5.2)
where CI (vi ) is the output of the image classification branch, LI is the image label, CT (vT ) is
the output of the text classification branch, LT is the text label, and α and β are two weighting
hyperparameters.
Text network: the textual metadata is used in the same way throughout all the architectures
in the chapter. We first concatenate all the available string fields (depending on the source of
the data, these can be title, description, category, subcategory, gender, etc.), then we remove
numbers and punctuation signs, and compute 100-dimensional word2vec descriptors [110] for
each word appearing more than 5 times in the training dataset. We compute these descriptors
using bi-grams and a context window of 3 words, as in Chapter 4 (please, refer to Fig. 4.5).
64
Figure 5.4: Relevance of image regions for answering different questions. Importance
of each pixel of the image when answering specific questions. Note how in the upper case, pixels
belonging to the cat have higher scores (red values), while in the lower case the most important
pixels belong to the shelf. Figure extracted from [215].
Finally, we average the descriptors in order to have a single vector representing the metadata of
the product. Averaging these distributed representations gave good results as a text descriptor
in [180]. The training corpus for the word2vec distributed representation consists of over 400, 000
fashion-only textual metadata.
These descriptors are then fed into a 3-layer neural network formed by fully connected (FC)
layers with Batch Normalization (BatchNorm) [73] and Rectified Linear Units (ReLU) that finally
produce a 1024-dimensional vector, later split into two branches as shown in branch D of Fig. 5.5:
• a FC + BatchNorm + ReLU block that reduces the dimension of the vectors to 128 followed
by a final FC layer and a SoftMax layer that reduce it to 19 elements corresponding to
category probabilities for classification.
• a FC + BatchNorm + ReLU block followed by a FC layer, both with 128-dimensional
outputs. The output of the last layer is the descriptor of the text in the common embedding.
5.3.2 First Approach: Full Image

We firstly train as a baseline the method using the entire images with their associated textual
information. These images present a huge variability: for shoes, for instance, they usually
consist of a frontal and superior view of one shoe, for pants they show in many cases a model’s
legs (including feet with shoes) but some shorts appear individually, as shirts usually do. We use
these images with their associated metadata to train the network. This is the baseline against
which we compare. We construct the positive pairs as an image with its corresponding metadata,
and the negative pairs as an image with textual metadata of a product from a different category.
The network architecture is shown in Fig. 5.5. It consists of the previously explained text
network in branch D together with the image network in branch A, whose architecture adopts the
shape of the well-known AlexNet [85] network followed by a few FC + BatchNorm + ReLU layers.
Outputs from both branches of the network converge in the contrastive loss. Text and image
gradients are also influenced by category classification losses. This approach is a fast, straight-
forward and not very accurate way of solving this specific problem. Nevertheless, depending on
the available data, it might be the only way.
65
5.3. METHOD
Figure 5.5: The three different network architectures used. Gray layers remain constant
for all the architectures (i.e., the text branch (D) and a few layers before each loss function).
Blue parts correspond to architectures using images as input (both full image (A) and cropped
bounding boxes (B) flow through the same layers), and green parts correspond to the architecture
using bounding box descriptors as input. These descriptors are the output of the frozen first
layers of the AlexNet architecture, so in case (C) the image branch of the network is only trained
from the first green layer.
5.3.3 Second Approach: Bounding Boxes

Since our final goal is to detect the most relevant region of the image according to the text, it
makes sense to train the network with smaller parts of the image instead of the whole image itself.
In order to do that, we use the Ground Truth bounding boxes (GT ) of each image, along with
300 bounding boxes per image (called proposals) as the input for our network. The proposals are
computed with [218], a method that uses an objectness score taking into account edges inside
the box and removing those that intersect with the box boundaries. We select the top 300 based
on that score.
We then define the possible combinations for positive and negative pairs as follows:
• Positive pairs: texti and:
a) GT bounding boxes of imagei
b) proposal bounding boxes of imagei with overlap of over 70% with GT.
• Negative pairs: texti and:
a) proposal bounding boxes of imagei whose overlap with GT is between 30% and 50%.
b) proposal bounding boxes of imagej (from a different category) with overlap of over
70% with their GT
66
Figure 5.6: RoI pooling explained. Given an image as input, a feature map is computed using
a convolutional neural network, and N regions of interest are proposed. Using both things, the
RoI pooling layer scales a region of the feature map corresponding to each one of the proposals
to a fixed pre-defined size, obtaining a list of N feature maps.
Bounding boxes with overlap under 30% with GT are not used as negative pairs for the
network because they would not be discriminative enough (the problem of differentiating between
positive and negative would focus on extreme cases, which are much more numerous and easy to
detect than the negative pairs we use).
For this approach, the network is the same as before (see branch B of Fig. 5.5), but the input
pairs are the resized proposal bounding boxes with their corresponding positive or negative texts.
The quality of the results is considerably increased (as seen in Table 5.1), but since the number
of pairs that we can construct per product is now much higher, it takes more time to reach a
good minima when training.
5.3.4 Third Approach: Region of Interest Pooling

After shifting from whole images to bounding boxes, the number of positive and negative pairs
that can be fed into the network is highly increased. Therefore, the training process takes more
time. For this reason, our next step is to train a smaller network with compact representations
of these images, reducing the computational cost of training all the convolutional layers of the
previous architecture. Positive and negative pairs are constructed in the same way, but in this
case using the output of a Region of Interest (RoI) pooling layer instead of crops of the image.
The difference between this strategy compared to using just image crops as inputs for the network
is explained below.
The RoI pooling layer (Fast-RCNN [49]) extracts fixed-size feature maps for proposed image
regions. Usually, hundreds (or thousands) of regions are proposed to maximize the recall during
the next phase. Then, given a feature map obtained from a convolutional neural network, RoI
pooling takes a part of the feature map corresponding to a proposed region of interest and scales
it to a pre-defined size. See Fig. 5.6 for details.
67
5.3. METHOD
(a) (b) (c)
Figure 5.7: Effects of bounding box resizing. Effects of resizing a bounding box of an item
with (b) or without (a) adding context information to avoid deformation. We see how without
the context, the resizing step extremely deforms the image. The feature map of this deformed
image will not be comparable to any of the feature maps extracted from regions of the image in
the testing phase. In our method, we resize regions of the feature map, not the image itself, but
this figure serves as an illustration of the problem.
The benefit of using RoI pooling is huge regarding processing speed. Instead of computing
the (very expensive) convolutions of the CNN for each one of the proposals, we just perform one
pass of the original image through that part of the network, and then crop the desired parts of
the feature map for the proposals. The speed up in both training and testing is considerable
(10× and 20× respectively).
The input to the image part of our network are the 6×6×256 Region of Interest (RoI) pooling
regions of the corresponding proposal bounding boxes extracted from the last convolutional layer
of AlexNet as in Fast-RCNN.
Now the training of the visual part of the network consists only of a first convolutional layer
(coupled with a ReLU) that reduces the third dimension of the data from 256 to 128 elements,
followed by two FC + BatchNorm + ReLU blocks that progressively transform these descriptors
into 512-dimensional vectors, that are then split into the already mentioned 128-dimensional
descriptors for classification and for the embedding. Layers previous to this first convolutional
layer are frozen and only used to extract the RoI pooling features (see branch C of Fig. 5.5).
5.3.5 Overlap Loss

Since our goal is to maximize the overlap between the selected bounding box (the one whose
descriptor is closest to the text descriptor) and the ground truth bounding box, we try to add
this overlap information to the embedded descriptors. For this purpose we learn to predict
the overlap of each bounding box with the corresponding ground truth of its image using a L1
regression loss.
In this case, the full training loss LGO consists of the previous loss plus the weighted overlap
loss:
LGO = LG + γLO (ov,ˆ ov) (5.3)
68
(a) Category: Shirts & (b) Category: T-Shirts. (c) Category: Coats & Jack-
Blouses. Description: Men Description: Women >T- ets Description: Women >Coats
Color: Casual shirts. Le Shirts. Twik - Boyfriend Canada Goose - Trillium parka. De-
31 - All-over check shirt. tee Twik. Exclusively from signed to withstand extreme condi-
Regular fit Le 31. Exclu- Twik. An ultra practical tions, the Trillium parka keeps you
sively from Le 31 for men. must-have neutral basic. Ul- snug and warm, even in the depths
Trendy all-over checks re- tra comfortable 100% cot- of winter. It features a sleek fit,
designed in a palette per- ton weave. Sewn rolled slightly cinched waist, and slimmer
fect for the upcoming season. sleeves. The model is wear- lines throughout. White duck down
Regular fit Button-down col- ing size small. Title: Twik fill. Dual-adjusting removable hood
lar. Contrasting underside. - Boyfriend tee (Women, with removable coyote fur ruff In-
Ultra comfortable 100% cot- Green, X-SMALL). Color: terior shoulder straps to carry the
ton poplin. The model is Mossy Green. parka like a backpack Heavy-duty
wearing size medium. Title: locking zipper with snap-button
Le 31 - All-over check shirt storm flap. Upper fleece-lined pock-
Regular fit (Men, Red, XX- ets, lower flap pockets with snaps.
LARGE). Color: Khaki. Thermal experience index: 4. Made
in Canada. The model is wearing
size small. Title: Canada Goose -
Trillium parka (Women, Pink, XX-
SMALL). Color: Khaki.
Figure 5.8: Some results of our method. Ground truth is shown in green, and the proposal
closest to the text in blue. On top of each figure there is its category and the overlap percentage
between the result and the GT. Caption of each figure is its textual metadata.
where LO is the L1 regression loss for overlap, ovˆ is the predicted overlap of the bounding box
with the corresponding ground truth bounding box and ov is the actual overlap with the ground
truth, computed as their intersection over union (ov(A, B) = (A ∩ B)/(A ∪ B)). This case is
omitted from Fig. 5.5 for clarity.
All the design choices for the network layers were taken after meticulous ablation studies.
5.3.6 Implementation Details

In all the cases, the networks were trained with batches of 64 pairs, with α = β [= γ] = 1, using
stochastic gradient descent with an initial learning rate of 10−3 that decreases every 10, 000
iterations by 5 · 10−4 with momentum 0.95. The margin for the contrastive loss was set to 1 after
several tests with different values. During training, classical data augmentation techniques were
69
5.4. DATASET
(a) Category: T-Shirts. De- (b) Category: Coats & Jack- (c) Category: Coats & Jack-
scription: Women >T-Shirts. ets. Description: Women ets. Description: DESIGN-
Twik - Solid high-neck tank Twik. >Coats. Vero Moda - Long base- ERS >Men >Coats. Pierre Bal-
Exclusively from Twik. A fit- ball jacket Vero Moda. Vero main - Military coat. Pierre Bal-
ted high-neck piece perfect for ca- Moda at Icône A preppy, chic main, the second line of the fa-
sual looks. Comfortable, stretch and sporty piece for a trendy fall mous French label, remains faith-
jersey. The model is wearing look. Blended wool with an ul- ful to the aristo-rock aesthetic of
size small. Title: Twik - Solid tra soft brushed finish and fine its big sister. The strength of
high-neck tank (Women, Red, X- satiny lining. Ribbed knit collar this structured military-style coat
SMALL). Color: Ruby Red. and cuffs Snap closure. Zip pock- lies in its impeccably masculine
ets. The model is wearing size cut and luxurious, ultra Parisian
small. Title: Vero Moda - Long nonchalance. Luxurious, struc-
baseball jacket (Women, Black, tured felted wool-blend material,
X-SMALL). Color: Black. full signature lining Double but-
toning, button cuffs. Piped verti-
cal side pockets, side patch pock-
ets with buttoned flaps Buttoned
shoulder tab. Back vent Signa-
ture embossed metallic buttons.
Made in Italy. The model is wear-
ing size 40. Title: Pierre Bal-
main - Military coat (Men, Black,
42). Color: Black.
Figure 5.9: More results of our method. Ground truth is shown in green, and the proposal
closest to the text in blue. On top of each figure there is its category and the overlap percentage
between the result and the GT. Caption of each figure is its textual metadata.
applied to the images (random horizontal flip, small rotations, etc.). For the bounding boxes, we
added random noise to their size and position of up to 5% of the bounding box dimensions. Also,
instead of directly resizing every bounding box to the size required by the network (227×227×3),
they were padded to be as square as the original image dimensions allow to prior to the resizing
step, thus taking into account image context and avoiding heavy deformations (see Fig 5.7).
5.4 Dataset
We use two different datasets in this chapter. The first one consists of 458, 700 products from
eight different e-commerces. The second one, used for testing, is formed by 3, 000 products
from a different e-commerce . Each product from the datasets is formed by an image with the
annotated GT bounding box and its associated metadata. Some examples of the images and
70
Table 5.1: Results of the architectures detailed in Section 5.3. Includes precision@top-K
and classification accuracies. Best result shown in bold.
recall@top-K Classif. Acc.

STRUCTURE
1 3 5 20 50 100 Text Image
1. Full image 21,87% 44,74% 58,48% 76,06% 79,58% 82,47% 98.08% 90.06%
2. Bounding boxes 53,52% 70,42% 77,46% 90,07% 92,11% 92,96% 94.22% 88.63%
3. 2 with overlap 52,11% 78,87% 81,69% 90,24% 91,30% 91,58% 97.24% 84.74%
4. RoI pooling 56,34% 80,01% 84,51% 90,14% 92,96% 95,77% 96.91% 80.33%
their associated textual information can be seen in Figs. 5.8 and 5.9.
5.5 Results
Our method was evaluated using a dataset with images coming from a different source than the
training dataset in order to test its ability to generalize.
All the results shown in this section are obtained according to the following evaluation pro-
cedure:
1. Extract the descriptors of the text and the image proposals.
2. Compute the distance between the image and text descriptors, and select the bounding
box with the smallest distance to the text.
3. Check the overlap between this bounding box and the ground truth bounding box of the
correct product. If the overlap is greater than 70%, the result is considered as a positive
main product prediction for this image. Otherwise, as negative. Overlap between bounding
boxes A and B is computed as (A ∩ B) / (A ∪ B), like in Section 5.3.5.
The numerical results we provide in this section are the percentage of test images with positive
predictions (overlap with ground truth greater than 70%) from the test set. Evaluations were
carried out for different positive overlap percentages, but we consider 70% as a good value. The
tendencies for the rest of overlap percentages evaluated were similar.
The first dataset described in Section 5.4 is homogeneously distributed into train and vali-
dation pairs of images and metadata (70% for training, 30% for validation). For each network
architecture, we use for testing the weights values of the iteration with best performance in the
validation subset. The test dataset (from which we present results), as stated in Section 5.4,
comes from a different image source to prove the generalization ability of the method. This
dataset also comes from an actual e-commerce website.
5.5.1 Quantitative results

In Table 5.1 we show results of the four architectures described in Section 5.3. As expected, every
architecture using bounding boxes surpasses the basic architecture using the whole image in terms
of percentage of test images with any of the top-K retrieved bounding boxes overlapping more
than 70% with the ground truth. We see that incorporating the overlap information increases the
71
5.5. RESULTS
Figure 5.10: Two-dimensional t-SNE visualization. Computed using the projections of the
training set images in our embedding.
performance of the method. Also, in general, the approach using RoI pooling descriptors yields
better results than the approach using bounding boxes through the whole image architecture. For
the architecture predicting the overlap percentage between each proposal and the GT bounding
box, the average error in the percentage prediction is 5, 81%. Even though our main purpose is
not classification, we use it as a help to incorporate to our embedded descriptors the ability to
separate better the clothes from different categories. Percentages of classification accuracy for
the different architectures are shown in Table 5.1.
5.5.2 Qualitative Results

A two-dimensional t-SNE [101] visualization of our embedding is shown in Fig. 5.10. The images
depicted in the embedding come from the training set, and we can see how our method, helped
by the category classification losses, learns how to group these images into category clusters.
Training set images normally present an individual garment over a white background.
Then, results of using this embedding to perform the task of main product detection in the
test set can be seen in Figs. 5.8 and 5.9. There, some images with their associated metadata
and GT bounding box are shown along with the nearest proposal detected by our method. Note
72
how these pictures are different from the training set: they normally show the products worn by
models, who also wear other clothes that might partially or completely appear on the images.
Sometimes the background is textured (Fig. 5.8c).
5.6 Summary
We have presented a method that uses textual metadata to detect the product of interest in
fashion e-commerce images. Text is parameterized using a distributed representation and for our
best approach we use compact representations of bounding boxes extracted from frozen layers
of a pre-trained network. We compare several network architectures combining different loss
types (contrastive, cross entropy and L1 regression). In our test dataset, with images and texts
coming from a different e-commerce than those used for training, our method is able to rank
the main product bounding box in the top-3 most probable candidate bounding boxes among
300 candidates in an 80% of the cases. At the same time, the network learns to classify these
products into the corresponding clothing category with high accuracy.
It is worth mentioning that this main product detection task continues to be very relevant
for Wide Eyes. It was integrated into the company framework to be used as an intermediate tool
when crawling data, and an evolution of this work using Graph Convolutional Networks [33] was
recently published by members of the company [172].
73
Contributions to Wide Eyes
CHAPTER 6. CONTRIBUTIONS TO WIDE EYES
This thesis is framed in the Industrial Doctorate plan, and therefore other tasks were con-
ducted apart from the research described in the previous chapters, as mentioned in Section 1.3.
Some of the most relevant jobs carried out during these years are listed below.
6.1 Network Training

The core technology of Wide Eyes is based on several neural networks for different tasks such as
classification, attribute prediction or garment detection. As expected, one of the most recurrent
things done during this period was training neural networks. Apart from the ones specified in the
previous chapters, some classification, attribute detection, ranking and object detection networks
were trained. This allowed to study and work with multiple architectures such as different
versions of ResNet, Inception or VGG. Also allowed to try some techniques like teacher-student
training, and many different loss functions like KL loss [214], stochastic bidirectional contrastive
loss or ranking losses such as those in [51] and [191]. All these experiments permitted exploring
different layer types, optimizers and schedulers. In addition, several techniques for improving
the training process were used. Some of them are changing the batch size for different stages of
the training, using a different learning rate for different layers, cosine annealing (decreasing the
learning rate following a cosine shape) or SGDR (stochastic gradient descent with restart).
6.2 Model Evaluations and Results Analysis

Evidently, all that training work would be useless without a good analysis of the results. Along
these years, countless experiments were evaluated using different methods such as ranking visu-
alization tools and metrics, analysis of intra and inter class variations of descriptors, confusion
matrices or distance histograms.
The effort to improve all the models based on these results was constant, and even after
wrapping up the publications mentioned in the previous chapters, we kept working on improve-
ments.
6.3 Dataset Creation

One can have the best model architectures and evaluation methods, but without the proper
training and evaluation data they are useless. During the years working in Wide Eyes, a great
amount of time was devoted to collect a huge amount of fashion images with their associated
texts and turn them into a useful dataset. This dataset was collected using data from different
sources, and had to be used to train and evaluate different models (e.g. multi-modal embedding,
classification, ranking or attribute extraction). Therefore, there is an important effort that needs
to be put into unifying the different categories and hierarchies coming from the e-commerce sites
the data was crawled from. In addition, there was normally a lot of noise (useless images, non-
fashion) in the crawled data, so we sometimes used a small convolutional neural network to
discard noisy unwanted images. Once the dataset is clean, it has to be split into training, testing
and validation subsets, and a decision needs to be taken with respect to the storage system
(location and format of the saved data).
77
6.4. CODE IMPROVEMENTS
Some of the specific tasks carried out during this period related to turning raw online data
into useful datasets are:
• Implementation of several web crawling systems to automatically collect data from online
e-commerce sites.
• Defining the best standard way of storing the information.
• Mapping categories from each different e-commerce to Wide Eyes categories (each e-
commerce has its own way of naming different categories and its own category hierarchy).
• Usage and creation of mySQL databases.
• Completing missing data inferring categories with our models.
• Merging different datasets to unify all of our data.
• Getting rid of unwanted images (noise) using small convolutional neural networks.
• Cleaning the texts (normalizing, getting rid of unwanted words, sometimes translating to
English from a different language).
• Balancing the different categories in the dataset (some categories had many more samples
than others).
In order to carry out some of these tasks, we got ideas from [178] or [203]. At some point we
also took advantage of pre-existing pubic datasets, like Pascal [41], Flickr8K [66], flickr30K [205],
the Amazon dataset from [106] or DeepFashion [97].
6.4 Code Improvements

All the computer vision code of Wide Eyes was stored in a git repository accessed by the whole
team (four people), so it was extremely important to only produce understandable, clean and
scalable code, otherwise the technical debt would have been dramatic. In order to efficiently train
and evaluate models and use the mentioned datasets, the repository included a few libraries to
which many contributions were made, especially in the last years of the thesis. Some of the most
important are listed below.
• Helping in the migration from Caffe to PyTorch as our main deep learning framework.
• Contributions to the training and evaluation code with new loss functions, dataset balanc-
ing system and metrics.
• Creation of a new training and evaluation framework. A more flexible version based on
Ignite [44] allowing for different loss functions, networks, learning rate schedulers, optimiz-
ers, data augmentation strategies or input formats (such as txt, json and h5 files or mySQL
databases) without changing the main script.
• Allowing to train models exploiting multithreading capabilities and using several machines
at the same time.
• Designing a fast search engine based on Approximate Nearest Neighbors (ANN) [119].
78
CHAPTER 6. CONTRIBUTIONS TO WIDE EYES
6.5 Prototyping
Apart from all the mentioned tasks, that ended up being extremely useful and important for the
computer vision team, many other techniques and algorithms were explored to check ideas or
create basic prototypes, sometimes without achieving a final outcome. Some of them are worth
mentioning:
• Implementation of a fashion chatbot. Taking advantage of the knowledge acquired devel-

oping the multi-modal embedding, some time was devoted to try to create some sort of
automatic answering bot.
• Creation of a color database and training a basic color classification system.
• Using hashing or binary networks to reduce the size of our models.
• K-nearest neighbors algorithm [11] was used for several experimental tasks.
• Using boosting techniques such as AnyBoost [104] or MILBoost [15] to combine weak
classifiers and try to find visual attributes in fashion images.
• Incorporating our BASS superpixels to a CRF architecture like the one in [149].
• Studying Gaussian Mixture Models and Bag of Words features for texts.
• Training lda2vec [112] for automatic attribute discovery, similar to what they do in [19].
• Studying doc2vec [88] and Fisher Vectors [124] for interpreting natural language and replace
word2vec in Chapters 4 and 5.
• For Chapter 4, some Canonical Correlation Analysis techniques were tested (such as
DCCA [13] and KCCA [16]).
6.6 Demos
Apart from several video demonstrations showing the performance of the algorithms, a website
demo of the multi-modal retrieval system was created, where the user could select an image from
our dataset or introduce a text, and the most similar results were shown.
6.7 Recommendation System

During the last year of the thesis, we tried to develop a fashion recommendation system using as
a ground truth for training the look images, where a person appears wearing different garments.
Based on some state-of-the-art works like [23, 147, 187, 127, 206].
After studying the methods described by these authors, the idea was to make a collaborative
filtering system based on clusters of products. We would have had a binary matrix (#products
× #products), very sparse, containing the looks. The idea was to cluster (via deep learning) all
the products into representative groups, reducing #products (∼ millions) to a few thousands of
groups, and construct the binary matrix with these groups (i.e., only one cluster for black shirts
79
6.7. RECOMMENDATION SYSTEM
Figure 6.1: Bidirectional LSTMs for fashion. Overview of Learning Fashion Compatibilty
with Bidirectional LSTMs. Figure extracted from [61].
instead of thousands of very similar black shirts, so the binary matrix would have been much less
sparse). After that, the collaborative filter would have been computed using the smaller matrix.
We implemented and tested JULE [200] for clustering. Results were not very promising with our
data, so we did a new research on the topic, carefully studying the work of [10], [60] and [61].
Other ideas that came up during this period were about creating a rating-based system using
DCCA [13].
In the end, we decided to follow for recommendation the same approach of embeddings de-
veloped in previous chapters. Our plan was based on the method in [61]. In that paper, the
authors develop a bidirectional LSTM (Long-Short Term Memory) [55] system to be able to
predict compatibility between different garments and complete or generate outfit recommenda-
tions. In order to do that, they claim to generate individual garment descriptors that take into
account the rest of garments in the outfit using images of the different garments (see Fig. 6.1).
Our idea was to create an embedding where one could map this type of descriptors (that include
information about the whole outfit) with descriptors of entire images of a person wearing the
whole outfit. Doing this, the system would have been able to train a recommendation system
like the one in [61] and then associate the recommendation information to the entire image,
taking advantage of the trained bi-LSTM system for the recommendation and the embedding
for mapping the results from product sequences to single look images.
After implementing the paper and evaluating many different results trying to create the
clustering embedding, we got to the conclusion that the descriptors used for recommendation in
the paper were not as dependent on the rest of garments in the outfit as we initially thought,
therefore they were very difficult to associate with the entire look image.
80
Conclusions
I am glad that you are here with me. Here at the end of all
things.
J.R.R. Tolkien
CHAPTER 7. CONCLUSIONS
Throughout this thesis, we pushed in the direction of reducing the semantic gap mentioned in
Section 1.1, and we did it focusing in the fashion domain. This gap remains open, and, fortunately
for researchers in the field, will remain open for years to come. Everything we presented in this
dissertation is just a small sample of the possibilities that a trending field like fashion offers for
computer vision and machine learning researchers.
In order to do so, and always aligning our efforts with the commercial interests of Wide Eyes,
our work ranges from a new low-level algorithm for superpixel segmentation to a more abstract
interpretation of images tied with their textual metadata, first using entire images and then
moving to a region-specific interpretation. Along the way, we moved from the classical computer
vision approach used in Chapter 3 to the usage of the nowadays ubiquitous convolutional neural
networks in the subsequent chapters.
Apart from exploring other fashion-related problems like the ones mentioned in Section 2.3,
many possibilities arise from the work we presented. An overview of some of them is provided
below, along with the main conclusions of each of the chapters.
7.1 Superpixel Segmentation

In the first part of the thesis, the BASS superpixel algorithm was designed to alleviate the
problem of oversegmentation in Wide Eyes image annotation system. Too many superpixels had
to be annotated to select a meaningful region of the image, and if the number of superpixels
was decreased, the resultant segmentations were not precise enough for annotating small details.
The problem was solved with the design of a new method that takes into account the amount
of information (i.e., edges) in each part of the image, allowing for bigger superpixels in regions
that are not very interesting. The method was evaluated not only using a well-known fashion
dataset, but many other generic object segmentation datasets, obtaining results that prove our
algorithm to be well suited for other scenarios. Qualitatively speaking, our segmentations present
in general much bigger superpixels in background or uniform areas and much smaller superpixels
in regions of the image where details appear. In terms of quantitave results, the evaluated metrics
show that our algorithm is able to obtain similar error values to the state of the art using less
superpixels, which was our initial goal.
7.1.1 Future Work

Since 2015, when this research was carried out, and following one of the main trends in computer
vision in the last years, some authors have tackled the superpixel segmentation problem via deep
learning techniques. For instance, [170] learns to predict a pixel affinity score taking the segmen-
tation error into account for training, and [75] proposes SSN (Superpixel Sampling Network), a
differentiable end-to-end trainable superpixel segmentation algorithm. This interesting line of
work remains open for us. Replacing the iterative optimization method presented in Chapter3
with deep learning techniques seems like a clear option to evolve the BASS algorithm. The prob-
lem is well-suited for this kind of Machine Learning algorithms: inputs are images that could
be treated with CNNs and the energy functions presented in Chapter 3 could be used as loss
functions. If we make everything differentiable, a great advantage (computationally speaking)
would arise from using GPUs. Also, we could easily try different types of optimizers and apply
83
7.2. FASHION MULTI-MODAL EMBEDDING
data augmentation techniques to regularize the results. Another advantage would be that we
might get rid of all the parameters that have to be set for BASS.
7.2 Fashion Multi-modal Embedding

In Chapter 4, we started leveraging the potential of the previously unused textual metadata in
Wide Eyes database. First, we created an embedding space capable of encapsulating images and
texts. Evaluated with data coming from fashion e-commerce sites, the embedding performed
well in the multi-modal retrieval task, accomplishing its main goal: allowing to search in the
database using images or text as queries indistinctly. It is worth noting that the dataset is
extremely challenging, with hundreds of thousands of products that in some cases are extremely
similar, and whose textual descriptions are often very noisy. In addition, as a side benefit, the
method achieved high precision values in the classification task, which was used as an auxiliary
loss function to help clustering similar items in the embedding.
7.2.1 Future Work

Deep learning field is in constant evolution. Every year, new techniques and architectures ap-
pear to improve previous results. Considering some of the advances published in the last years
of the thesis, one of the main points to be considered as a future line of work is replacing the
backbone network of the embedding system. Studying different options for the image branch of
the network might lead to better results using newer architectures that have obtained better re-
sults in open computer vision challenges, like ResNet [63] or Inception-v4 [162], that help solving
the vanishing gradient problem that appears when networks get deeper and deeper via residual
connections between layers. Using better backbone networks would probably yield more discrim-
inative descriptors, refining the clustering ability of the embedding, and therefore improving the
performance in the retrieval task.
Another interesting change would be replacing the contrastive loss function from Chapter 4
by a triplet loss like the one used in [25] that maximizes the distance between negative pairs
and minimizes the one between similar pairs in a simultaneous way instead of independently. Or
even by the magnet loss introduced in [138], that considers distributions of samples to respect
intra-class variation and inter-class similarity in contrast to the unimodal separation enforced by
the contrastive and triplet losses.
7.3 Main Product Detection

Finally, in Chapter 5 we went one step further and created an embedding space very similar
to the one in the previous chapter, but adapted to detect the main product sold in typical
fashion e-commerce images. We did that projecting parts of the images into the embedding
space and finding the ones with the minimum distance to the corresponding textual description.
We took advantage of part of a state-of-the-art object detection algorithm, the RoI pooling layer
from [49], to efficiently generate the bounding box candidates. Also, a new term is added to
the loss function with the intention of maximizing the overlap with the ground truth bounding
box. In most of the evaluated cases (again, in a large dataset from fashion e-commerce sites),
84
CHAPTER 7. CONCLUSIONS
the ground truth bounding box was among the top-3 ranked boxes according to their distance
to the text.
7.3.1 Future Work

Although we compare three different approaches for our method, there is still room for different
modifications, especially considering the evolution of deep learning object detection architectures
in recent publications. In the last part of Chapter 5, where bounding boxes are generated in a way
similar to what Fast-RCNN does, we could use a different bounding box proposal approach such
as Faster-RCNN [135], that uses a Region Proposal Network (RPN) to generate the bounding
box proposals. This method is 10 times faster in test time, being able to run in real time.
Another approach would be using YOLO [134], which splits images in a grid and is even faster,
but struggles with the detection of small objects.
A bigger change would come from trying to develop an approach that is more continuous (not
as discretized as with the bounding box system) and detects objects on a pixel-probability level,
like some of the state-of-the-art methods mentioned in Section 5.2. An interesting approach
could be adapting a pixel-by-pixel method to work on a superpixel basis, taking advantage of
the algorithm developed in Chapter 3 to alleviate the computational cost. Furthermore, and
connecting with Section 7.1.1, a loss term to create the superpixels could be included in the
same network architecture, so the whole system could be trained together.
85
Bibliography
[1] Ecommerce 101 + the history of online shopping: What the past says about tomor-
row’s retail challenges. https://www.bigcommerce.com/blog/ecommerce/#ecommerce-
timeline. Accessed: 2019-10-25.
[2] Global fashion industry statistics. https://fashionunited.com/global-fashion-

industry-statistics/. Accessed: 2020-01-30.
[3] The ultimate list of e-commerce stats for holiday 2016. http://blog.marketingadept.
com/the-ultimate-list-of-e-commerce-marketing-stats-for-holiday-2016/. Ac-
cessed: 2017-01-23.
[4] I. E. Abdou. Quantitative methods of edge detection. Technical report, University Of

Southern California Los Angeles Image Processing Inst, 1978.
[5] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels
compared to state-of-the-art superpixel methods. Pattern Analysis and Machine Intelli-
gence, 34(11):2274–2282, 2012.
[6] A. Agudo and F. Moreno-Noguer. Learning shape, motion and elastic models in force
space. In IEEE International Conference on Computer Vision, 2015.
[7] A. Agudo and F. Moreno-Noguer. Simultaneous pose and non-rigid shape with particle
dynamics. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[8] A. Agudo and F. Moreno-Noguer. DUST: Dual union of spatio-temporal subspaces for
monocular multiple object 3D reconstruction. In IEEE Conference on Computer Vision
and Pattern Recognition, 2017.
[9] S. Akbar, L. Jordan, A. M. Thompson, and S. J. McKenna. Tumor localization in tissue

microarrays using rotation invariant superpixel pyramids. In Biomedical Imaging (ISBI),
2015 IEEE 12th International Symposium on, pages 1292–1295, 2015.
[10] Z. Al-Halah, R. Stiefelhagen, and K. Grauman. Fashion forward: Forecasting visual style
in fashion. In IEEE International Conference on Computer Vision, pages 388–397, 2017.
[11] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression.

The American Statistician, 46(3):175–185, 1992.
[12] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neigh-
bor in high dimensions. In 47th annual IEEE Symposium on Foundations of Computer
Science (FOCS’06), pages 459–468, 2006.
87
BIBLIOGRAPHY
[13] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In
International Conference on Machine Learning, pages 1247–1255, 2013.
[14] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image
segmentation. Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.
[15] B. Babenko, P. Dollár, Z. Tu, and S. Belongie. Simultaneous learning and alignment: Multi-
instance and multi-pose learning. In Workshop on Faces in ’Real-Life’ Images: Detection,
Alignment, and Recognition, 2008.
[16] F. R. Bach and M. I. Jordan. Kernel Independent Component Analysis. Journal of Machine
Learning Research, 3(Jul):1–48, 2002.
[17] S. Bell and K. Bala. Learning visual similarity for product design with convolutional neural
networks. ACM Transactions on Graphics (SIGGRAPH), 34(4):98, 2015.
[18] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new
perspectives. Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
[19] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characteriza-
tion from noisy web data. In European Conference on Computer Vision, pages 663–676.
Springer, 2010.
[20] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object
classification: A domain adaptation approach. In Conference on Neural Information Pro-
cessing Systems, 2010.
[21] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3(Jan):993–1022, 2003.
[22] L. Bossard, M. Dantone, C. Leistner, C. Wengert, T. Quack, and L. Van Gool. Apparel
classification with style. In Asian Conference on Computer Vision, 2012.
[23] R. Catherine and W. Cohen. Transnets: Learning to transform for recommendation. In
Proceedings of the eleventh ACM Conference on Recommender Systems, pages 288–296,
2017.
[24] J. Y. Chai, C. Zhang, and R. Jin. An empirical investigation of user term feedback in
text-based targeted image search. ACM Transactions on Information Systems (TOIS),
25(1):3, 2007.
[25] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image
similarity through ranking. Journal of Machine Learning Research, 11(Mar):1109–1135,
2010.
[26] H. Chen, A. Gallagher, and B. Girod. Describing clothing by semantic attributes. In
European Conference on Computer Vision, pages 609–623. Springer, 2012.
[27] Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and S. Yan. Deep domain adaptation
for describing people based on fine-grained clothing attributes. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 5315–5324, 2015.
88
BIBLIOGRAPHY
[28] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis.
Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
[29] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks
for natural language processing. arXiv preprint arXiv:1606.01781, 2016.
[30] E. Corona, A. Pumarola, G. Alenyà, G. Pons-Moll, and F. Moreno-Noguer. SMPLicit:

Topology-aware Generative Model for Clothed People. In IEEE Conference on Computer
Vision and Pattern Recognition, 2021.
[31] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with multiscale graph decomposi-
tion. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
[32] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing

by latent semantic analysis. Journal of the American Society for Information Science,
41(6):391, 1990.
[33] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs

with fast localized spectral filtering. In Advances in Neural Information Processing Systems,
pages 3844–3852, 2016.
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar-
chical image database. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 248–255, 2009.
[35] W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan. Style finder: Fine-
grained clothing style detection and retrieval. In IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 2013.
[36] F. Diaz, B. Mitra, and N. Craswell. Query expansion with locally-trained word embeddings.
arXiv preprint arXiv:1605.07891, 2016.
[37] B. Dinakaran, J. Annapurna, and C. A. Kumar. Interactive image retrieval using text and
image content. Cybern Inf Tech, 10:20–30, 2010.
[38] P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In IEEE International
Conference on Computer Vision, 2013.
[39] J. Dong, Q. Chen, W. Xia, Z. Huang, and S. Yan. A deformable mixture parsing model
with parselets. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[40] A. Eriksson, C. Olsson, and F. Kahl. Normalized cuts revisited: A reformulation for seg-
mentation with linear grouping constraints. Journal of Mathematical Imaging and Vision,
39(1):45–61, 2011.
[41] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal
visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–
338, June 2010.
89
BIBLIOGRAPHY
[42] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation.

International Journal of Computer Vision, 59(2):167–181, 2004.
[43] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain

adaptation using subspace alignment. In IEEE Conference on Computer Vision and Pat-
tern Recognition, 2013.
[44] V. Fomin, J. Anmol, S. Desroziers, Y. Kumar, J. Kriss, A. Tejani, and E. Rippeth. High-
level library to help with training neural networks in pytorch. https://github.com/
pytorch/ignite, 2020.
[45] D. Frejlichowski, P. Czapiewski, and R. Hofman. Finding similar clothes based on seman-
tic description for the purpose of fashion recommender system. In Asian Conference on
Intelligent Information and Database Systems, pages 13–22. Springer, 2016.
[46] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A

deep visual-semantic embedding model. In Conference on Neural Information Processing
Systems, 2013.
[47] B. Fulkerson, A. Vedaldi, S. Soatto, et al. Class segmentation and object localization
with superpixel neighborhoods. In IEEE International Conference on Computer Vision,
volume 9, pages 670–677. Citeseer, 2009.
[48] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized lan-
guage model for information retrieval. In Special Interest Group on Information Retrieval
Conference, 2015.
[49] R. Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision, pages
1440–1448, 2015.
[50] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain
adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[51] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for
multilabel image annotation. arXiv preprint arXiv:1312.4894, 2013.
[52] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean

approach to learning binary codes for large-scale image retrieval. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2012.
[53] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-

sentence embeddings using large weakly annotated photo collections. In IEEE European
[54] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An
unsupervised approach. In IEEE Conference on Computer Vision and Pattern Recognition,
2011.
[55] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
90
BIBLIOGRAPHY
[56] M. Grbovic, N. Djuric, V. Radosavljevic, and N. Bhamidipati. Search retargeting using

directed query embeddings. In International World Wide Web Conference, 2015.
[57] J.-W. Ha, H. Pyo, and J. Kim. Large-scale item categorization in e-commerce using multiple
recurrent neural networks. In Special Interest Group on Knowledge Discovery and Data,
pages 107–115. ACM, 2016.
[58] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg. Where to buy it:
Matching street clothing photos in online shops. In IEEE Conference on Computer Vision
and Pattern Recognition, 2015.
[59] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant

mapping. In IEEE Conference on Computer Vision and Pattern Recognition, 2006.
[60] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis. Automatic
spatially-aware fashion concept discovery. In IEEE International Conference on Computer
Vision, pages 1463–1471, 2017.
[61] X. Han, Z. Wu, Y.-G. Jiang, and L. S. Davis. Learning fashion compatibility with bidirec-
tional LSTMs. In ACM International Conference on Multimedia, pages 1078–1086, 2017.
[62] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In IEEE International Conference on Computer
Vision, pages 1026–1034, 2015.
[63] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[64] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends
with one-class collaborative filtering. In International Conference on World Wide Web,
pages 507–517. International World Wide Web Conferences Steering Committee, 2016.
[65] X. He, R. S. Zemel, and D. Ray. Learning and incorporating top-down cues in image
segmentation. In European conference on computer vision, pages 338–351. Springer, 2006.
[66] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task:
Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–
899, 2013.
[67] T. Hofmann. Probabilistic latent semantic indexing. In Special Interest Group on Infor-
mation Retrieval Conference, 1999.
[68] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context from a single image. In IEEE
International Conference on Computer Vision, volume 1, pages 654–661. IEEE, 2005.
[69] W.-L. Hsiao and K. Grauman. Learning the latent “look”: Unsupervised discovery of
a style-coherent embedding from fashion images. In IEEE International Conference on
Computer Vision, pages 4213–4222. IEEE, 2017.
91
BIBLIOGRAPHY
[70] W.-L. Hsiao and K. Grauman. Creating capsule wardrobes from fashion images. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 7161–7170, 2018.
[71] J. Huang, R. S. Feris, Q. Chen, and S. Yan. Cross-domain image retrieval with a dual
attribute-aware ranking network. In Proceedings of the IEEE international conference on
computer vision, pages 1062–1070, 2015.
[72] S. J. Hwang, K. Grauman, and F. Sha. Analogy-preserving semantic embedding for visual
object categorization. In International Conference on Machine Learning, pages 639–647,
2013.
[73] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International Conference on Machine Learning, 2015.
[74] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan. Large scale visual
recommendations from street fashion images. In Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 1925–1934, 2014.
[75] V. Jampani, D. Sun, M.-Y. Liu, M.-H. Yang, and J. Kautz. Superpixel sampling networks.
In IEEE European Conference on Computer Vision, pages 352–368, 2018.
[76] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human

shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition, pages
7122–7131, 2018.
[77] A. Kannan, P. P. Talukdar, N. Rasiwasia, and Q. Ke. Improving product classification

using images. In International Conference on Data Mining, 2011.
[78] M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering

elements of fashion styles. In IEEE European Conference on Computer Vision, 2014.
[79] H. Kim, S. Lee, D. Lee, S. Choi, J. Ju, and H. Myung. Real-time human pose estimation
and gesture recognition from depth images using superpixels and svm classifier. Sensors,
15(6):12410–12427, 2015.
[80] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882, 2014.
[81] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Multimodal neural language models. In

International Conference on Machine Learning, 2014.
[82] Y. Kita. Elastic-model driven analysis of several views of a deformable cylindrical object.
volume 18, pages 1150–1162. IEEE, 1996.
[83] A. Kolesnikov, M. Guillaumin, V. Ferrari, and C. H. Lampert. Closed-form approximate

crf training for scalable image segmentation. In IEEE European Conference on Computer
Vision. 2014.
[84] P. Krähenbühl and V. Koltun. Geodesic object proposals. In European Conference on

Computer Vision, pages 725–739. Springer, 2014.
92
BIBLIOGRAPHY
[85] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In Conference on Neural Information Processing Systems, 2012.
[86] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In Advances
in Neural Information Processing Systems, pages 950–957, 1992.
[87] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text
classification. In Association for the Advancement of Artificial Intelligence, volume 333,
pages 2267–2273, 2015.
[88] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Inter-

national Conference on Machine Learning, pages 1188–1196, 2014.
[89] A. Levinshtein, A. Stere, K. N. Kutulakos, D. J. Fleet, S. J. Dickinson, and K. Siddiqi.

Turbopixels: Fast superpixels using geometric flows. Pattern Analysis and Machine Intel-
ligence, 31(12):2290–2297, 2009.
[90] D. Li, H.-Y. Lee, J.-B. Huang, S. Wang, and M.-H. Yang. Learning structured semantic
embeddings for visual recognition. arXiv preprint arXiv:1706.01237, 2017.
[91] Y. Li, L. Cao, J. Zhu, and J. Luo. Mining fashion outfit composition using an end-to-end
deep learning approach on set data. IEEE Transactions on Multimedia, 19(8):1946–1955,
2017.
[92] R.-S. Lin, D. A. Ross, and J. Yagnik. Spec hashing: Similarity preserving algorithm for
entropy-based coding. In 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 848–854. IEEE, 2010.
[93] F. Liu, C. Shen, G. Lin, and I. Reid. Deep convolutional neural fields for depth estimation
from a single image. In IEEE Conference on Computer Vision and Pattern Recognition,
2015.
[94] S. Liu, J. Feng, C. Domokos, H. Xu, J. Huang, Z. Hu, and S. Yan. Fashion parsing with
weak color-category labels. IEEE Transactions on Multimedia, 16(1):253–265, 2013.
[95] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to-shop: Cross-scenario clothing
retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer
Vision and Pattern Recognition, pages 3330–3337. IEEE, 2012.
[96] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect
a salient object. Pattern Analysis and Machine Intelligence, 33(2):353–367, 2011.
[97] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations. In IEEE Conference on Computer Vision
and Pattern Recognition, June 2016.
[98] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Jour-
nal of Computer Vision, 60(2):91–110, 2004.
93
BIBLIOGRAPHY
[99] C. Lynch, K. Aryafar, and J. Attenberg. Images don’t lie: Transferring deep visual semantic
features to large-scale multimodal learning to rank. arXiv preprint arXiv:1511.06746, 2015.
[100] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person
image generation. In Advances in Neural Information Processing Systems, pages 406–416,
2017.
[101] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning
Research, 9(Nov):2579–2605, 2008.
[102] T. Malisiewicz and A. A. Efros. Improving spatial support for objects via multiple seg-
mentations. 2007.
[103] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring ecological
statistics. In IEEE International Conference on Computer Vision, 2001.
[104] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient

descent. In Advances in neural information processing systems, pages 512–518, 2000.
[105] K. Matzen, K. Bala, and N. Snavely. Streetstyle: Exploring world-wide clothing styles
from millions of photos. arXiv preprint arXiv:1706.01869, 2017.
[106] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel. Image-based recommendations
on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pages 43–52, 2015.
[107] McCormick, Chris. WordVec Tutorial. http://www.mccormickml.com, 2016. Accessed:

2019-10-24.
[108] T. McInerney and D. Terzopoulos. A finite element model for 3D shape reconstruction and
nonrigid motion tracking. In IEEE International Conference on Computer Vision, 1993.
[109] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations
in vector space. arXiv preprint arXiv:1301.3781, 2013.
[110] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations

of words and phrases and their compositionality. In Conference on Neural Information
Processing Systems, 2013.
[111] B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana. A dual embedding space model for
document ranking. arXiv preprint arXiv:1602.01137, 2016.
[112] C. E. Moody. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv
preprint arXiv:1605.02019, 2016.
[113] F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix
regression. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2823–
2832, 2017.
94
BIBLIOGRAPHY
[114] F. Moreno-Noguer and P. Fua. Stochastic exploration of ambiguities for nonrigid shape
recovery. volume 35, pages 463–475. IEEE, 2013.
[115] F. Moreno-Noguer and J. Porta. Probabilistic simultaneous pose and non-rigid shape. In
IEEE Conference on Computer Vision and Pattern Recognition, pages 1289–1296, 2011.
[116] F. Moreno-Noguer, M. Salzmann, V. Lepetit, and P. Fua. Capturing 3D stretchable surfaces
from single images in closed form. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
[117] G. Mori. Guiding model search using segmentation. In IEEE International Conference on
Computer Vision, 2005.
[118] G. Mori, X. Ren, A. A. Efros, and J. Malik. Recovering human body configurations:
Combining segmentation and recognition. In IEEE Conference on Computer Vision and
Pattern Recognition, volume 2, pages II–II. Citeseer, 2004.
[119] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm
configuration. VISAPP (1), 2(331-340):2, 2009.
[120] P. Neubert, N. Sünderhauf, and P. Protzel. Superpixel-based appearance change prediction
for long-term navigation across seasons. Robotics and Autonomous Systems, 69:15–27, 2015.
[121] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and
J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint
arXiv:1312.5650, 2013.
[122] C. Pantofaru, C. Schmid, and M. Hebert. Object recognition by integrating multiple
image segmentations. In IEEE European Conference on Computer Vision, pages 481–494.
Springer, 2008.
[123] K. Pearson. Liii. on lines and planes of closest fit to systems of points in space. The
London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–
572, 1901.
[124] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization.
In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[125] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with com-
pressed fisher vectors. In IEEE Conference on Computer Vision and Pattern Recognition,
2010.
[126] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image
classification. In IEEE European Conference on Computer Vision, 2010.
[127] M. Polato and F. Aiolli. Exploiting sparsity to build efficient kernel based collaborative
filtering for top-n item recommendation. Neurocomputing, 268:17–26, 2017.
[128] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer.
Geometry-aware network for non-rigid shape prediction from a single view. In Proceedings
of the Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
95
BIBLIOGRAPHY
[129] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer. Unsupervised person im-

age synthesis in arbitrary poses. In IEEE Conference on Computer Vision and Pattern
Recognition, 2018.
[130] A. Pumarola, J. Sanchez-Riera, G. Choi, A. Sanfeliu, and F. Moreno-Noguer. 3dpeople:

Modeling the geometry of dressed humans. In IEEE International Conference on Computer
Vision, pages 2242–2251, 2019.
[131] A. Ramisa, G. Alenya, F. Moreno-Noguer, and C. Torras. Learning rgb-d descriptors of

garment parts for informed grasping. In Engineering Applications of Artificial Intelligence
(EEAI), volume 35, pages 246–258, October 2014.
[132] A. Ramisa, G. Alenyà, F. Moreno-Noguer, and C. Torras. Finddd: A fast 3d descriptor

to characterize textiles for robot manipulation. In International Conference on Intelligent
Robots and Systems, 2013.
[133] A. Ramisa, F. Yan, F. Moreno-Noguer, and K. Mikolajczyk. Breakingnews: Article anno-

tation by image and text processing. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 40(5):1072–1085, 2017.
[134] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time
object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages
779–788, 2016.
[135] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In Advances in Neural Information Processing Systems,
pages 91–99, 2015.
[136] X. Ren and J. Malik. Learning a classification model for segmentation. In Pattern Analysis
and Machine Intelligence, pages 10–17. IEEE, 2003.
[137] Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille. Multi-instance visual-semantic embedding.
[138] O. Rippel, M. Paluri, P. Dollar, and L. Bourdev. Metric learning with adaptive density
discrimination. arXiv preprint arXiv:1511.05939, 2015.
[139] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learning representations by back-

propagating errors. Cognitive Modeling, 5(3):1, 1988.
[140] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,

A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition
challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[141] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa-

thy, A. Khosla, M. Bernstein, et al. ImageNet large scale visual recognition challenge.
International Journal of Computer Vision, 115(3):211–252, 2015.
[142] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new
domains. In IEEE European Conference on Computer Vision, 2010.
96
BIBLIOGRAPHY
[143] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multi-level pixel-aligned implicit
function for high-resolution 3d human digitization. In IEEE Conference on Computer
Vision and Pattern Recognition, June 2020.
[144] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate

Reasoning, 50(7):969–978, 2009.
[145] M. Salzmann, F. Moreno-Noguer, V. Lepetit, and P. Fua. Closed-form solution to non-rigid

3D surface registration. 2008.
[146] J. Sanchez, J. Östlund, P. Fua, and F. Moreno-Noguer. Simultaneous pose, correspondence

and non-rigid shape. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 1189–1196, 2010.
[147] S. Schmit and C. Riquelme. Human interaction with recommendation systems: On bias
and exploration. ArXiv e-prints, 1050:1, 2017.
[148] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and
Machine Intelligence, 22(8):888–905, 2000.
[149] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. A high performance CRF

model for clothes parsing. In Asian Conference on Computer Vision, 2014.
[150] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in fashion:

Modeling the perception of fashionability. In IEEE Conference on Computer Vision and
Pattern Recognition, 2015.
[151] E. Simo-Serra and H. Ishikawa. Fashion style in 128 floats: Joint ranking and classification
using weak data for feature extraction. In IEEE Conference on Computer Vision and
Pattern Recognition, 2016.
[152] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer. A joint model for 2d and
3d pose estimation from a single image. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 3634–3641, 2013.
[153] E. Simo-Serra, A. Ramisa, G. Alenyà, C. Torras, and F. Moreno-Noguer. Single image 3d

human pose estimation from noisy observations. In 2012 IEEE Conference on Computer
Vision and Pattern Recognition, pages 2673–2680. IEEE, 2012.
[154] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. DaLI: Deformation and light invariant
descriptor. In International Journal of Computer Vision (IJCV), volume 115, pages 135–
154, 2015.
[155] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. 3d human pose tracking priors using
geodesic mixture models. International Journal of Computer Vision, 122(2):388–408, 2017.
[156] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
97
BIBLIOGRAPHY
[157] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in
videos. In IEEE International Conference on Computer Vision, page 1470. IEEE, 2003.
[158] J. R. Smith. The real problem of bridging the “semantic gap”. In International Workshop
on Multimedia Content Analysis and Mining, pages 16–17. Springer, 2007.
[159] Z. Song, M. Wang, X.-s. Hua, and S. Yan. Predicting occupation via human clothing and
contexts. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
[160] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:

A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15:1929–1958, 2014.
[161] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,

and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer
Vision and Pattern Recognition, pages 1–9, 2015.
[162] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception
architecture for computer vision. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2818–2826, 2016.
[163] J. K. V. Tan, I. Budvytis, and R. Cipolla. Indirect deep structured learning for 3d human
body shape and pose prediction. 2017.
[164] P. Tangseng, Z. Wu, and K. Yamaguchi. Looking at outfit to parse clothing. arXiv preprint
arXiv:1703.01386v1, Mar 2017.
[165] P. J. Toivanen. New geodosic distance transforms for gray-scale images. Pattern Recognition
Letters, 17(5):437–450, 1996.
[166] A. Torralba, R. Fergus, Y. Weiss, et al. Small codes and large image databases for recogni-
tion. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 2.
Citeseer, 2008.
[167] E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer. Dense segmentation-aware

descriptors. In IEEE Conference on Computer Vision and Pattern Recognition, June 2013.
[168] E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer. Segmentation-

aware deformable part models. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 168–175, 2014.
[169] E. Trulls Fortuny, I. Kokkinos, A. Sanfeliu Cortés, and F. Moreno-Noguer. Dense

segmentation-aware descriptors. 2016.
[170] W.-C. Tu, M.-Y. Liu, V. Jampani, D. Sun, S.-Y. Chien, M.-H. Yang, and J. Kautz. Learning
superpixels with segmentation-aware affinity loss. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 568–576, 2018.
[171] H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki. Self-supervised learning of motion
capture. In Advances in Neural Information Processing Systems, pages 5236–5246, 2017.
98
BIBLIOGRAPHY
[172] O. Y. Vacit, L. Yu, and J. van de Weijer. Main product detection with graph networks in
fashion. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[173] M. Van den Bergh, X. Boix, G. Roig, and L. Van Gool. Seeds: Superpixels extracted via
energy-driven sampling. International Journal of Computer Vision, 111(3):298–314, 2015.
[174] A. Van Den Hengel, A. Dick, T. Thormählen, B. Ward, and P. H. Torr. Videotrace:
rapid interactive scene modelling from video. In ACM Transactions on Graphics (ToG),
volume 26, page 86. ACM, 2007.
[175] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In IEEE
European Conference on Computer Vision. 2008.
[176] O. Veksler, Y. Boykov, and P. Mehrani. Superpixels and supervoxels in an energy opti-
mization framework. In IEEE European Conference on Computer Vision. 2010.
[177] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on
immersion simulations. Pattern Analysis and Machine Intelligence, (6):583–598, 1991.
[178] S. Vittayakorn, T. Umeda, K. Murasaki, K. Sudo, T. Okatani, and K. Yamaguchi. Auto-

matic attribute discovery with neural activations. In European Conference on Computer
Vision, pages 252–268. Springer, 2016.
[179] S. Vittayakorn, K. Yamaguchi, A. C. Berg, and T. L. Berg. Runway to realway: Visual

analysis of fashion. In 2015 IEEE Winter Conference on Applications of Computer Vision,
pages 951–958. IEEE, 2015.
[180] V. Vukotić, C. Raymond, and G. Gravier. Multimodal and crossmodal representation

learning from textual and visual features with bidirectional deep neural networks for video
hyperlinking. In ACM workshop on Vision and Language Integration Meets Multimedia
Fusion, 2016.
[181] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep learning for
content-based image retrieval: A comprehensive study. In ACM International Conference
on Multimedia, pages 157–166, 2014.
[182] C. Wang, Z. Liu, and S.-C. Chan. Superpixel-based hand gesture recognition with kinect
depth camera. Multimedia, IEEE Transactions on, 17(1):29–39, 2015.
[183] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embed-
dings. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5005–5013,
2016.
[184] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching for phrase
localization. In European Conference on Computer Vision, pages 696–711. Springer, 2016.
[185] P. Wang, G. Zeng, R. Gan, J. Wang, and H. Zha. Structure-sensitive superpixels via
geodesic distance. International Journal of Computer Vision, 103(1):1–21, 2013.
99
BIBLIOGRAPHY
[186] S. Wang, H. Lu, F. Yang, and M.-H. Yang. Superpixel tracking. In IEEE International
[187] Z. Wang and Y. Zhang. Opinion recommendation using neural memory model. arXiv
[188] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in Neural Information
Processing Systems, pages 1753–1760, 2009.
[189] P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph. D. dissertation, Harvard University, 1974.
[190] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with
joint word-image embeddings. Machine learning, 81(1):21–35, 2010.
[191] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image
annotation. In Twenty-Second International Joint Conference on Artificial Intelligence,
2011.
[192] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled
data for image classification. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2691–2699, 2015.
[193] Y. Xiao and K. Cho. Efficient character-level document classification by combining convo-
lution and recurrent layers. arXiv preprint arXiv:1602.00367, 2016.
[194] K. Yamaguchi, T. L. Berg, and L. E. Ortiz. Chic or social: Visual popularity analysis in
online fashion networks. In ACMMM, 2014.
[195] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper doll parsing: Retrieving similar
styles to parse clothing items. In IEEE Conference on Computer Vision and Pattern
Recognition, 2013.
[196] K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. Parsing clothing in fashion

photographs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[197] K. Yamaguchi, T. Okatani, K. Sudo, K. Murasaki, and Y. Taniguchi. Mix and match:
Joint model for clothing and attribute recognition. In British Machine Vision Conference,
volume 1, page 4, 2015.
[198] F. Yan and K. Mikolajczyk. Deep correlation for matching images and text. In IEEE
Conference on Computer Vision and Pattern Recognition, 2015.
[199] Q. Yan, J. Shi, L. Xu, and J. Jia. Hierarchical saliency detection on extended cssd. arXiv
[200] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and
image clusters. In IEEE Conference on Computer Vision and Pattern Recognition, pages
5147–5156, 2016.
100
BIBLIOGRAPHY
[201] W. Yang, P. Luo, and L. Lin. Clothing co-parsing by joint image segmentation and labeling.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 3182–3189, 2014.
[202] J. Yao, M. Boben, S. Fidler, and R. Urtasun. Real-time coarse-to-fine topologically pre-
serving segmentation. In IEEE Conference on Computer Vision and Pattern Recognition,
2015.
[203] T. Yashima, N. Okazaki, K. Inui, K. Yamaguchi, and T. Okatani. Learning to describe

e-commerce images from noisy online data. In Asian Conference on Computer Vision,
pages 85–100. Springer, 2016.
[204] G. Yildirim, C. Seward, and U. Bergmann. Disentangling multiple conditional inputs in

gans. arXiv preprint arXiv:1806.07819, 2018.
[205] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual deno-
tations: New similarity metrics for semantic inference over event descriptions. Transactions
of the Association for Computational Linguistics, 2:67–78, 2014.
[206] J. Yuan, W. Shalaby, M. Korayem, D. Lin, K. AlJadda, and J. Luo. Solving cold-start
problem in large-scale recommendation engines: A deep learning approach. In 2016 IEEE
International Conference on Big Data (Big Data), pages 1901–1910. IEEE, 2016.
[207] T. Zahavy, A. Magnani, A. Krishnan, and S. Mannor. Is a picture worth a thousand

words? a deep multi-modal fusion architecture for product classification in e-commerce.
[208] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In

IEEE European Conference on Computer Vision, pages 818–833. Springer, 2014.
[209] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas. Query specific fusion for image
retrieval. In European Conference on Computer Vision, pages 660–673. Springer, 2012.
[210] X. Zhang, L. Liang, and H.-Y. Shum. Spectral error correcting output codes for efficient
multiclass recognition. In 2009 IEEE 12th International Conference on Computer Vision,
pages 1111–1118. IEEE, 2009.
[211] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classi-
fication. In Conference on Neural Information Processing Systems, 2015.
[212] Y. Zhang, R. Hartley, J. Mashford, and S. Burn. Superpixels via pseudo-boolean optimiza-
tion. In IEEE International Conference on Computer Vision, 2011.
[213] Y. Zhang, Z. Jia, and T. Chen. Image retrieval with geometry-preserving visual phrases.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 809–816. IEEE,
2011.
[214] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
101
BIBLIOGRAPHY
[215] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual
question answering. arXiv preprint arXiv:1512.02167, 2015.
[216] W. Zhou, H. Li, and Q. Tian. Recent advance in content-based image retrieval: A literature
survey. arXiv preprint arXiv:1706.06064, 2017.
[217] S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy. Be your own prada: Fash-
ion synthesis with structural coherence. In IEEE International Conference on Computer
Vision, pages 1680–1688, 2017.
[218] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In IEEE
European Conference on Computer Vision, pages 391–405, 2014.
102

Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach

Uploaded by

Copyright:

Available Formats

Universitat Politècnica de Catalunya

Programa de Doctorat: Automàtica, Robòtica i Visió

Fashion Discovery: A Computer Vision Approach

• The construction of an image and text embedding for fashion data.

Keywords image retrieval, multimodal embedding, superpixels, siamese neural networks,

4 Fashion Multi-modal Embedding 41

5 Main Product Detection 59

6 Contributions to Wide Eyes 75

1.1 Quick overview of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Artificial Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Examples of oversegmentation and undersegmentation problems . . . . . . . . . . 25

5.1 Overview of our proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1 Bi-LSTMs for fashion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.1 Fashion datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Retrieval results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 Results of the architectures detailed in Section 5.3 . . . . . . . . . . . . . . . . . 71

• BASS: Boundary-Aware Superpixel Segmentation

• Multi-modal Fashion Product Retrieval

• Multi-modal Joint Embedding for Fashion Product Retrieval

• Multi-modal Embedding for Main Product Detection in Fashion

1.5 Thesis overview

• Chapter 5 is devoted to the creation of a new multi-modal embedding specialized in the

2.1 Computer Vision

2.2 Deep Learning

xlj = σ  xl−1 + blj  , (2.1)

multiplied by the activations of each neuron k (xl−1

passed through σ(·).

(a) ReLU (b) Sigmoid and tanh (c) PReLU

2.3 Computer Vision and Machine Learning for Fashion

Conditional Image Generation

Geometric representations of clothing

Figure 3.1: Examples of oversegmentation (a) and undersegmentation (b) problems.

(a) (b) (c)

• A new boundary-aware initialization method for superpixel centers.

3.2 Related Work

3.2.1 Graph-based algorithms

(a) (b) (c) (d)

3.2.2 Seed-growing methods

3.2.3 Coarse-to-fine methods

3.3 Boundary-aware Superpixel Segmentation

3.3.1 Boundary detection

3.3.2 Seeds initialization

(a) (b) (c) (d)

(e) (f) (g)

(a) Input (b) C = 1 (c) C = 5 (d) C = 10 (e) C = 20 (f) C = 50

3.3.3 Energy function

3.3.4 Optimization process

3.4 Experimental evaluation

(a) (b) (c) (d)

3.4.1 Evaluation Metrics

1 X P :P ∩S6=0 min (|Pin | , |Pout |)

Algorithm 1 Algorithm for Boundary-Aware Superpixel Segmentation.

3.4.2 Comparison against State of the Art

(a) (b) (c)

3.4.3 Qualitative Results

4.2 Background and Related Work

Text-based retrieval techniques

Figure 4.2: Examples of text preprocessing.