Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach
Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach
Universitat Polit' Ecnica de Catalunya: Fashion Discovery: A Computer Vision Approach
Ph.D. Thesis
Directors:
Company advisor:
Dr. Francesc Moreno Noguer
LongLong Yu
Dr. Edgar Simo Serra
Barcelona, 2020
Abstract
Performing semantic interpretation of fashion images is undeniably one of the most challeng-
ing domains for computer vision. Subtle variations in color and shape might confer different
meanings or interpretations to an image. Not only is it a domain tightly coupled with human
understanding, but also with scene interpretation and context. Being able to extract fashion-
specific information from images and interpret that information in a proper manner can be useful
in many situations and help understanding the underlying information in an image.
Fashion is also one of the most important businesses around the world, with an estimated
value of 3 trillion dollars [2] and a constantly growing online market, which increases the utility
of image-based algorithms to search, classify or recommend garments.
This doctoral thesis aims to solve specific problems related with the treatment of fashion
e-commerce data, from low-level pure pixel information to high-level abstract conclusions of the
garments appearing in an image, taking advantage of the multi-modality of the available data
for developing some of the solutions.
The contributions include:
• A new superpixel extraction method focused on improving the annotation process for cloth-
ing images.
• The application of this embedding space to the task of retrieving the main product in an
image showing a complete outfit.
In summary, fashion is a complex computer vision and machine learning problem at many
levels, and developing specific algorithms that are able to capture essential information from
pictures and text is not trivial. In order to solve some of the challenges it proposes, and taking
into account that this is an Industrial Ph.D., we contribute with a variety of solutions that can
boost the performance of many tasks useful for the fashion e-commerce industry.
I’d like to express my gratitude to all the people who helped me during these years.
First, I must thank my three advisors. Francesc Moreno, the best supervisor one can think
of since day one, and who has paved my professional career with great support and magnificent
opportunities. I wouldn’t be in such a good place today if I hadn’t had started working with him
six years ago during my Master’s degree. Edgar, for his brilliant ideas, the countless videocalls
across time zones, the abura soba and the months in Tokyo. And Long, for the time spent coding
side by side and for teaching me how to be a better programmer and a better gardener.
To my parents, always supportive in spite of the distance, putting up with my not-so-frequent
phone calls and my summers in other continents. To my little brother, who is now taller than
me but still gets mad when loses at FIFA.
To everyone in Wide Eyes for giving me the opportunity of working in a startup environment.
Time spent there had its ups and downs, but allowed me to meet extraordinary people. Special
thanks to Arnau for all the conversations, advice and invaluable help, and Oğuz for the support,
the funny moments and the short but intense ping pong tournament.
To the Driblinhos team, whom with I’ve shared some of the most amazing experiences in
my life and many kilometers in planes, cars, vans and boats. To all my flatmates in Barcelona,
especially vegetarian Argentinian drummers and Catalan smoke sellers and DoPs.
Of course, if you’re reading this thesis, it is undoubtedly in great part because of the great
people at IRI. Downstairs, in office 6 and finally in office 19, it has been a pleasure to share space,
knowledge and jokes with all of you. Special thanks to all the subgroups with regular meetings
I proudly belong to: the tupper crew, meeting daily, because eating reheated leftovers outdoors
without even having a table never felt so good; IRI football, meeting weekly, for making me enjoy
sport again after I broke my leg for a second time; and the Agustins Academy, meeting yearly, for
investing a huge amount of time and effort in something with the only objective of laughing for
half an hour. Major shout out to Rick (office mate for a few months, the best footballer in IRI
for years), Juanan (half briliant / half competitive devil), Carlogarro (pure light) and Fherrero
(who always makes the best jokes, which allows him to make the worst ones sometimes) for
surprisingly revitalizing my IRI days when I thought they were almost over.
Last, but not least, exceptional regards to Est for the countless laughters, for being kind
enough to present her Ph.D. before me to show me how it’s done, and for being a part of my life
since eπ in the 005. Thank you for all the years of music.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Overview 7
2.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Computer Vision and Machine Learning for Fashion . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Superpixel Segmentation 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Graph-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Seed-growing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Coarse-to-fine methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Boundary-aware Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Boundary detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Seeds initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Energy function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.4 Optimization process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Comparison against State of the Art . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 Conclusions 81
7.1 Superpixel Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Fashion Multi-modal Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 Main Product Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
List of Figures
4.1 Example of a text and nearest images from the test set . . . . . . . . . . . . . . . 44
4.2 Examples of text preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 CBIR - Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Architecture of the proposed neural network . . . . . . . . . . . . . . . . . . . . . 50
4.5 Example of training samples with a context window size of 2 words . . . . . . . . 51
4.6 Scheme of the word2vec method . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Word2vec’s arithmetical properties . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Contrastive loss behaviour with positive and negative pairs . . . . . . . . . . . . 54
4.9 Examples of products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
And I knew exactly what to do. But in a much more real sense,
I had no idea what to do.
Michael Scott
CHAPTER 1. INTRODUCTION
1.1 Motivation
There is a huge difference between hearing music and listening to music. While the first action
tends to be unsubstantial and in some cases even annoying, the second one can be rewarding at
so many levels. The same discrepancy in meaning can be observed when having a conversation:
listening to your interlocutors implies paying attention to what they say, while hearing them is
basically acknowledging that some sound is stimulating your auditory system while you think
about finishing your thesis dissertation or what to have for dinner.
A direct analogy can be drawn for images. It is completely different to see a picture and
to look at a picture. The former means just receiving light impulses through your eyes, while
the latter includes discerning complex relationships between what you see, like grouping regions
into objects, spatial reconstruction of the scene or higher level tasks such as understanding a
scene: what or who is there? What is the purpose of the things or people you see in a particular
place or disposition? Is there some contextual information you can extract from just looking at
a picture?
This complicated understanding process is done almost automatically by our brains, but
achieving the same level and speed is far away from trivial for machines. The difference between
being able to extract basic features from what a computer “sees” and being able to decipher the
true meaning of an image is known as the semantic gap [158], and is one of the key problems in
computer vision.
When talking about pictures with people on them, clothing is probably one of the most
important sources of information. It can provide lots of context about the event where the
picture was taken, like a sports venue or a wedding, or about the people appearing in it, like
their occupation or whether they belong to some sort of social tribe.
Not only is fashion a very interesting field in terms of computer vision, it is also a buoyant
market whose e-commerce revenues are expected to exceed $638 billion only in the U.S. by
2022 [1]. Apart from having had a skyrocketing growth in the last decade, the majority of the
expected new consumers are within the 16 to 34 age group. Interest of the young generations for
fashion might be partly provoked by the raise of social networks. As an example, in 2019, it is
estimated that Instagram users uploaded every day 95 million pictures and 500 million stories —
short publications deleted after 24 hours. Many of these publications are human-centric pictures
where the user wants to look as good as possible, and clothing style is key. Social media is
the new word-of-mouth, and all these pictures (some of them tagged with the clothing product
information) generate a large percentage of the income for fashion e-commerce sites. Hence
the importance of having a useful and efficient website, where products are easy to find and
recommendations are meaningful. In both aspects, and in many more, computer vision plays
a critical role. This confluence of factors makes the last decade a perfect breeding ground for
companies applying computer vision and machine learning techniques to fashion, like Wide Eyes
Technologies, where part of this research has been carried out.
This is an industrial Ph.D., i.e., the research work has been split between the university and
a company. Concretely, between the Universitat Politècnica de Catalunya (UPC) and Wide Eyes
Technologies, a startup company devoted to computer vision for the fashion industry. Wide Eyes
(WE) is a business-to-business (B2B) technology company providing computer vision solutions
to companies. WE focuses on image similarity, offering services to retrieve clothes using pictures
as queries and to find similar products in a customers’ catalog. They also work on attribute
3
1.2. GOALS
Figure 1.1: Quick overview of the thesis. From the low level task of superpixel segmentation
to a high level text-based product detection on images, this thesis aims to contribute to the
fashion industry from a computer vision point of view.
detection, which allows them to offer an auto tagging system to their clients.
Wide Eyes business model is based on the machine learning models created and trained by
the research team, in which I was working during the three years of development of this thesis.
Therefore, the research work presented in this dissertation is strongly influenced by the needs of
Wide Eyes, and has been developed using their data and infrastructure along with the university
resources (both from UPC and Waseda University in Tokyo, where a three-months research stay
was carried out).
1.2 Goals
This dissertation aims to extract and relate the semantic information contained in fashion e-
commerce data, with a special focus in developing systems that solve some of the current problems
in Wide Eyes. More precisely, it starts by facilitating the task of fashion image annotation
developing a new oversegmentation algorithm that drastically reduces the number of segments.
As will be explained in Chapter 3, this algorithm can be used in other domains, since it reduces
the complexity of the oversegmentation results without losing information. Furthermore, in
Chapter 4 we improve the garment retrieval system by leveraging joint information coming from
images and textual descriptions in Wide Eyes datasets. Finally, Chapter 5 is devoted to the
usage of a similar embedding space to detect the product being sold in a fashion e-commerce
image that contains other products based on its textual description. Please, see Fig. 1.1, for a
quick general view of the main topics of the thesis.
4
CHAPTER 1. INTRODUCTION
1.3 Contributions
During the development of this thesis, we started from a low-level viewpoint (grouping the
pixels of an image based on their color and distribution), to tackle later mid-level tasks like the
creation of a multi-modal embedding and its usage in image retrieval, ending with a higher level
application that lies in between object detection and phrase localization. The main contributions
produced along this path can be divided into four main parts:
1. BASS Superpixel algorithm: we present a new algorithm for the task of image over-
segmentation. Our algorithm, called Boundary-Aware Superpixel Segmentation (BASS),
produces superpixels with a bigger size in regions of the image with less changes in texture,
and smaller superpixels where they are needed, in regions with more details. This allows
to generate more compact representations of images, avoiding the addition of extra super-
pixels that don’t provide new information. Therefore, these superpixels can be used to
drastically reduce the number of superpixels to annotate in the task of manual annotation
of images for semantic segmentation. The algorithm is further explained in Chapter 3.
2. Multi-modal embedding: the majority of this thesis has been devoted to the construc-
tion and utilization of a fashion-specific multi-modal embedding space. In this space,
images and textual metadata representing fashion items can be projected, and distances
between both types of information can be easily computed. This space, generated using
a neural network, can be applied to tasks such as image and text retrieval and text-based
object detection, as explained in the next two contributions.
3. Product retrieval: the first application for which we use the multi-modal embedding is
product retrieval. We extract the embedded descriptor of a query (that can be an image or a
text) and find the closest images and texts in a big (half a million products) dataset created
from fashion e-commerce data. Chapter 4 explains the construction of the embedding and
its application to image retrieval.
4. Main product detection: finally, in Chapter 5 we apply the idea of the multi-modal
embedding to a very specific task that we call main product detection. As some kind of
specific phrase localization, our method aims to find the main product in an image depicting
a model with several garments given its textual metadata, understood as the one the text
refers to. This last contribution obtained the best paper award in 2017’s ICCV Workshop
on Computer Vision for Fashion.
5. Along with these contributions, there are many different technological tasks (less research-
oriented) that were carried out during these years working in Wide Eyes. The most im-
portant ones are briefly described in Chapter 6.
5
1.4. PUBLICATIONS
1.4 Publications
The following papers have been published during the development of this thesis:
• Chapter 2 reviews the current state-of-the-art in computer vision, machine learning and
fashion, focusing on the common areas between the three fields and relating them with our
work.
• Chapter 3 describes our approach on image oversegmentation and compares its results
with different state-of-the-art techniques in multiple and variate datasets.
• Chapter 4 focuses on the creation of a multi-modal embedding space for images and
texts, and its utility as a retrieval system, using classification as a tool and a byproduct.
The retrieval and classification performances are evaluated in a dataset with nearly half a
million products and compared against a baseline method.
• Chapter 7 summarizes the work explained in the previous chapters and proposes future
research lines to take it further.
6
Overview
The two principal ingredients of this thesis are computer vision and fashion. When talking about
computer vision in this decade, it is inevitable to mention deep learning as well. Therefore, we
organize this overview chapter in three main sections: computer vision, deep learning and fashion,
trying to emphasize how the three are not only separate trending fields, but interconnected
domains full of challenging problems for research. Further specific related work will be provided
in each of the subsequent chapters.
CHAPTER 2. OVERVIEW
9
2.2. DEEP LEARNING
Figure 2.1: Artificial Neural Network. Graphical depiction of an Artificial Neural Network
(ANN) with two fully connected layers.
the monopoly on many of the computer vision solutions proposed to solve problems based on
feature extraction from images, like classification, object detection or retrieval. Next, we provide
a background on deep learning and its contributions on some recent advances in computer vision.
i.e. the sum of weights connecting every neuron k in layer l − 1 with neuron j in layer l (wjk
l )
10
CHAPTER 2. OVERVIEW
Figure 2.2: Activation functions. Graphics for different activation functions (ReLU, sigmoid,
hyperbolic tangent and Parametric ReLU).
the weights connecting the neurons in layer l − 1 with the neurons in l, and the vector containing
the bias of the neurons in the layer l.
These parameters are learned using the well-known backpropagation algorithm, that fine-
tunes the weights of the network based on an error value obtained with a loss function in the
previous iteration. This technique was introduced in the 1970s, but wasn’t fully appreciated
until Rumelhart et al. [139] proved that it worked faster than previous approaches for learning.
Backpropagation consists of computing the partial derivatives of an error or loss function L with
respect to the network weights. These derivatives are computed layer by layer applying the chain
rule of calculus.
When training a network, an iterative process is followed with the following steps until some
stopping condition is met:
1. Forward pass: each sample is fed into the network and propagated through all the layers
using Eq. (2.2).
2. Loss computation: the value of L is obtained (using the corresponding ground truth
information of the sample in the case of supervised training).
3. Backward pass: propagation of the error backwards through the network from the last
layer to the first.
4. Weights update: the variation in the values of the weights after iteration i, ∆W i , is
computed as the derivative of the loss multiplied by a learning rate factor η, so weight
values are updated as in Eq. (2.3), and the same for bias values (Eq. (2.4)).
∂L
W i+1 = W i − ∆W i = W i − η (2.3)
∂W i
∂L
bi+1 = bi − ∆bi = bi − η (2.4)
∂bi
11
2.2. DEEP LEARNING
Figure 2.3: Convolution operation. The filter is multiplied element-wise with the correspond-
ing region of the source image, and the results are summed, obtaining a single value that goes
into the next layer. The convolution of the filter with the image in all the possible locations
produces all the values of the next layer.
In computer vision, where the inputs to the network are images, it is quite usual to use
convolutional layers (instead of the fully connected layers just explained). Here, the input is not
a vector, but a multi-channel image. A convolutional layer normally comprises a set of filters
that are independently convolved with the image, each one of them providing a feature map. The
convolution operation of one filter is the result of taking the dot product between the filter and
a small part of the image, sliding over all posible locations for the filter. Therefore, it produces a
scalar value for each location. In this way, the size of the resulting feature map is (if no padding is
added) smaller than the image size. See Fig. 2.3 for a graphical explanation. Using convolutional
layers, the 2D spatial information is preserved, and the number of parameters decreases at the
cost of increasing the number of operations. The behaviour of this layers is similar to fully
connected ones, but the weights are shared for all the neurons in the layer. The filters become
our learnable parameters. Figure 2.4 shows how depending on the depth on the network, filters
learn to activate when certain edges, patterns or colors are detected on the image.
One limitation of these layers is that they learn the precise position of features in the input
image, so small variations in the image will produce different feature maps. To overcome this
problem, pooling layers are normally added in between convolutional layers. Without extending
too much, these layers basically reduce the feature map size (what is called downsampling) by
taking for instance the maximum (max pooling) or the average (average pooling) over a set of
contiguous pixels.
An architecture with tens of millions of parameters is prone to overfitting if the training is
long enough. In order to avoid this problem and get networks with a certain ability to generalize
their results, apart from using a dataset as big as possible, some regularization techniques are
beneficial. Some of the most common are L2 or L1 regularization (adding a term to the loss
12
CHAPTER 2. OVERVIEW
Figure 2.4: Filters learned after training. Note how the ones in the first layers are more
abstract (vertical or horizontal patterns, or even simpler edges with different orientations),
while those in deeper layers will have a high response to very specific shapes. Figure extracted
from [208].
penalizing big values of the weights [86]), dropout [160] (randomly removing some nodes and
their connections for one iteration, so each iteration has a different set of nodes) and data
augmentation (applying some transformation to the input data, like rotating, flipping, shifting,
scaling, cropping, changing brightness or color, etc. in the case of images to artificially increase
the size of the training dataset).
Although ANNs have been around for quite a long time [189], and despite the accumulation
of many minor contributions improving their performance, it wasn’t until the computational
power was enough to train systems with such high numbers of parameters that they exploded
as one of the most used techniques in computer vision. Now, they provide the state-of-the-art
performance in many tasks, some of which are briefly reviewed below.
Image Classification
Image classification has lived a revolution since Krizhevsky et al. [85] reduced the error percentage
of the ImageNet Large Scale Visualization Challenge [140] by more than 10 points thanks to the
utilization of GPUs (graphics processing units) enabling to train their expensive model (Fig. 2.5
shows some examples of the images in the challenge). The network they defined, AlexNet, was
formed by five convolutional layers with some max pooling layers in between, followed by three
fully connected layers. They used the non-saturating Rectified Linear Unit (ReLU) activation
function (f (x) = max(0, x)), that significantly increased the training speed with respect to the
13
2.2. DEEP LEARNING
Figure 2.5: ImageNet dataset examples. Some images extracted from the ImageNet dataset
and their corresponding category labels. Figure extracted from [34].
usual hyperbolic tangent (f (x) = tanh(x)) or sigmoid (f (x) = (1 + e−x )−1 ). See Fig. 2.2 for a
graphic explanation.
This achievement cleared the path for all the deep learning techniques used for this and other
tasks ever since. For instance, Simonyan and Zisserman [156] and Szegedy et al. [161] studied
how the depth of a Convolutional Neural Network (CNN) affected its classification accuracy.
Kaiming et al. [62] claimed to achieve superhuman performance using Parametric ReLU units
(see Fig. 2.2), and so did Ioffe and Szegedy [73] using Batch Normalization (BatchNorm). Batch
Normalization is a technique that normalizes the output of the previous activation layer using
the current batch’s mean and standard deviation during training. By doing this, the covariance
shift (amount by what hidden unit values vary) is reduced, and the network will perform better
when receiving a test set with a distribution different from the training set. It also increases the
stability of the neural network, reducing overfitting and allowing to use higher learning rates.
Later, Kaiming et al. [63] aimed to ease the training of very deep networks presenting a residual
framework that inserted shortcut connections in the networks.
Object Detection
A more challenging problem in computer vision is object detection, i.e. detecting and classifying
a specific region of an image, for which PASCAL Visual Object Classes (PASCAL VOC [41])
is one of the most used datasets. Some of the most successful works in object detection are
YOLO [134](Fig. 2.6), that treats the problem as a regression problem, Fast-RCNN [49] (more
detailed in Section 5.3.4, where one of their ideas is applied to our method) and its evolution
Faster-RCNN [135], that introduces a Region Proposal Network (RPN) for both predicting object
bounds and objectness scores, enabling almost cost-free region proposals.
Image Segmentation
Deep learning has also been applied to the image segmentation problem, already explained in
Section 2.1. The goal of this task is assigning a label to each pixel in the image. For this, the
output of the network has to be a feature map with the same height and width of the original
14
CHAPTER 2. OVERVIEW
Figure 2.6: YOLO object detection results. Examples of object detection using YOLO
algorithm. Figure extracted from [134].
input image. Although Chapter 3 of this thesis is devoted to oversegmentation, classical pre-deep
learning techniques are used, and this is just mentioned to draw attention to the fact that in a
future these techniques can be considered as an alternative to the proposed method.
Recapping the overview so far and relating it with the present work, in this thesis we start
focusing on low level problems tackled with more classical techniques (Chapter 3), then evolve to
use deep learning techniques relating image and textual information (Chapter 4) and finish with
a sort of object detection method mixed with phrase localization (Chapter 5). All these methods
are designed with special attention to their application in the fashion domain, and that is why
the next Section presents a brief summary of computer vision and machine learning techniques
applied to fashion.
15
2.3. COMPUTER VISION AND MACHINE LEARNING FOR FASHION
Figure 2.7: Examples of retrieved items for the street-to-shop task. Query on the left,
results on the right, ordered by increasing distance to the query. Ground truth is marked in
green among the results. Figure extracted from [58].
Retrieval
The retrieval task consists of finding relevant items in a dataset that match a given query. More
precisely, talking about fashion product retrieval, the task consists of finding a product inside
a catalog based on —usually— a short textual description or an image. Some retrieval systems
rely upon an attribute-based search, like [35]. Others focus on the so called street-to-shop task,
consisting of matching a street product (a product appearing in a real-world picture, taken by
a non-expert user in an uncontrolled environment) with a shop product (a product appearing
in a studio picture, normally isolated and with a plain background). For this problem, Liu
et al. [95] describe an approach based on features for detected human parts, while Kiapour et
al. [58] propose to solve an exact street-to-shop problem, not only giving as a result a similar
item, but the identical one. They treat the problem as a similarity learning task and formulate
it as a binary classification where example pairs are classified into positive or negative. Some
results are shown in Fig. 2.7. In [179], the authors deal with a slightly different kind of domain
adaptation problem, aiming to find similarities between outfits from fashion runways and street
outfits (see Fig. 2.8). In this thesis, we tackle the problem of multi-domain retrieval, being able
to search for images or text descriptions using both images or texts as queries indistinctly via
the creation of a common embedding space. For a more detailed overview on retrieval methods
relevant to our research, go to Section 4.2.
Clothing Categorization
Another common task is garment classification or clothing categorization. That is, given an
image depicting a garment, assigning it a category label (e.g., shirt or shoes). To solve this
problem, Bossard et al. [22] use noisy data to train a multi-class system based on Random
Forests. Chen et al. [26] extract low-level features using human pose information and then learn
attribute classifiers. In our case, classification just serves as an additional help to give semantic
16
CHAPTER 2. OVERVIEW
Figure 2.8: Example of runway to realway solution. The query comes from a fashion
runway session, while the results are found among street pictures. Figure extracted from [179].
meaning to the embedding spaces designed in Chapters 4 and 5, and good classification results
are a byproduct of the quality of our embeddings.
Clothing Parsing
Adapting semantic segmentation (one of the classical computer vision problems) to the fashion
domain receives the name of clothing parsing, i.e. assigning a garment label to each pixel on
the image. Solving this task has proven itself as a fruitful source of new algorithms, many of
them trained using the Fashionista dataset released by Yamaguchi et al. in [196], consisting of
more than 150,000 fashion photos with associated text annotations. This is one of the datasets
used for evaluation in Chapter 3. Also in [196], they use Conditional Random Fields (CRF) with
knowledge about pose estimation to solve the clothing parsing problem. Other work that uses
pose-aware CRFs applied to this problem is [149], that overcomes the problem of not having class
annotations in test time taking into account that the domain of the task is very specific: clothing
a person. Yamaguchi et al. in [195] transfer mask predictions from the retrieved clothing items
to the unannotated query image. In [201], Yang et al. propose a new framework for clothing
parsing by exploiting relationships between contiguous parts of an image as well as parts of
different images with similar appearance.
Some other datasets, like Paperdoll [195] or Fashion144k [150] are useful for semisupervised
algorithms, since their labels are weak and noisy. More recently, DeepFashion dataset [97] has
been released, containing over 800,000 diverse fashion images from different domains, annotated
with multiple categories and attributes per image, and also with bounding boxes of the gar-
ments and clothing landmarks. Although we do not specifically treat this problem along this
dissertation, some of these datasets were used to train or evaluate different algorithms during
the development of the thesis, and the superpixels generated in Chapter 3 are developed with
the specific aim of efficiently annotating clothing parsing data.
Attribute Recognition
Given the large amount of subtle details that can make one garment different from another one,
many research works focus on attribute recognition. For this task, the assumption is that every
17
2.3. COMPUTER VISION AND MACHINE LEARNING FOR FASHION
Figure 2.9: Annotated samples from Fashionista dataset. Figure extracted from [149].
fashion item in a picture has some lower-level properties that can boost recognition or retrieval
tasks. These properties, called attributes, can be generic (like color or fabric) or garment-specific
(like neck type for upper-body objects, legs length for pants or heels height for shoes). Normally,
each e-commerce tags its products with a series of attributes that are unique and (probably)
slightly different from the attributes of its competitors.
In [197], the authors aim to recognize fashion attributes (and, at the same time, to detect
clothing items) establishing some inter-object and inter-attribute compatibility information using
CRFs. For instance, dresses are incompatible with pants, skirts or shorts. This compatibility
information comes from a real-world dataset, and can be seen in Fig. 2.10. Berg et al. [19] use
images and text mined from the Internet to automatically discover attributes of different types
(global, local, color, texture or shape). This work, as Chapter 5 of this thesis, aims to relate
noisy text annotations with regions of images. While they focus on discovering attributes of a
particular object, we focus on finding out an entire object in images with more than one item.
Next, there is a brief overview of some trends in computer vision for fashion research that were
explored at some point along the development of this thesis, but without reaching a satisfactory
result in the form of a publication or a final product for the company.
Style Understanding
Moving to higher level tasks, we find techniques that seek to extract the underlying style in-
formation given by the combination of different garments that form an outfit. Here, we find
works like [151], where the authors proposed compact 128-dimensional features that encapsulate
the properties of an outfit; [69], that uses unsupervised probabilistic polylingual topic models;
or [150], that proposes a fashionability score to evaluate outfit compatibility. [194] focuses on
18
CHAPTER 2. OVERVIEW
Figure 2.10: Pearson correlation between clothing items. Notice exclusive blocks (for
instance, boots and shoes are incompatible, and dress is not very compatible with shirt, top,
pants, skirt or shorts). Figure extracted from [197].
detecting potential popularity of a picture in different scenarios (online and offline). More social-
oriented papers also exploit the relationships between the different garments on images, like [78],
that classifies people in different social tribes according to what they wear, or [159], predicting
people’s occupation based on their clothing. Some authors try to accumulate style information
over time, analyzing trend evolutions, like [64, 10], or even clustering the fashion styles according
to world regions based on millions of Instagram photos to analyze spatio-temporal trends [105].
Recommendation
When the information relating several garments is used not only to predict a compatibility score
or extract style information, but to actively recommend the inclusion of some other garment in
the outfit (or in a specific user wardrobe), we talk about recommendation.
Recommendation techniques are widely used in online shopping, music and video streaming
services or social networks, where one of the main problems is the data sparsity. One very popu-
lar approach for recommendation is based on user ratings, but this requires active participation
by the users, who need to provide some feedback. Recent works have focused on overcoming this
problem. Some of them use the attributes for calculating similarity between outfits to recom-
mend [45], others combine visual appearance and metadata, like [91] or [61], where they treat
outfits as sequences of garments to exploit the benefits of Long Short-Term Memory networks
(LSTMs). Hsiao and Grauman [70] create what they call capsule wardrobes, sets of garments
that can be combined to form several different stylish outfits. Given the subjective nature of the
19
2.3. COMPUTER VISION AND MACHINE LEARNING FOR FASHION
Figure 2.11: Artificially generated fashion images. Figure extracted from [129].
domain, where small changes determine if two or more garments are fashionable or not, tasks re-
lated to style and recommendation are difficult to evaluate and are often based on co-occurrence
or manual assessment, both of which are normally biased.
One of the biggest trends in computer vision in the past years are the Generative Adversarial
Networks (GANs), and fashion researchers have of course been aware. Some very recent works
in the area are suitable to be used as data augmentation in many fashion-related tasks, since
they synthetically generate images of people wearing clothes in different positions [129, 100] or
garments with different attributes [204], while some other researchers focus on the virtual fitting
room problem, where the goal is generating an image of the user wearing different clothes [217].
There has been a wide range of literature in representing the geometry of clothing. Early ap-
proaches described non-rigid surfaces using models inspired by physics, such as thin-plates [108]
and elastic models [82]. More complex deformations can be captured by Shape-from-Template
approaches [114, 116], which aim at recovering the surface geometry given a reference configura-
tion in which the template shape is known, and a set of 3D-to-2D correspondences between this
shape and the input image. On top of this, additional constraints enforcing isometry [145] and
photometric consistency [114, 116] are considered. Temporal information is another typically
exploited prior. Non-rigid Shape-from-Motion techniques recover deformable shape and camera
motion from a sequence of 2D tracks, exploiting physical properties [7, 8, 6, 115, 146]. In all
these works, it is paramount to achieve good keypoint descriptors robust to non-linear image
deformations [154]. Recent works, have shown that such representations can be well captured
by deep architectures, even without requiring local descriptors [128]. Depth from RGB-D cam-
eras has been also used to model clothing, especially for robotics applications related to cloth
manipulation [131, 132].
20
CHAPTER 2. OVERVIEW
Table 2.1: Fashion datasets. List of different datasets with fashion images, with their image
and text type, the number of garment categories and attributes, and whether they include any
localization data or annotated pairs of images representing the same item.
Dataset # images Image type Text type # cats. # attrs. Localization Pairs
Clothing Attrs. [26] 1856 Shop / street Tags 7 26 - -
Fashionista [196] 685 Social network Tags, comments 56 - 14 body parts -
Paper Doll [195] 339,797 Social network Tags, comments 56 - Weak pose estim. -
ACWS [22] 145,718 Mixed Tags 15 77 keys Bounding boxes -
Colorful-Fashion [94] 2,682 Social network Tags 23 13 colours Pixel-level color -
Daily-Photos [39] 2,500 Social network - 18 - Pixel-level -
Cl. Co-Parsing [201] 2,098 Social network Tags 59 - Pixel-level -
Fashion-136K [74] 135,893 Social network Tags, description - - Bounding boxes -
Fashion-350K [74] 350,000 Tops & blouses Tags, description - - Bounding boxes -
Fashion-Q1K [74] 1,000 Skirts Tags, description - 8 patterns Bounding boxes -
Online-data [27] 341,021 Street Tags, description 15 67 ∼ 6K b. boxes -
Exact S2S [58] 425,040 Street / shop Tags 11 - Bounding boxes 39,479
Clothes-50K [192] ∼ 70K Shop Description 14 - - -
Online-offline [71] ∼ 500,000 Shop / street Tags 9 179 - 90,000
DeepFashion [97] >800,000 Shop / street Tags, description 50 1,000 Landmarks 300,000
2.4 Summary
In this chapter, we presented some of the existing challenges in the fashion domain that can be
tackled with the help of computer vision or deep learning. Problems like segmentation, classi-
fication or retrieval are perfect examples of the combination between scientific and commercial
interest that this topic offers. Moreover, the large quantity of available data (see Table 2.1) en-
courages the adoption of these techniques and makes it a very interesting subject for researchers.
21
Superpixel Segmentation
In this chapter we describe the first contribution of the thesis: a new method for generating
superpixel segmentations of images. This new method produces smaller superpixels in regions of
interest and large superpixels in more homogeneous regions with less information, and is moti-
vated by the flaws of other superpixels method when facing the specific task of image annotation
for clothing parsing.
CHAPTER 3. SUPERPIXEL SEGMENTATION
(a) (b)
3.1 Introduction
If we stated in the introduction that this thesis would start by solving low-level problems, in
computer vision there is no lower level than pixels. This chapter is devoted to the development of
a new method to group pixels into perceptually meaningful atomic clusters (known as superpixels)
that share some properties (e.g. color or texture), normally belonging to the same object on the
image.
Assuming that a superpixel is formed only by pixels belonging to the same physical world
object, this oversegmentation can drastically reduce the computational cost of properties that
remain approximately constant for an object [174]. In addition, the information provided by
superpixels is much more discriminative than that of single pixels, because it includes color
histograms and shape, and therefore can be used for instance in applications that require spatial
information [68]. As expected, representing images as a non-overlapping set of superpixels is
a standard practice as a preprocessing step for many computer vision applications, including
depth estimation [93, 68], object detection [168], localization [9], tracking [186], appearance
descriptors [167, 169], gesture recognition [182], human pose estimation [79, 118, 117], place or
object recognition [120, 122] and semantic segmentation [149, 65, 102, 47].
By the time this thesis started, one of the main concerns in Wide Eyes was obtaining more
data. More precisely, images annotated with segmented garments. In WE’s annotation system,
similar to the one in [164], the user would click on a superpixel and assign it to a certain category.
The main problem of this system is a classic problem of superpixels: it was hard to find the
balance between the number of superpixels and the adjustment to the boundaries present on the
image. A large number of superpixels would guarantee that even the smallest parts of the objects
in the image can be correctly segmented, but the cost of annotating that many superpixels would
be huge. On the other hand, having a lower number of superpixels might lead to the loss of some
details in the segmentation, and smaller garments like glasses, shoes or jewelry might remain
unsegmented. Please see Fig. 3.1 for examples of over and undersegmentation.
25
3.1. INTRODUCTION
(a) (b)
Figure 3.2: Overview of the proposed method. From left to right: (a): input image, with
overlaid boundaries and initial seeds positions; geodesic distance with respect to a specific seed;
and result of our Boundary-Aware Superpixel Segmentation (BASS) with 26 superpixels. (b):
results of state-of-the-art superpixel segmentations SEEDS [173] (36 superpixels), SLIC [5] (36
superpixels), and Yao et al. [202] (48 superpixels). Even with a smaller number of superpixels,
our algorithm is able to achieve better results for the Variation of Information (VOI ) metric
while maintaining the Undersegmentation Error (UE)value when compared with state-of-the-art
methods.
In other words, superpixels are expected to reduce image complexity while respecting the
boundaries, and at the same time they should avoid loss of information due to under-segmentation.
The trade-off between these two requirements has been tackled via Normalized Cuts [148], mean
shift [28], local variation [42], geometric flows [185, 89] and watershed [177]. Another requirement
when computing the superpixels consists of homogeneously distributing them over the image and
keeping their sizes within limited bounds.
In contrast, and taking these considerations into account, we argue that in many situations,
the superpixels can be safely merged and their number highly reduced, simplifying thus subse-
quent tasks. Therefore, for the first part of this thesis we focused on generating a superpixel
segmentation algorithm able to reduce the number of superpixels without losing the details of
the image. The main idea behind the algorithm is adapting the size of the superpixels on each
region of the image depending on the density of boundaries in that region (as objects in images
can generally be described by their boundaries), so that large homogeneous regions are divided
into larger superpixels, while regions with more texture and details are divided into smaller su-
perpixels to maintain that information. For this purpose, we introduce two main ingredients:
1) we first propose a new approach that spreads the initial superpixels seeds non-uniformly, de-
pending on the image content, and 2) we leverage on image intensity boundaries and a geodesic
distance metric to produce smaller superpixels where there is potentially more information in
the image (i.e., regions with more intensity boundaries), and bigger superpixels in regions with
less presence of boundaries. By doing this, we simultaneously prevent extreme oversegmentation
without information gain, and avoid undersegmentation in regions where more precise superpix-
els are needed, hence we are able to maintain the coherence of the image structure with fewer
superpixels than other approaches.
Even though the origin of the idea was to improve fashion product annotation, the algorithm
we develop is generic and agnostic to the type of images it receives. In fact, we first evaluated
26
CHAPTER 3. SUPERPIXEL SEGMENTATION
Figure 3.3: Results of graph-based superpixel algorithms. Examples from [42] (a), [31]
(b) and [40] (c). The approach in (a) produces different-sized superpixels, but not very accurate
to the semantic segmentation. Superpixels in (b) are bigger, more similar to a semantic segmen-
tation result than a superpixel preprocessing step. In (c), manual constraints can be introduced
to improve the results.
it in a fashion specific dataset and then, to demonstrate generality, in other datasets. As shown
in Fig. 3.2 and expanded in the results section, our approach brings numerous advantages and
improves segmentation metrics compared to the most recent methods. Concretely, we show
the resulting algorithm to yield smaller Variation of Information values in these datasets while
maintaining Undersegmentation Error values similar to the state-of-the-art methods1 .
In summary, the main contributions of this chapter are:
• Use of an energy function that takes into account color information and both Euclidean
and geodesic distance between pixels.
• Exhaustive evaluation of the resulting algorithm in seven different datasets (both multiclass
and foreground/background) with two different metrics.
• Better Variation of Information metric than state-of-the-art methods and similar values
for Undersegmentation Error and Boundary Recall for a smaller number of superpixels.
27
3.2. RELATED WORK
Figure 3.4: Results of seed-growing superpixel algorithms. Examples from [175] (a), [89]
(b), [5] (c) and [185] (d). In (a), a tree is formed linking pixels to the nearest neighbor instead of
shifting each point to a local mean. In (b) and (c), superpixels grow from evenly distributed pixels
called seeds, producing dense and regular grids of superpixels. In (c), seeds are also regularly
placed, but superpixels can be split according to geodesic distances.
minimizing a graph-based objective function. However, the computational cost of NC was quite
expensive, taking several minutes for segmenting a 480 × 320 pixel image. Based on the same
idea, subsequent works proposed alternatives to speed up the graph-based minimization process
by using agglomerative clustering of the nodes [42] (example result in Fig. 3.3a), by decomposing
the graph in multiple scales [31] (see Fig. 3.3b) or by adding grouping constraints [40] (Fig. 3.3c).
One of the most well known approaches is Graphcut [176], in which the constraints for the label of
a pixel come from a dense set of overlapped patches, enforcing the regularity of the superpixels.
Finally, [212] uses pseudo-boolean optimization to speed-up the graph cut algorithm to 0.5
seconds per image. Although a lot of work has been devoted to the optimization of this kind of
algorithms, especially regarding memory load, they still present some disadvantages with respect
to our proposed approach, such as an excessive uniformity between the resulting superpixels
caused by their tendency to produce small contours.
28
CHAPTER 3. SUPERPIXEL SEGMENTATION
(a)
(b)
Figure 3.5: Results of coarse-to-fine superpixel algorithms. Examples from [173] (a) and
[202] (b). As depicted in the images, the algorithms iteratively divide the image in blocks, that
are assigned to superpixels based on the minimization of an energy function.
between the pixels and the seeds. All the methods within this category are more efficient than
graph-based algorithms, being SLIC the fastest among them. Nonetheless, their performance is
not always better. Our method follows this line of work, but we primarily favor reducing the
number of superpixels trying to maintain the quality of the segmentation.
29
3.3. BOUNDARY-AWARE SUPERPIXEL SEGMENTATION
Figure 3.6: Summary of the main steps of the method. First, the boundary image is
obtained. Seeds are regularly distributed over the image, and based on the density of edges,
some of them are deleted and some intermediate seeds are added. After that, more seeds are
placed in the center of large empty spaces. Once the seeds positions are determined, the method
iterates computing the energy function for each seed, and assigning labels to pixels trying to
minimize the total energy. Upon the termination condition is reached, the connectivity of the
labeled pixels is enforced, achieving the final superpixel segmentation.
differences and similarities with previous methods from the state of the art.
Commonly, superpixel algorithms group pixels based on L2 distance computed in a 5-dimensional
space of color and pixel coordinates2 . In this way, if two pixels are close and have a similar color,
they tend to be grouped into the same superpixel.
While this is an standard practice, it ignores the information along the path joining pairs of
pixels, which can produce undesirable effects such as undersegmentations. Furthermore, many
state-of-the-art algorithms force superpixels to be regular-sized and homogeneously distributed
over the image. Again, this seems to be a reasonable heuristic to apply, however, it is prone to
produce excessive over-segmentations in regions where small superpixels are unnecessary, such
as backgrounds or large regions with homogeneous color.
These methods produce satisfactory results when the desired number of superpixels is prop-
erly set, i.e., with a value that balances the trade-off between preserving the image details and
producing an excessively large number of superpixels. Nonetheless, in many cases an extreme
over-segmentation is needed in order for superpixels to adapt to the ground-truth boundaries.
This fact implies a higher cost in the computation of the segmentation. Furthermore, since super-
pixels are mainly used as a compressed representation for images in higher-level tasks, increasing
the number of superpixels also increases the complexity of these applications.
For the first part of the thesis, we address the problem with the goal of producing more
“useful” superpixels, preventing extreme over-segmentation while still producing an accurate
representation of the image for subsequent tasks. In order to do that, we compute the boundaries
2
Three dimensions for color space (e.g., RGB or CIELAB), and two for pixel coordinates (horizontal and
vertical)
30
CHAPTER 3. SUPERPIXEL SEGMENTATION
(a) (b)
Figure 3.7: Examples of images and their extracted boundaries. Observe how the method
for boundary detection that we use [38] extracts high level boundaries at object level, discarding
internal boundaries not useful for the next steps.
of the image and increase the concentration of superpixels in regions with more edges, where more
detail is necessary. Consequently, superpixels in these regions are smaller than those located in
more homogeneous ones (with few edges). Moreover, drawing inspiration in [185], we modify
the energy function to be minimized by adding a new term that takes into account the geodesic
distance between two points, which helps to retain the structure. Yet, note that [185] does
still produce quite homogeneous superpixels, not content aware sized superpixels as we do (see
Fig. 3.4d).
We next describe the steps of the proposed algorithm. Refer to Fig. 3.6 for a visual explana-
tion.
31
3.3. BOUNDARY-AWARE SUPERPIXEL SEGMENTATION
Figure 3.8: Examples of seeds locations after initialization. Observe how seeds are gen-
erally centered around regions of interest (with many boundaries) while in more homogeneous
spaces the number of seeds is drastically reduced.
pixels found inside a square region sized S × S around each seed, we decide whether or not to
add or delete any seed by comparison against a certain threshold Tad = ( ei )/N , being ei a pixel
P
in the boundary image (with value 0 or 1), and N the total number of pixels in the image. More
formally, the seed addition/deletion operation can be written as:
if (PS ei )/N > 3 · Tad
(
Add,
P
(3.1)
Delete, if ( S ei )/N < Tad
where S ei represents the sum of all the pixels in the mentioned square region centered in a
P
seed. If the condition for adding seeds is satisfied, four new seeds are created in the corners of
such region. The integral image of the boundaries is used to obtain these values in order to speed
up the computation.
Note that the condition for adding is harder than that for deleting, as our objective is min-
imizing the final number of superpixels while maintaining a good quality in the segmentation.
Finally, we place a seed in the centroid of empty regions with areas larger than S × S pixels (top
right image in Fig. 3.6). Some examples of seeds placement are shown in Fig. 3.8.
32
CHAPTER 3. SUPERPIXEL SEGMENTATION
Figure 3.9: Compactness effect. Input image (a) and results of BASS varying only the
compactness term C. Observe how the superpixels tend to be more regular as the value of C
increases. In images (b) and (c) superpixels are sharp and tend to have irregular frontiers. In
(e), the oversegmentation starts losing some meaningful parts of the image. (f) shows the effect
of using an extreme value.
Sk = [lk , ak , bk , xk , yk ]T (3.2)
where (xk , yk ) are the pixel coordinates of seed Sk on the image and (lk , ak , bk ) are its color
values in CIELAB color space. This color space represents the color as three real values: L for
lightness from 0 (black) to 100 (white), a from green to red and b from blue to yellow (values
of a and b are implementation-specific, but are normally in the range of -100 to +100 or -128
to +127). It was designed to approximate the human perception, which makes it perfect for the
task of superpixel segmentation, since we want our superpixels to be visually meaningful.
33
3.4. EXPERIMENTAL EVALUATION
Figure 3.10: Geodesic distance. Two examples of geodesic distance in a region around a
specific seed (marked with a red dot). For each case, from left to right: region of the original
image, region of the edges image and geodesic distance, where black is lower and white is higher.
Observe how edges act as a sort of barriers, increasing distance of pixels on the other side.
The two first energy terms in Eq. (3.3), corresponding to color and Euclidean distance, are
computed as in [5]:
q
Elab = (lk − li )2 + (ak − ai )2 + (bk − bi )2 (3.4)
q
Exy = (xk − xi )2 + (yk − yi )2 (3.5)
The last energy term depends on the gray-weighted geodesic distance computed over the binary
boundary image. This distance is defined as the smallest weighted sum of gray levels along
the discrete path between two given pixels. Concretely, we implement the Distance Transform
on Curved Space from [165]. This operation yields an image where every pixel i has a value
corresponding to the distance of that pixel to the nearest seed Sk . The region in which we
compute this energy for each seed has a 2S ×2S size. Examples of this distance for different seeds
in a given image are shown in Fig. 3.10, whereas Fig. 3.11 shows examples of color, Euclidean
and geodesic energy values per pixel.
We initialize the energy of all pixels to E0 . A reasonable choice would be to set E0 = ∞, but
that would force all pixels to get a label in the first iteration, even when they are not specially
close to any seed. For that reason, we set E0 as a finite value that we linearly increase with the
number of deleted seeds. Thus, if the energy of a pixel is not lower than E0 it will have label
l = 0, and all pixels with such label will form a superpixel (as seen in Fig. 3.9f). We then iterate
until the maximum allowed number of iterations (itmax ) is reached. Finally, those superpixels
whose area is extremely small are removed by merging them with adjacent bigger superpixels.
All these steps are described in Algorithm 1.
34
CHAPTER 3. SUPERPIXEL SEGMENTATION
Figure 3.11: Different energy values. Input image (a) and examples of per-pixel energy values
(color (b), Euclidean (c) and geodesic (d) distance to different seeds marked with a red circle),
where blue means lower distance and red means higher distance.
has very different images, but most are both smooth and simple. On the other hand, images
from CSSD and ECSSD present more natural situations.
We compare our approach, dubbed BASS, against three state-of-the-art algorithms: SEEDS [173],
SLIC [5], and Yao et al. [202]. All algorithms were evaluated with the code from the authors’
websites. For BASS, the maximum number of iterations has been experimentally determined as
10 to produce fast segmentations without excessively affecting their quality.
A brief description of the metrics used to evaluate the segmentations is given below, followed
by a discussion of the results obtained.
" !#
rij rij
V OI(X; Y ) = − rij · log + log (3.6)
X
i,j
pi qj
where pi = |Xi |/n, qj = |Yj |/n and rij = |Xi |∩|Yj |/n. Lower values correspond to smaller distances
and hence to more similar segmentations.
Undersegmentation Error (UE). It is computed as
35
3.4. EXPERIMENTAL EVALUATION
where GT is the set of ground truth segments, P are the superpixel segments, S the ground
truth segments, and |Pin | and |Pout | represent the area of P inside and outside S, respectively.
A low value is desirable.
Boundary Recall. Represents the amount of ground truth boundaries that are covered by the
over-segmentation boundaries. Considering a binary image of boundaries (1: boundary, 0: not
boundary) with a dilation of 3 pixels, we compute the Boundary Recall for all the pixels as the
number of true positives (T P ) divided by the sum of T P plus the number of false negatives
(F N ) and higher values indicate more accurate segmentations.
TP
BR = (3.8)
(T P + F N )
36
CHAPTER 3. SUPERPIXEL SEGMENTATION
Figure 3.12: Evaluation metrics. Values obtained for different number of superpixels. The
results show that our approach outperforms state-of-the-art methods in Variation of Information
metric. It obtains the second best result in Undersegmentation Error and performs in pair with
state-of-the-art methods like SEEDS and SLIC in Boundary Recall. In (a) and (b), lower values
correspond to better segmentations. For (c), bigger is better.
and emphasize the generalization of the method, even though specific parameter sets per dataset
would give better individual results. These results show how our algorithm consistently decreases
the V OI for all number of superpixels and, at the same time, maintains U E and BR values similar
to state-of-the-art methods. Indeed, we argue that lower V OI is much more representative for
our primary goal of retaining the image information with a minimal number of superpixels. This
is clearly illustrated in Figure 3.13.
3.5 Summary
This chapter has presented the first contribution of the thesis: an over-segmentation algorithm
to compute superpixels that are aware of the boundary information of the input image in order
to simplify the final result. The problem has been formulated as an iterative clustering problem
using color, Euclidean distance and geodesic distance over an edge image. Our method has been
evaluated against the state-of-the-art using seven different datasets.
Our algorithm outperforms state-of-the-art methods in the most significant metric according
to our goal while maintaining the quality of the segmentation.
37
3.5. SUMMARY
Figure 3.13: UE vs. VOI. For two different images, two segmentations with similar U E are
presented. The segmentation with BASS has a lower V OI value in both cases, hence it is more
similar to the ground truth, containing more useful information.
The algorithm is implemented in C++ and runs on CPU in about 0.5 seconds per image.
The code is publicly available at https://github.com/arubior/bass-superpixels.
38
CHAPTER 3. SUPERPIXEL SEGMENTATION
Figure 3.14: Qualitative results (I). Some results of our superpixel segmentation algorithm
compared to state-of-the-art methods.
39
3.5. SUMMARY
Figure 3.15: Qualitative results (II). Some examples of our superpixel segmentation algorithm
compared to state-of-the-art methods.
40
Fashion Multi-modal Embedding
In this chapter, we leverage the textual metadata that normally accompanies the fashion images
in our datasets and propose a joint multi-modal embedding that maps both the text and images
into a common latent space. Distances in this space correspond to similarity between products,
allowing to effectively perform retrieval tasks which are efficient and accurate.
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
4.1 Introduction
Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites
are updating with thousands of images and their associated metadata (textual information),
deepening the problem, akin to finding a needle in a haystack. Not only the size of the databases,
but the level of traffic of modern e-commerce is also growing fast. U.S. retail e-commerce, for
instance, was expected to grow 16.6% on 2016 Christmas holidays (after a 15.3% increase in
2014), with 92% of the holiday shoppers going online to search or buy gifts [3]. In order to adapt
to this trend, modern retail sellers have to provide an easy-to-use experience to their customers,
where products are easy to find and well classified.
In a scenario where huge amounts of new products are presumably arriving on a daily basis
and must be searchable and classified, Machine Learning techniques stand out like a good choice
due to their good results classifying and clustering vast amounts of multidimensional data. More
and more sellers are including Machine Learning technologies in their online sites, specially for
advertising, recommendation and search. Nevertheless, currently most of the searches are text-
only, not taking into account the eventuality of a product not totally tagged or described. Images
without a rich text description are virtually unfindable by classical search, based on text. The
ability to compute distances between text and image in the same space allows to retrieve images
that are similar to a text, and not necessarily images whose associated texts are similar.
Fashion e-commerce products usually consist of pictures and associated metadata, generally
in the form of textual information such as brief descriptions, titles, series of tags, colors or sizes.
Existing approaches for retrieval focus only on images and require hard to obtain datasets for
training [58]. Instead, we opt to leverage easily obtained metadata for training our model, and
learning a mapping from text and images (like the ones in Fig. 4.1) to a common latent space,
in which distances correspond to similarity. We train the system using large-scale real world e-
commerce data by both minimizing the similarity between related products and using auxiliary
classification networks that encourage the embedding to have semantic meaning. Results are
compared against existing approaches and show significant improvements in retrieval tasks on
a large-scale e-commerce dataset. We also provide an analysis of the influence of the different
metadata.
Prior to a detailed analysis of the model itself, it is interesting to note why such an approach
is needed. In our case, observations come from different input channels with different options
for representations: images are normally represented as dense real-valued vectors that take into
account structural relationships between pixel intensities, whereas texts are usually processed
as descriptors based on sparse word count vectors. For this reason, without the intervention
of a specific system devoted to narrow the gap between the two types of descriptors, it is no-
tably harder to find the highly non-linear relationships across modalities than among the same
modality.
More concretely, our approach consists of exploiting a Convolutional Neural Network (CNN)
for processing images, as well as word2vec-based embedding with a Neural Network for processing
the textual information. Both networks are trained such that the distance between the output
of related image-text pairs is minimized, while the distance between unrelated image-text pairs
is maximized. Additionally, two auxiliary classification networks are used in combination with
classification losses to retain semantic information in the common embedding.
We evaluate the retrieval task, where our proposed approach outperforms KCCA [16] and
43
4.2. BACKGROUND AND RELATED WORK
Figure 4.1: Example of a text and nearest images from the test set. Example of a typical
text description found in many of the products in our dataset, and nearest images from the test
set found using our embedding, that yields low distances between texts and images referring to
similar objects.
Bag-of-word features on a large e-commerce dataset, and also classification, where the embedded
descriptors obtain values over 90% for both text and image. Additionally, an analysis of the
different textual metadata is provided.
Summarizing: in this chapter, we learn a representation space (embedding) f (x) for features
coming from visual and textual data. The learned features are smooth (small changes in the
input data lead to small changes in the embedding space, i.e. x ≈ y → f (x) ≈ f (y)), come
from multiple explanatory features (since they are computed using distributed representations,
not sparse), and are valid for retrieval and classification tasks. Our embedding provides textual
representations with the continuity of the visual space, overcoming the problem of artificially
partitioning it into disjoint parts [190].
4.2.1 Retrieval
As detailed in Chapter 1, interest of computer vision researchers in fashion has increased in the
past years. Many works focus on clothing parsing, or higher level tasks such as inferring a person
44
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
occupation or social tribe, or recommendation. Nevertheless, some of the more practical tasks
still might be clothing retrieval and classification, which we tackle in this chapter.
Retrieval task consists of finding similar items in a dataset given a query. The usual pipeline
for image retrieval is formed by three steps: extracting local image descriptors (such as Fisher
Vectors [124, 125, 126]), reducing the dimensionality (with techniques like principal component
analysis – PCA [123] – or linear discriminant analysis – LDA) and indexing. For text retrieval,
classical approaches [32, 67] looked for repetitions of the query words in a document, while newer
latent semantic models [21, 110] use more powerful distributed text representations capable of
learning the context of words and meaning of documents. There is recently a great effort focused
on word embeddings and their applications [36, 48, 56, 111].
According to [37], current retrieval techniques for large datasets with images and metadata
can be distributed into: text-based, content-based, composite and interactive approaches.
45
4.2. BACKGROUND AND RELATED WORK
words, and some others including grammatically richer descriptions) and is very domain specific.
For this reason, we do not need a heavy preprocessing step and we focus on noise removal
and homogenization, switching everything to lowercase and removing punctuation symbols and
numbers that don’t help in our embedding task.
Of course, these purely text-based methods are reliable when images have good annotations.
Otherwise, they are unable to find the similar items. And even with good annotations, the
descriptions of images tend to be highly subjective and incapable of capturing all the relevant
details, so the description of one image can be extremely different from one person to another.
46
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
Neighbor (ANN) techniques. Without getting into details, many techniques that also use binary
hashing are based on the Hamming distance, like [210, 52, 166, 144, 92]. In [213], the results of
Bag of Words were improved encoding more spatial information through what they call geometry-
preserving visual phrases. This approach captures local and long-range spatial distribution of the
words, not only co-occurrences. A popular approach to deal with large scale image databases is
to index the items with inverted files [157], that store mapping from content (e.g., visual words)
to documents.
Still, bridging the so-called ”semantic gap” between low-level pixel features readable by
machines and high-level semantic information understood by humans remains one of the main
problems of CBIR.
Of course, talking about CBIR in the last decade, one of the biggest breakthroughs is the
generalization of the usage of Convolutional Neural Networks (CNN), as detailed in Chapter 1.
Has deep learning really pushed research in the right direction towards reducing the semantic
gap? In [181], they conduct an exhaustive study with hopeful results for feature extraction and
retrieval, concluding that properly designed deep learning systems have the potential to outper-
form conventional hand-crafted feature extraction methods, and that training with similarity or
classification losses can as well improve the retrieval performance of classic systems.
There are also works trying to fuse the retrieval results coming from different methods,
like [209].
47
4.2. BACKGROUND AND RELATED WORK
since both inputs are RGB images. Basura et al. [43] tackle the unsupervised domain adaptation
problem (adapting different source and target domains) making use of subspaces described by
eigenvectors induced by a PCA. This work performs what they call subspace alignment using
a transformation matrix to map one domain to the other. Their data is formed by images
from different domains. [54] tackles the same problem, but viewing the subspaces as points in
Grassmann manifolds, while [50] integrates an inifinite number of subspaces describing changes
in properties into a geodesic flow kernel that models the shift between domains. In contrast to
all these previous works, our multi-domain framework, instead of images coming from different
domains, uses image and text data. Nevertheless, most ideas coming from these techniques are
still interesting independently of the origin of the data, since for all of them the final goal is
having some sort of common descriptor valid for items coming from both domains.
Most of the approaches for multi-domain classification train with one source domain and then
fine-tune their classifiers to work with the target domain [20, 142]. In our case, we simultaneously
train with data from both domains, producing a common space specifically learned for the
retrieval task that offers also a good performance in the classification task.
In some cases, retrieval systems can learn from the users’ feedback indicating the relevance of
the results obtained. A complete review of interactive retrieval techniques is out of the scope
of this chapter. Many relevance-feedback techniques are described in [24], concluding that this
feedback is not always effective because the users are able to specify irrelevant search results that
will have a negative impact in the system’s performance.
4.2.2 Classification
Since most of the e-commerce metadata consists of textual features, product classification is
normally addressed as text-only classification [57]. However, recent works have started trying to
boost the classification performance using multi-modal architectures. For instance, the authors
of [46] combine the image network from [85] with a skip-gram language model, without obtaining
a significant improvement (probably due to poor text labels). In [77], they improve text-based
classification with a multi-modal architecture, but in a small dataset. Some other works created
good embeddings without focusing on classification [99, 81, 53]. A joint space for image and text
is created in [198] using a deep learning version of the classical canonical correlation analysis
method (DCCA).
Text classification
Text classification is typically treated as a two-step process, where in the first place some features
are extracted (e.g. Bag-of-Words, n-grams, etc.) and then they are used for classification. These
domain-specific priors can be replaced by generic ones in the case of deep learning [18] without
degrading the quality of the results. More specifically, Convolutional and Recurrent Neural
Networks are used to capture the sequentiality of the text [207]. Both types of networks are
suitable to be applied to distributed embeddings [80, 87, 57] or characters [211, 29, 193].
48
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
Image classification
Regarding image classification, it is nowadays practically monopolized by Convolutional Neu-
ral Networks (CNNs). These architectures represent the state-of-the-art performance on the
ImageNet Large-Scale Visual Recognition Challenge [85, 156, 141, 63]. This has already been
addressed in Section 4.2.1 and Chapter 2, so we will not enter into more detail.
4.3 Method
An overview of our method can be seen in Fig. 4.4. The main goal of the work detailed in
this chapter is the creation of an embedding such that, given an image or a text as a query,
similar images and/or texts can be retrieved from a dataset. Our joint multi-modal embedding
approach consists of a neural network with two branches: one for image and one for text. The
image branch is based on a Convolutional Neural Network (CNN) which converts a 227 × 227
pixel image into a fixed-size 128-dimensional vector. The text branch is based on a multi-layer
neural network and uses as inputs features extracted by a pre-trained word2vec network which
are converted into a fixed-size 128-dimensional vector. Both branches, whose specifities will be
given later in the chapter, are trained jointly such that the 128-dimensional output space becomes
a joint embedding by minimizing the distance between related image-text pairs and maximizing
the distance between unrelated image-text pairs. Two auxiliary classification networks are also
used during training that encourage the joint embedding to also encode semantic concepts.
49
4.3. METHOD
Figure 4.4: Architecture of the proposed neural network. When sizes of two dimensions
are equal, some of them are omitted for clarity. Fully connected layers are uni-dimensional. Text
descriptor and Image descriptor are the embedded vectors describing the input text and image,
respectively.
one for classification and one for the embedding. The classification branch has two fully connected
layers (F C9 and F C10) and outputs the SoftMax score of the different classes. The embedding
branch has a single layer which outputs the 128-dimensional feature vector for the embedding
(F C11). All fully connected layers F C8 − F C11 consist of the fully connected layer itself,
followed by a Batch Normalization (BatchNorm layer and a Rectified Linear Unit (ReLU) layer.
See Fig. 4.4 for a visual representation of the network. Images are preprocessed before entering
the network: they are resized to a specific size (227 × 227) and normalized using precomputed
values for the mean and standard deviation.
50
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
Figure 4.5: Example of training samples with a context window size of 2 words. Figure
extracted from [107].
51
4.3. METHOD
Figure 4.6: Scheme of the word2vec method. Input and output have the same size (the
vocabulary size, K, and each element represents a word. The hidden layer has the size of the
desired output descriptor, that will be extracted from the matrix transform between the hidden
and last layer, with size M × K.
The input for the text branch of our network are M -sized descriptors computed averaging the
previously learned word2vec distributed representations for all the words in each text. Averaging
these descriptors has been proven successful in [180].
Our text network consists of 3 common fully connected layers that output 1024-dimensional
features (F C12-F C14). Afterwards, like in Section 4.3.1, the network splits into two branches:
the classification branch and the embedding branch. The classification branch consists once
again of two additional layers (F C15 and F C16) and the output is the score of the different
classes. The embedding branch outputs 128-dimensional vectors for the joint embedding. All
fully connected layers in the text network are formed by the fully connected layer itself followed
by a BatchNorm layer and a ReLU layer.
4.3.3 Training
For training we use the previously collected large dataset of corresponding text-image pairs with
class labels. Each sample consists of an image, its associated textual metadata and a category
label. The category labels are used for the classification losses and for randomly sampling
negative examples for training the embedding, so that the picked negative example belongs not
only to a different product, but to a different category.
Cross entropy losses are used for classification. This loss function (see Eq. (4.1)) is used in
classification tasks where the outputs are probability values between 0 and 1 for each class. In
52
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
Figure 4.7: Word2vec’s arithmetical properties. This is a classical example of the arithmetic
properties of the space generated by word2vec. Operating with the descriptors of the words give
results like king − man + woman = queen.
our case, 32-dimensional vectors of probabilities of belonging to a clothing category listed in 4.4.
Given a sample as input (an image or a text), its cross entropy loss is computed as follows:
N
LX (C, L) = − Li log (Ci ) (4.1)
X
i=1
where N is the number of categories (32 in our case, see Section 4.4 for details), Li is the label
for category i and Ci is the output of our classification algorithm for category i.
Simultaneously training the text and image networks is done jointly by encouraging similar
text-images pairs to have a small distance between the embedded vectors, while dissimilar text-
image pairs have a large distance. Images and their associated text are used as positive pairs,
while unrelated image-text pairs are obtained by randomly sampling images and texts from
unrelated categories. This is done by using the contrastive loss described by Hadsell et al. [59]:
1
LC (vI , vT , y) = (1 − y) (kvI − vT k2 )2
2
1
+ (y) {max (0, m − kvI − vT k2 )}2 (4.2)
2
where vI and vT are two embedded vectors corresponding to the image and the text respectively,
y is a label that indicates whether or not the two vectors are compatible (y = 0) or dissimilar (y =
1), and m is a margin for the negatives. After trying different values for m ({0.1, 1, 10, 50, 100})
we select the one that performs best, m = 1. For a visual explanation of contrastive loss, see
Fig. 4.8.
The full training loss consists of both the contrastive loss and the weighted sum of the cross
entropy classification losses:
LC (vI , vT , y) + αLX (CI (vI ), LI ) + βLX (CT (vT ), LT ) (4.3)
53
4.4. DATASET
Figure 4.8: Contrastive loss behaviour with positive and negative pairs. The mini-
mization of this function reduces the distance between two vectors if they are labeled as similar
(positive), and increases it if they are considered dissimilar. The margin value defines a radius
outside which negative samples do not contribute to the loss value (such as y2 ).
where LX is the cross entropy loss, CI (vI ) is the output of the image classification network, LI
is the image label, CT (vT ) is the output of the text classification network, LT is the text label,
and α and β are two weighting hyperparameters.
We train the network for 100,000 iterations with batches of 64 samples (forming in each iteration
64 correlated pairs image-text and 64 non-correlated pairs) with α = β = 1. Training is done
using stochastic gradient descent with backpropagation. We use an initial learning rate of 10−3
and decrease it by 5 · 10−4 every 10, 000 iterations with momentum 0.95.
4.4 Dataset
The dataset we use consists of 431,841 images of fashion products with associated texts. We
have collected this data in WE from dozens of different fashion websites through web scrapping.
Images depict individual fashion items with constant background (generally white) as well
as images with models wearing several clothes at the same time. Regarding textual metadata,
each website contains its own tags, but we generalize all of them so the textual information for
each product comes separated in the following fields: title, description, gender, type, color and
category. See Fig. 4.9 for examples of products in the dataset.
54
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
Figure 4.9: Examples of products. Images and their associated textual metadata. Note how
for some of the products, the textual metadata is not extremely descriptive, limited to a bunch
of words selected for search engine optimization.
55
4.5. RESULTS
Table 4.1: Retrieval results. Median rank and recall at x (f@x%) of our method (using
word2vec) compared against KCCA and against our method using Bag of Words for text repre-
sentation.
We use 60% of the dataset for training, 30% for test and 10% for validation, and train the
model using different combinations of textual information associated to the images to check the
influence of the different types of text, as will be explained in Section 4.5.
4.5 Results
Next, we describe the results obtained by applying our method to a fashion e-commerce dataset,
in which we train a common embedding where distances between text and images referring to
products of the same category are considerably smaller than distances between those of different
categories. We compare against existing approaches, analyze the different text features, and
look at classification results with the auxiliary networks. We compare the results using a simple
Bag-of-Words approach, and the most complex distributed representation given by word2vec.
We check how the amount and type of text influences the clustering in the embedding and the
classification accuracy. Finally, we use kernel canonical correlation analysis (KCCA) [16] to
obtain another common embedding for comparison.
We use the following metrics to check the results:
• Median rank: median position of the first correct result in the ranked list of results (in
percentage of the total number of samples evaluated).
56
CHAPTER 4. FASHION MULTI-MODAL EMBEDDING
Table 4.2: Results using the information in different text fields. We see how Title
and Category are extremeley discriminative and saturate the text classification accuracy when
appear. We compare against a model trained without the classification losses, seeing how the
difference between positive and negative distance increases at the expense of losing more than
10% classification accuracy. Diff. column corresponds to the difference between mean distance
of positive pairs and negative pairs (bigger is better).
• Recall at x: percentage of test queries for which the recall at x% is positive. Recall at
x% (f@x%) is positive for a query if the correct result can be found in the first x% of the
results ranked by distance to the query.
• Classification accuracy: percentage of items in the test set that are correctly classified
by the network.
4.5.1 Retrieval
In order to evaluate our method, we compute the 128-dimensional descriptors of all images and
texts in the testing set. Then, we use the text as queries to obtain the most related images,
and vice-versa. Looking at the position at which the exact match is obtained, we compute the
median rank for each case. The resulting values are below 2%, meaning that the exact match is
usually closer than the 98% of the dataset, beating the result obtained by KCCA1 and by our
same architecture substituting the word2vec by a classical Bag of Words.
These results, the recall@K (which shows that around 80% of the times the exact match is
among the top 5% of nearest items) and the classsification accuracy can be seen in Table 4.1.
We also tested the performance of our model with respect to the different data fields available
in the dataset, concluding that, even if the Description field by itself gives good results, using
highly discriminating fields such as Title or Category slightly improves the metrics (see Table 4.2).
4.5.2 Classification
In parallel to the ranking task, we are training a classification task. This one, intended to
help clustering in a certain way the products of the same category in the common embedding,
1
The KCCA model has been trained with less descriptors (only 10, 000) due to memory errors when trying to
use the whole training set.
57
4.6. SUMMARY
Table 4.3: Classification results. Classification accuracy of our approach using word2vec and
a simplified version using Bag of Words for text representation. Diff. column corresponds to the
difference between mean distance of positive pairs and negative pairs (bigger is better). Best
result shown in bold.
Classif. Acc.
Model Diff.
Text Image
Bag of Words 99.78% 71.73% 0.327576
word2vec 99.97% 90.06% 0.44
maintains high accuracy values (> 95% in some cases, as seen in Table 4.2) for the 32 clothing
categories defined in the dataset. In Table 4.3, we show the global classification accuracy results
for two versions of our method, as well as the difference in the mean distance between negative
and positive pairs in our embedding. A higher number for this last value implies that the
descriptors in our embedding are closer in items from a same category and further away in the
embedding.
4.6 Summary
Creating a common embedding where texts and images can be easily compared represents an
opportunity for e-commerces to classify and tag images without associated text, to fill in missing
information looking for similar texts. We presented results on a challenging real-world dataset,
obtaining low median rank values and showing at the same time very good classification accuracy
for both image and text classification.
In this chapter, we have presented an approach for joint multi-modal embedding with neural
networks with a focus on the fashion domain. Our approach is easily amenable to large existing
e-commerce datasets by exploiting readily available images and their associated metadata. By
training the embedding such that distances correspond to similarities, our approach can be
easily used for retrieval tasks. Furthermore, our auxiliary classification networks encourage the
embedding to have semantic meaning, making it suitable as features for classification tasks.
58
Main Product Detection
In this last chapter, we present an approach to detect the main product in fashion images by
exploiting the textual metadata associated with each image. Our approach is based on a joint
embedding of object proposals and textual metadata with an architecture based on the one in
Chapter 4 to predict the main product in the image. We additionally use several complementary
classification and overlap losses in order to improve training stability and performance. Tests on a
large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong
baselines and is able to accurately detect the main product in a wide diversity of challenging
fashion images.
CHAPTER 5. MAIN PRODUCT DETECTION
5.1 Introduction
Most of current commercial transactions occur online. Every modern shop with growing expec-
tations presents to their potential customers the option of buying some or part of the products in
their online catalogs. For instance, 92% of the U.S. Christmas shoppers went online on holidays
in 2016, a 16.6% increase over the same period in 2015 [3].
The way the products are presented to the customer is a key factor to increase online sales.
In the case of fashion e-commerce, a specific item being sold is normally depicted worn by a
model and tastefully combined with other garments to make it look more attractive. Existing
approaches for recommendation or retrieval focus on images only, and normally require hard-to-
obtain datasets for training [58], omitting the metadata associated with the e-commerce products
such as titles, colors, semantic tags or descriptions that can be used to improve the information
obtained from the images.
In this work, we propose to leverage this metadata information to select the most relevant
region in an image, or more specifically, to detect the main product in a fashion image that might
contain several garments. This allows us to subsequently train specific product classifiers, which
do not need to be fed with the whole image. Additionally, this process can also be used as a first
step in tasks like visual question answering or, together with customer behavior data, to extract
useful information relating the type of images in an e-commerce and its sales.
Our approach consists of a first step to extract descriptors of object proposals, that are then
used to train a joint textual and image embedding. The distances between descriptors in this
common latent space are then used to retrieve the main product of each specific image as the
closest object proposal to the textual information, as seen in Fig. 5.1.
We train our method with images of individual garments and evaluate its performance in
images coming from a different e-commerce of images of models wearing the clothes, and we
show it to be able to detect a region with an exigent 70% overlap with the ground truth in more
than 80% of the cases among the top-3 bounding box proposals.
61
5.2. RELATED WORK
Figure 5.1: Overview of our proposed method. From a fashion e-commerce image and its
associated textual metadata, we extract several bounding box proposals and select the one that
represents the main product being described in the text.
image region, while we find the image region closest to a rich textual description. Furthermore,
we base our approach on a much faster algorithm to generate the proposals, extracted from [218].
Other works try to acquire a deeper understanding of the available textual information. The
embedding in [72] is created to explicitly enforce class-analogy preservation. In [183], the task of
image-sentence retrieval is carried out using entire images, with application to phrase localization
on the Flickr30K Entities dataset. See Fig. 5.3 for an example of these results.
The basic idea of two network branches (one for images, one for texts) connected with a
margin loss is similar to ours, but our approach incorporates the classification information to the
gradients of the network, while [183] only enforces the ranking task with combined hinge loss
functions. [90] proposes a two-step process where a network is trained in the first place with
multi-labeled images and then used to mine top candidate image regions for the labels. The work
of [184] is devoted to the structured matching problem, studying semantic relationships between
phrases and relating them to regions of images.
Some other works focus on Visual Question Answering [215], taking a step further the goal of
image region importance according to text, using it to generate a proper answer to a question, as
shown in Fig. 5.4. While these works focus on many-to-many correspondences, i.e. relating parts
62
CHAPTER 5. MAIN PRODUCT DETECTION
Figure 5.2: Multiple-Instance visual-semantic Embedding. The aim of this paper is tag-
ging different parts of the image with different labels. Figure extracted from [137].
of sentences to regions of images, our work tries to associate all the available textual metadata
to only one region of the image, simulating the potential problem we are dealing with: receiving
images and text from fashion e-commerces and detecting the product being sold among all the
products on the images.
5.3 Method
Our goal is to detect the main product corresponding to the product being sold in a fashion
image. We consider the case where the image contains several other garments and has additional
metadata associated to it. We will solve this problem creating a common embedding for images
and texts, and then finding the bounding box whose embedded representation is closest to the
representation of the text. In order to do so, we explore different architectures and combinations
of artificial neural networks. Next, we describe the different approaches that we incrementally
propose, stating the pros and cons of each one of them.
1
LC (vI , vT , y) = (1 − y) (kvI − vT k2 )2
2
1
+ (y) {max (0, m − kvI − vT k2 )}2 (5.1)
2
63
5.3. METHOD
Figure 5.3: Parts of an image related with parts of a text. Figure extracted from [183].
where y is the label indicating whether the two vectors vI and vT , corresponding to image and
text descriptors respectively, are similar (y = 0) or dissimilar (y = 1). The value m is the margin
value for negative samples. Therefore, both positive and negative image-text pairs must be used
in order for the network to learn a good embedding.
Classification loss: the classification loss (for both text and image branches) is a cross en-
tropy loss (see Eq. (4.1)) comparing the predicted vector (composed of 19 category probabilities)
and the ground truth category label (a binary vector of the same size with only one activation)
that can be:
where CI (vi ) is the output of the image classification branch, LI is the image label, CT (vT ) is
the output of the text classification branch, LT is the text label, and α and β are two weighting
hyperparameters.
Text network: the textual metadata is used in the same way throughout all the architectures
in the chapter. We first concatenate all the available string fields (depending on the source of
the data, these can be title, description, category, subcategory, gender, etc.), then we remove
numbers and punctuation signs, and compute 100-dimensional word2vec descriptors [110] for
each word appearing more than 5 times in the training dataset. We compute these descriptors
using bi-grams and a context window of 3 words, as in Chapter 4 (please, refer to Fig. 4.5).
64
CHAPTER 5. MAIN PRODUCT DETECTION
Figure 5.4: Relevance of image regions for answering different questions. Importance
of each pixel of the image when answering specific questions. Note how in the upper case, pixels
belonging to the cat have higher scores (red values), while in the lower case the most important
pixels belong to the shelf. Figure extracted from [215].
Finally, we average the descriptors in order to have a single vector representing the metadata of
the product. Averaging these distributed representations gave good results as a text descriptor
in [180]. The training corpus for the word2vec distributed representation consists of over 400, 000
fashion-only textual metadata.
These descriptors are then fed into a 3-layer neural network formed by fully connected (FC)
layers with Batch Normalization (BatchNorm) [73] and Rectified Linear Units (ReLU) that finally
produce a 1024-dimensional vector, later split into two branches as shown in branch D of Fig. 5.5:
• a FC + BatchNorm + ReLU block that reduces the dimension of the vectors to 128 followed
by a final FC layer and a SoftMax layer that reduce it to 19 elements corresponding to
category probabilities for classification.
• a FC + BatchNorm + ReLU block followed by a FC layer, both with 128-dimensional
outputs. The output of the last layer is the descriptor of the text in the common embedding.
65
5.3. METHOD
Figure 5.5: The three different network architectures used. Gray layers remain constant
for all the architectures (i.e., the text branch (D) and a few layers before each loss function).
Blue parts correspond to architectures using images as input (both full image (A) and cropped
bounding boxes (B) flow through the same layers), and green parts correspond to the architecture
using bounding box descriptors as input. These descriptors are the output of the frozen first
layers of the AlexNet architecture, so in case (C) the image branch of the network is only trained
from the first green layer.
66
CHAPTER 5. MAIN PRODUCT DETECTION
Figure 5.6: RoI pooling explained. Given an image as input, a feature map is computed using
a convolutional neural network, and N regions of interest are proposed. Using both things, the
RoI pooling layer scales a region of the feature map corresponding to each one of the proposals
to a fixed pre-defined size, obtaining a list of N feature maps.
Bounding boxes with overlap under 30% with GT are not used as negative pairs for the
network because they would not be discriminative enough (the problem of differentiating between
positive and negative would focus on extreme cases, which are much more numerous and easy to
detect than the negative pairs we use).
For this approach, the network is the same as before (see branch B of Fig. 5.5), but the input
pairs are the resized proposal bounding boxes with their corresponding positive or negative texts.
The quality of the results is considerably increased (as seen in Table 5.1), but since the number
of pairs that we can construct per product is now much higher, it takes more time to reach a
good minima when training.
67
5.3. METHOD
Figure 5.7: Effects of bounding box resizing. Effects of resizing a bounding box of an item
with (b) or without (a) adding context information to avoid deformation. We see how without
the context, the resizing step extremely deforms the image. The feature map of this deformed
image will not be comparable to any of the feature maps extracted from regions of the image in
the testing phase. In our method, we resize regions of the feature map, not the image itself, but
this figure serves as an illustration of the problem.
The benefit of using RoI pooling is huge regarding processing speed. Instead of computing
the (very expensive) convolutions of the CNN for each one of the proposals, we just perform one
pass of the original image through that part of the network, and then crop the desired parts of
the feature map for the proposals. The speed up in both training and testing is considerable
(10× and 20× respectively).
The input to the image part of our network are the 6×6×256 Region of Interest (RoI) pooling
regions of the corresponding proposal bounding boxes extracted from the last convolutional layer
of AlexNet as in Fast-RCNN.
Now the training of the visual part of the network consists only of a first convolutional layer
(coupled with a ReLU) that reduces the third dimension of the data from 256 to 128 elements,
followed by two FC + BatchNorm + ReLU blocks that progressively transform these descriptors
into 512-dimensional vectors, that are then split into the already mentioned 128-dimensional
descriptors for classification and for the embedding. Layers previous to this first convolutional
layer are frozen and only used to extract the RoI pooling features (see branch C of Fig. 5.5).
68
CHAPTER 5. MAIN PRODUCT DETECTION
(a) Category: Shirts & (b) Category: T-Shirts. (c) Category: Coats & Jack-
Blouses. Description: Men Description: Women >T- ets Description: Women >Coats
Color: Casual shirts. Le Shirts. Twik - Boyfriend Canada Goose - Trillium parka. De-
31 - All-over check shirt. tee Twik. Exclusively from signed to withstand extreme condi-
Regular fit Le 31. Exclu- Twik. An ultra practical tions, the Trillium parka keeps you
sively from Le 31 for men. must-have neutral basic. Ul- snug and warm, even in the depths
Trendy all-over checks re- tra comfortable 100% cot- of winter. It features a sleek fit,
designed in a palette per- ton weave. Sewn rolled slightly cinched waist, and slimmer
fect for the upcoming season. sleeves. The model is wear- lines throughout. White duck down
Regular fit Button-down col- ing size small. Title: Twik fill. Dual-adjusting removable hood
lar. Contrasting underside. - Boyfriend tee (Women, with removable coyote fur ruff In-
Ultra comfortable 100% cot- Green, X-SMALL). Color: terior shoulder straps to carry the
ton poplin. The model is Mossy Green. parka like a backpack Heavy-duty
wearing size medium. Title: locking zipper with snap-button
Le 31 - All-over check shirt storm flap. Upper fleece-lined pock-
Regular fit (Men, Red, XX- ets, lower flap pockets with snaps.
LARGE). Color: Khaki. Thermal experience index: 4. Made
in Canada. The model is wearing
size small. Title: Canada Goose -
Trillium parka (Women, Pink, XX-
SMALL). Color: Khaki.
Figure 5.8: Some results of our method. Ground truth is shown in green, and the proposal
closest to the text in blue. On top of each figure there is its category and the overlap percentage
between the result and the GT. Caption of each figure is its textual metadata.
where LO is the L1 regression loss for overlap, ovˆ is the predicted overlap of the bounding box
with the corresponding ground truth bounding box and ov is the actual overlap with the ground
truth, computed as their intersection over union (ov(A, B) = (A ∩ B)/(A ∪ B)). This case is
omitted from Fig. 5.5 for clarity.
All the design choices for the network layers were taken after meticulous ablation studies.
69
5.4. DATASET
(a) Category: T-Shirts. De- (b) Category: Coats & Jack- (c) Category: Coats & Jack-
scription: Women >T-Shirts. ets. Description: Women ets. Description: DESIGN-
Twik - Solid high-neck tank Twik. >Coats. Vero Moda - Long base- ERS >Men >Coats. Pierre Bal-
Exclusively from Twik. A fit- ball jacket Vero Moda. Vero main - Military coat. Pierre Bal-
ted high-neck piece perfect for ca- Moda at Icône A preppy, chic main, the second line of the fa-
sual looks. Comfortable, stretch and sporty piece for a trendy fall mous French label, remains faith-
jersey. The model is wearing look. Blended wool with an ul- ful to the aristo-rock aesthetic of
size small. Title: Twik - Solid tra soft brushed finish and fine its big sister. The strength of
high-neck tank (Women, Red, X- satiny lining. Ribbed knit collar this structured military-style coat
SMALL). Color: Ruby Red. and cuffs Snap closure. Zip pock- lies in its impeccably masculine
ets. The model is wearing size cut and luxurious, ultra Parisian
small. Title: Vero Moda - Long nonchalance. Luxurious, struc-
baseball jacket (Women, Black, tured felted wool-blend material,
X-SMALL). Color: Black. full signature lining Double but-
toning, button cuffs. Piped verti-
cal side pockets, side patch pock-
ets with buttoned flaps Buttoned
shoulder tab. Back vent Signa-
ture embossed metallic buttons.
Made in Italy. The model is wear-
ing size 40. Title: Pierre Bal-
main - Military coat (Men, Black,
42). Color: Black.
Figure 5.9: More results of our method. Ground truth is shown in green, and the proposal
closest to the text in blue. On top of each figure there is its category and the overlap percentage
between the result and the GT. Caption of each figure is its textual metadata.
applied to the images (random horizontal flip, small rotations, etc.). For the bounding boxes, we
added random noise to their size and position of up to 5% of the bounding box dimensions. Also,
instead of directly resizing every bounding box to the size required by the network (227×227×3),
they were padded to be as square as the original image dimensions allow to prior to the resizing
step, thus taking into account image context and avoiding heavy deformations (see Fig 5.7).
5.4 Dataset
We use two different datasets in this chapter. The first one consists of 458, 700 products from
eight different e-commerces. The second one, used for testing, is formed by 3, 000 products
from a different e-commerce . Each product from the datasets is formed by an image with the
annotated GT bounding box and its associated metadata. Some examples of the images and
70
CHAPTER 5. MAIN PRODUCT DETECTION
Table 5.1: Results of the architectures detailed in Section 5.3. Includes precision@top-K
and classification accuracies. Best result shown in bold.
their associated textual information can be seen in Figs. 5.8 and 5.9.
5.5 Results
Our method was evaluated using a dataset with images coming from a different source than the
training dataset in order to test its ability to generalize.
All the results shown in this section are obtained according to the following evaluation pro-
cedure:
2. Compute the distance between the image and text descriptors, and select the bounding
box with the smallest distance to the text.
3. Check the overlap between this bounding box and the ground truth bounding box of the
correct product. If the overlap is greater than 70%, the result is considered as a positive
main product prediction for this image. Otherwise, as negative. Overlap between bounding
boxes A and B is computed as (A ∩ B) / (A ∪ B), like in Section 5.3.5.
The numerical results we provide in this section are the percentage of test images with positive
predictions (overlap with ground truth greater than 70%) from the test set. Evaluations were
carried out for different positive overlap percentages, but we consider 70% as a good value. The
tendencies for the rest of overlap percentages evaluated were similar.
The first dataset described in Section 5.4 is homogeneously distributed into train and vali-
dation pairs of images and metadata (70% for training, 30% for validation). For each network
architecture, we use for testing the weights values of the iteration with best performance in the
validation subset. The test dataset (from which we present results), as stated in Section 5.4,
comes from a different image source to prove the generalization ability of the method. This
dataset also comes from an actual e-commerce website.
71
5.5. RESULTS
Figure 5.10: Two-dimensional t-SNE visualization. Computed using the projections of the
training set images in our embedding.
performance of the method. Also, in general, the approach using RoI pooling descriptors yields
better results than the approach using bounding boxes through the whole image architecture. For
the architecture predicting the overlap percentage between each proposal and the GT bounding
box, the average error in the percentage prediction is 5, 81%. Even though our main purpose is
not classification, we use it as a help to incorporate to our embedded descriptors the ability to
separate better the clothes from different categories. Percentages of classification accuracy for
the different architectures are shown in Table 5.1.
72
CHAPTER 5. MAIN PRODUCT DETECTION
how these pictures are different from the training set: they normally show the products worn by
models, who also wear other clothes that might partially or completely appear on the images.
Sometimes the background is textured (Fig. 5.8c).
5.6 Summary
We have presented a method that uses textual metadata to detect the product of interest in
fashion e-commerce images. Text is parameterized using a distributed representation and for our
best approach we use compact representations of bounding boxes extracted from frozen layers
of a pre-trained network. We compare several network architectures combining different loss
types (contrastive, cross entropy and L1 regression). In our test dataset, with images and texts
coming from a different e-commerce than those used for training, our method is able to rank
the main product bounding box in the top-3 most probable candidate bounding boxes among
300 candidates in an 80% of the cases. At the same time, the network learns to classify these
products into the corresponding clothing category with high accuracy.
It is worth mentioning that this main product detection task continues to be very relevant
for Wide Eyes. It was integrated into the company framework to be used as an intermediate tool
when crawling data, and an evolution of this work using Graph Convolutional Networks [33] was
recently published by members of the company [172].
73
Contributions to Wide Eyes
CHAPTER 6. CONTRIBUTIONS TO WIDE EYES
This thesis is framed in the Industrial Doctorate plan, and therefore other tasks were con-
ducted apart from the research described in the previous chapters, as mentioned in Section 1.3.
Some of the most relevant jobs carried out during these years are listed below.
77
6.4. CODE IMPROVEMENTS
Some of the specific tasks carried out during this period related to turning raw online data
into useful datasets are:
• Implementation of several web crawling systems to automatically collect data from online
e-commerce sites.
• Defining the best standard way of storing the information.
• Mapping categories from each different e-commerce to Wide Eyes categories (each e-
commerce has its own way of naming different categories and its own category hierarchy).
• Usage and creation of mySQL databases.
• Completing missing data inferring categories with our models.
• Merging different datasets to unify all of our data.
• Getting rid of unwanted images (noise) using small convolutional neural networks.
• Cleaning the texts (normalizing, getting rid of unwanted words, sometimes translating to
English from a different language).
• Balancing the different categories in the dataset (some categories had many more samples
than others).
In order to carry out some of these tasks, we got ideas from [178] or [203]. At some point we
also took advantage of pre-existing pubic datasets, like Pascal [41], Flickr8K [66], flickr30K [205],
the Amazon dataset from [106] or DeepFashion [97].
78
CHAPTER 6. CONTRIBUTIONS TO WIDE EYES
6.5 Prototyping
Apart from all the mentioned tasks, that ended up being extremely useful and important for the
computer vision team, many other techniques and algorithms were explored to check ideas or
create basic prototypes, sometimes without achieving a final outcome. Some of them are worth
mentioning:
• K-nearest neighbors algorithm [11] was used for several experimental tasks.
• Using boosting techniques such as AnyBoost [104] or MILBoost [15] to combine weak
classifiers and try to find visual attributes in fashion images.
• Incorporating our BASS superpixels to a CRF architecture like the one in [149].
• Studying Gaussian Mixture Models and Bag of Words features for texts.
• Training lda2vec [112] for automatic attribute discovery, similar to what they do in [19].
• Studying doc2vec [88] and Fisher Vectors [124] for interpreting natural language and replace
word2vec in Chapters 4 and 5.
• For Chapter 4, some Canonical Correlation Analysis techniques were tested (such as
DCCA [13] and KCCA [16]).
6.6 Demos
Apart from several video demonstrations showing the performance of the algorithms, a website
demo of the multi-modal retrieval system was created, where the user could select an image from
our dataset or introduce a text, and the most similar results were shown.
79
6.7. RECOMMENDATION SYSTEM
Figure 6.1: Bidirectional LSTMs for fashion. Overview of Learning Fashion Compatibilty
with Bidirectional LSTMs. Figure extracted from [61].
instead of thousands of very similar black shirts, so the binary matrix would have been much less
sparse). After that, the collaborative filter would have been computed using the smaller matrix.
We implemented and tested JULE [200] for clustering. Results were not very promising with our
data, so we did a new research on the topic, carefully studying the work of [10], [60] and [61].
Other ideas that came up during this period were about creating a rating-based system using
DCCA [13].
In the end, we decided to follow for recommendation the same approach of embeddings de-
veloped in previous chapters. Our plan was based on the method in [61]. In that paper, the
authors develop a bidirectional LSTM (Long-Short Term Memory) [55] system to be able to
predict compatibility between different garments and complete or generate outfit recommenda-
tions. In order to do that, they claim to generate individual garment descriptors that take into
account the rest of garments in the outfit using images of the different garments (see Fig. 6.1).
Our idea was to create an embedding where one could map this type of descriptors (that include
information about the whole outfit) with descriptors of entire images of a person wearing the
whole outfit. Doing this, the system would have been able to train a recommendation system
like the one in [61] and then associate the recommendation information to the entire image,
taking advantage of the trained bi-LSTM system for the recommendation and the embedding
for mapping the results from product sequences to single look images.
After implementing the paper and evaluating many different results trying to create the
clustering embedding, we got to the conclusion that the descriptors used for recommendation in
the paper were not as dependent on the rest of garments in the outfit as we initially thought,
therefore they were very difficult to associate with the entire look image.
80
Conclusions
I am glad that you are here with me. Here at the end of all
things.
J.R.R. Tolkien
CHAPTER 7. CONCLUSIONS
Throughout this thesis, we pushed in the direction of reducing the semantic gap mentioned in
Section 1.1, and we did it focusing in the fashion domain. This gap remains open, and, fortunately
for researchers in the field, will remain open for years to come. Everything we presented in this
dissertation is just a small sample of the possibilities that a trending field like fashion offers for
computer vision and machine learning researchers.
In order to do so, and always aligning our efforts with the commercial interests of Wide Eyes,
our work ranges from a new low-level algorithm for superpixel segmentation to a more abstract
interpretation of images tied with their textual metadata, first using entire images and then
moving to a region-specific interpretation. Along the way, we moved from the classical computer
vision approach used in Chapter 3 to the usage of the nowadays ubiquitous convolutional neural
networks in the subsequent chapters.
Apart from exploring other fashion-related problems like the ones mentioned in Section 2.3,
many possibilities arise from the work we presented. An overview of some of them is provided
below, along with the main conclusions of each of the chapters.
83
7.2. FASHION MULTI-MODAL EMBEDDING
data augmentation techniques to regularize the results. Another advantage would be that we
might get rid of all the parameters that have to be set for BASS.
84
CHAPTER 7. CONCLUSIONS
the ground truth bounding box was among the top-3 ranked boxes according to their distance
to the text.
85
Bibliography
[1] Ecommerce 101 + the history of online shopping: What the past says about tomor-
row’s retail challenges. https://www.bigcommerce.com/blog/ecommerce/#ecommerce-
timeline. Accessed: 2019-10-25.
[3] The ultimate list of e-commerce stats for holiday 2016. http://blog.marketingadept.
com/the-ultimate-list-of-e-commerce-marketing-stats-for-holiday-2016/. Ac-
cessed: 2017-01-23.
[5] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels
compared to state-of-the-art superpixel methods. Pattern Analysis and Machine Intelli-
gence, 34(11):2274–2282, 2012.
[6] A. Agudo and F. Moreno-Noguer. Learning shape, motion and elastic models in force
space. In IEEE International Conference on Computer Vision, 2015.
[7] A. Agudo and F. Moreno-Noguer. Simultaneous pose and non-rigid shape with particle
dynamics. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[8] A. Agudo and F. Moreno-Noguer. DUST: Dual union of spatio-temporal subspaces for
monocular multiple object 3D reconstruction. In IEEE Conference on Computer Vision
and Pattern Recognition, 2017.
[10] Z. Al-Halah, R. Stiefelhagen, and K. Grauman. Fashion forward: Forecasting visual style
in fashion. In IEEE International Conference on Computer Vision, pages 388–397, 2017.
[12] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neigh-
bor in high dimensions. In 47th annual IEEE Symposium on Foundations of Computer
Science (FOCS’06), pages 459–468, 2006.
87
BIBLIOGRAPHY
[13] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In
International Conference on Machine Learning, pages 1247–1255, 2013.
[14] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image
segmentation. Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.
[15] B. Babenko, P. Dollár, Z. Tu, and S. Belongie. Simultaneous learning and alignment: Multi-
instance and multi-pose learning. In Workshop on Faces in ’Real-Life’ Images: Detection,
Alignment, and Recognition, 2008.
[16] F. R. Bach and M. I. Jordan. Kernel Independent Component Analysis. Journal of Machine
Learning Research, 3(Jul):1–48, 2002.
[17] S. Bell and K. Bala. Learning visual similarity for product design with convolutional neural
networks. ACM Transactions on Graphics (SIGGRAPH), 34(4):98, 2015.
[18] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new
perspectives. Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
[19] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characteriza-
tion from noisy web data. In European Conference on Computer Vision, pages 663–676.
Springer, 2010.
[20] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object
classification: A domain adaptation approach. In Conference on Neural Information Pro-
cessing Systems, 2010.
[21] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3(Jan):993–1022, 2003.
[22] L. Bossard, M. Dantone, C. Leistner, C. Wengert, T. Quack, and L. Van Gool. Apparel
classification with style. In Asian Conference on Computer Vision, 2012.
[23] R. Catherine and W. Cohen. Transnets: Learning to transform for recommendation. In
Proceedings of the eleventh ACM Conference on Recommender Systems, pages 288–296,
2017.
[24] J. Y. Chai, C. Zhang, and R. Jin. An empirical investigation of user term feedback in
text-based targeted image search. ACM Transactions on Information Systems (TOIS),
25(1):3, 2007.
[25] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image
similarity through ranking. Journal of Machine Learning Research, 11(Mar):1109–1135,
2010.
[26] H. Chen, A. Gallagher, and B. Girod. Describing clothing by semantic attributes. In
European Conference on Computer Vision, pages 609–623. Springer, 2012.
[27] Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and S. Yan. Deep domain adaptation
for describing people based on fine-grained clothing attributes. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 5315–5324, 2015.
88
BIBLIOGRAPHY
[28] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis.
Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.
[29] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun. Very deep convolutional networks
for natural language processing. arXiv preprint arXiv:1606.01781, 2016.
[31] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with multiscale graph decomposi-
tion. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar-
chical image database. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 248–255, 2009.
[35] W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan. Style finder: Fine-
grained clothing style detection and retrieval. In IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 2013.
[36] F. Diaz, B. Mitra, and N. Craswell. Query expansion with locally-trained word embeddings.
arXiv preprint arXiv:1605.07891, 2016.
[37] B. Dinakaran, J. Annapurna, and C. A. Kumar. Interactive image retrieval using text and
image content. Cybern Inf Tech, 10:20–30, 2010.
[38] P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In IEEE International
Conference on Computer Vision, 2013.
[39] J. Dong, Q. Chen, W. Xia, Z. Huang, and S. Yan. A deformable mixture parsing model
with parselets. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
[40] A. Eriksson, C. Olsson, and F. Kahl. Normalized cuts revisited: A reformulation for seg-
mentation with linear grouping constraints. Journal of Mathematical Imaging and Vision,
39(1):45–61, 2011.
[41] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal
visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–
338, June 2010.
89
BIBLIOGRAPHY
[44] V. Fomin, J. Anmol, S. Desroziers, Y. Kumar, J. Kriss, A. Tejani, and E. Rippeth. High-
level library to help with training neural networks in pytorch. https://github.com/
pytorch/ignite, 2020.
[45] D. Frejlichowski, P. Czapiewski, and R. Hofman. Finding similar clothes based on seman-
tic description for the purpose of fashion recommender system. In Asian Conference on
Intelligent Information and Database Systems, pages 13–22. Springer, 2016.
[47] B. Fulkerson, A. Vedaldi, S. Soatto, et al. Class segmentation and object localization
with superpixel neighborhoods. In IEEE International Conference on Computer Vision,
volume 9, pages 670–677. Citeseer, 2009.
[48] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word embedding based generalized lan-
guage model for information retrieval. In Special Interest Group on Information Retrieval
Conference, 2015.
[49] R. Girshick. Fast R-CNN. In IEEE International Conference on Computer Vision, pages
1440–1448, 2015.
[50] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain
adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[51] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe. Deep convolutional ranking for
multilabel image annotation. arXiv preprint arXiv:1312.4894, 2013.
[54] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An
unsupervised approach. In IEEE Conference on Computer Vision and Pattern Recognition,
2011.
[55] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
90
BIBLIOGRAPHY
[57] J.-W. Ha, H. Pyo, and J. Kim. Large-scale item categorization in e-commerce using multiple
recurrent neural networks. In Special Interest Group on Knowledge Discovery and Data,
pages 107–115. ACM, 2016.
[58] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg. Where to buy it:
Matching street clothing photos in online shops. In IEEE Conference on Computer Vision
and Pattern Recognition, 2015.
[60] X. Han, Z. Wu, P. X. Huang, X. Zhang, M. Zhu, Y. Li, Y. Zhao, and L. S. Davis. Automatic
spatially-aware fashion concept discovery. In IEEE International Conference on Computer
Vision, pages 1463–1471, 2017.
[61] X. Han, Z. Wu, Y.-G. Jiang, and L. S. Davis. Learning fashion compatibility with bidirec-
tional LSTMs. In ACM International Conference on Multimedia, pages 1078–1086, 2017.
[62] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In IEEE International Conference on Computer
Vision, pages 1026–1034, 2015.
[63] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[64] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends
with one-class collaborative filtering. In International Conference on World Wide Web,
pages 507–517. International World Wide Web Conferences Steering Committee, 2016.
[65] X. He, R. S. Zemel, and D. Ray. Learning and incorporating top-down cues in image
segmentation. In European conference on computer vision, pages 338–351. Springer, 2006.
[66] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task:
Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–
899, 2013.
[67] T. Hofmann. Probabilistic latent semantic indexing. In Special Interest Group on Infor-
mation Retrieval Conference, 1999.
[68] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context from a single image. In IEEE
International Conference on Computer Vision, volume 1, pages 654–661. IEEE, 2005.
[69] W.-L. Hsiao and K. Grauman. Learning the latent “look”: Unsupervised discovery of
a style-coherent embedding from fashion images. In IEEE International Conference on
Computer Vision, pages 4213–4222. IEEE, 2017.
91
BIBLIOGRAPHY
[70] W.-L. Hsiao and K. Grauman. Creating capsule wardrobes from fashion images. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 7161–7170, 2018.
[71] J. Huang, R. S. Feris, Q. Chen, and S. Yan. Cross-domain image retrieval with a dual
attribute-aware ranking network. In Proceedings of the IEEE international conference on
computer vision, pages 1062–1070, 2015.
[72] S. J. Hwang, K. Grauman, and F. Sha. Analogy-preserving semantic embedding for visual
object categorization. In International Conference on Machine Learning, pages 639–647,
2013.
[73] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International Conference on Machine Learning, 2015.
[74] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan. Large scale visual
recommendations from street fashion images. In Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 1925–1934, 2014.
[75] V. Jampani, D. Sun, M.-Y. Liu, M.-H. Yang, and J. Kautz. Superpixel sampling networks.
In IEEE European Conference on Computer Vision, pages 352–368, 2018.
[79] H. Kim, S. Lee, D. Lee, S. Choi, J. Ju, and H. Myung. Real-time human pose estimation
and gesture recognition from depth images using superpixels and svm classifier. Sensors,
15(6):12410–12427, 2015.
[80] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882, 2014.
[82] Y. Kita. Elastic-model driven analysis of several views of a deformable cylindrical object.
volume 18, pages 1150–1162. IEEE, 1996.
92
BIBLIOGRAPHY
[85] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu-
tional neural networks. In Conference on Neural Information Processing Systems, 2012.
[86] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In Advances
in Neural Information Processing Systems, pages 950–957, 1992.
[87] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text
classification. In Association for the Advancement of Artificial Intelligence, volume 333,
pages 2267–2273, 2015.
[90] D. Li, H.-Y. Lee, J.-B. Huang, S. Wang, and M.-H. Yang. Learning structured semantic
embeddings for visual recognition. arXiv preprint arXiv:1706.01237, 2017.
[91] Y. Li, L. Cao, J. Zhu, and J. Luo. Mining fashion outfit composition using an end-to-end
deep learning approach on set data. IEEE Transactions on Multimedia, 19(8):1946–1955,
2017.
[92] R.-S. Lin, D. A. Ross, and J. Yagnik. Spec hashing: Similarity preserving algorithm for
entropy-based coding. In 2010 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, pages 848–854. IEEE, 2010.
[93] F. Liu, C. Shen, G. Lin, and I. Reid. Deep convolutional neural fields for depth estimation
from a single image. In IEEE Conference on Computer Vision and Pattern Recognition,
2015.
[94] S. Liu, J. Feng, C. Domokos, H. Xu, J. Huang, Z. Hu, and S. Yan. Fashion parsing with
weak color-category labels. IEEE Transactions on Multimedia, 16(1):253–265, 2013.
[95] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan. Street-to-shop: Cross-scenario clothing
retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer
Vision and Pattern Recognition, pages 3330–3337. IEEE, 2012.
[96] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect
a salient object. Pattern Analysis and Machine Intelligence, 33(2):353–367, 2011.
[97] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes
recognition and retrieval with rich annotations. In IEEE Conference on Computer Vision
and Pattern Recognition, June 2016.
[98] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Jour-
nal of Computer Vision, 60(2):91–110, 2004.
93
BIBLIOGRAPHY
[99] C. Lynch, K. Aryafar, and J. Attenberg. Images don’t lie: Transferring deep visual semantic
features to large-scale multimodal learning to rank. arXiv preprint arXiv:1511.06746, 2015.
[100] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person
image generation. In Advances in Neural Information Processing Systems, pages 406–416,
2017.
[101] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning
Research, 9(Nov):2579–2605, 2008.
[102] T. Malisiewicz and A. A. Efros. Improving spatial support for objects via multiple seg-
mentations. 2007.
[103] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring ecological
statistics. In IEEE International Conference on Computer Vision, 2001.
[105] K. Matzen, K. Bala, and N. Snavely. Streetstyle: Exploring world-wide clothing styles
from millions of photos. arXiv preprint arXiv:1706.01869, 2017.
[106] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel. Image-based recommendations
on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pages 43–52, 2015.
[108] T. McInerney and D. Terzopoulos. A finite element model for 3D shape reconstruction and
nonrigid motion tracking. In IEEE International Conference on Computer Vision, 1993.
[109] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations
in vector space. arXiv preprint arXiv:1301.3781, 2013.
[111] B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana. A dual embedding space model for
document ranking. arXiv preprint arXiv:1602.01137, 2016.
[112] C. E. Moody. Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv
preprint arXiv:1605.02019, 2016.
[113] F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix
regression. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2823–
2832, 2017.
94
BIBLIOGRAPHY
[114] F. Moreno-Noguer and P. Fua. Stochastic exploration of ambiguities for nonrigid shape
recovery. volume 35, pages 463–475. IEEE, 2013.
[115] F. Moreno-Noguer and J. Porta. Probabilistic simultaneous pose and non-rigid shape. In
IEEE Conference on Computer Vision and Pattern Recognition, pages 1289–1296, 2011.
[116] F. Moreno-Noguer, M. Salzmann, V. Lepetit, and P. Fua. Capturing 3D stretchable surfaces
from single images in closed form. In IEEE Conference on Computer Vision and Pattern
Recognition, 2009.
[117] G. Mori. Guiding model search using segmentation. In IEEE International Conference on
Computer Vision, 2005.
[118] G. Mori, X. Ren, A. A. Efros, and J. Malik. Recovering human body configurations:
Combining segmentation and recognition. In IEEE Conference on Computer Vision and
Pattern Recognition, volume 2, pages II–II. Citeseer, 2004.
[119] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm
configuration. VISAPP (1), 2(331-340):2, 2009.
[120] P. Neubert, N. Sünderhauf, and P. Protzel. Superpixel-based appearance change prediction
for long-term navigation across seasons. Robotics and Autonomous Systems, 69:15–27, 2015.
[121] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and
J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint
arXiv:1312.5650, 2013.
[122] C. Pantofaru, C. Schmid, and M. Hebert. Object recognition by integrating multiple
image segmentations. In IEEE European Conference on Computer Vision, pages 481–494.
Springer, 2008.
[123] K. Pearson. Liii. on lines and planes of closest fit to systems of points in space. The
London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–
572, 1901.
[124] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization.
In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[125] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with com-
pressed fisher vectors. In IEEE Conference on Computer Vision and Pattern Recognition,
2010.
[126] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image
classification. In IEEE European Conference on Computer Vision, 2010.
[127] M. Polato and F. Aiolli. Exploiting sparsity to build efficient kernel based collaborative
filtering for top-n item recommendation. Neurocomputing, 268:17–26, 2017.
[128] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer.
Geometry-aware network for non-rigid shape prediction from a single view. In Proceedings
of the Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
95
BIBLIOGRAPHY
[134] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time
object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages
779–788, 2016.
[135] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In Advances in Neural Information Processing Systems,
pages 91–99, 2015.
[136] X. Ren and J. Malik. Learning a classification model for segmentation. In Pattern Analysis
and Machine Intelligence, pages 10–17. IEEE, 2003.
[137] Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille. Multi-instance visual-semantic embedding.
arXiv preprint arXiv:1512.06963, 2015.
[138] O. Rippel, M. Paluri, P. Dollar, and L. Bourdev. Metric learning with adaptive density
discrimination. arXiv preprint arXiv:1511.05939, 2015.
[142] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new
domains. In IEEE European Conference on Computer Vision, 2010.
96
BIBLIOGRAPHY
[143] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multi-level pixel-aligned implicit
function for high-resolution 3d human digitization. In IEEE Conference on Computer
Vision and Pattern Recognition, June 2020.
[147] S. Schmit and C. Riquelme. Human interaction with recommendation systems: On bias
and exploration. ArXiv e-prints, 1050:1, 2017.
[148] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and
Machine Intelligence, 22(8):888–905, 2000.
[151] E. Simo-Serra and H. Ishikawa. Fashion style in 128 floats: Joint ranking and classification
using weak data for feature extraction. In IEEE Conference on Computer Vision and
Pattern Recognition, 2016.
[152] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer. A joint model for 2d and
3d pose estimation from a single image. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 3634–3641, 2013.
[154] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. DaLI: Deformation and light invariant
descriptor. In International Journal of Computer Vision (IJCV), volume 115, pages 135–
154, 2015.
[155] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. 3d human pose tracking priors using
geodesic mixture models. International Journal of Computer Vision, 122(2):388–408, 2017.
[156] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
97
BIBLIOGRAPHY
[157] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in
videos. In IEEE International Conference on Computer Vision, page 1470. IEEE, 2003.
[158] J. R. Smith. The real problem of bridging the “semantic gap”. In International Workshop
on Multimedia Content Analysis and Mining, pages 16–17. Springer, 2007.
[159] Z. Song, M. Wang, X.-s. Hua, and S. Yan. Predicting occupation via human clothing and
contexts. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
[162] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception
architecture for computer vision. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2818–2826, 2016.
[163] J. K. V. Tan, I. Budvytis, and R. Cipolla. Indirect deep structured learning for 3d human
body shape and pose prediction. 2017.
[164] P. Tangseng, Z. Wu, and K. Yamaguchi. Looking at outfit to parse clothing. arXiv preprint
arXiv:1703.01386v1, Mar 2017.
[165] P. J. Toivanen. New geodosic distance transforms for gray-scale images. Pattern Recognition
Letters, 17(5):437–450, 1996.
[166] A. Torralba, R. Fergus, Y. Weiss, et al. Small codes and large image databases for recogni-
tion. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 2.
Citeseer, 2008.
[170] W.-C. Tu, M.-Y. Liu, V. Jampani, D. Sun, S.-Y. Chien, M.-H. Yang, and J. Kautz. Learning
superpixels with segmentation-aware affinity loss. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 568–576, 2018.
[171] H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki. Self-supervised learning of motion
capture. In Advances in Neural Information Processing Systems, pages 5236–5246, 2017.
98
BIBLIOGRAPHY
[172] O. Y. Vacit, L. Yu, and J. van de Weijer. Main product detection with graph networks in
fashion. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
[173] M. Van den Bergh, X. Boix, G. Roig, and L. Van Gool. Seeds: Superpixels extracted via
energy-driven sampling. International Journal of Computer Vision, 111(3):298–314, 2015.
[174] A. Van Den Hengel, A. Dick, T. Thormählen, B. Ward, and P. H. Torr. Videotrace:
rapid interactive scene modelling from video. In ACM Transactions on Graphics (ToG),
volume 26, page 86. ACM, 2007.
[175] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking. In IEEE
European Conference on Computer Vision. 2008.
[176] O. Veksler, Y. Boykov, and P. Mehrani. Superpixels and supervoxels in an energy opti-
mization framework. In IEEE European Conference on Computer Vision. 2010.
[177] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on
immersion simulations. Pattern Analysis and Machine Intelligence, (6):583–598, 1991.
[181] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep learning for
content-based image retrieval: A comprehensive study. In ACM International Conference
on Multimedia, pages 157–166, 2014.
[182] C. Wang, Z. Liu, and S.-C. Chan. Superpixel-based hand gesture recognition with kinect
depth camera. Multimedia, IEEE Transactions on, 17(1):29–39, 2015.
[183] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embed-
dings. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5005–5013,
2016.
[184] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching for phrase
localization. In European Conference on Computer Vision, pages 696–711. Springer, 2016.
[185] P. Wang, G. Zeng, R. Gan, J. Wang, and H. Zha. Structure-sensitive superpixels via
geodesic distance. International Journal of Computer Vision, 103(1):1–21, 2013.
99
BIBLIOGRAPHY
[186] S. Wang, H. Lu, F. Yang, and M.-H. Yang. Superpixel tracking. In IEEE International
Conference on Computer Vision, 2011.
[187] Z. Wang and Y. Zhang. Opinion recommendation using neural memory model. arXiv
preprint arXiv:1702.01517, 2017.
[188] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in Neural Information
Processing Systems, pages 1753–1760, 2009.
[189] P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph. D. dissertation, Harvard University, 1974.
[190] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation: learning to rank with
joint word-image embeddings. Machine learning, 81(1):21–35, 2010.
[191] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image
annotation. In Twenty-Second International Joint Conference on Artificial Intelligence,
2011.
[192] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled
data for image classification. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2691–2699, 2015.
[193] Y. Xiao and K. Cho. Efficient character-level document classification by combining convo-
lution and recurrent layers. arXiv preprint arXiv:1602.00367, 2016.
[194] K. Yamaguchi, T. L. Berg, and L. E. Ortiz. Chic or social: Visual popularity analysis in
online fashion networks. In ACMMM, 2014.
[195] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper doll parsing: Retrieving similar
styles to parse clothing items. In IEEE Conference on Computer Vision and Pattern
Recognition, 2013.
[197] K. Yamaguchi, T. Okatani, K. Sudo, K. Murasaki, and Y. Taniguchi. Mix and match:
Joint model for clothing and attribute recognition. In British Machine Vision Conference,
volume 1, page 4, 2015.
[198] F. Yan and K. Mikolajczyk. Deep correlation for matching images and text. In IEEE
Conference on Computer Vision and Pattern Recognition, 2015.
[199] Q. Yan, J. Shi, L. Xu, and J. Jia. Hierarchical saliency detection on extended cssd. arXiv
preprint arXiv:1408.5418, 2014.
[200] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and
image clusters. In IEEE Conference on Computer Vision and Pattern Recognition, pages
5147–5156, 2016.
100
BIBLIOGRAPHY
[201] W. Yang, P. Luo, and L. Lin. Clothing co-parsing by joint image segmentation and labeling.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 3182–3189, 2014.
[202] J. Yao, M. Boben, S. Fidler, and R. Urtasun. Real-time coarse-to-fine topologically pre-
serving segmentation. In IEEE Conference on Computer Vision and Pattern Recognition,
2015.
[205] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual deno-
tations: New similarity metrics for semantic inference over event descriptions. Transactions
of the Association for Computational Linguistics, 2:67–78, 2014.
[206] J. Yuan, W. Shalaby, M. Korayem, D. Lin, K. AlJadda, and J. Luo. Solving cold-start
problem in large-scale recommendation engines: A deep learning approach. In 2016 IEEE
International Conference on Big Data (Big Data), pages 1901–1910. IEEE, 2016.
[209] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas. Query specific fusion for image
retrieval. In European Conference on Computer Vision, pages 660–673. Springer, 2012.
[210] X. Zhang, L. Liang, and H.-Y. Shum. Spectral error correcting output codes for efficient
multiclass recognition. In 2009 IEEE 12th International Conference on Computer Vision,
pages 1111–1118. IEEE, 2009.
[211] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classi-
fication. In Conference on Neural Information Processing Systems, 2015.
[212] Y. Zhang, R. Hartley, J. Mashford, and S. Burn. Superpixels via pseudo-boolean optimiza-
tion. In IEEE International Conference on Computer Vision, 2011.
[213] Y. Zhang, Z. Jia, and T. Chen. Image retrieval with geometry-preserving visual phrases.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 809–816. IEEE,
2011.
[214] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
101
BIBLIOGRAPHY
[215] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual
question answering. arXiv preprint arXiv:1512.02167, 2015.
[216] W. Zhou, H. Li, and Q. Tian. Recent advance in content-based image retrieval: A literature
survey. arXiv preprint arXiv:1706.06064, 2017.
[217] S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy. Be your own prada: Fash-
ion synthesis with structural coherence. In IEEE International Conference on Computer
Vision, pages 1680–1688, 2017.
[218] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In IEEE
European Conference on Computer Vision, pages 391–405, 2014.
102