Signal Processing: Image Communication 29 (2014) 434–447
Contents lists available at ScienceDirect
Signal Processing: Image Communication
journal homepage: www.elsevier.com/locate/image
Bayesian salient object detection based on saliency
driven clustering
Lei Zhou a, Keren Fu a, Yijun Li a, Yu Qiao a, XiangJian He b, Jie Yang a,n
a
b
Shanghai Jiao Tong University, Shanghai, China
University of Technology, Sydney, Australia
a r t i c l e in f o
abstract
Article history:
Received 5 August 2013
Received in revised form
3 January 2014
Accepted 3 January 2014
Available online 30 January 2014
Salient object detection is essential for applications, such as image classification, object
recognition and image retrieval. In this paper, we design a new approach to detect salient
objects from an image by describing what does salient objects and backgrounds look like
using statistic of the image. First, we introduce a saliency driven clustering method to
reveal distinct visual patterns of images by generating image clusters. The Gaussian
Mixture Model (GMM) is applied to represent the statistic of each cluster, which is used to
compute the color spatial distribution. Second, three kinds of regional saliency measures,
i.e, regional color contrast saliency, regional boundary prior saliency and regional color
spatial distribution, are computed and combined. Then, a region selection strategy
integrating color contrast prior, boundary prior and visual patterns information of images
is presented. The pixels of an image are divided into either potential salient region or
background region adaptively based on the combined regional saliency measures. Finally,
a Bayesian framework is employed to compute the saliency value for each pixel taking the
regional saliency values as priority. Our approach has been extensively evaluated on two
popular image databases. Experimental results show that our approach can achieve
considerable performance improvement in terms of commonly adopted performance
measures in salient object detection.
& 2014 Elsevier B.V. All rights reserved.
Keywords:
Saliency object detection
Saliency driven clustering
Regional saliency computation
Bayesian model
1. Introduction
A human visual system (HVS) often pays more attention
to some parts of an image. It is visual attention that allows
people to select the information which is most relevant to
ongoing behavior. Visual attention has been studied by
researchers in physiology, psychology, neural systems, and
computer vision for a long time. Extracting objects from an
image is a hot research topic and has wide applications, such
as content-based image retrieval [1], image/video compression and coding [2], object recognition and scene understanding [3–6] and image segmentation [7,8] in areas of
n
Corresponding author.
E-mail address: jieyang@sjtu.edu.cn (J. Yang).
0923-5965/$ - see front matter & 2014 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.image.2014.01.001
computer vision and computer graphics. Under the mechanism of visual attention, HVS picks out relevant parts of a
scene as attention regions corresponding to salient regions in
images. In natural scene, salient regions generally stand out
relative to its surroundings. This mechanism can be
explained by a center-surround difference model [9], which
is implemented in the feature spaces of luminance, color and
orientation.
In recent years, salient object detection has aroused
researches' interest and the related work has been divided
into two categories, i.e, approaches of bottom-up category
and approaches of top-down category respectively. In
bottom-up visual saliency, previous research [10,11]
revealed that contrast is the most influential factor in
low-level visual saliency. By computing the contrast over
a pixel domain or region domain, many visual saliency
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
435
Fig. 1. The flowchart of our proposed model.
models have been proposed over the past year [12–17].
The existing saliency models based on color contrast can
simultaneously compute global contrast of an image and
spatial coherence between regions and have displayed
impressive results. The contrast-based models tell “what
the objects look like” by highlighting the pixels with great
center-surround difference. However, the performance of
the saliency maps that only rely on color contrast will
degrade when the images are of confusing pattern or
complex scene.
Different from the contrast prior, background prior
tackles the salient object detection problem by asking the
question “what the background should look like”. In [18],
two priors, boundary and connectivity prior were used as the
priors about common backgrounds in natural images. The
boundary prior was discovered from the observations that
”the image boundary is mostly background” and ”the salient
objects seldom touch the image boundary”. Even if the
boundary priors work for most images, it may fail when
objects significantly touch the image boundary or the images
are of complex background.
Intuitively, if both the information about “what the objects
look like” and “what the backgrounds look like” is available, it
will be easier to tackle the ill-posed salient object problem. To
describe the appearance of salient objects or backgrounds,
statistic of images is a powerful feature. From the perspective
of statistic theory, many saliency models have been proposed
[19–21]. However, the extracted initial salient or background
regions are crucial for constructing reliable appearance models. In [20], a coarse salient region was first obtained via a
convex hull of interest points and a Laplacian sparse subspace
clustering algorithm was presented to obtain the prior map
related to the salient object. The algorithm is effective but the
cluster method used is of high computation complexity. In
[19], each image was divided into initial attention region and
initial background region using an adaptive threshold. This
method may fail when the salient object possesses a low
contrast with the background. In [21], a polygonal potential
Region-Of-Interest was extracted through analyzing the edge
distribution in an image.
In this paper, we address the salient object detection
problem by collecting reliable information about “what the
salient objects look like and what the backgrounds look like”
by taking the advantages of contrast prior, boundary prior
and visual patterns of images. To tell “what the salient
objects look like”, we propose a method for selecting initial
salient and background region based on the techniques like,
saliency driven clustering, regional saliency computation and
adaptively thresholding. To capture the structural information of an image, the image is first separated into several
non-overlapping regions by saliency driven clustering. Then,
we utilize three regional saliency measures, i.e, regional
color contrast saliency, regional boundary prior saliency
and color spatial distribution, to compute the saliency level
of each region. Three measures are combined non-linearly to
obtain the pixel level saliency values. Then, we classify the
pixels of an image into either potential salient region or
background region using some adaptive threshold. The color
information within the potential salient region and background region is represented using GMMs. Finally, the
saliency model is computed by applying a general Bayesian
framework. The main contributions of the paper are summarized as follows and the flow chart of our approach is
shown in Fig. 1.
We propose an effective method for implementing the
boundary prior saliency. A saliency driven clustering
approach is proposed based on the combination of
color contrast saliency and boundary prior saliency.
The cluster numbers are determined automatically by
histogram analysis.
Color spatial distribution is calculated by analyzing the
color statistic of each cluster. The regional saliency values
are computed by combining three regional saliency
measures, i.e, regional color contrast saliency, color spatial
distribution and regional boundary prior saliency.
An adaptive initial regions selection method is presented. A Bayesian framework is applied to generate
the final saliency maps.
The organization of this paper is as follows. The related
work is presented in Section 2. We introduce the approach
of saliency driven clustering in Section 3. The strategy of
region selection based on regional saliency computation is
proposed in Section 4, the three measures are introduced
in Section 4.2. Our statistic saliency model is presented in
Section 5. Experimental results are given in Section 6 and
we conclude in Section 7.
2. Related work
Among bottom-up category, as one of the work [15]
described, the contrast was often defined over various
different classes of image features including color variation
of pixels or image patches, spatial frequencies, structure and
statistic distribution of image patches, histogram and the
combination of all above. In [16], pixel's or region's contrast
was computed over all other pixels or regions. Then the
compactness in the spatial domain and the cluster contrast
436
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
evaluated by the difference between models of different
clusters were combined to generate a saliency map. Fu et al.
[17] took the advantages of color contrast and color distribution to carry out the saliency detection. Then a saliency
measure was obtained by computing two measures of
contrast that rated the uniqueness and the spatial distribution of image patches. Furthermore, postprocessing steps
were applied to refine the result as well. In [22], a detection
algorithm which was based on four principles observed in
the psychological literature was presented. The rule of
“distinctive colors or patterns” was considered for computing the saliency. In [15], high dimensional Gaussian filters
were formulated to generate saliency map in an efficient
way. Besides, there are a lot of visual saliency models which
measure visual saliency in the frequency domain. In [23],
Hou et al. proposed a visual saliency model based on the
natural image statistics. In [24], the phase spectrum of
quaternion Fourier transform was exploited to evaluate the
saliency at block level.
Many works have also been proposed applying information theory [25–28]. Bruce et al. proposed a model [25]
in which visual saliency was represented by a local likelihood of each image patch decomposed by the filters
learned from natural images. In [27], Dominik et al.
proposed a saliency algorithm in which the contrast of
the center and surround distribution of features was
computed to based on the Kullback–Leibler divergence
for salient object detection. In [28], information divergence
was used to express the non-uniform distribution of the
visual information in an image and it improved the
Bayesian surprise model to compute information divergence across an image. A visual saliency map was finally
obtained from the information divergence.
From the perspective of statistical theory, several saliency models have been proposed. The global information
of an image was applied to generate a saliency map of high
quality as shown in [29,30], the distinctness of different
statistic models representing different clusters showed
great importance in measuring the saliency of a region.
Zhang et al. [31] posited a saliency detection problem by
representing visual information of a specific class of object
using Bayesian framework and the information of a known
target class was modeled by a likelihood function. There
are also many works that compute contrast based on
image regions' natural statistics. In [32], Bayesian theory
was applied to describe the interaction between top-down
and bottom-up information. It evaluated and selected
visual features before saliency estimation. In [19], the
attention Gaussian mixture model (GMM) for salient
object and background GMM were constructed based on
the image clustering result, and pixels were classified
under the Bayesian framework to obtain the salient object.
In [29], the framework of mixture of Gaussian in H–S space
was used to compute the distance between clusters and
color spatial distribution. Then, color and orientation
distributions in images are fully utilized to selectively
generate a saliency map. In [30], a kernel density estimation (KDE) based nonparametric model was constructed
for each segmented region, and color and spatial saliency
measures of KDE models were evaluated and exploited to
measure saliency of pixels. In [16], the histograms of
regions were exploited to generate the saliency map at
pixel-level and region-level respectively. In [20], a Bayesian framework was proposed to combine the low level
cues (coarse saliency region obtained via a convex hull of
interest points) and mid level cues (saliency information
provided by superpixels) to generate a saliency map.
The significant difference between the proposed model
and previous statistical theory based models such as [19–21]
is that the initial salient and background regions are extracted
by a region selection strategy which integrates color contrast
prior, boundary prior and visual patterns information of
images. Then, the regional saliency values are taken as the
prior probability of Bayesian model and likelihood probability
is computed by analyzing statistic of the adaptively selected
initial regions.
3. Saliency driven clustering
Intuitively, image clusters can reveal distinct visual patterns of an image. In this section, we present a saliency driven
clustering method. First, the basic notations of the color
contrast saliency and boundary prior saliency are introduced.
Then, the clustering technique which is based on the analysis
of combined saliency map's histogram is introduced.
3.1. Color contrast saliency
Color contrast is inspired by the observation that color
components of salient objects may have a strong contrast
to their surroundings. Assume that an image with size N is
divided into M regions (or superpixels) and the regions are
represented as Ri ; i A f1; 2; 3; …; Mg. Then, region Ri's color
Rcon
contrast saliency Si
is computed according to the
definition in [16,17]:
SRcon
¼ ∑ Dc ðRi ; Rj ÞDs ðRi ; Rj Þ;
i
ð1Þ
jai
where Dc ðRi ; Rj Þ ¼ J ci cj J is the color distance between
the two regions. ci represents the average color information in region i, i.e., ci ¼ ∑Ik A Ri I Ck =jRi j, for i A f1; 2; …; Mg,
where ICk (or Ik) is the color feature vector at pixel k. jRi j is
the size of region Ri. Ds ðRi ; Rj Þ in Eq. (1) stands for the
spatial distance between regions Ri and Rj, which is
defined as
Ds ðRi ; Rj Þ ¼ e
α‖pi
pj ‖2
;
ð2Þ
where α is a parameter to control the contrast's sensitivity
to spatial distance and pi describes the average position of
region Ri, i.e., pi ¼ ∑Ik A Ri I Pk =jRi j, where IPk is the coordinate
vector at pixel k. In the experiment, 300 superpixels are
generated using SLIC [33] and we set α ¼ 1 10 3 . Finally,
¼
the pixel level color contrast saliency is given as Scon
i
SRcon
; iA Rj . We normalize Scon to the range [0,1] through
j
Scon ¼ ðScon
minðScon ÞÞ=ðmaxðScon Þ
minðScon ÞÞ.
3.2. Boundary prior saliency
As stated in [34], the boundary prior is an important
and helpful measure for salient object detection. For an
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
image, the pixel-level undirected weighted graph is represented as G ¼ fV; ɛg. In the graph, the boundary nodes (ΩB )
are selected using the strategy similar with [34]. Then, the
geodesic distance of a pixel m to the boundary nodes is
computed as the shortest distance to all the background
nodes [35]
gðmÞ ¼ min dg ðs; mÞ;
s A ΩB
ð3Þ
where dg ðs; mÞ is the geodesic distance between two nodes
s and m, which is computed based on the length of a
discrete path [35]:
dg ðs; mÞ ¼ min LðΓÞ;
Γ A P s;m
437
pixels fΓ 1 ; …; Γ n g. The term dðΓ i ; Γ i þ 1 Þ is Euclidean distance between Γi and Γ i þ 1 Þ and J ∇ðΔi Þ J is a finite
difference approximation of the image gradient between
Γi and Γi þ 1. We use the parameter γg to weight two kinds
of distances: the Euclidean distance and the geodesic
distance. The role of γg has been studied in [35] and we
set γ g ¼ 0:2 in the experiments. In our implementation, the
paths in Eq. (5) are computed using fast marching algorithm [36]. Then, the boundary prior saliency is defined as
¼ gðiÞ; iA ½1; …; N and it is normalized to the
SBoundary
i
range [0,1] as well.
ð4Þ
3.3. Saliency driven clustering
where P s;m stands for the set of paths between node s and
m, and the length L of a discrete path Γ is defined as [35]
n 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
LðΓÞ ¼ ∑
ð1 γ g ÞdðΓ i ; Γ i þ 1 Þ2 þ γ g J ∇ðΔi Þ J 2 ;
ð5Þ
To integrate the complementary strength of two kinds
of saliency maps, the color contrast saliency and boundary
prior saliency are combined nonlinearly,
where Γ is an arbitrary discrete path composed with n
Boundary
con
Scb
;
i ¼ Si n Si
i¼1
i A ½1; …; N:
ð6Þ
Fig. 2. An example illustrates the procedure of peak selection by histogram analysis. Nine peaks [1, 47, 66, 101, 144, 160, 194, 235, 255] are located by the
Hill climbing algorithm. (a) The source image; (b) Scon; (c) Sboundary; (d) Scb; (e) histogram ranges from 0 to 255; and (f) the selected Peak 47. The rectangle is
the search window; (g) the selected Peak 101; and (h) the selected Peak 235.
438
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
Then, the image space is separated into several nonoverlapping regions by K-means clustering [37] based on
cb
the saliency values ½Scb
1 ; …; SN . For K-means clustering, the
number of clusters K and the initial positions of centroids
are determined automatically. We use the Hill Climbing
algorithm [38] to analyze the histogram of Scb. The
computed peaks of histogram are selected as the starting
points for clustering and K is set as the number of peaks.
The procedure for saliency driven clustering is described
below.
1. We first build the histogram ranging [0,255] of combined saliency map Scb.
2. We construct a search window of size 30 and the center
of the window will move from 0 to 255 to search for
the peak of histogram. The pixel number of current bin
(The bin which lies in the center of the search window
is selected as the current bin.) is compared with the
neighboring bins' pixel numbers, and the current bin
will be selected as a peak if its number is the largest in
the search window. For the bins which are in the range
[0,15], the size of left half search window will be less
than 15 and for the bins in the range of [240, 255], the
size of right half search window will be less than 15.
We will ensure that there exists at least one half search
window whose size is 15 in the searching process. The
number of computed peaks is K and the set of peak bins
is Pb.
3. We set the clusters number as K and take Pb as the
starting centroids for K-means clustering. Then K saliency driven clusters RS1 ; …; RSK are generated.
Fig. 2 displays the process of histogram analysis by moving
search window. In Fig. 3, a clustering example is displayed.
The cluster number K ¼9 is first determined by analyzing
the histogram of combined saliency map. Then, the image
displayed is separated into nine non-overlapping regions
by K-means. It is observed from the clustering map (Fig. 3
(l)) that pixels in clusters (Fig. 3(i)–(k)) only belong to the
salient object. The cluster (Fig. 3(c)) contains only the
background pixels. The clusters (Fig. 3(d)–(h)) contain
both object and background pixels. It is clear that the
visual patterns reflected by clusters are distinct.
In the next section, we will introduce the method for
region selection by analyzing the property of each cluster,
so as to obtain more reliable information about “what the
objects look like and what the backgrounds look like”.
4. Regional saliency computation for region selection
For the generated K clusters, we compute three kinds of
regional saliency values, i.e., regional color contrast saliency, regional boundary prior saliency and color spatial
distribution. Different from the definition of distribution in
[39], we model the color distribution based on the statistic
of generated clusters. The statistics of each region is
represented using a GMM and the color distribution is
modeled as how widely the color contained in a cluster is
separated in the whole image region.
4.1. Representation of color statistics using GMM
To represent color statistics in all the regions, we choose
RGB colors as the local features to describe color information
for each region l A ½RS1 ; …; RSK and they are modeled using a
Gaussian mixture model (GMM). Let the color models be
represented by GMM fαc ; μc ; Σ c gCc ¼ 1 in the color space,
where fαc ; μc ; Σ c g contains the weight, the mean color and
the covariance matrix of the c-th component. For pixels in
each region, a set of GMM parameters are learned. The
Gaussian mixture distribution can be written as
VðI x jlÞ ¼ ∑αcl NðI x jμcl ; Σ cl Þ;
c
l A ½RS1 ; …; RSK ;
ð7Þ
where αcl ; μcl and Σcl represent the weight, the mean color and
the covariance matrix of the c-th component learned from
pixels in region l respectively. The parameters of a GMM can
be obtained by maximizing the log likelihood function for a
GMM using techniques like gradient-based optimization or
expectation–maximization algorithm. In our experiments,
GMM with five components are used to represent the color
statistics in each cluster and the EM algorithm [40] is applied
to generate the parameters of GMMs.
Fig. 3. Illustration of the process of clustering using saliency driven K-means. (a) Original image; (b) the combined saliency map; (c)–(k) are the separated
nine regions (marked in blue); (l) clustering map, different clusters are labeled different colors. (For interpretation of the references to color in this figure
caption, the reader is referred to the web version of this article.)
439
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
Fig. 4. Illustration of the process for region selection. Brighter pixels represent higher measurement values. (a) Original image; (b) color contrast saliency;
(c) boundary prior saliency; (d) Scb; (e) the boundary of regions; (f) to display regions with different colors; (j) the pixels in region PSR are labeled blue; and
(h) the pixels in region BK are labeled red. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of
this article.)
4.2. Definition of regional saliency values
Regional color contrast saliency: The regional color
contrast saliency is defined as
colorðiÞ ¼
∑x A RSi Scon
x
;
jRSi j
i A f1; …; Kg:
ð8Þ
Regions with higher average saliency values are more
likely to be contained in a salient object.
Regional boundary prior saliency: The Regional boundary saliency of region RSi is defined as
boundðiÞ ¼
∑x A RSi SBoundary
x
;
jRSi j
i A f1; …; Kg:
ð9Þ
Color spatial distribution: The color spatial distribution
csd(i) of region RSi describes the spatial distribution of the
color information contained in region RSi. It is computed as
the spatial variance of RSi's color distribution.
V h ðiÞ ¼
1 N
∑ VðI x jl ¼ RSi Þjxh
jXji x ¼ 1
three pixel level saliency maps nonlinearly and Wmap is
normalized into ½0; 1, W map ¼ W map minðW map Þ=max
ðW map Þ minðW map Þ. We take Wmap as a prior probability
inferred by contrast prior, color distribution prior and
boundary prior. We define the set of potential salient
pixels as PSR and the set of background pixels as BK. PPSR
represents the prior probability with respect to potential
salient regions and P PSR ðI p Þ ¼ W map ðpÞ. PBK is defined as the
prior probability with respect to background and P BK ðI p Þ ¼
1 P PSR ðI p Þ.
For the image in Fig. 3, it is separated into nine regions
and the computed three measures for nine regions are
color ¼ ½0:1597; 0:6226; 0:6662; 0:6673; 0:7313; 0:7376;
0:7566; 0:8759; 0:9456;
bound ¼ ½0:0947; 0:2579; 0:4273; 0:5568; 0:6299; 0:7549;
0:8618; 0:8660; 0:9412;
csd ¼ ½0:8080; 0:8007; 0:8385; 0:9062; 0:9154; 0:8930;
Sh ðiÞj2
0:9324; 0:9518; 0:9540;
ð13Þ
N
Sh ðiÞ ¼
1
∑ VðI x jl ¼ RSi Þ xh ;
jXji x ¼ 1
ð10Þ
4.3. Region selection
where xh is the x-coordinate of pixel x and jXji ¼ ∑N
x¼1
VðI x jl ¼ RSi Þ. The saliency value is interpreted as how
widely the pixels are distributed. V h ðiÞ is the horizonal
variance of the spatial position of pixels in the image. The
vertical variance V v ðiÞ is similarly defined. VPðiÞ ¼ V h ðiÞ
þV v ðiÞ, and VP is normalized to [0,1] for all regions. Then,
the color spatial distribution of region RSi is defined as
csdðiÞ ¼ 1
VPðiÞ:
ð11Þ
The pixel level saliency values are defined according to
the regional saliency values color, bound and csd
M color ðxÞ ¼ colorðjÞ; x A RSj ; i ¼ ½1; 2; …; N;
Mbound ðxÞ ¼ boundðjÞ; x A RSj ; i ¼ ½1; 2; …; N;
M csd ðxÞ ¼ csdðjÞ;
x A RSj ; i ¼ ½1; 2; …; N;
ð12Þ
where N is the size of image. Then a pixel level weight map
W map ¼ M color nM csd nMbound is constructed by combining
The performance of the statistic information extraction
is dependent on the correctness of the GMMs to model
foreground and background objects. Then, we propose a
method for selecting foreground and background regions.
The image space is divided into background region and
potential salient region using adaptive threshold selection.
The adaptive threshold λ is determined using OTSU [41]
and threshold λ controls the process of region selection.
We define SR and SB as two sets of pixels' indices,
SB ¼ fijW map o λg consists of the indices of background
pixels and SR ¼ fijW map Zλg consists of the indices of
potential salient pixels. We compute a convex hull C to
enclose all the potential salient pixels in SR. The initial set
of salient pixels is computed as PSR ¼ fiji A Cg and the
related set of background pixels is BK ¼ fiji=
2 Cg. The procedure of region selection is displayed in Fig. 4.
440
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
5. Statistic saliency generation
In the process of our algorithm, two GMM models shown
in Eq. (7) are trained over the pixels in set PSR and BK
respectively. The pixels in PSR tend to be contained by a
salient object while pixels in BK are more likely to be part of
the background. The pixels' similarity with salient region PSR
is defined as P gmm;s ðI x jPSRÞ ¼ VðI x jl ¼ PSRÞ. Similarly, the similarity with background region is P gmm;s ðI x jBKÞ ¼ VðI x jl ¼ BKÞ.
For pixel p, the likelihood is determined by its similarity to the
pixels in salient region and its difference from the pixels in
background region. The normalized likelihood probability
which expresses how probable that the observed data is
salient on pixel p is
P likli ðI p jPSRÞ ¼
P gmm;s ðI p jPSRÞ
:
P gmm;s ðI p jPSRÞ þ P gmm;s ðI p jBKÞ
ð14Þ
Similarly, P likli ðI p jBKÞ ¼ 1 P likli ðI p jPSRÞ. From the Bayesian
framework, the posterior probability on pixel p is
P posterior ðPSRjI p Þ ¼
P likli ðI p jPSRÞnP PSR ðI p Þ
;
Z
ð15Þ
where Z is the normalization constant, which ensures that the
posterior distribution on the left-hand side is a valid probability density and integrates to one. Z ¼ P likli ðI p jPSRÞ n P PSR ðI p Þ
þ P likli ðI p jBKÞ n P BK ðI p Þ. Finally, the statistic saliency (also
called GMM saliency) is defined as
Sgmm ðI p Þ ¼ P posterior ðPSRjI p Þ:
ð16Þ
Fig. 5. Experimental results on MSRA-1000 database. (a) Average precision–recall curves using different approaches and (b) average precision,
recall and F-measure using different approaches with adaptive thresholding.
6. Experiments
The empirical analysis is implemented on two popular
saliency databases: MSRA-1000 [42] and Berkley-300
database [43]. For quantitative comparison, the precision
and recall rates of various models are evaluated. For a
given threshold T, the precision and recall rates of a certain
saliency detection method are defined as
PrecisionðTÞ ¼
RecallðTÞ ¼
1 INUM jM i ðTÞ \ Gi j
∑
;
INUM i ¼ 1 jM i ðTÞj
1 INUM jM i ðTÞ \ Gi j
∑
:
INUM i ¼ 1
jGi j
Measurement is defined as
F β ðTÞ ¼
ð1 þβ2 ÞPrecisionðTÞ RecallðTÞ
;
β2 PrecisionðTÞ þRecallðTÞ
ð19Þ
where β ¼ 0:3. Besides, we evaluate the strategy of saliency
driven clustering (introduced in Section 3) and the regional saliency measures (introduced in Section 4).
6.1. Evaluation on MSRA-1000 dataset
ð17Þ
In Eq. (17), Mi(T) is the binary mask obtained by directly
thresholding the saliency map using threshold T on the ith image. Gi is the ground truth. j j denotes the mask's
sum area. INUM is the amount of images in a database. In
addition to precision–recall (PR) curves, for each image, we
follow [42,15] to segment a saliency map by adaptive
threshold
∑N V i
T s ¼ min 2 i
; T max ;
ð18Þ
N
where N denotes the number of pixels in the saliency map
and i is the pixel index. Vi is the saliency value on pixel i.
Tmax is the upper bound for the saliency value. In the
experiment, the saliency values are projected into the
range of [0,255] and we set T max ¼ 255 in the experiment.
Then the precision, recall and F-Measurement values are
computed over the ground truth maps, where F-
First, we generate the saliency maps for all 1000 testing
images using the proposed saliency model. The saliency
detection performance of proposed saliency model (with
abbreviation GMMS) is compared with nine state-of-theart saliency models, that are IT (Ittis model)) [9], LC
(Luminance Contrast) [44], SR (Spectral Residual) [23], FT
(Frequency-tuned) [42], HC (Histogram Contrast) [16], RC
(Region Contrast) [16], SF (Saliency Filters) [15], CA (Context-awear model) [22], SC (Superpixel-based Contrast)
[17] and SCMS (spatial-color constraint and multi-scale
segmentation) [45] for comparison.
The saliency maps of the state-of-the-art works excluding SC, SF and SCMS are provided in [16].1 The SF [15]
saliency maps are obtained from the author's webpage.2
The SCMS [45] saliency maps are downloaded from the
1
2
http://cg.cs.tsinghua.edu.cn/people/ cmm/Saliency/Index.htm
http://www.fedeperazzi.com/saliency_filters/
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
441
Fig. 6. Subjective comparison of our saliency map with six state-of-the-art methods on the MSRA database. (a) Original image; saliency maps of (b) SCMS;
(c) FT; (d) HC; (e) RC; (f) SF; (g) SC; (h) GMMS and (l) ground truth. (For interpretation of the references to color in this figure caption, the reader is referred
to the web version of this article.)
author's webpage.3 The results of SC are generated using
the codes provided by author. In Fig. 5(a), we compare our
precision–recall curve with other methods. Compared
with other approaches, the proposed GMMS demonstrates
the highest precision level corresponding to all the recall
rates ranging from 0 to 1. Fig. 5(b) displays the average
precision, recall and F-measure values in the adaptive
threshold experiment. Among all the approaches, GMMS
achieves the highest precision, recall and F-measure values
compared with other approaches. Besides the quantitative
3
http://ivipc.uestc.edu.cn/lfxu/
evaluation, our method is visually compared with six
methods, i.e, SCMS, HC, RC, SR, SF and SC, and the results
of some testing images are displayed in Fig. 6. Brighter
pixels indicate higher saliency probabilities. Visually, it can
be seen that our GMMS obtains relatively higher quality
saliency maps compared with the compared state-of-theart methods. The RC model sometimes highlights only
some parts of an salient object.
The methods like HC and FT are sensitive to background
noise and they often fail to identify background patches
correctly. Compared with those methods, GMMS achieves
better performance. For images with clear contrast
between the salient object and the background (see the
442
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
Fig. 6) and GMM saliency can generate more stable and
reliable saliency information in cluttered background.
However, some parts of the salient object may be assigned
relatively low saliency probabilities and hence the salient
objects are incompletely extracted (see the 2nd example,
where saliency values in black regions of the cell phone
are computed incorrectly because the black color is not the
dominant color in the statistic model). The results can be
further improved by refining the GMM saliency maps, so
as to obtain more smoother saliency maps.
6.2. Evaluation on Berkley-300 database
The Berkley-300 database is a more challenging database which contained 300 images with more complex
background or multiple objects of different sizes and
positions. The foreground masks are provided by [18] as
the ground truth. We compare the curve of our approach
with LC [44], SR [23], FT [42], HC [16], RC [16], CA [22] and
SC [17]. The PR curves are shown in Fig. 7(a) and the
average precision, recall and F-measure values are displayed in Fig. 7(b). GMMS achieves the best performance
both in the terms of PR curve and the adaptive segmentation experiments. It is observed from the visual comparison in Fig. 8 that GMMS performs better in highlighting
salient objects and suppressing background clutter under
various condition, such as for images with texture (the 6th
and 7th examples), images with weak boundary (the 5th
and 10th examples), image with small objects (the 9th
example) or color salient objects which are similar to part
of image's background (the 3rd and 4th examples) (Fig. 9).
6.3. Evaluation of saliency driven clusters
Fig. 7. Experimental results on Berkley-300 database. (a) Average precision–recall curves using different approaches and (b) average precision,
recall and F-measure using different approaches with adaptive thresholding.
examples in the 3rd and 8th rows in Fig. 6), the salient
objects can be located well in most of the saliency maps.
For images with complex background scenes containing
structure or texture pattern or images with relatively
low contrast boundary between salient objects and the
surrounding background regions (see the 1st, 6th 7th and
9th example in Fig. 6), the quality of color contrast based
model (HC, RC, SCMS and SC) is obviously degraded. In
contrast, GMMS can locate and highlight the object region
more correctly, and suppress background noise correctly
by utilizing the Bayesian framework integrating the region
based saliency values as priority. For example, in the 6th
picture listed in Fig. 6, our model can label the boundary
correctly although both of the background and object have
blue color. Because we integrate the boundary priority into
process of clustering and the structural information (the
generated clusters) may provide more reliable cues about
“what the object looks like”, even though the pixels near
the boundary have similar color with the object. The
methods (such as HC, RC) that use only color features (like
color contrast) are sensitive to background noise in a
complex scene (see the 1st and 3th example images in
From the aspect of saliency computation, good clusters
should generate satisfactory separation of salient objects
and background. To quantitatively evaluate the cluster
results, we report our method's scores related to two
criteria (1) Variation of Information (VoI) [46], computing
the information of one results not contained in the other;
(2) Global Consistency Error (GCE) [47], measuring the
extent to which on segment is a refinement of the other.
The results are obtained on Berkley Segmentation Database [47]. We compare the average scores of our cluster
strategy (with abbreviation SDC) with Ncut [48], Meanshift [49], Normalized Tree Partitioning (NTP) [50] and
JSEG. The scores are listed in Table 1 and SDC obtains the
lowest VoI with a value of 1.8360 and the second lowest
GCE with a value of 0.1955 across the test dataset, which
means that SDC can separate pixels into different clusters
with high precision according to their saliency level more
correctly. Moreover, it costs less than 1 s for SDC to process
a typical 400 300 image.
6.4. The role of measures for regional saliency computation
Three regional saliency measures based on the generated clusters, the regional color contrast saliency Eq. (8),
regional boundary prior saliency Eq. (9) and color spatial
distribution Eq. (10) are computed. We have explored the
effectiveness of each individual measure and the
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
443
Fig. 8. Subjective comparison of our saliency map with six state-of-the-art methods on the Berkley-300 database. (a) Original images; (b) FT; (c) HC; (d)
RC; (e) SR; (f) SC; (g) CA; (h) GMMS; and (i) the ground truth.
Fig. 9. Illustration of the clusters generated. The images with cluster boundaries are listed in the first row. In the second row, clusters are labeled with
different colors. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
444
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
combination strategies on MASA-1000 dataset. The PR
curves for comparison are displayed in Fig. 10, and some
meaningful conclusions can be drawn. On one hand, the
regional computation over clusters can lead to performance improvement for color contrast saliency and
boundary prior saliency Fig. 10(a). The improvement
mainly lies in the integration of visual patterns of images.
On the other hand, the non-linear combinations between
M col ; Mbound ; Mcsd are evaluated as well. The performance of
Table 1
The VOI and GCE scores for evaluating the generated clusters.
Method
VoI
GCE
Ncut
Mean-shift
NTP
JSEG
Proposed SDC
2.9061
1.9725
2.4954
2.3217
1.8360
0.2232
0.1888
0.2373
0.1989
0.1955
four kinds of combinations , M col nM bound , M col nMcsd , M bound
nMcsd and M col nM bound nM csd are evaluated on MSRA-1000
dataset and the results are listed in Fig. 10(b). The combination
of M bound n M csd can achieve significant improvement over the
individual component. Even if the combinations M col n M bound ,
M col n Mcsd and Mcol n M bound n M csd only improve the performance slightly over that of Mcol by analyzing the ROC curves,
the complementary strengths of each saliency map would
contribute to generate better performance. Fig. 11 shows some
combination results visually. The combination can improve
the performance over any of the individual measure. In
summary, the improvement mainly lies in that the saliency
measures can either suppress background clutters or highlight
salient objects, which could be explained by the examples
listed in Fig. 11. First, the color spatial distribution and
boundary prior will contribute to suppress the widely distributed background clutters. As shown in images 2, 5, 6 and 8
of Fig. 11, the widely distributed background noise tends to
have lower csd value (Eq. (11)). Moreover, the background
clutters are mostly likely to be near the boundary and the
Fig. 10. Comparison of the individual component on MSRA dataset. (a) Comparison of three measurements for regional saliency computation. (b)
Comparison of the combination strategy. (c) Enlarged image of subfigure (b) in low recall region. (d) Enlarged image of subfigure (b) in high recall region.
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
445
Fig. 11. Illustration of three measures for regional saliency computation. From left to right: original image; ground truth; color contrast saliency; regional
color saliency; boundary prior saliency; regional boundary saliency; color spatial distribution; combination of M color nM csd , M bound nM color , M bound nM csd , and
the combined of three measures. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
boundary prior we use can contribute to suppress the background noise. The results of Image 2 and Image 6 show that
the background clutters can be suppressing by combining
boundary prior. In image 5, the boundary prior saliency fails to
locate the object, however the nonlinear combination of three
measures can still generate satisfactory result. Second, even if
the performance of the computed color spatial distribution is
not robust for images with complex background scenes, it still
owns some characteristics like highlighting the salient objects
or enhancing the saliency difference between salient objects
and backgrounds (image 1, image 7 and image 8 in Fig. 11).
The contributions of extracted visual patterns for saliency performance improvement can be summarized in
two aspects. On one hand, the regional computation of
color contrast saliency and boundary prior saliency can
achieve considerable performance improvement over individual measures (Fig. 10.(a)). On the other hand, the color
spatial distribution which is calculated over the statistic of
clusters provides meaningful complementary cues for
salient object detection and background clutters suppression (visual examples listed in Fig. 11).
6.5. Computation cost
It takes about 9.3 s for our method to process a typical
400 300 image on our 2.7 GHZ Pentium Dual-Core machine
with 2 G RAM. The computation of color contrast saliency
costs 2.2 s, which respectively takes 1 s for SLIC segmentation
and 1.2 s for superpixel-based saliency computation Eq. (1).
Computation of the boundary prior saliency and combination
takes about 0.8 s. The saliency driven clustering takes about
1.3 s. It costs 4.5 s to compute the regional color contrast
saliency, color spatial distribution and regional boundary prior
saliency. The most time-consuming step is the computation of
446
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
GMM parameters Eq. (7) for color spatial distribution, it costs
about 0.4 s to estimate the parameters for a cluster. Generally,
it takes about 4 s for an image separated into 10 clusters. The
final step of GMM saliency computation Eq. (16) only costs
0.5 s.
7. Conclusion
We have presented a saliency detection framework by
modeling “what the salient objects look like” and “what
the background should look like” using statistic of images.
To exhibit diverse and meaningful visual patterns information
of natural images, we propose a saliency driven clustering
method based on the combination of contrast prior saliency
and boundary prior saliency. To incorporate the visual pattern
information of images into saliency model, the clusters are
used for computing the color spatial distribution, region based
color contrast saliency values and region based boundary
prior saliency values. Then, the salient region GMM and
background GMM are constructed based on the separated
salient regions and background region using adaptive threshold which is computed over the combined regional saliency
values. Finally, a Bayesian model is applied for generating high
quality full resolution saliency maps. Experimental results on
the most popular datasets indicate the advantages of our
method against other state-of-the-art approaches on highlighting salient objects and suppressing the cluttered background. The comparison experiments also indicate the advantages of building a saliency model based on visual patterns
information of images, such as clusters. Since a simple clustering method based on color contrast and boundary prior
saliency map is used for saliency detection, we will exploit a
more effective clustering strategy for generating semantic
regions in our future work.
Acknowledgments
This research is partly supported by NSFC, China (No:
61273258, 61375048), Ph.D. Programs Foundation of Ministry
of Education of China (No. 20120073110018).
References
[1] H. Fu, Z. Chi, D. Feng, Attention-driven image interpretation with
application to image retrieval, Pattern Recognit. 39 (9) (2006)
1604–1621.
[2] L. Itti, Automatic foveation for video compression using a neurobiological model of visual attention, IEEE Trans. Image Process. 13 (10)
(2004) 1304–1318.
[3] D. Walther, C. Koch, Modeling attention to salient proto-objects,
Neural Netw. 19 (9) (2006) 1395–1407.
[4] A. Oliva, A. Torralba, et al., Trends Cogn. Sci. 11 (12) (2007) 520–527.
[5] P.L. Rosin, A simple method for detecting salient regions, Pattern
Recognit. 42 (11) (2009) 2363–2371.
[6] J. Qin, N.H. Yung, Scene categorization via contextual visual words,
Pattern Recognit. 43 (5) (2010) 1874–1888.
[7] Y.T. Wu, F.Y. Shih, J. Shi, Y.T. Wu, A top-down region dividing
approach for image segmentation, Pattern Recognit. 41 (6) (2008)
1948–1960.
[8] J. Xue, L. Wang, N. Zheng, G. Hua, Automatic salient object
extraction with contextual cue and its applications to recognition
and alpha matting. /http://www.sciencedirect.com/science/arti
cle/pii/S0031320313001581S, 2013.
[9] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention
for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20
(11) (1998) 1254–1259.
[10] W. Einhäuser, P. Koènig, Does luminance-contrast contribute to a
saliency map for overt visual attention? Eur. J. Neurosci. 17 (5)
(2003) 1089–1097.
[11] D. Parkhurst, K. Law, E. Niebur, et al., Modeling the role of salience in
the allocation of overt visual attention, Vis. Res. 42 (1) (2002)
107–124.
[12] L. Itti, N. Dhavale, F. Pighin, Realistic avatar eye and head animation
using a neurobiological model of visual attention, in: SPIE's 48th
Annual Meeting, International Society for Optics and Photonics,
Optical Science and Technology, 2004, pp. 64–78.
[13] Le.O. Meur, Le.P. Callet, D. Barba, D. Thoreau, A coherent computational approach to model bottom-up visual attention, IEEE Trans.
Pattern Anal. Mach. Intell. 28 (5) (2006) 802–817.
[14] J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: 2006
Conference on Neural Information Processing Systems (NIPS), 2006.
[15] F. Perazzi, P. Krahenbuhl, Y. Pritch, A. Hornung, Saliency filters:
contrast based filtering for salient region detection, in: 2012 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
IEEE, 2012, pp. 733–740.
[16] M.M. Cheng, G.X. Zhang, N.J. Mitra, X. Huang, S.M. Hu, Global
contrast based salient region detection, in: 2011 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011,
pp. 409–416.
[17] K. Fu, C. Gong, J. Yang, Y. Zhou, Salient object detection via color
contrast and color distribution, in: 2012 IEEE 11th Asian Conference
on Computer Vision (ACCV), IEEE, 2012.
[18] Y. Wei, F. Wen, W. Zhu, J. Sun, geodesic saliency using background
priors, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 29–42.
[19] W. Zhang, Q. Wu, G. Wang, H. Yin, An adaptive computational model
for salient object detection, IEEE Trans. Multimed. 12 (4) (2010)
300–316.
[20] Y. Xie, H. Lu, M. Yang, Bayesian saliency via low and mid level cues,
IEEE Trans. Image Process. 22 (5) (2012) 1689–1698.
[21] Z. Liang, Z. Chi, H. Fu, D. Feng, Salient object detection using contentsensitive hypergraph representation and partitioning, Pattern
Recognit. 45 (11) (2012) 3886–3901.
[22] S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection,
IEEE Trans. Pattern Anal. Mach. Intell. 34 (10) (2012) 1915–1926.
[23] X. Hou, L. Zhang, Saliency detection: a spectral residual approach, in:
2007 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), IEEE, 2007, pp. 1–8.
[24] C. Guo, L. Zhang, A novel multiresolution spatiotemporal saliency
detection model and its applications in image and video compression, IEEE Trans. Image Process. 19 (1) (2010) 185–198.
[25] N.D. Bruce, Features that draw visual attention: an information
theoretic perspective, Neurocomputing 65–66 (2005) 125–133.
[26] L. Itti, P. Baldi, Bayesian surprise attracts human attention, Vis. Res.
49 (10) (2009) 1295–1306.
[27] D.A. Klein, S. Frintrop, Center-surround divergence of feature statistics for salient object detection, in: 2011 IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 2214–2219.
[28] W. Hou, X. Gao, D. Tao, X. Li, Visual saliency detection using
information divergence, Pattern Recognit. 46 (10) (2013) 2658–2669.
[29] V. Gopalakrishnan, Y. Hu, D. Rajan, Salient region detection by
modeling distributions of color and orientation, IEEE Trans. Multimed. 11 (5) (2009) 892–905.
[30] Z. Liu, R. Shi, L. Shen, Y. Xue, K.N. Ngan, Z. Zhang, Unsupervised
salient object segmentation based on kernel density estimation and
two-phase graph cut, IEEE Trans. Multimed. 14 (4) (2012) 1275–1289.
[31] L. Zhang, M.H. Tong, T.K. Marks, H. Shan, G.W. Cottrell, Sun: a
Bayesian framework for saliency using natural statistics, J. Vis. 8 (7)
(2008) 1–7.
[32] X.P. Hu, L. Dempere-Marco, E.R. Davies, Bayesian feature evaluation
for visual saliency estimation, Pattern Recognit. 41 (11) (2008)
3302–3312.
[33] W.F. Noh, P. Woodward, Slic (simple line interface calculation), in:
Proceedings of the Fifth International Conference on Numerical
Methods in Fluid Dynamics, June 28–July 2, Twente University,
Enschede, Springer, 1976, pp. 330–340.
[34] Y. Wei, F. Wen, W. Zhu, J. Sun, Geodesic saliency using background
priors, in: 2012 Europe Conference on Computer Vision, Springer,
2012, pp. 29–42.
[35] V. Gulshan, C. Rother, A. Criminisi, A. Blake, A. Zisserman, Geodesic
star convexity for interactive image segmentation, in: 2010 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
IEEE, 2010, pp. 3129–3136.
L. Zhou et al. / Signal Processing: Image Communication 29 (2014) 434–447
[36] L. Yatziv, A. Bartesaghi, G. Sapiro, O(n) implementation of the fast
marching algorithm, J. Comput. Phys. 212 (2) (2006) 393–399.
[37] J.A. Hartigan, M.A. Wong, Algorithm as 136: a k-means clustering
algorithm, J. R. Stat. Soc. Ser. C (Appl. Stat.) 28 (1) (1979) 100–108.
[38] T. Ohashi, Z. Aghbari, A. Makinouchi, Hill-climbing algorithm for
efficient color-based image segmentation, in: IASTED International
Conference on Signal Processing, Pattern Recognition, and Applications, 2003, pp. 17–22.
[39] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, H.Y. Shum, Learning
to detect a salient object, IEEE Trans. Pattern Anal. Mach. Intell. 33
(2) (2011) 353–367.
[40] T.K. Moon, The expectation-maximization algorithm, IEEE Signal
Process. Mag. 13 (6) (1996) 47–60.
[41] N. Otsu, A threshold selection method from gray-level histograms,
Automatica 11 (285–296) (1975) 23–27.
[42] R. Achanta, S. Hemami, F. Estrada, S. Susstrunk, Frequency-tuned
salient region detection, in: 2009 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), IEEE, 2009, pp. 1597–1604.
[43] V. Movahedi, J.H. Elder, Design and perceptual validation of performance measures for salient object segmentation, in: 2010 IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), IEEE, 2010, pp. 49–56.
447
[44] Y. Zhai, M. Shah, Visual attention detection in video sequences using
spatiotemporal cues, in: Proceedings of the 14th Annual ACM
International Conference on Multimedia, ACM, 2006, pp. 815–824.
[45] L. Xu, H. Li, L. Zeng, K.N. Ngan, Saliency detection using joint spatialcolor constraint and multi-scale segmentation, J. Vis. Commun.
Image Represent 24 (4) (2014) 465–476.
[46] M. Meila, Comparing clusterings: an axiomatic view, in: Proceedings
of the 22nd International Conference on Machine Learning, ACM,
2005, pp. 577–584.
[47] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human
segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: Proceedings of Eighth IEEE International Conference on Computer
Vision, ICCV 2001, vol. 2, IEEE, 2001, pp. 416–423.
[48] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans.
Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.
[49] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature
space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002)
603–619.
[50] J. Wang, Y. Jia, X.S. Hua, C. Zhang, L. Quan, Normalized tree
partitioning for image segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, CVPR 2008, IEEE, 2008,
pp. 1–8.