This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE SYSTEMS JOURNAL
1
Stereoscopic Image Quality Assessment Based on
Depth and Texture Information
Xingang Liu, Member, IEEE, Kai Kang, and Yinbo Liu
Abstract—With the increasing growth of multimedia applications over the networking in recent years, users have put forward
much higher requirements for multimedia signals, particularly for
image/video signal’s quality of experience (QoE) than before. The
objective approach of image quality assessment (IQA) plays an
important role for the development of compression standards and
various multimedia applications. However, the quality assessment
of stereoscopic (3-D) images faces more new challenges, such as
depth perception, virtual view synthesis, and asymmetric stereo
compression. In this paper, we propose a new full-reference (FR)
3-D IQA method to measure the quality of the distorted images.
The properties of the depth component, structure component,
and gradient component are taken into account to establish the
proposed metric. The experimental results show that the proposed
metric is highly consistent with the subjective test scores compared
with the existing related metrics. In addition, the main significance
of the proposed metric is that it not only could effectively evaluate
the quality of 3-D image but also has a satisfied effect for measuring the quality of 2-D image.
Index Terms—Depth map, full-reference (FR), image quality
assessment (IQA), quality of experience (QoE), three-dimensional
(3-D) image.
I. I NTRODUCTION
I
N recent years, with the high-speed development of the
communication systems, the multimedia applications over
the networking have grown rapidly for many fields, such as
live broadcast, video on demand, real-time conferencing, and
so on [1] and [2]. People can have the access to use the
network on any facilities, anyplace, and anytime. Consequently,
the user have put forward much higher requirements for the
quality of the multimedia signals than before; in particular, the
video/image quality at the terminal is becoming the representative one in abundant emerging multimedia quality of experience
(QoE). Currently, 3-D media services have been widely using
Manuscript received November 13, 2014; revised February 23, 2015;
accepted March 26, 2015. The work was supported by the Natural Science
Foundation of China under Grant G0501020161301268, and also supported
by the Fundamental Research Funds for the Central Universities under Grant
ZYGX2015Z009.
X. Liu is with the School of Electronic Engineering, University of Electronic
Science and Technology of China, Chengdu 610051, China (e-mail: hanksliu@
uestc.edu.cn).
K. Kang was with the School of Electronic Engineering, University of
Electronic Science and Technology of China, Chengdu 610051, China. He is
now with the Huawei Research Center, Chengdu 611731, China.
Y. Liu was with the School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu 610051, China. He is now
with the Audio and Video Technology Platform Department, ZTE Corporation,
Shenzhen 518057, China.
Digital Object Identifier 10.1109/JSYST.2015.2478119
in areas such as homes, workplaces, public spaces, and so on.
Considered as the most epidemic media services, 3DTV and
Free-viewpoint TV are expected to succeed High Definition TV
(HDTV), bringing the visually lifelike graphical effects and the
extended visual sensation to the end users [3], [4]. The delivery
chain of the whole transmission system for 3DTV includes
five stages listed as content production, encoding, transmission,
decoding at the receive side, and display. However, any stage
may cause the quality reduction of the 3-D image signal [5].
Moreover, the error generated at a certain stage may result in
the error propagation in the subsequent stages of the delivery
chain [6]. In addition, the compression technologies of 3-D
image are imperfect at the present. Therefore, the 3-D image
quality assessment (IQA) is an extremely important section in
the design and optimization of the 3-D image processing system
under wireless network. IQA, an image evaluation technology
for 2-D/3-D image, which goes through the transmission or
processing system, can contribute to maintain the quality of
media service by monitoring the quality of 2-D/3-D image
signals [7].
Over the past decades, IQA for digital image signal is one
of the basic and challenging problems in the field of image
and video processing, and it is the fact that the subjective IQA
(SIQA) is known as the most accurate method to evaluate the
quality of 2-D/3-D image since it is psychological based by
using structured experimental designs and human participating
to evaluate the quality of 2-D/3-D image [8], [9]. However, the
realization of SIQA is complex and unpractical for real systems.
Therefore, a great deal of attention has focused on the objective
IQA (OIQA), which is a technology to establish an objective
function by considering the physical aspects and psychological
issues to evaluate the quality of the digital image signal [10].
Compared with SIQA, OIQA is simple and convenient, and it
could realize the effect of automatically predicting the visual
quality in real time [11]–[13]. Therefore, many efforts have
been made to evaluate 2-D video/image quality over the last
few decades, and a great deal of the corresponding metrics has
been proposed. Simultaneously, some quality metrics have been
applied to commerce. However, the OIQA for 3-D image is still
at its preliminary stage [14].
Among the multiple 3-D image formats, such as stereoscopic
image, multiview image, depth-image-based rendering (DIBR),
and so on, the stereoscopic image is the most simple one,
which consists of two views (commonly named as left and
right views) captured by closely located two cameras where
the slight difference between the two images is regarded as the
disparity. By making use of the disparity information, human
1932-8184 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
IEEE SYSTEMS JOURNAL
beings can perceive the sensation of the depth for stereoscopic
image [15]–[17]. Nowadays, it is the most commercialized 3-D
image format and widely used for 3-D movie theaters, glasstype 3DTVs, and so on [18]–[20]. For this case, our research is
carried out on basic of the stereoscopic image.
According to availability of the reference image, OIQA
metrics can be classified as full-reference (FR), no-reference
(NR), and reduced-reference (RR) methods [21]–[22]. FR computes the quality difference by comparing the pixel information
between the original and distorted images [3], [23], [24]. RR
only uses some features of image, and NR predicts the test
image quality without any reference information of the original
image. In particular, FR OIQA techniques are based on the
direct comparison of the test image with its distortion version.
As the whole original information is available, FR metrics often
return the most accurate and robust quality measurement. This
paper does focus on FR IQA metric.
According to the existing methods, we can conclude that
the block artifacts and other factors related with human visual
system (HVS), such as contrast, structure, blur, and so on, are
the most important sections to design and optimize the performance of the OIQA metric [25], [26]. In this paper, a novel
FR OIQA metric based on the depth and texture information
is proposed. First, the contrast, structure, luminance, and depth
properties are taken into consideration, and then, the distortion
of stereoscopic image caused by blocking artifact is measured
by using the original and gradient maps. Combined with the
two aspects, the final 3-D OIQA metric of stereoscopic image is
established. A great deal of simulations by using varied testing
data bases is implemented in our experimental part to confirm
the correctness and progressiveness of our proposal.
The remainder of this paper is organized as follows:
Section II discusses the recent related research works of the
OIQA. The proposed metric is described in Section III, and
Section IV presents the simulation results and performance
analysis. Finally, the conclusions and future work are given in
Section V.
II. R ELATED W ORKS
In the past few years, various 2-D OIQA for different evaluation models have emerged rapidly and achieved good performance because it has the capability of automatically predicting
the image quality in real time. These IQA methods could be
grouped into three major categories [1] described as follows.
A. Pixel-Based Fidelity Method
This kind of method mainly measures the distortion or impairments in video/image by physically comparing the pixel
values between the distorted and original image signals and
ignoring the perception of human vision. Mean squared error
(MSE) and peak signal-to-noise ratio (PSNR) are the representative quality assessments in this assortment, since they
are easy to understand in mathematics and calculate in pixelbased fidelity. However, without considering the perceptual
information involving of the human vision, the two quality
assessments have some limitations in practical applications.
B. Psychophysical-Based Method
The properties and mechanisms, such as luminance, contrast
sensitivity, luminance masking and color perception, and so on,
are mainly used in this kind of method. What is noteworthy
is that the properties and mechanisms are related with the
HVS and obtained from the researches of the physiology and
psychology. By using the properties and mechanisms, each part
of the mathematical models is respectively established. Video
quality metric (VQM) [27] and perceptive VQM (PVQM) [3]
are the popular metrics for this method.
C. Engineering-Based Method
The main idea of the engineering method is to extract and
analyze some certain features or artifacts in image signal.
Compared with the psychophysical method, the method sets
up the model from the integral degree to acquire the ultimate
IQA algorithm. In addition, the method is much more simple
and effective than the psychophysical method. Examples of this
kind of IQA method include structural similarity (SSIM) [13],
content-based metric (CBM) [1], and VQM [3].
It is obvious that the most reliable IQA metric should be built
to mimic the HVS, because the human beings are the ultimate
receivers of the visual perception in practical applications.
Therefore, compared with the pixel-based fidelity method, the
psychophysical and engineering methods are considered as the
better methods for evaluating the quality of 2-D/3-D image.
Combining the strengths of both methods, a new FR quality
assessment metric is proposed in this paper. In the following
sections, we will describe some well-known IQA in detail.
The distinguished quality metric PSNR, which is the most
commonly used in objective metric applications in the world
nowadays, is presented at first. It is worth noting that the PSNR
is derived from the MSE that is calculated by using the summation of difference for the luminance values between the
original and distorted images. The working function of PSNR
is presented in (1) and (2), where X(i, j) and Y (i, j) are the
corresponding pixel values in the original and distorted frames,
respectively. M and N are the total pixel numbers in the
horizontal and vertical directions of a frame, respectively. Here,
k is the number of the bit used for describing one pixel. Thus
MSE =
N
M
(X(i, j) − Y (i, j))2
i=1j=1
M ×N
2
(2k − 1)
.
PSNR = 10 log10
MSE
(1)
(2)
It is a pity that PSNR merely describes the difference in
pixel values at the same position between the original frame and
the distorted frame. As a consequence, the results obtained by
PSNR method deviate from the results of the subjective video
quality assessment (SVQA), which is verified in many works
[8], [11]. In addition, without taking the perception of human
vision into consideration, PSNRs have a weak correlation with
the subjective quality assessment (SQA) scores to evaluate
the quality of 2-D/3-D image. Thus, the objective quality
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LIU et al.: STEREOSCOPIC IMAGE QUALITY ASSESSMENT BASED ON DEPTH AND TEXTURE INFORMATION
assessment (OQA) that is highly correlated with the SQA has
been widely studied.
A notable quality assessment method base on the perception information of human vision called SSIM is proposed by
Wang et al. [12], which has become one of the key methods
for image/video quality assessment for its accurate quality
assessment results and wide application fields. It is noteworthy
is that the HVS is highly specialized in extracting the structural
information from the viewing field rather than simple pixel errors. It is well known that the statistical properties of the natural
visual environment have vital important effect on the cognition
process of the HVS, which includes evolution, development,
and adaption. One of the properties for the natural visual signal
is highly structured. The structural information is the theory that
the pixels are strong interdependencies, particularly when they
are spatially close. These dependencies carry vital important
information about the structure of the objects in the visual
scene. Based on the theory, SSIM is formed. In this algorithm,
the structural information is independent on the luminance and
contrast of an image. Hence, the image/video quality assessment method could be separated as the measurements of luminance distortion, contrast distortion, and structural distortion.
As shown in (3), signal X and Y are the original and distorted
image signals represented as X = {xi | i = 1, 2, . . . , M } and
Y = {yi | i = 1, 2, . . . , M }, respectively, where i is the pixel
index and M, N , are the number of the pixel for an image in
the vertical and horizontal directions, respectively. Thus
SSIM(x, y) =
(2ux uy + c1 )(2σxy + c2 )
.
(u2x + u2y + c1 )(σx2 + σy2 + c2 )
(3)
Equation (3) gives the function of SSIM, where µx and µy
refer to the means of the signal X and Y , respectively; σx and
σy denote the variances of signal X and Y , respectively; and
the σxy denotes the covariance between X and Y . c1 and c2
are constants whose values are very small and will take effect
only when (u2x + u2y ) or (σx2 + σy2 ) is small. SSIM satisfies the
following conditions.
• SSIM (x, y) = SSIM(y, x);
• SSIM (x, y) ≤ 1;
• SSIM (x, y) = 1, if and only if the distorted image pixel
values wholly identify with the original ones.
Afterward, an improved quality algorithm that is named as
Mean Structural Similarity (MSSIM) is published in [12] by
Wang et al. It should be noted that MSSIM is a block-based
VQM, as shown in (4), where, xj and yj are the block content
at the jth block position for the original frame and the distorted
frame, respectively, and M is the number of the blocks for
the frame. By using the modified algorithm, MSSIM can obtain much better experimental results that are related with the
subjective video quality scores, as compared with the previous
metric in [9]. That is
MSSIM =
M
1
SSIM(xj , yj ).
M j=1
(4)
SSIM or MSSIM has a significant performance on quality evaluation, particularly for motionless image or slightly
3
changed video frames. There are two important profits for its
performance. First, it considers not only the physical differences of the pixels but also the contrast and structure distortions. Second, the luminance, contrast, and structure distortions
are mutual independence, avoiding the interactions with each
other. Hence, the metric has a very important impact on the
improvement for the SIQA metrics. However, its application
for general image signal has obvious limitations and is not
very smooth. First of all, the basic coding unit of general video
coding standard is 8 × 8 subblock, so that the blocking effect
always causes at the corresponding boundaries due to the lossy
quantization operation. Therefore, the blocking artifact is not
detected well in SSIM since its basic operation unit is also
8 × 8. Second, the primal problem of SSIM is that it is mainly
for image signal, and the characters of video signal, such as
motion vector, correlation of the neighboring frames, etc., are
not fully considered. Hence, it could not give well enough
quality evaluations, particularly for fast motion changing image
signals [28]. For these cases, it is not suitable for the quality
assessment of the more complex stereoscopic image.
With the increasing requirement for video/image in the last
decades, quality assessment metrics for 3-D video/image are
constantly emerging. The PQM is the representative one to
evaluate the quality of 3-D video [4]. It is a perceptual quality
metric that can be applied to measure the distortion of 2-D/3-D
video/image. It quantifies the distortions of luminance and contrast, which is calculated by the luminance value, mean, and
variance of the original and distorted views, and covariance
between the original and distorted views. These two distortions
are weighed by the mean of each block of the original view
to obtain the final distortion of the entire frame. In [14], the
author proposed a perceptual FR quality assessment metric.
In this algorithm, the binocular vision phenomenon that fuses
and perceives the images from two eyes is tackled as a single
perception for stereoscopic images. Through analyzing the
relationship between the MSE and the SSIM, a MSE-SSIM
IQA metric is proposed by Tan et al. in [15] by making full
use of the variance of the original image and the MSE between
the original and distorted images. The distortions of contrast
coefficient on the block level in the frame are taken into
account. As shown in (5), the distortion is calculated as follows:
MSE-SSIM(Oi , Di ) =
2
2σO
i
2
2σO
+ k1
i
+ MSE(Oi , Di ) + k1
(5)
where σoi is the block variance of the original frame. MSE
(Oi , Di ) is calculated by the summation of difference for
the luminance values between the original and distorted block
images. To avoid the mathematical logic error, it should be
noted that a constant k1 is inserted in (5). It is the fact that
the variance factor is roughly considered as the judge of the
contrast for the frame signal. It is powerfully applied to evaluate
the distorted levels of the contrast for a frame. Notice that the
algorithm of MSE-SSIM is conceptually independent in a sense
that depends on the mean of the original frame and the change
of the luminance component between the original and distorted
frames.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
IEEE SYSTEMS JOURNAL
Fig. 1. Schematic of overview of the proposed metric.
MSE-SSIM is one of the representative algorithms for 2-D
IQA and shows good quality evaluation results. The character
of this metric is that it is sensitive to slight changes in image
degradation, and the error qualification starts at pixel level right
up to the sequence level.
III. D ESCRIPTION OF THE P ROPOSED Q UALITY M ETRIC
Generally, the stereoscopic images consist of two views, i.e.,
left and right view. The quality of each view plays an extremely
important role in stereoscopic IQA; users perceive the stereoscopic image based on each view. We deal with the binocular
vision that images perceived from two eyes are fused as a single
perception. The proposed IQA mainly quantifies the following
two distortions: the distortion of block content and the distortion for block boundary information. For the distortion of block
content, which contains luminance and contrast features. they
are exactly important to perceive the image quality in HVS. For
the distortion of block boundary information, the processes of
motion estimation, transformation, and quantization in the units
of the blocks will deteriorate boundaries of blocks, and HVS is
very sensitive to the boundary information of the image signals.
To obtain the degree of the distortions in the frame, we adopt
the 8 × 8 block as the basic operation unit in this paper. The
algorithm can be effectively applied to measure the distortion of
the stereoscopic image. Moreover, it is also used to evaluate the
quality of 2-D image. The schematic overview of the proposed
metric is illuminated in Fig. 1. The detailed procedure of the
quality assessment is described in the following parts.
A. Image Segmentation
With the recent studies, we have found that it is accepted
that, although the depth map includes less information of the
image signal compared with the image view, it could effectively
represent the image features of key contents, such as the boundaries, edges, and so on [17], [29]. Therefore, it is considered as
the important edge information in this paper to enhance and
optimize the proposed metric for 3-D image signal. HVS has
different sensations for different depth distances. In addition,
HVS is highly efficient for extracting the structural information
when we perceive the depth in 3-D image signal. Under these
considerations, the image views are segmented into multiple
depth planes.
Image segmentation plays a crucial role in many applications, such as image analysis and comprehension, computer
vision, image coding, pattern recognition, and medical image
analysis. Image segmentation methods are generally summarized as two sorts: threshold method based on gray-level histogram and region-growing method. The threshold methods
became the most popular for its less computation, simplicity,
and stability. In classical threshold image segmentation, an
image is usually segmented and simply sorted into object and
background by setting a threshold. In this paper, we can segment the image into different regions with the Otsu algorithm
[17]. The step of Otsu is described as follows.
First, the probability distribution of image is needed to be by
the histogram of the image as
pi =
L
xi
, pi ≥ 0,
pi = 1
X
i=1
(6)
where i represent the pixel value in the image, xi means
the number of pixels whose gray value equal to i, X is the
total number of pixels in the image, and pi is the probability
distribution in grayscale.
Then, the pixels are divided into two classes: objects and
background (O) and (B) Otsu, which find out a threshold to
minimize the intraclass variance.
The class probability and class mean are calculated by
ωO = P (O) =
t
pi = ω(t)
(7)
i=1
ωB = P (B) =
L
i=t+1
pi = 1 − ω(t)
(8)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LIU et al.: STEREOSCOPIC IMAGE QUALITY ASSESSMENT BASED ON DEPTH AND TEXTURE INFORMATION
µO =
t
iP (i|O) =
t
ipi /ωO =
i=1
i=1
where µ(t) =
t
5
µ(t)
ω(t)
ipi
(9)
i=1
L
µB =
iP (i|B) =
L
ipi /ωB =
i=t+1
i=t+1
L
where µ(L) =
µ(L) − µ(t)
1 − ω(t)
ipi
(10)
i=1
where ωO and ωB represent the class probability of the background and object, respectively; L is the grayscale of the image,
and it equals to 255 in optical imagery; and t means the
threshold to separate the background and object.
µ(L) is the total mean level of the original image, and µ(t) is
the first-order cumulative moments of the histogram up to the
tth level. It can easily be concluded that, for any choice of t, the
following formula can be established:
ω O µO + ω B µB = µT
(11)
ωO + ωB = 1.
(12)
2
The class O and B variances are defined as σo2 , and σB
.
That is
2
σO
=
t
(i − µO )2 P (i|O) =
(i − µO )2 pi /ωO
(13)
Fig. 2. (a) Original image “im17_l.” (b) Indentify different depth planes of
“im17_l.”
i=1
i=1
2
σB
=
t
L
(i − µB )2 P (i|B) =
L
(i − µB )2 pi /ωB . (14)
i=t+1
i=t+1
The intraclass variance σ12 , the interclass variance σ22 , and the
total variance of levels are calculated by
2
2
σ1 2 = ωO σO
+ ωB σB
(15)
σ2 2 = ωO (µO − µT )2 + ωB (µB − µT )2
(16)
σT2 = σ12 + σ22 .
(17)
The parameter η(t) is introduced and defined as
η(t)2 =
σ22
.
σT2
(18)
The optimization task can be transferred to
T = max (η(t)) .
(19)
With the threshold t, the pixels ranged from [1, . . . , t] consist
of object region, and the results belong to background.
Then, we shall segment the object and background area
images into multiple depth planes. There are several ways to
statistically analyze the image planes for the depth map, and
the histogram is utilized in this paper because it is very effective
on analysis and processing for digital image signal. The range
of the pixel value in the depth map is between 0 and 255. As
shown in Fig. 3, the generation of image planes based on depth
Fig. 3. Perceptive vision of different depth planes of “im17_l.” (a)–(c) Different planes of the original image. (d) Overall visual depth image of “im17_l.”
map is illuminated. According to Fig. 2(b), it is clear that the
peak frequencies occurred around several depth distances, such
as 25, 160, and so on. The depth map can be segmented by the
peaks. For this case, the images planes are defined as shown
in Fig. 3(a)–(c), with the increasing order of the depth values.
The highlight blocks are closer to observer than the dark blocks
in Fig. 3(d). In addition, our statistical analysis shows that the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
image plane with close distance to viewers has larger effect on
the perception of HVS.
Based on the characteristic of histogram, the image planes
are established. It is generally known that the numbers of image
planes that can be obtained by the number of peaks are variation
in view of different image views. In general, the numbers of
image views range from 2 to 5 in our experiments.
B. Detection of the Image Gradient Magnitude
SSIM uses the structural information instead of error
sensitivity-based measurement for quality assessment since the
main function of the HVS is to extract structural information
from the viewing field, and the SSIM is highly adapted for this
purpose. SSIM is widely used because of its accurate quality
assessment ability and general applicability. However, it only
considers scaling distortions of the luminance, contrast for the
block content, and structure that is unsuitable to assess the
distortion of blocks.
Generally, to reduce the data load for the network services,
the video signal should be compressed by video coding standards before the transmission. Almost all of the current video
coding standards are lossy codec, particularly due to the quantization operation [18]. Therefore, the blocking artifact always
happens and could affect the judgment of HVS to the image
quality. Although the blocking effect generally happens along
the boundaries of the blocks, the blocking artifact will extend to
the inner block since the pixel values also have been averaged
by the lossy compression operation. In addition, when the
sudden scene change happens, the distortions between the original and distorted blocks also will have a significant effect in
evaluating the quality of stereoscopic/2-D image. The blocking
artifact not only occurs in the 2-D video/image but also in the
stereoscopic video/image because the stereoscopic video/image
generally consists of the left and right views, i.e., each view
is the 2-D video/image. The blocking artifact is caused by the
distortion of the boundary information. Therefore, it is very
important to consider the boundary information in IQA.
Image gradients can be used to extract edge information from
images. Gradient images are created from the original image
(generally by convolving with a filter, one of the simplest is the
Sobel filter) for this purpose. Each pixel of a gradient image
represents the change in intensity of that corresponding point in
the original image. After computing out gradient images, pixels
with large gradient values are considered edge pixels for its
dramatic changes in local region. Image gradient computation
is a traditional topic in image processing. Gradient operators expressed by convolution masks are efficient to represent the edge
information. The three most commonly used gradient operators
are the Sobel operator, the Prewitt operator, and the Scharr
operator. In our proposed IQA metrics, the Scharr operator
chosen by experiments is used to calculate the gradient in this
paper. The partial derivatives Gx (X) and Gy (X) of the image
f (X) along horizontal and vertical directions using the Scharr
operator listed in Table I. The gradient magnitude of f (X)is
then defined as
(20)
G = G2x + G2y .
IEEE SYSTEMS JOURNAL
TABLE I
PARTIAL D ERIVATIVES OF f (X) U SING S CHARR G RADIENT O PERATORS
According to the gradient map, we take the similar measure
with MSE-SSIM to calculate the block content and get MSE
(GOi , GDi ), as shown in
8
8
(GOi (a, b) − GDi (a, b))2
a=1 b=1
MSE(GOi , GDi ) =
8×8
(21)
where GOi (a, b) and GDi (a, b) are the gradient map on the ith
block of the original and distortion map.
C. Distortion for the Entirely Image
The distortion of contrast and structure coefficients GM
(Oi , Di ) in the block are taken into account. As shown in (22),
these two distortions are conjunctively calculated. Thus
GM(Oi , Di )
=
2
2
+2σG
+k1
2σO
i
O
i
2 +2σ 2
2σO
GOi +MSE(Oi , Di )+MSE(GOi , GDi )+k1
i
(22)
2
2
where σOi
and σGOi
are the variance on the ith block of the
original and gradient map. To avoid the mathematical logic
error, a constant k1 is inserted to guarantee the meaning of the
denominator.
According to the analysis in the previous section, the quality
score for the 3-D image planes (AGM) is calculated as follows:
j
AGM =
GM(Oi , Di )
i∈Bj
Nj
(23)
where Bj is the jth block in the image, Nj is the number of
blocks in Bj .
After having obtained the quality scores of AGM at each
region, the quality for each image plane is calculated. We will
also use the linear weighting to combine these obtained quality
scores for different image planes in the 3-D image (IAGM), as
shown in
IAGM =
p
wj × AGMj
(24)
j=1
where wj is the weight assigned
to different image planes,
respectively, with the constraint pi=1 wj = 1 to quantify the
importance of each image planes in the binocular vision.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LIU et al.: STEREOSCOPIC IMAGE QUALITY ASSESSMENT BASED ON DEPTH AND TEXTURE INFORMATION
7
Fig. 4. Plots of DMOS versus IAGM with different distortion types. (a)–(e) Plot of JP2K, BLUR, JPEG, WN, and FF distortion, respectively.
IV. E XPERIMENTAL R ESULTS AND A NALYSIS
To evaluate the performance of the proposed algorithm, the
IAGM and other state-of-the-art IQA metrics are implemented
in the experimental phase.
A. Subjective Quality Assessment for 3-D Image Signals
The main target of the research on IQA is to explore the
superior OIQA metric, which could be consistent with the subjective assessment result as much as possible. Therefore, the
subjective experiments are executed at first to provide comparable data for the design of OIQA functions. It is the fact
that SIQA is the most precise measures of perceptual quality
since it is generated by HVS directly. For this kind of experiments, people are involved to evaluate the image quality under
controlled test environments. In our experiments, we use the
Live Database presented in [24] developed at The University of
Texas at Austin. It consists of 20 reference images and a total
of 365 distorted images with 5 types of distortions at different
distortion levels. To evaluate the performance, our proposed
metric is compared with various famous metrics by using five
distortion types that are JPEG compression, JPEG2000 compression, additive white Gaussian noise, Gaussian blur, and a
fast-fading model based on the Rayleigh fading channel, which
abbreviated to JPEG, JP2K, WN, Blur, and FF, respectively.
B. OQA Metrics Used in Our Paper
With the development of the wireless networking, multiple
relevant intelligent devices are produced and wildly used. The
IQA metric is regarded as an efficient method for measuring
the final received image quality. Moreover, OIQA could become the part of the real-time feedback mechanism and often
applied to optimize the transmitting system. Four evaluation indexes, which are Pearson linear correlation coefficient (PLCC),
Spearman’s rank order correlation coefficient (SROCC), root
mean square error (RMSE), and coefficient of determination
(R-square) are used to compare the performance of the mentioned metrics. It should be noted that SROCC and R-square
are employed to assess prediction monotonicity, and PLCC and
RMSE are used to evaluate prediction accuracy in contrast.
For a perfect match between the objective scores after nonlinear regression and the subjective scores, PLCC = SROCC =
R-square = 1 and RMSE = 0. The value of PLCC, SROCC,
R-square, and RMSE will be equal to 1, 1, 1, and 0, respectively,
in the ideal situation where the objective scores after nonlinear
regression [21], as shown in (25), and the subjective scores are
a perfect match.
1
1
−
+ α4 · x + α5
DMOS = α1 ·
2 1 + exp (α2 · (x − α3 ))
(25)
where DMOS means that subjective scores, α1 , α2 , α3 , α4 ,
and α5 are determined by using the subjective scores and the
objective scores.
In order to validate the proposed IAGM algorithm, different
distortion types are selected to make performance, as shown
in Fig. 4, where each point in the plots represents one test
image, and the vertical and horizontal axes represent DMOS
and IAGM, respectively. It is obvious that, if the prediction is
perfect, the point will lie on the diagonal line. In this paper,
some 3-D IQA metrics are chosen based on their reported
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8
IEEE SYSTEMS JOURNAL
Fig. 5. Ensemble performance with test on live 3-D-image database. (a)–(f) PSNR, MSSIM, PQM, M-SVD, MSE-SSIM, and the proposed method IAGM
overcome versus DMOS in sequence.
TABLE II
PLCC C OMPARED W ITH O THER M ETHODS
TABLE III
SROCC C OMPARED W ITH O THER M ETHODS
performance and availability to verify the universality of our
proposed algorithm. In addition, in order to measure the efficiencies of the popular 2-D quality assessment metrics for
evaluating 3-D visual quality, four other metrics for 2-D images
are employed in our experiments. It should be noted that the
2-D IQA metrics are applied on the left and right views, respectively. The single measure the quality of 3-D image is produced
by using the average of the 2-D IQA metrics scores. The performance comparison between our proposed metrics and the other
latest six OQA algorithms is demonstrated by using the scatter
plots, as shown in Fig. 5. The vertical axis means the subjective
scores, which is represented as DMOS, and the horizontal axis
denotes the revised objective quality scores after the nonlinear
mapping algorithm. From Fig. 5, we can conclude that the
proposed IAGM outperforms the other algorithms.
The values of the four indexes for each distortion type of
the database are listed in Tables II–V. For each IQA algorithm
in each test, we can conclude that the proposed metric outperforms the other 2-D image and 3-D IQA algorithms. Since the
PSNR, MSSIM, M_SVD, and MSE-SSIM metrics are directly
extended from the 2-D case and have not taken the binocular
visual characteristics into account, the overall performances
of these are worse than our proposed metric, in general, even
if they may be effective for some special distortion types.
For example, it is well accepted that MSSIM has the good
performance to measure the distortion of image. However, we
find that the shortcoming is existed in this metric, which could
be effectively optimized by taking the properties of human
visual perception into account. In addition, our experiment results could be used to confirm this hypothesis. According
to Tables II–V, the IAGM algorithm performs better than
other metrics for all distortions, except for WN. In the situation of WN distortion, MSSIM performs better than other
metrics.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LIU et al.: STEREOSCOPIC IMAGE QUALITY ASSESSMENT BASED ON DEPTH AND TEXTURE INFORMATION
TABLE IV
RMSE C OMPARED W ITH O THER M ETHODS
9
content could be automatically scaled by using the histogram
reduced by depth map; (3) the gradient magnitude similarity
as a weighting function is also used to enhance this model in
assessing the quality of image. Compared with state-of-the-art
IQA metrics, the proposed IAGM metric performs better in
terms of both accuracy and efficient, making IAGM an ideal
choice for high-performance IQA applications.
R EFERENCES
TABLE V
R-S QUARE C OMPARED W ITH O THER M ETHODS
Both IAGM and MSE-SSIM have higher ranking fractions
for JPEG 2K and BLUR compared to other distortions. The
result also shows that the proposed algorithm is robust to small
shifts. As shown in Table II, to some extent, it could have
good performance to some certain distorted types of 3-D image.
However, it is still lower than some 2-D metrics. The reason is
that the quality of stereo video to some degree is dependent
on the stereo matching algorithm, so that the measurement for
disparity does not coincide with the human vision of disparity.
As a future work, we will focus on developing a new algorithm,
which uses disparity map to compute the quality of different
resolution image/video.
V. C ONCLUSION
The OQA of stereo images plays a key role for the development of 3-D image compression standards and the success
in enriching various 3-D visual applications. Recently, the
technologies of intelligent internet of things have been fast
developed; in particular, for the 3-D image, many immersive
scenes have been shown to people and let the perceptive
appreciation be more vivid. In this paper, an efficient IQA
metric for 3-D image signal has been proposed by considering
local image properties and psychological property on HVS and
considering the depth map as the important factor for 3-D
IQA. The advantages of our proposed metric are as follows:
1) the calculation of the local pixel-based distortion, contrast
distortion, and its structural distortion could perfectly describe
the image features of the image view; (2) the local stimuli
[1] H. Wei, H. Y. Zhao, C. Lin, and Y. Yang, “Effective load balancing for
cloud-based multimedia system,” in Proc. Int. Conf. Electron. Mech. Eng.
Inf. Technol., 2011, pp. 165–168.
[2] W.-M. Chen, C.-J. Lai, H.-C. Wang, H.-C. Chao, and C.-H. Lo, “H.264
video watermarking with secret image sharing,” IET Image Process.,
vol. 5, no. 4, pp. 349–354, Jun. 2011.
[3] W. Lin and C.-C. Jay Kuo, “Perceptual visual quality metrics: A survey,”
J. Vis. Commun. Image Representation, vol. 22, no. 4, pp. 297–312,
May 2011.
[4] P. Joveluro, H. Malekmohamadi, W. A. C. Fernando, and A. M. Kondoz,
“Perceptual video quality metric for 3D video quality assessment,” in
Proc. 3DTV Conf., Jul. 2010, pp. 1–4.
[5] T.-Y. Wua, C.-Y. Chenb, L.-S. Kuoc, W.-T. Leec, and H.-C. Chao, “Cloudbased image processing system with priority-based data distribution mechanism,” Comput. Commun., vol. 35, no. 15, pp. 1809–1818, Sep. 2012.
[6] Z. M. P. Sazzad, S. Yamanaka, Y. Kawayoke, and Y. Horita, “Stereoscopic
image quality prediction,” in Proc. IEEE QoMEX, Jul. 2009, pp. 180–185.
[7] F. Qi, T. Jiang, S. Ma, and D. Zhao, “Quality of Experience Assessment for Stereoscopic Images,” in Proc. IEEE Int. Conf. Circuits Syst.,
May 2012, pp. 1712–1715.
[8] K. Ha and M. Kim, “A perceptual quality assessment metric using temporal complexity and disparity information for stereoscopic video,” in Proc.
IEEE Int. Conf. Image Process., Sep. 2011, pp. 2525–2528.
[9] H.-W. Chang, H. Yang, Y. Gan, and M.-H. Wang, “Sparse feature fidelity
for perceptual image quality assessment,” IEEE Trans. Image Process.,
vol. 22, no. 10, pp. 4007–4018, Oct. 2013.
[10] M. Carnec, P. L. Callet, and D. Barba, “Objective quality assessment of
color images based on a generic perceptual reduced reference,” J. Signal
Process.: Image Commun., vol. 23, no. 4, pp. 239–256, Apr. 2008
[11] T. M. Kusuma and H.-J. Zepernick, “A reduced-reference perceptual
quality metric for in-service image quality assessment,” in Proc. IEEE 1st
Workshop Mobile Future Symp. Trends Commun., Oct. 2003, pp. 71–74.
[12] Z. Wang, L. Lu, and A. C. Bovik, “Video quality assessment based on
structural distortion measurement,” IEEE Trans. Image Process., vol. 19,
no. 2, pp. 121–132, Feb. 2004
[13] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
assessment: From error visibility to structural similarity,” IEEE Trans.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[14] F. Shao, W. Lin, S. Gu, G. Jiang, and T. Srikanthan, “Perceptual
full-reference quality assessment of stereoscopic images by considering
binocular visual characteristics,” IEEE Trans. Image Process., vol. 22,
no. 5, pp. 1940–1953, May 2013.
[15] H. Tan, Z. Li, Y. H. Tan, S. Rahardja, and C. Yeo, “A perceptually relevant
MSE-based image quality metric,” IEEE Trans. Image Process., vol. 22,
no. 11, pp. 4447–4459, Nov. 2013.
[16] Y.-H. Lin and J.-L. Wu, “Quality assessment of stereoscopic 3D image
compression by binocular integration behaviors,” IEEE Trans. Image
Process., vol. 23, no. 4, pp. 1527–1542, Apr. 2014.
[17] S. Zhou, P. Yang, and W. Xie, “Infrared image segmentation based on Otsu
and genetic algorithm,” in Proc. Int. Conf. Multimedia Technol., 2011,
pp. 5421–5424.
[18] S. L. P. Yasakethu, S. T. Worrall, D. V. S. X. De Silva, W. A. C. Fernando,
and A. M. Kondoz, “A compound depth and image quality metric for
measuring the effects of packet loss on 3D video,” in Proc. Int. Conf.
Digital Signal Process., Corfu, Greece, Jul. 2011, pp. 1–7.
[19] A. K. Moorthy, C.-C. Su, A. Mittal, and A. C. Bovik, “Subjective evaluation of stereoscopic image quality,” Signal Process.: Image Commun.,
vol. 28, no. 8, pp. 870–883, Sep. 2013.
[20] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,
“Study of the subjective and objective quality assessment of video,” IEEE
Trans. Image Process., vol. 19, no. 6, pp. 1427–1441, Jun. 2010.
[21] P. G. Gottschalk and J. R. Dunn, “The five-parameter logistic: A characterization and comparison with the four-parameter logistic,” Anal. Biochem.,
vol. 343, no. 1, pp. 54–65, Aug. 2005.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10
[22] L. Jin, A. Boev, A. Gotchev, and K. Egiazarian, “3D-DCT based perceptual quality assessment of stereo video,” in Proc. IEEE Int. Conf. Image
Process., Brussels, Belgium, Sep. 2011, pp. 2521–2524.
[23] X. Liu, M. Chen, W. Tang, and C. Yu, “Hybrid no-reference video quality
assessment focusing on codec effects,” KSII Trans. Internet Inf. Syst.,
vol. 3, no. 2, pp. 592–606, Jan. 2011.
[24] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity
index for image quality assessment,” IEEE Trans. Image Process., vol. 20,
no. 8, pp. 2378–2386, Aug. 2011
[25] S.-W. Jung, “A modified model of the just noticeable depth difference
and its application to depth sensation enhancement,” IEEE Trans. Image
Process., vol. 22, no. 10, pp. 3892–3903, Oct. 2013.
[26] A. M. Demirtas, A. R. Reibman, and H. Jafarkhani, “Full-reference quality estimation for images with different spatial resolutions,” IEEE Trans.
Image Process., vol. 23, no. 5, pp. 2069–2080, May 2014.
[27] M. H. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,” IEEE Trans. Broadcast., vol. 50, no. 3,
pp. 312–322, Sep. 2004.
[28] X. Liu, L. T. Yang, and K. Sohn, “High-speed inter-view frame mode decision procedure for multi-view video coding,” J. Future Gener. Comput.
Syst. (Elsevier), vol. 28, no. 6, pp. 947–956, Jun. 2011.
[29] A. Liu, W. Lin, and M. Narwaria, “Image quality assessment based
on gradient similarity,” IEEE Trans. Image Process., vol. 21, no. 4,
pp. 1500–1512, Apr. 2012.
Xingang Liu (M’10) received the B.Eng. degree
in the school of Electronic Engineering (EE), University of Electronic Science and Technology of
China (UESTC), Chengdu, China, in 2000, and
the M.Eng. and Ph.D. degrees from Yeungnam
University, Gyeongsan, Korea, in 2005 and 2009,
respectively.
He is a Full Professor with the school of EE,
UESTC, Chengdu, China, where was a faculty
member in EE during 2000–2003. He was a BK21
Research Fellow with the school of Electrical and
Electronic Engineering, Yonsei University, Seoul, Korea, from 2010 to 2011.
He was also an Adjunct Professor with Dongguk University, Seoul, Korea.
His research interests include multimedia signal communication-related topics,
such as heterogeneous/homogenous video transcoding, video quality measurement, video signal error concealment, mode decision algorithm, 2-D/3-D video
codec, and so on.
Dr. Liu is a member of the Korea Information and Communications Society
(KICS), and Korea Society for Internet Information (KSII).
IEEE SYSTEMS JOURNAL
Kai Kang received the M.S. degree in electronic engineering from the University of Electronic Science
and Technology of China (UESTC), Chengdu, China
in 2015.
He is currently an Engineer with the Huawei
Research Center, Chengdu, China. His research interests include 2-D/3-D video/image quality assessment, cloud computing systems, and so on.
Yinbo Liu received the M.S. degree in electronic engineering from the University of Electronic Science
and Technology of China (UESTC), Chengdu, China
in 2015.
He is currently a Video Algorithm Engineer with
the Audio and Video Technology Platform Department, ZTE Corporation, Shenzhen, China. His research interests include video compression, mode
decision, and so on.