Keywords

1 Introduction

The video quality evaluation (VQE) is a promising research area due to its wide range of applications in the development of various video coding algorithms [1]. The technical coding areas involved with the FVV are characterized by the view generation using multiview video coding (MVC) and the view synthesis. This process first goes through the image warping and then a hole filling technique e.g. the inverse mapping technique or spatial/temporal correlation as simple post processing filtering [2, 3]. Since the synthesized view is generated at a virtual position between left and right views, there is no available reference frame for quality estimation of FVV [4]. Usually the quality estimation is performed in two ways: objective and subjective, where the former one is more widely used due to its simplicity, ease of use and having real-time applications. Thus, a good number of citable researches have been conducted based on the objective image quality estimation [5,6,7]. The quality estimation could be further categorized into full-reference (i.e. original videos as reference), reduced-reference (i.e. existence of partial signals as reference) and no-reference schemes. Among them, the applications of full-reference metrics such as the SSIM or PSNR have been restricted to the reference based situations only and these metrics lose their suitability in estimating different qualities of FVV where the reference frame is not available. To address the limitations of full-reference metrics, a number of no-reference based research works have recently come into the light for quality evaluation [8,9,10]. The introduced statistical metrics may not be suitable to some high quality ranges since the quality perception in these area is mostly due to perceptual human visual system (HVS) features, rather than to the statistics of the image [11]. However, different features of the HVS are not actively studied in the existing schemes. The authors in [12] performed the human cognition based quality assessment using eye-tracking and evolved more realistic ground truth visual saliency model to improve their algorithm. In fact, the eye-tracking has become a non-intrusive, affordable, and easy-to-use tool in human behavior research today. With very few exceptions, anything with a visual component can be eye tracked by simply employing the software based eye-tracking simulator [13]. Unlike objective quality evaluation, the subjective studies could yield valuable data to evaluate the performance of objective methods towards aiming the ultimate goal of matching human perception [14]. Thus, a number of quality assessment algorithms have been proposed which are closely related to the studies of human visual attention and cognition. The study in [15] introduced a no-reference framework using blur and blockiness metric to improve the performance of objective metric using eye-tracker data. The authors in [16] introduced a model to judge the video quality on the basis of psychological merits including- the pupil dilation and electroencephalogram signalling. Exploiting the eye gaze-data, Albanesi and Amadeo [17] generated a voting algorithm to develop a no-reference method. Using the scan path of eye movements, Tsai et al. [18] subjectively assessed the perceived image and its colour quality. Conversely, the widely used subjective testing scheme- the MOS [19, 20] is often biased by a number of factors such as viewers mode, domain knowledge, testing environment, and many more which may actively influence the effectiveness of quality assessment process. Podder et al. [21] first introduced the subjective metric- QMET, however, their initial work is based on the single view video where the viewing angle is fixed for users. Moreover, their introduced approach highly depends on threshold selection for each feature and incur with the lack of proper correlation setting among features. The most importantly, their metric does not perform well in different contents and resolutions of the videos. The proposed method is a significantly extended version of their work where the major amendments include the employment of FVV i.e. in the no reference scenario, increasing number of features, better correlation analysis of features, performing content and resolution invariant operation on features, synthesizing them by an adaptive weighted function, comparing the new metric with PSNR, SSIM, and MOS, and eventually employing two widely used estimators the Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC) to justify the effectiveness of the proposed QMET for a range of FVV sequences.

Fig. 1.
figure 1

More concentrated eye-traversing approach is perceived for relatively better quality contents (e.g. Newspaper sequence image in (b)). The opposite is noticed in (e) for which the pupil-size sharply increases in (c), while the gaze event duration notably decreases in (f).

Let us first concentrate on Fig. 1 in which (a) and (d) represent a multiview video sequence namely Newspaper encoded as good and poor quality respectively, while (b) and (e) demonstrate the eye traversing approach of a viewer for good and poor quality image contents respectively. The tracked gaze plots indicate more concentrated eye-traversal for relatively better quality contents. Now if we determine Length (L) and Angle (A) features for the gaze plots, they could explicitly tell about the viewers pattern browsing (i.e. smooth or random as depicted in Fig. 1(b) and (e)). Then we discover that the quality variation effects on both Pupil-size (P) and Gaze-duration (T) variation presented in Fig. 1(c) and (f), thus, we calculate four cardinal features- L, A, P, and T for each potential gaze plot (PGP) from the gaze trajectory. The PGPs in this test are defined by the fixations (i.e. visual gaze on a single location) and saccades (i.e. quick movement of eyes between two or more phases of fixations). The content and resolution invariant operations are then performed on the features and adaptively synthesized using a weighted function to develop the proposed QMET. The higher QMET score promises good quality video as the viewers could better capture its content information with smooth global browsing. Experimental results reveal that the quality evaluation carried out by the QMET could better perform compared to the objective metric SSIM, and the subjective estimator MOS. Since the eye tracker data could be easily captured today by directly employing the software based eye-tracking simulator (i.e. device itself is no longer required), the utility of the QMET could also be more flexible simple simulator generated data set.

2 Proposed Method

First of all, by employing the HEVC [22] reference software HM15.0 [23], different video quality segments were generated and then watched by a group of participants. The processed eye-tracker recorded data were analyzed using four quality correlation features, i.e. L, A, P, and T. The content and resolution invariant operations were carried out on the features and the features were synthesized by an adaptive weighted function to develop a new metric- QMET. The process diagram of the entire process is presented in Fig. 2, while the key steps are described in the succeeding sections.

Fig. 2.
figure 2

Process diagram of the proposed QMET development.

2.1 Data Capture and Pre-processing

The participants (including males and females) who were recruited from the University had normal or corrected-to-normal vision and did not suffer from any medical condition to adversely influence our project [ethical approval no. 2015/124]. They fall within the 20–45 age band and are undergraduate/postgraduate students, PhD students, and lecturers of the University. A number of multiview sequences which are used in this test comprise the resolution type of \(1920\times 1088\) and \(1024\times 768\) (detail to be found in [24]). To avoid the biasness, initially we use the gray scale components only and randomly vary the display order of the quality segments to the participants. We generate three different quality types of each video including Excellent (using quantization parameter \(QP=5\)), Fair (\(QP=25\)), and Very-poor (\(QP=50\)). Calibration and a trial run was performed so that the participants feel comfort about the whole process. Upon their satisfaction, the Tobii eye tracker [25] was employed to record their eye movements. As the device recorded data at 60 HZ frequency and allocated frame rate was 30 (fps), each frame could accommodate two gaze points and a single whole video covered 9000 gaze plots having 1800 for each quality segment.

Fig. 3.
figure 3

The Length (L), Angle (A), and Pupil-size (P) features have a proportionate correlation, while the Gaze-duration (T) feature has an inversely proportionate correlation with quality degradation

2.2 Features Correlation Analysis with Quality

The Length (L- in pixel) of the i th potential gaze plot is calculated using the Euclidean distance with respect to the (i + 1)th gaze plot, while the Angle (A- in degree) of the ith plot is calculated by using both the reference of its (\(\textit{i}-\textit{1}\))th and (i + 1)th values (where i = 1, 2, ..., n and the values of L and A are not calculated for the 1st and nth plots). The pupil-size (P- in mm) and Gaze-duration (T- in ms) for each ith plot are determined by averaging the values of left and right pupil size and the eye-tracker recorded timestamp data respectively by employing MATLAB R2012a (MathWorks Inc., Massachusetts, USA). The overall calculated results indicate that L, A, P features have a proportionate correlation, while the feature T has an inversely proportionate correlation with the video quality degradation as demonstrated in Fig. 3.

This time, the contribution of each feature has been estimated in the context of segregating different quality contents. It is observed that no single feature could solely be the best representative in distinguishing different qualities. The individual \(Q-score\) (i.e. the calculated pseudo score of the proposed QMET) of each feature is determined by exploiting the Eqs. (1)–(4), where Q1, Q2, Q3, and Q4 indicate the \(Q-score\) for the individual feature L, A, P, and T respectively.

$$\begin{aligned} Q_1=L^{\delta L} \end{aligned}$$
(1)
$$\begin{aligned} Q_2=A^{\varphi A} \end{aligned}$$
(2)
$$\begin{aligned} Q_3=(P/2)^{\gamma P} \end{aligned}$$
(3)
$$\begin{aligned} Q_4= \sqrt{2T}^{(\eta /\sqrt{2T})} \end{aligned}$$
(4)

here, \(\delta \), \(\varphi \), \(\gamma \), and \(\eta \) are the weighting factors of L, A, P, and T features respectively. Let us briefly discuss the formation of equations to produce different \(Q-scores\) using the power law where the relative change in one quantity results in a proportional change in the other quantity, i.e. one quantity varies as a power of another [26]. In our case, the relative value change of the features is unknown, and their corresponding reproduced \(Q-score\) is unknown as well, however, whether they have proportionate or inversely proportionate relation is known. For example, lower L indicates higher quality and respective higher \(Q-score\), but still, we do not know how much. Since the value change of L for each quality segment is not significant (e.g. 0.08 for Excellent and 0.12 for Fair and the maximum average does not exceed 0.50), it could be best represented only by its power representation as the smaller power with smaller base produces a higher score. This could eventually produce a clear score difference for different quality segments. The features L, A, and P work with power-weight multiplication, however, power-weight division for T similarly works here as it has as inversely proportionate relation with Q-score. The relationships are presented in Eqs. (1)–(4). The rationality of using the Q-score is to predict a better picture of the QMET performance change for various changes of L, A, P, and T within a sizable format that ranges from 0 to 1. Since L, A, P, and T features could jointly advice about how far, how much, how large, and how long respectively in the spatiotemporal domain, the features are synthesized by developing an adaptive weighted function equated as \(Q=L^{\delta L}\times A^{\varphi A}\times (P/2)^{\gamma P}\times \sqrt{2T}^{(\eta /\sqrt{2T})}\). The purpose of this multiplication is to keep a persistent relation of L, A, P, and T features with the reproduced Q-score. As the normalized value of the features varies within the range 0 to 1 using Eqs.  (1)–(4), their multiplication could better reproduce the ultimate score within the predefined limit. Note that the weight for \(\delta \), \(\varphi \), \(\gamma \), and \(\eta \) in the Eqs. (1)–(4) is fixed with 0.5 in this test. This is because we further calculate the slope at each point changing the quality (i.e. Excellent, Fair, and so on) and determine their average for a number of weights. Since the calculated average using weight 0.5 outperforms the other weight combinations, we fix it for the entire experiment to best distinguish different quality segments which is demonstrated in Fig. 4. The distribution of other combinations might work better; however, the tested results demonstrate a good correlation of QMET with other metrics.

Fig. 4.
figure 4

The synthesizing operation using Length, Angle, Pupil-size, and Gaze-duration features could better distinguish different quality segments

2.3 Content and Resolution Invariant Operation on Features

Let us first consider the content (left in Fig. 5) and resolution (right in Fig. 5) based unprocessed L of two example sequences e.g. Poznan_Street and Newspaper presented in Fig. 5. The calculated variations between the highest and lowest values are 41.72% and 28.63% according to the contents and resolutions respectively. Now, the content invariant operation follows a number of steps. First, we figure out the L of the PGPs as mentioned in Sect. 2.2; Second, calculate the average of potential gaze plot (x) and (y) and entitle it by the centre coordinate C(x,y); Third, with respect to C(x,y), we calculate the Euclidean distance of all PGPs and sort the values of length by lowest to the highest order. The rationality of this ordering scheme is due to prioritize the foveal central concentration on pixels by partially avoiding the long surrounded parafoveal, or perifoveal fixations [27] that may incur even with attentive eye browsing; Fourth, to determine the object motion area, we take the average of first \(\mu \) sorted values (\(\mu = 75\%\) in this test since it could help the QMET in obtaining the highest score) which is the foreseen radius of captured affective region; Fifth, the radius is then employed as a divisor of calculated lengths for each potential gaze plots in the First step.

Fig. 5.
figure 5

The video content and resolution based unprocessed Length

Similar to the content based lengths, we also observe a stunning variation of 28.63% for different resolution based lengths in Fig. 5 (right). As a result, we exploit a number of multiplication factors (passively act as compensators) eventually to neutralize the impact of various size video resolutions displayed on the screen. For example, assuming \(1024\times 768\) resolution sequence as a reference, the unprocessed lengths of its higher and lower resolution sequences are multiplied by 0.75 and 1.25 respectively. Almost for all the sequences, since the eye-tracker recorded data demonstrates a good correlation among the highest to the lowest resolution videos, the multipliers could perform well in resolution invariant operation. The outcomes then turn into the normalized values ranging within 0 to 1. The resultant effect of content plus resolution invariant operation for L is revealed in the top-left of Fig. 6 which is undertaken for the final QMET scoring. Once the similar operations are performed on the features A, P, and T, the variation effects could be significantly minimized as illustrated in the top-right, bottom-left and bottom-right respectively as demonstrated in Fig. 6.

Fig. 6.
figure 6

The obtained values of L, A, P, and T (normalized) after performing the content and resolution invariant operation

2.4 The Development of QMET

If relatively lower values of L, A, and P, and higher values of T belong to a potential gaze plot, the QMET should produce relatively higher score. Thus, the QMET score is calculated for all PGPs of each Excellent, Fair, and Very-poor quality segment of the sequences by adaptively synthesizing the features as follows:

$$\begin{aligned} Q_{MET}=L^{\delta L}\times A^{\varphi A}\times (P/2)^{\gamma P}\times \sqrt{2T}^{(\eta /\sqrt{2T})} \end{aligned}$$
(5)

where the weight for \(\delta \), \(\varphi \), \(\gamma \), and \(\eta \) is fixed with 0.5 in this experiment as stated earlier. In an unusual case, if the normalized values of L and A become 0 for 30 consecutive frames (as the frame rate is kept 30 in this test), then a mimicking operation is performed. The rationality of allocating such operation is due to handling the consecutive 0 s that may incur with the intentional eye fixation of participants to a certain PGP. Thus, the user data which have got stack over the frames are forcefully panelized by arbitrarily setting the value of L = 0.1 and A = 0.1. This operation is applicable only for the features L and A since P and T are still ! = 0 then. Note that during this test, we did not experience such unusual situation and carried out no such operation.

3 Experimental Outcomes

The QMET evaluated maximum and minimum scores for each quality segment using two example sequences are presented in Fig. 7(a). For both sequences, the obtained score for the Excellent quality segment is the highest which gradually decreases with respect to the quality degradation and reaches its lowest for the Very-poor segment of quality. Compared to the Newspaper, the QMET score sharply decreases for the Poznan_Street sequence. This is because compared to its Excellent quality segment, the recorded supporting gaze data for the Very-poor quality incur with recurrent unsuitable feature values and produce a lower QMET score. Once we calculate the average score of each Max and Min for the individual quality segment, we notice that the average recognition of variation between the best and worst quality becomes 72.35% which indicate a clear quality distinguishing capability of the QMET.

Fig. 7.
figure 7

Different scoring orientations of QMET for a wide range of qualities (both the participant and video-basis)

Figure 7(b) revels the participant-wise and video-wise average QMET score for three different quality segments. The QMET could obtain the highest score i.e. 0.78 and 0.71 for the Excellent quality segment according to both the video and participant basis as the participants could better capture information from the best quality contents with smooth global browsing. Conversely, for the lowest scores i.e. 0.25 and 0.21 at Very-poor segment, participants in most cases do not succeed to capture content information due do its unpleasant quality and then immediately move to the next but still erroneous. As the number of such hits and miss browsing sharply increases with time, the quality score also decreases as plenty of inappropriate feature values incur with the scoring process. Therefore, for a sequence having really Poor to Very-poor quality, it becomes very unlikely to acquire higher quality score using the proposed QMET. This time, for better justifying the performance of QMET against the PSNR, SSIM, and the MOS using the FVV, two different quality segments (i.e. Excellent and Very-poor) have been taken into account. The calculated average score of four metrics for these segments are reported in Fig. 8(a)–(d). The obtained percentages of variations between the highest score (for Excellent quality segment) and the lowest score (for Very-poor quality segment) using PSNR, SSIM, QMET, and MOS are 57.39, 32.49, 78.51, and 69.71 as represented in Fig. 8(e). The outcomes indicate that the QMET estimated average quality segregation score outperforms the rest of the metrics. This is because viewers could better capture good quality synthesized video content with smooth global browsing. Conversely, the poorly reconstructed synthesized views incur with the localized edge reconstruction and crack like artifacts. Thus, the recorded gaze data of poor contents indicate participants haphazard means of browsing (being affected by unsuccessful attempts due to unpleasant quality) that could not meet the balanced feature correlation criteria and generate lower QMET score. Figure 8(f) indicates the maximum achievable difference (e.g. the difference between the highest score of Excellent quality and the lowest score of Very-poor quality segment) picked out by the four metrics where the MOS could outperform the other metrics. The Very-poor quality segment of some synthesized video (e.g. Newspaper) incur with an arbitrarily nominated lower score such as 0.05 (out of 1.0) which lead to such stunning variations. The calculated results for free viewpoint videos in Fig. 8 indicate that the improvement using the subjective assessment such as MOS could perform better than those of the objective metrics PSNR and SSIM. This is mostly due to the PSNR and SSIM do not find an available reference image to calculate the score in this regard. However, according to Fig. 8(e), the human visual perception based QMET could demonstrate relatively improved performance compared to the MOS in terms of segregating different aspects of coded video quality.

Fig. 8.
figure 8

In the figure, (a–d) reveal the average quality variation identification carried out by the PSNR, SSIM, QMET, and MOS for the Excellent and Very-poor quality segments of free viewpoint videos which is more explicitly presented in (e), while (f) indicates the maximum achievable difference (e.g. the difference between the highest score of Excellent quality and the lowest score of Very-poor quality segment) obtained by four metrics.

Fig. 9.
figure 9

The performance comparison of PSNR, SSIM, QMET, and MOS metrics on the Excellent, Fair, and Very-poor quality segment using FVV. Lower the calculated variation for a segment better the metric performance is presumed.

Now, two remarkable annotations: first, if different videos are coded using the same quality (e.g. QP = 5 for Excellent), the reproduced scores should have no stunning variations. Surprisingly, the PSNR discard this trend and almost for all quality segments, its variation reached the highest as illustrated in Fig. 9. Thus, it might lose its suitability for a wide range of free view video sequences. On the other side, for the Very-Poor quality segment, the participants perhaps give some arbitrary scores for which the MOS reaches its apex and its proficiency drops down in this regard. This example also requires the development of another subjective metric other than MOS for relatively fairer scoring. Although the QMET performs better than PSNR and MOS, the SSIM appears most stable for all segments. This is because the SSIM is a perception-based model that considers degradation in an image mainly by recognizing the change in structural information. To justify the second observation, i.e. even the same sequence is coded with a range of qualities, the recognition of quality variation should be prominent which has been verified by employing two ranges of variations (Excellent - Fair and Fair - Very-poor) and reported in Fig. 10. For the first range of segments, all the metrics with free view video although perform in a similar manner, the QMET appears the most responsive in differentiating the range of qualities. The SSIM tends to be the least responsive metric in this regard. For the second range of segments (i.e. Fair - Very-poor), the QMET and the MOS reach their apex to indicate their best performance in the context of quality segregation. Interestingly, for both range of segments, the subjective estimators perform relatively better compared to the objective ones.

Fig. 10.
figure 10

The PSNR, SSIM, QMET, and MOS metrics recognized percentage of quality variation for a range of quality segment differences. Higher the calculated percentage of variation detection in segments [X–Y], better the metric performance is presumed.

For further performance estimation of four metrics, the calculated results for all videos used in this test are reported in Table 1 by implementing both the PLCC and SRCC’s evaluation criteria. A good quality metric is expected to achieve higher values in both PLCC and SRCC [8]. According to both PLCC and SRCCs judgement, the QMET reveal the similar performance compared to the PSNR, however, it could obtain relatively higher score compared to the SSIM and MOS. In fact, the obtained results of the proposed metric are promising given the fact that no information about the reference image is available to the QMET for evaluating quality. Since the scoring pattern of four metrics are approximately similar in terms of distinguishing different quality contents as illustrated in Figs. 9 and 10, and Table 1, the proposed QMET could be well represented as a new member of the quality metric family and successfully employed as an impressive alternative to the subjective estimator MOS. It could also be employed to evaluate the effectiveness of using the objective metrics PSNR and SSIM since the QMET does not require any ground-truth reference for quality estimation.

Table 1. Average performance of four metrics according to both PLCC and SRCC’s evaluation criteria.

The potential application of QMET could be the evaluation of synthesized views reproduced by different FVV generation algorithms. A good number of contributions could be found in the literature which claim about the image quality improvement mostly depending on the objective metric PSNR, SSIM or the subjective estimator MOS. However, it is presented earlier that the subjective estimator MOS performs better than the objective metrics in most cases during evaluating the FVV quality. Since the proposed QMET is mostly correlated to the proximity of human cognition, its assessment process is presumed to be more neutral compared to the MOS. Moreover, since the view synthesis algorithms go through some post-processing phases such as inverse mapping or inpainting for crack filling, it is highly anticipated to obtain higher quality evaluation score using QMET especially for those algorithms successfully overcoming the crack filling artifacts.

4 Conclusion

In this work, a no-reference video quality assessment metric has been developed based on the free view video. The newly developed metric QMET could be an impressive substitute to the popularly used subjective estimator MOS for quality evaluation and comparison. In the metric generation process, the human perceptual eye- traversing nature on videos is exploited and discovered the patterns of Length, Angle, Pupil-size, and Gaze-duration features from the recorded gaze trajectory for varied video qualities. The content and resolution invariant operations are carried out prior to synthesizing them using an adaptive weighted function to develop the QMET. The experimental analysis reveal that the quality evaluation carried out by the QMET is mostly similar to the MOS and the reference required PSNR and SSIM in terms of assessing different aspects of quality contents. Eventually, the outcomes of four metrics have further been tested using the Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficients (SRCC) evaluation criteria which indicate that the QMET could relatively better perform compared to the MOS and the SSIM for a wide range of free viewpoint video contents. Since the eye-tracker data could be easily captured nowadays by directly employing the software based eye-tracking simulator (i.e. device itself is no longer required), the utility of the QMET could also be more flexible using such simple simulator generated data set.