Keywords

1 Introduction

1.1 Background

The recognition of chemicals in the environment is an essential need for the living organisms. Odours are detected through millions of olfactory receptors that are located at the top of nasal cavities. The human olfactory system consists of three main components: (1) an array of olfactory receptors (2) the olfactory bulb that receives neural inputs about odours detected by the receptors and (3) the brain. The olfactory system collects a sample from its environment and transmit it to the brain, where it is recognized as a specific odour.

An olfactory system is able to detect a broad range of smells. However, the human olfacotry system fails to respond to many air pollutants; people can have different sensitivity to many air pollutants and even be accustomed to toxic smells.

The contamination of air by harmful chemicals is referred to as air pollution and is one of the biggest concerns worldwide. This is mainly because the air pollution has direct influence on the environmental and human health. Auditing odourants is a crucial element in assessment of indoor and outdoor air quality. There are various odour measurement techniques such as dilution-to-threshold, olfactometers, and referencing techniques [25]. The dependence of these methods on human evaluation makes them less accurate and sometimes undesirable.

The concept of an artificial olfaction was introduced by [29]. The primary artificial olfaction rely on a gas multisensor array. The term electronic nose (e-nose) appeared for the first time in the early 1990s [12]. E-nose is designed for recognizing simple or complex odours in its environment and it comprises two main elements of hardware and software. The hardware usually include a set of gas sensors (such as metal oxide semi-conductors, conducting polymers, etc.) with partial specificity, air conditioner, flow controller, electronics, and many more components. The software consists of statistical methods for pre-processing the data and pattern recognition methods for predicting the odour concentration.

The gas sensors of e-nose should have certain features. Similar to human nose receptors, the gas sensors of e-nose need to be highly sensitive with respect to chemical compounds and less sensitive towards temperature and humidity. In addition, the sensors should be able to respond to various chemical compounds. Among the other features, one can name durability, selectivity, and easy calibration.

Gas sensor’s performance is affected by various elements. One of the most serious deterioration in sensors is owing to a phenomenon called drift. Drift is the low frequency change in a sensor that causes offset measurements. Sensor drift, therefore, need to be detected and compensated to guarantee accurate sensor measurement. Several methods have been introduced to overcome the drift phenomenon including [2, 5, 28, 34].

The multivariate response of gas sensor arrays undergoes different pre-processing procedures before the prediction is performed using statistical tools such as regression, classification, or clustering. [3, 15, 23] have discussed methods for analyzing the gas sensor array data.

1.2 Motivation

The e-nose is capable of reproducing the human sense of smell using an array of gas sensors and pattern recognition methods. Pattern recognition methods use a set of labelled data to predict the odour concentration for each set of sensor measurements. The labelled data consist of a sub-sample of sensors’ outputs considered for further analyses of its concentration in olfactometry.

One of the application of e-nose is in environmental activities; e-noses provide industries with odour management plan to minimize the effect of odour in the environment. To this end, e-noses are installed in outdoor fields such as compost sites, landfill sites, waste water plants, etc., where the environmental condition can greatly fluctuate. Consequently, the occurrence of unwanted variability is very typical.

During the sampling process, sensors in the e-nose device may report incorrect values or some of the sensors stop functioning for a short period of time. These anomalies are ought to be diagnosed and reported in real time using a computationally efficient algorithm, which is the focus of this research.

We propose an online data validation algorithm which compares e-nose measurements with a set of reference samples and allocate them accordingly to different zones. The zones are distinguished from each other using distinct colors like green, yellow, red, etc., to represent the extent of the validity of the measurement. The main focus of this work is summarized in the below flowchart, Fig. 1.

Fig. 1.
figure 1

A schematic flowchart of the proposed online task for an e-nose.

1.3 Data Preparation

The e-nose relies on a sensor array consists of several gas sensors. The number of the sensors depends on the purpose of analysis. Each sensor represents an attribute; the more the sensors are, the better the e-nose discriminate among analytes. Nonetheless the inclusion of too many sensors can lead to unnecessary data and a complex system.

The e-nose under the study includes 11 senors each designed to be responsive to a specific chemical compound in the air. However, senors react to almost all gases as they may not be highly selective. As a result, some of the sensors are highly positively correlated with each other, see Figs. 2 and 3 (left panel). Consider the data matrix \(\mathbf {X}_{n\times p}\) with its rows being n independent realization of 11 sensor values, \(\mathbf {x}^{^\top }_{p\times 1}\) in which \(\mathbf {a}^{^\top }\) indicates the transpose of the vector \(\mathbf {a}\).

Fig. 2.
figure 2

Senor’s output during three days of sampling for 4 randomly selected sensors.

The covariance matrix of \(\mathbf {x}_{p\times 1}\), say \(\varvec{\varSigma }=[\sigma _{ij}]_{i,j=1,2,\ldots ,p}\), is defined as

$$\varvec{\varSigma }_{p \times p}=\mathrm {Cov}(\mathbf {x})= \mathrm {E}\{(\mathbf {x}-\varvec{\mu })(\mathbf {x}-\varvec{\mu })^{^\top }\},$$

where \(\varvec{\mu }\) represents the mean of \(\mathbf {x}\), and \(\mathrm {E}\) is the mathematical expectation. The covariance, \(\sigma _{ij}\), measures the degree to which two attributes are linearly associated. However, in order to have a better idea about the relationship between two attributes, one needs to eliminate the effect of other attributes. The partial correlation is the correlation between two attributes, while controlling for the effects of other attributes. The inverse of covariance matrix is commonly known as precision or concentration matrix. The entries of \(\varvec{\varSigma }^{-1}\) have an interpretation in terms of partial correlation. Non-zero elements of \(\varvec{\varSigma }^{-1}\) implies conditional dependence. Therefore, the sparse estimation of \(\varvec{\varSigma }^{-1}\) pinpoints the block structure of attributes. Sparse estimation of \(\varvec{\varSigma }^{-1}\) set some of the \(\varvec{\varSigma }^{-1}\) entries to zero. Investigation of the inherent dependence between the sensor values is then performed by means of the partial correlation.

Here, the graphical lasso [11] is considered for a better understanding of the existing relationship between the sensor values. [11] proposed estimating the covariance matrix such that its inverse, \(\varvec{\varSigma }^{-1}\), is sparse by applying a lasso penalty [33]. In Fig. 3 (right panel), the undirected graph connects two attributes which are conditionally correlated given all other attributes. The sensors 5 to 8 are correlated with each other conditioning on the effect of the others. This is also reflected in the heatmap of the correlation matrix Fig. 3 (left panel). This dependence must be taken into account while modelling the data. Gaussianity of the data is another crucial assumption that should be verified. The validity of this assumption for the sensor values is tested using various methods such as analyzing the distribution of individual sensor values, scatter plot of the linear projection of data using principal components, estimating the multivariate kurtosis and skewness, and also multivariate Mardia test, see Fig. 4.

Fig. 3.
figure 3

Left panel, heatmap of the correlation matrix of the sensor values (\(s_1\)\(s_{11}\)). Right panel, the undirected graph of partial correlation using the graphical lasso. The undirected graph of the right panel approves the block structure of the heatmap of the left panel.

2 Data Analysis

We aim to verify the validity of e-nose measurements by considering some reference samples for the purpose of comparison. These reference samples are collected when the e-nose functions normally, and the conditions are fully under control. The e-nose measurements are compared with reference samples and are allocated to various zones accordingly. These zones are distinguished by various colors, like green, yellow, red, etc., to indicate the status of e-nose measurements [26].

Fig. 4.
figure 4

Left panel, the Q-Q plot of squared Mahalanobis distance supposed to follow the chi-squared distribution for Gaussian data. Right panel, the marginal density for some randomly chosen sensor values. Both graphs confirm the non-Gaussianity of data.

Two distinct reference sets, if applicable, are recommended for data validation. Reference 1 consists of data in a period of sampling defined by an expert after installation of the e-nose. The data in this period of sampling is called as proposed set. Reference 2, upon its availability, is manually gathered samples from the field that are brought to the laboratory for quantification of their odour concentration. The data in this period of sampling is called calibration set, to emphasize that it can be incorporated for data modelling using a supervised learning algorithm.

If a new datum does not follow the overall pattern of data previously observed, then it is marked as an outlier and is assigned to Red zone. This zone represents a dramatic change in the pattern of samples and is referred to as “risky” observations. If the new datum is not an outlier and it is also located within the data polytope of the Reference 1 or the Reference 2, it is allocated to Green or Blue zone respectively. These zones represent the “safe” observations. If the new datum is not an outlier, but outside of the area of Green and Blue zones, they are assigned to Yellow zone. This zone displays potentially “critical” observations.

Fig. 5.
figure 5

Validity assessment for about 700 samples based on 2 sensor values. Left panel, the plot illustrates the contour map of estimated density function for the 2 sensors. Right panel, the density function of the samples demonstrated in 3D with zones identified for each of the samples in the sensor 1 (\(s_1\)) versus sensor 2 (\(s_2\)) plane. Higher density is assigned to the Green, Blue, and Orange zones compared to the Yellow and Red zones. (Color figure online)

If large proportion of samples belong to the Yellow and Red zones, the reliability of the system should be suspected. Undesirable measurements can be the outcome of physical complications, such as sensor loss in the e-nose, or sudden changes in the chemical pattern of the environment. Zone assignment, therefore, require some outlier detection algorithms. For the Green and the Blue zones, the new samples are projected onto a subspace with lower dimension. Dimension reduction methods such as principal component analysis (PCA) can serve for this purpose [20]. PCA attempts to explain the data covariance matrix, \(\hat{\varvec{\varSigma }}\), by a set of components; these components are the linear combination of the primary attributes. PCA, basically, converts a set of possibly correlated attributes into a set of linearly uncorrelated axes through orthogonal linear transformations. The first k (\(k<p\)) principal components are the eigenvectors of the covariance matrix \(\varvec{\varSigma }\) associated with the k largest eigenvalues. The classical estimation of covariance matrix, \(\hat{\varvec{\varSigma }}\), is strongly influenced by outliers [30]. As producing outlier is typical of sensor data, robust covariance estimation must be applied to avoid misleading results.

Robust principal component analysis [17] is employed for dimension reduction purpose throughout this article. This robust PCA computes the covariance matrix through projection pursuit [24] and minimum covariance determinant [7] methods. The robust PCA procedure can be summarized as follows:

  1. 1.

    The matrix of data is pre-processed such that the data spread in the subspace of at most \(\min (n-1,p)\).

  2. 2.

    In the spanned subspace, the most obvious outliers are diagnosed and removed from data. The covariance matrix is calculated for the remaining data, \(\hat{\varvec{\varSigma }}_0\).

  3. 3.

    \(\hat{\varvec{\varSigma }}_0\) is used to decide about the number of principal components to be retained in the analysis, say \(k_0\) (\(k_0<p\)).

  4. 4.

    The data are projected onto the subspace spanned by the first \(k_0\) eigenvectors of \(\hat{\varvec{\varSigma }}_0\).

  5. 5.

    The covariance matrix of the projected points is estimated robustly using minimum covariance determinant method and its k leading eigenvalues are computed. The corresponding eigenvectors are the robust principal components.

The Red zone represents the outliers of the samples as being measured by the e-nose through time. One common approach for detecting outliers in multivariate data is to use the Mahalonobis ditstance

$$\begin{aligned} D_{m}(\mathbf {x}_i)=\sqrt{(\mathbf {x}_i-\hat{\mu })^{^\top }\hat{\varvec{\varSigma }}^{-1} (\mathbf {x}_i-\hat{\mu })}. \end{aligned}$$
(1)

The large value of \(D_{m}(\mathbf {x}_i)\) for \(i=1, 2, \ldots , n,\) indicates that the observation \(\mathbf {x}_i\) locates away from the centre of the data \(\hat{\mu }\). As the estimation of \(\mu \) and \(\varvec{\varSigma }\) is itself affected by outliers, the use of Eq. (1) is inadvisable for outlier detection. Even the robust plugin estimation of \(\mu \) and \(\varvec{\varSigma }\) do not lead to any improvement as long as the associated outlier detection cut-offs are based on elliptical distributions. [4, 18] suggested an outlier detection method which does not assume any elliptical distribution for data. Their method is formed on a modified version of Stahel-Donoho outlyingness measure [9, 32] and is called adjusted outlyingness (\(\mathrm {AO}\)) criterion. For the observation \(\mathbf {x}_i\), the Stahel-Donoho measure is

$$\begin{aligned} \mathrm {SD}(\mathbf {x}_i)=\sup _\mathbf{a \in \mathbf {R}^{p}} \frac{|\mathbf a ^{^\top }\mathbf {x}_i-\mathrm {median}(\mathbf {X}_n \mathbf a )|}{\mathrm {mad}(\mathbf {X}_n \mathbf a )}, \end{aligned}$$
(2)

where \(\mathrm {mad}(\mathbf {X}_n \mathbf a )= 1.483 \mathrm {median}_i|\mathbf a ^{^\top }\mathbf {x}_i-\mathrm {median}(\mathbf {X}_n \mathbf a )|\) is the median absolute deviation. The \(\mathrm {SD}\) measure essentially looks for outliers by projecting each observation on many univariate directions. As it is not applicable to look for all possible directions, it is suggested that considering 250p directions, where p is the number of attributes, suffices and produces efficient results. Taking into account the effect of skewness in the \(\mathrm {SD}\) measure results in the following \(\mathrm {AO}\)

$$\begin{aligned} \mathrm {AO}_i=\sup _{\mathbf {a} \in \mathbf {R}^{p}} {\left\{ \begin{array}{ll} \frac{\mathbf {a}^{^\top }\mathbf {x}_i-\mathrm {median}(\mathbf {X}_n \mathbf {a})}{w_2-\mathrm {median}(\mathbf {X}_n \mathbf {a})} &{} \text {if } \mathbf {a}^{^\top }\mathbf {x}_i > \mathrm {median}(\mathbf {X}_n \mathbf {a}),\\ \frac{\mathrm {median}(\mathbf {X}_n \mathbf {a})- \mathbf {a}^{^\top }\mathbf {x}_i }{\mathrm {median}(\mathbf {X}_n \mathbf {a})- w_2} &{} \text {if } \mathbf {a}^{^\top }\mathbf {x}_i < \mathrm {median}(\mathbf {X}_n \mathbf {a}),\\ \end{array}\right. } \end{aligned}$$
(3)

where \(w_1\) and \(w_2\) are the lower and upper whiskers of the adjusted boxplot [19]. If the \(\mathrm {AO}_i\) exceeds the upper whisker of the adjusted boxplot, it is then detected as an outlier.

The sample that is rendered as an outlier by AO measure, belongs to the Red zone. For the specification of the remaining zones, we need to define the polytopes of the samples in Reference 1 and Reference 2. These polytopes are built using the convex hull of the robust principal component scores. More specifically, the boundary of the Green zone is defined by computing the convex hull of the robust principal component scores of the Reference 1.

Before determining the color tag for each new data, the samples are checked for missing values and are imputed if needed by multivariate imputation methods such as [21]. The idea behind the validity assessment is visualized in Fig. 5. For simplicity, only 2 sensors are used for all computations in Fig. 5 and a 2D presentation of zones is plotted using the sensors’ coordinates. Suppose that \(\mathbf {X}_{n\times 11}\) represents the matrix of sensor values for n samples, \(\mathbf {y}_n\) the vector of corresponding odour concentration values and \(\mathbf {x}^{^\top }_l\) is the lth row of \(\mathbf {X}_{n\times 11}\), \(l=1, 2, \ldots , n\). Furthermore, suppose that \(n_1\) refers to the number of samples in the proposed set of the sampling and \(n_2\) refers to the number of samples in the calibration set. The samples of the proposed set are always available, but not necessarily the calibration set. Two different scenarios occur based on the availability of the calibration set.

If the calibration set is accessible, then Scenario 1 happens. Otherwise, we only deal with Scenario 2. Scenario 1 is a general case which is explained more in detail. The data undergo a pre-processing stage, including imputation and outlier detection, before any further analyses. Having done the pre-processing stage, data are stored as Reference 1, \(\mathbf {X}_{n_{1}\times 11}\), and Reference 2, \(\mathbf {X}_{n_{2}\times 11}\). The first k, e.g. \(k=2,3\), robust principal components of \(\mathbf {X}_{n_{1}\times 11}\) are calculated and the corresponding loading matrix is denoted by \(\mathbf {L}_1\). The pseudo code of two algorithms for Scenario 1 is provided below. Scenario 2 is a special case of Scenario 1 in which Sub-Algorithm (Scenario 1) is used with \( \mathrm {ConvexHull}^{(2)}=\varnothing \) that eliminates the Blue and Orange zones. Consequently, there is no model for odour concentration prediction in the Main Algorithm.

figure a
figure b

In Sect. 3, a set of simulated data is used to verify the relevancy of our proposed algorithm and the choice of statistical methods. The applicability of our algorithm is also tested based on 8 months sampling from the e-nose in Sect. 4.

3 Simulation

We examine the methodology on two sets of simulated data to highlight the importance of the assumptions such as non-elliptical contoured distribution and robust estimation considered in our methodology. In each example, we stored the simulated data in the matrix \(\mathbf {X}_{n \times 2}\), where \(\mathbf {x}_l^{^\top }=(x_{l1}, x_{l2}); l=1, 2, \ldots , n\).

In the first example, the data is simulated from a mixture distribution with \(10\%\) contamination. The elements of mixture distribution are chosen arbitrarily from Gaussian and the Student’s t-distribution.

We simulated data from the bivariate skew t-distribution [14] in the second example in order to test the effect of skewness on our algorithm.

Using classical approaches for outlier detection without considering the actual data distribution, mistakenly renders many observations as outliers, Figs. 6 and 7 (top right panel). The parameters of interest, the mean vector and the covariance matrix, need to be estimated robustly, otherwise the confidence region misrepresents the underlying distribution. In Figs. 6 and 7 (bottom left panel), the classical confidence region is pulled toward the outlier observations. On the contrary, the robust confidence region perfectly unveil the distribution of the majority of observations because of the robust and efficient estimation of the mean and the covariance matrix. Consequently, the classical principal components are affected by the inefficient estimation of the covariance matrix. We proposed using methods that deal with asymmetric data appropriately. Adjusted outlyingness (AO) measure identifies the outliers of the data correctly. Considering a sub-sample of data as Reference 1 in each of the examples, the result of the Main Algorithm can be observed in the right bottom panel of Figs. 6 and 7.

Fig. 6.
figure 6

Top left panel, the simulated data from the mixture distribution \(f(x)=(1-\varepsilon ) f_1(x)+ \varepsilon f_2(x)\) with contamination proportion of \(\varepsilon =\frac{1}{10}\), and \(f_1\) and \(f_2\) being the Gaussian and Student’s t-distribution respectively. Top right panel, the outliers of data are identified and highlighted with red using the classical Mahalonobis distance and 95th percentile of the chi-squared distribution with two degrees of freedom. Bottom left panel, the \(95\%\) confidence region for the data is computed using the classical estimates of parameters (cyan) and the robust estimates (gold). Bottom right panel, the Main Algorithm is implemented and the zones are plotted using their associated color tag. (Color figure online)

Fig. 7.
figure 7

Top left panel, the simulated data from the bivariate skew t-distribution. Top right panel, the outliers of data are identified and highlighted with red using the classical Mahalonobis distance and 95th percentile of the chi-squared distribution with two degrees of freedom. Bottom left panel, the \(95\%\) confidence region for the data is computed using the classical estimates of parameters (cyan) and the robust estimates (gold). Bottom right panel, the Main Algorithm is implemented and the zones are plotted using their associated color tag. (Color figure online)

4 Experiment

In order to evaluate the performance of our data validation method, we implement the Main Algorithm on a collection of e-nose measurements. We decide to keep the first 3 robust principle components of the data PC1, PC2, PC3 for simplification and the easy visualization. The 3 principal components correspond to the 3 largest eigenvalues of the robust covariance matrix. Prior to the implementation of the Main Algorithm, the data undergoes a pre-processing stage including the imputation of the missing values.

The validity of the e-nose measurements are identified using the Main Algorithm for the 8 months of sampling. In favor of more readable graphs, only a subset of 500 samples out of 200 thousands of observations are plotted. In Fig. 8, the sample points are drawn in gray and each zone is highlighted using its corresponding color. The circles in Fig. 8 are also illustrated on PC1 and PC2 plane for a better demonstration of the zones.

Fig. 8.
figure 8

A random sample of size \(n=500\) is plotted over the first three robust principal components coordinates. From top left panel to bottom right panel, the colored blobs represent green, blue, yellow, and red zones respectively. (Color figure online)

The interpretation of a zone is heavily depends on its definition. For instance, the Green, Blue, and Orange zones, represent samples that are very close the samples that have already been observed in either Reference 1 or Reference 2. As the observations in reference sets were entirely under control, the Green, Blue, and Orange zones affirm the validity of the samples. In addition, the accuracy of the gas concentration predicted for these zones is certified. On the other hand, the gas concentration prediction for samples in the Red zone is less accurate compared with that of the Green, Blue, and Orange zones.

The data that are significantly dissimilar to the already observed data deserve further attention. These data are outliers and are reported in the Red zone. Similarly, the gas concentration predictions associated with samples in the Red zone can be very mis-leading. Generating a remarkable percentage of samples belonging to the Yellow and the Red zones refers to the possible failure of the e-nose equipment.

5 Computational Complexity

Here, we discuss the computational complexity of our proposed algorithm (Main Algorithm). First, a brief introduction to computational complexity is given to facilitate the understanding.

The computational complexity of an algorithm is studied asymptotically by the big O-notation [1]. The big O-notation explains how quickly the run-time of an algorithm grows relative to its input. For instance, sum of n values require \((n-1)\) operations. Consequently, the mean requires n operations reserving one for the division of the sum by n. As they are both bounded by a linear function, they have computational complexity of order O(n). In other words, the performance of the sum and mean grow linearly and in a direct proportion to the size of the input. Note that not all algorithms are computationally linear. Computational complexity of covariance matrix, for instance, is \(O(np^2)\) where n is the sample size and p is the number of attributes. Since each covariance calls for sum of the pairwise cross-products each of complexity O(n). In total, there are \(\frac{p (p-1)}{2}\) off-diagonal cross products and p square sums for the diagonal entries of the covariance matrix. This yields \(n\{p(p-1)+p\}\) operations. For a fixed number of attributes p, the computation is of order O(n). Likewise, for a fixed number of observations the computation is of order \(O(p^2)\). Another nontrivial example for non-linear algorithm is PCA or the robust PCA. Computation of robust principal components involves various operations that has been briefly discussed in Sect. 2. Computational complexity of robust PCA is discussed below. Computation of robust PCA comprises the following steps:

  1. 1.

    Reducing the data space to an affine subspace spanned by the n observations using singular value decomposition of \((\mathbf {X}-\mathbf {1}_n \hat{\varvec{\mu }})^{^\top }(\mathbf {X}-\mathbf {1}_n \hat{\varvec{\mu }})\), where \(\mathbf {1}_n\) is the column vector of n dimension with all entries equal to 1. This step is of order \(O(p^3)\), see [13, 16].

  2. 2.

    Finding the least outlying points using the Stahel-Donoho affine-invariant outlyingness [9, 32]. Adjusting this outlyingness measure by the minimum covariance determinant location and scale estimators is of order \(O( p n \log n)\), see [17, 18]. Then the covariance matrix of the non-outliers data, \(\hat{\varvec{\varSigma }_0}\), is calculated which is computationally less expensive.

  3. 3.

    Performing the principal component analysis on \(\hat{\varvec{\varSigma }_0}\) and choosing the number of projection components (say \(k_0 < p\)) to be retained. Computing the \(\hat{\varvec{\varSigma }_0}\) needs \(n p^2\) operations. Thus its complexity is \(O(n p^2)\). The spectral decomposition of the covariance matrix is achieved by applying matrix-diagonalization method, such as singular value decomposition or Cholesky decomposition. This results in \(O(p^3)\) computational complexity. Determining the \(k_0\) largest eigenvalues and their corresponding eigenvectors has time complexity of \(O(k_0 p^2)\) [10]. As a result, the time complexity of this step is \(O(n p^2)\).

  4. 4.

    Projecting the data onto the subspace spanned by the first \(k_0\) eigenvectors, i.e. \((\mathbf {X}-\mathbf {1}_n \hat{\varvec{\mu }}) \mathbf {P}_{p \times k_0}\) where \(\mathbf {P}_{p \times k_0}\) is the matrix of eigenvectors corresponding to the first \(k_0\) eigenvalues. This step has \(O(n p k_0)\) time complexity.

  5. 5.

    Computing the covariance matrix of the projected points using the method of fast minimum covariance determinant has the computational complexity which is sub-linear in n, for fixed p. This is O(n) [31]. The calculation of the spectral decomposition of the final covariance matrix is bounded by \(O(n k_0)\) time complexity.

Remark 1

The computational complexity of robust PCA is \(O(\max \{p n \log n , n p^2 \})\), or \(O(p^2 n \log n)\) considering the worst case complexity.

To ascertain the complexity of the Main Algorithm, one needs to analyze each step separately. The measurement validation in e-nose broadly necessitates the calculation of certain steps of the Main Algorithm including Step Require, Step 1, Step 3, and Step 4. All these tasks excluding Step 4 of the Main Algorithm (Sub-Algorithm) must be run only once. Step 4 duplicates upon the arrival of the new observations.

First, we start by evaluating the complexity of Step Require, Step 1, and Step 3 that should be run once. Afterwards Step 4 is analyzed in a similar fashion. Note that for the e-nose data, the number of samples is generally much greater than the number of sensors p. In addition, as the number of sensors p is fixed in an e-nose equipment, the computational complexity is reported as the function of number of samples only.

The Main Algorithm starts with the robust PCA over the Reference 1. As a result, Step Require has \(O(\{n_1 \log n_1\})\) complexity assuming p to be fixed. Step 1 requires \(O( n_1 k_0)\) computing time for computing \(\mathbf {X}_{n_{1}\times 11}\mathbf {L}_1\) where \(k_0\) stands for the number of eigenvectors retained in the loading matrix \(\mathbf {L}_1\). Computing the convex hull of these projected values for \(k_0 \le 3\) is of order \(O(n_1 \log n_1)\). For \(k_0 > 3\), the computational complexity of hull increases exponentially with \(k_0\), see [6, 27]. Similarly, the same complexity is valid for Step 3. Performing some pre-processing steps on the Reference sets including outlier detection using AO measure has \(O(n_1 \log n_1)\) complexity [18] assuming that \(n_1 > n_2\), which is common in practice. As a result, Step Require, Step 1, and Step 3 which is performed only once take \(O(n_1 \log n_1)\) run-time.

Now, we analyze Step 4 in terms of its computational complexity. Step 4 mainly does the following three tasks.

  1. (i)

    Accumulating the new observations with the past history, \(\mathbf {X}^{^\top }_{1:t \times p}=[\mathbf {X}^{^\top }_{1:t-1 \times p}: \mathbf {x}_{t \times p}]\) where \(n_1 < t \le n\), and identifying outliers using AO measure. This has computational complexity of \(O(t \log t)\).

  2. (ii)

    Projecting the observations onto the space of Reference 1, \(\mathbf {x}^{^\top }_{l}\mathbf {L}_1\). This is a simple matrix product and has the computational complexity of \(O(k_0 p)\).

  3. (iii)

    Verifying whether the projection of data, \(\mathbf {x}^{^\top }_{l}\mathbf {L}_1\), locates within the convex hull of either Reference 1 or Reference 2 which is equivalent to solving a linear optimization with linear constraints [8, 22]. The algorithm used for this purpose has computational complexity which varies quadratically with respect to the number of vertices of the convex hull, and has \(O(n_1^2 k_0)\) complexity in the worst case. The R code used for solving this linear program resembles the MATLAB codeFootnote 1 and is available upon the request.

Thus, the computational complexity of Step 4 is \(O(t \log t)\) as in practice the convex hull of Reference 1 is computed, in Step 1, and kept fixed prior to this step.

Remark 2

The computational complexity of Main Algorithm is \(O(t \log t)\).

The mean CPU time in seconds for Step Require, Step 1, and Step 3 that need to be run once and Step 4 which duplicates for each new sample, are reported in Fig. 9. This figure confirms that the run-times for the ensemble of the steps Require, 1, 3 and 4 agree with the computational complexity evaluated theoretically earlier. This implies that measurement validation can be achieved with \(O(t \log t)\) time complexity employing our proposed method.

Fig. 9.
figure 9

The solid line shows the mean CPU time in seconds as a function of input being run on 1.3 GHz i5 processor. The dashed lines depict the lower and the upper bound of the \(95\%\) confidence interval for the mean CPU time. Left panel, the run-time corresponding to Step Require, Step 1, and Step 3 as the function of the number of samples in Reference 1, \(n_1\). Right panel, the run-time associated with Step 4 as a function of the total number of samples upto the moment, t. In each iteration, 100 new observations are sampled.

6 Conclusion

An electronic nose device, which mainly consists of a multi-sensor array, attempts to mimic the human olfactory system. The sensor array is composed of various sensors selected to react to a wide range of chemicals to distinguish between mixtures of analytes. Employing the pattern recognition methods, the sensor’s output are compared with reference samples to predict odour concentration. Consequently, the accuracy of predicted odour concentration depends heavily on the validity of sensor’s output. An automatic procedure that detects the samples’ validity in an online fashion has been a technical shortage and is addressed in this work. A measurement validation process provides the possibility of attaching a margin of error to the predicted odour concentrations. Furthermore, it allows taking the subsequent actions such as re-sampling to re-calibrate the models or checking the e-nose device for possible sensor failures. The proposed measurement validation algorithm initiates a new development in automatic odour detection by minimizing the manpower intervention.