Smoothness Bias in Relevance Estimators for Feature Selection in Regression

Alexandra Degeest^18,19,
Michel Verleysen¹⁹ &
Benoît Frénay²⁰

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 519))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

2502 Accesses

Abstract

Selecting features from high-dimensional datasets is an important problem in machine learning. This paper shows that in the context of filter methods for feature selection, the estimator of the criterion used to select features plays an important role; in particular the estimators may suffer from a bias when comparing smooth and non-smooth features. This paper analyses the origin of such bias and investigates whether this bias influences the results of the feature selection process. Results show that non-smooth features tend to be penalised especially in small datasets.

You have full access to this open access chapter, Download conference paper PDF

About Filter Criteria for Feature Selection in Regression

Comparison Between Filter Criteria for Feature Selection in Regression

Efficient feature selection using shrinkage estimators

Article Open access 09 May 2019

Keywords

1 Introduction

High-dimensional datasets are now ubiquitous. Selecting a subset of the most relevant features is useful to ease the learning process, to alleviate the curse of dimensionality, to increase the interpretability of features, to visualise data, among others. Many works focus on methods to reduce the number of features in datasets [1,2,3,4,5,6,7]. These methods can be roughly categorised into filter methods, wrappers and embedded methods that all have their respective advantages and drawbacks [1]. This paper focuses on filter methods, which have the advantage to be fast because they do not require to train any model during the feature selection process, contrarily to wrappers [6] and embedded methods [8].

Filters use a relevance criterion during the feature selection process. Three popular relevance criteria used to select features in regression tasks are the correlation, the mutual information and the noise variance. This paper focuses on mutual information and noise variance because of their property to be able to detect features that have nonlinear relationships with the variable to predict. It shows that the statistical estimators of mutual information and of noise variance both suffer from a bias, mostly when small samples are considered, and that this bias may affect the selection of the features. The paper also shows that this bias disappears in large datasets, but faster when using noise variance than when using mutual information.

The remaining of the paper is organised as follows. Feature selection in regression with filters is detailed in Sect. 2. Section 3 analyses the behaviour of mutual information and Delta Test, and discusses the potential bias for small sample datasets. In order to confirm the bias and its consequences, simple experiments are described in Sect. 4 and their results are shown in Sect. 5. Finally, conclusions are given in Sect. 6.

2 Feature Selection with Filters

In the context of filter methods for feature selection, a relevance criterion is necessary to select the most relevant features among all the available ones. The relevance criterion aims at measuring the existing relationship between a feature or a set of features and the variable to predict. There exist several relevance criteria. Correlation is the simplest one, but it is only able to detect linear relationships between random variables, and it is restricted to the univariate case (sets of features can only be evaluated individually, which prevents to take into account the possible relations between the features themselves). In this paper, we focus on nonlinear and multivariate relationships between a set of random input variables and one random output variable. For this type of relationships, mutual information (MI) and noise variance are both popular measures used as relevance criteria for filter methods. Both need to be estimated in practice on a finite set of data: traditional estimators are the Kraskov estimator for the former and the Delta Test for the latter. These criteria have been repeatedly used for feature selection in regression problems [9, 10].

This section reviews the mutual information (MI) and noise variance criteria, and their Kraskov and Delta Test estimators. Both estimators are based on k-nearest neighbours. The next sections show that these estimators implicitly take into account a measure of smoothness (Sect. 3), which could lead to a bias in the choice of features during the feature selection process (Sects. 4 and 5).

2.1 Feature Selection with Mutual Information

Mutual information (MI) is a popular criterion for filter methods [5, 11,12,13,14]. Based on entropy, it is a symmetric measure of the dependence between random variables, introduced by Shannon in 1948 [15]. MI measures the information contained in a feature, or in a group of features, with respect to another one. It has been shown to be a reliable criterion to select relevant features in classification [16] and regression [9, 10, 17, 18]. This paper focuses on regression problems.

Let X and Y be two random variables, where X represents the features and Y the target. MI measures the reduction in the uncertainty on Y when X is known

$$ I(X;Y) = H(Y) - H(Y|X) $$

(1)

Where

$$ H(Y) = - \int_{Y} {p_{Y} } (y)\,{ \log }\,p_{Y} (y)dy $$

(2)

is the entropy of Y and

$$ H(Y|X) = \int_{X} {p_{X} } (x)\,H(Y|X = x)dx $$

(3)

is the conditional entropy of Y given X. The mutual information between X and Y is equal to zero if and only if they are independent. If Y can be perfectly predicted as a function of X, then I(X; Y) = H(Y).

In addition to the criterion, feature selection needs a search procedure to find the best feature subset among all possible ones. Given the exponential number of possible subsets, search procedures such as greedy search or genetic algorithms are used to find the best subset of features without having to compute the selection criterion between all subsets of variables and the output. Among these subsets, the one maximising the MI with the output is selected.

In practice, MI cannot be directly computed because it is defined in terms of probability density functions. These probability density functions are unknown when only a finite sample of data is available. Therefore, MI has to be estimated from the dataset. The estimator introduced by Kraskov et al. [19] is based on a k-nearest neighbour method and results from the Kozachenko-Leonenko entropy estimator [20]

$$ \hat{H}(X) = - \psi (k) + \psi (N) + { \log }\,c_{d} + \frac{d}{N}\sum\limits_{i = 1}^{N} {\log \, \epsilon_{k} (i),} $$

(4)

where k is the number of neighbours, N is the number of instances in the dataset, d is the dimensionality, c_d = (2π^d/2)/Γ(d/2) is the volume of the unitary ball of dimension d, ϵ_k(i) is twice the distance from the i^th instance to its k^th nearest neighbour and ψ is the digamma function.

Kraskov estimator (4) of the mutual information is then

$$ \hat{I}(X;Y) = \psi (N) + \psi (K) - \frac{1}{k} - \frac{1}{N}\sum\limits_{i = 1}^{N} {(\psi (\tau_{x} (i)) + \psi (\tau_{y} (i)))} $$

(5)

where τ_x(i) is the number of points located no further than the distance ϵ_X(i, k)/2 from the i^th observation in the X space, τ_y(i) is the number of points located no further than the distance ϵ_Y(i, k)/2 from the i^th observation in the Y space and where ϵ_X(i, k)/2 and ϵ_Y(i, k)/2 are the projections into the X and Y subspaces of the distance between the i^th observation and its k^th neighbour.

2.2 Feature Selection with Noise Variance

Noise variance is another filter criterion used for feature selection. Its definition is even more intuitive than mutual information. With this filter criterion, the noise represents the error in estimating the output variable by a function of the input variables, under the hypothesis that this function could be built (by a machine learning regression model). It is a filter criterion because it does not require building a regression model, but it is close to the idea of a wrapper method because the goal is to evaluate how good a model could be.

Let us consider a dataset with N instances, d features X_j, a target Y and N input-output pairs (x_i, y_i). The relationship between these input-output pairs is

$$ y_{i} = f({\mathbf{x}}_{i} ) + \epsilon_{i} \;\;i = 1, \ldots ,N $$

(6)

where f is the unknown function between x_i and y_i, and ϵ_i is the noise or prediction error when estimating f. The principle is to select the subsets of features which lead to the lowest prediction error, or lowest noise variance [17].

In practice the noise variance has to be estimated, e.g. with the Delta Test [18]. The Delta Test δ is defined as

$$ \delta = \frac{1}{2N}\sum\limits_{i = 1}^{N} {[y_{NN(i)} - y_{i} } ]^{2} $$

(7)

where N is the size of the dataset, y_NN(i) is the output associated to x_NN(i), x_NN(i) being the nearest neighbour of the point x_i.

Similarly to the use of mutual information for feature selection, when using the Delta Test the relationships between several subsets of features and Y are computed, again with a search procedure such as a greedy search. Among these subsets of features, the one minimising the value of δ with Y will be selected. The Delta Test has also been widely used for feature selection [21, 22].

3 Behaviour of kNN-Based Estimators of Relevance Criteria in Small Sample Scenarios

This section analyses the behaviour of the mutual information and noise variance estimators in small datasets.

3.1 Mutual Information Analysis

The Kraskov estimator (5) can be used to estimate MI in regression. However, as a kNN-based estimator of I(X; Y) = H(Y) − H(Y|X), it is affected by the degree of smoothness of the relationship between the target and the considered features. Indeed, the Kraskov estimator assumes that the conditional distribution p(Y|X) is stationary in the k-neighbourhood of x. However, if the neighbourhood of x is large, which is the case when the sample is small, this hypothesis does not hold anymore and the interval of observed values for Y will widen. The Kraskov estimator will consequently overestimate the conditional entropy H(Y|X) and underestimate I(X; Y). This underestimation will be more severe for non-smooth functions, as the interval of Y in the neighbourhood of x is larger in this case. Consequently, when two features will be compared, the one that has the smoother relation to Y will tend to be favoured in the feature selection.

3.2 Delta Test Analysis

To estimate the variance of the noise in regression problems, the Delta Test uses a 1-nearest neighbour method by looking for the nearest neighbour of each point of the dataset and by computing a variation in target values between the point and its nearest neighbour.

The Delta Test, already defined in (7), can be rewritten with (6) as

$$ \delta = \frac{1}{2N}\sum\limits_{i = 1}^{N} {[f({\mathbf{x}}_{NN(i)} ) + \epsilon_{NN(i)} - f({\mathbf{x}}_{i} )} - \epsilon_{i} ]^{2} $$

(8)

where noise ϵ_i is i.i.d. The average behaviour of the Delta test can be characterised using a first order approximation f(x) ≈ f(x_i) + ∇f(x_i)^T (x − x_i), based on the assumption that the nearest neighbour is close enough to make this approximation sufficiently accurate. The expected value of the Delta Test is then approximated as

$$ \begin{aligned} {\text{E[}}\delta ] & = {\text{ E}}\left[ {\frac{1}{2N}\sum\limits_{i = 1}^{N} {[f({\mathbf{x}}_{NN(i)} ) + \epsilon_{NN(i)} - f({\mathbf{x}}_{i} )} - \epsilon_{i} ]^{2} } \right] \\ & \approx {\text{E}}\left[ {\frac{1}{2N}\sum\limits_{i = 1}^{N} {[\nabla f({\mathbf{x}}_{i} )^{T} ({\mathbf{x}}_{NN(i)} - {\mathbf{x}}_{i} ) + \epsilon_{NN(i)} } - \epsilon_{i} ]^{2} } \right] \\ & = {\text{E[}} \epsilon^{2} ] + \frac{1}{2}{\text{E}}\left[ {[\nabla f({\mathbf{x}})^{T} \;({\mathbf{x}}_{NN} - {\mathbf{x)}}]^{2} } \right] .\\ \end{aligned} $$

(9)

The first term of (9) is the noise variance, but the second term is related to the smoothness of f and is independent from the noise variance: it measures how much f changes on average from an instance x to its closest neighbour x_NN. This second term is affected by two factors. First, if the gradient is small (i.e. the function is smooth), the second term remains small. Second, if instances and their closest neighbours are close (i.e. the dataset is quite large), the second term also remains small. Hence, for small datasets, the second term penalises non-smooth functions.

3.3 Discussion

In small datasets, smooth relations between features and output will have, on average, a smaller Delta Test or a higher MI result. On the opposite, a nonsmooth relation will have, on average, a larger Delta Test or smaller MI result, even with the same level of target noise. As discussed above, estimators based on k-nearest neighbours methods seem to be biased by the smoothness of functions. The two estimators make the assumption that the function does not vary too much in the proximity of the neighbours. However, in small sample and with non-smooth functions, this assumption is violated, which introduces a bias in the estimators. It is thus anticipated that smooth features will tend to be selected first when comparing two features that have the same level of information content to predict output Y. However, this short analysis does not answer the question whether this estimation bias has a real influence during the feature selection process, nor if the problem is more severe with MI or with noise variance. The next section evaluates these questions by experiments.

4 Experimental Settings

In order to study how much the smoothness can be a bias for selection criteria such as the mutual information or Delta Test in regression, experiments performed in this paper consider several functions with various smoothnesses and several sizes of datasets. These experiments are conducted to give some insights to the questions raised in the previous section, i.e. does the estimation bias has an influence while comparing features, and is the problem more severe with MI or with noise variance.

Six different periodic functions have been generated with different frequencies and different levels of noise:

$$ \begin{array}{*{20}l} {y_{1} = \, f_{1} \left( {\mathbf{x}} \right) \, = sin\left( {\mathbf{x}} \right) + \epsilon} \hfill & {\text{where}\,\epsilon \, \sim{\text{N}}\left( {0,0.05} \right)} \hfill \\ {y_{2} = \, f_{2} \left( {\mathbf{x}} \right) \, = sin\left( {3{\mathbf{x}}} \right) +\epsilon } \hfill & {\text{where}\,\epsilon \, \sim{\text{N}}\left( {0,0.05} \right)} \hfill \\ {y_{3} = \, f_{3} \left( {\mathbf{x}} \right) \, = sin\left( {9{\mathbf{x}}} \right) +\epsilon \, } \hfill & {\text{where}\,\epsilon \, \sim{\text{N}}\left( {0,0.05} \right)} \hfill \\ {y_{4} = \, f_{4} \left( {\mathbf{x}} \right) \, = sin\left( {\mathbf{x}} \right) + \epsilon \, } \hfill & {\text{where}\,\epsilon \, \sim{\text{N}}\left( {0,0.3} \right)} \hfill \\ {y_{5} = \, f_{5} \left( {\mathbf{x}} \right) \, = sin\left( {3{\mathbf{x}}} \right) + \epsilon \, } \hfill & {\text{where}\,\epsilon \, \sim{\text{N}}\left( {0,0.3} \right)} \hfill \\ {y_{6} = \, f_{6} \left( {\mathbf{x}} \right) \, = sin\left( {9{\mathbf{x}}} \right) + \epsilon \, } \hfill & {\text{where}\,\epsilon \, \sim{\text{N}}\left( {0,0.3} \right)} \hfill \\ \end{array} $$

(10)

Figures 1(a), (b), (c), (d), (e) and (f) represent the six functions f₁, f₂, f₃, f₄, f₅ and f₆, respectively. In theory, features associated to f₁, f₂ and f₃ (resp. f₄, f₅, f₆) should be selected equally in a feature selection process, as prediction errors (or levels of noise) are identical.

The experiments have been performed with various sizes of samples, from extremely small to large ones. For each size of the sample, an estimator of the two decision criteria, the mutual information and the noise variance, has been used to drive the selection process, in order to show the influence of the bias introduced by the smoothness of the function on both criteria. For the noise variance, the estimator used is the Delta Test, based on a k-NN method with 1-nearest neighbour, and described in Sect. 2.2. For the mutual information, the estimator used is the one introduced by Kraskov et al. and described in Sect. 2.1, also based on a k-NN method (with k = 6 as suggested in [19]). All experiments have been repeated 10 times; averages are reported.

5 Experimental Results

Figure 2 represents the average value (on 10 repetitions) of the mutual information estimator (Fig. 2(a) and (c)) and the Delta Test estimator (Fig. 2(b) and (d)), for increasing sizes of the dataset.

All figures show a clear effect in overestimating the noise variance and underestimating the mutual information in small datasets. The over- and underestimations are much more severe for non-smooth functions (f₃ and f₆). It is also clear that when the size of the dataset increases, the biases tend to disappear. What is more interesting to see is that the asymptotic values of the Delta Test are reached in this experiment when the dataset includes a few hundreds of instances, while for the MI a few thousands of instances are necessary, in the same experiment (the horizontal logarithmic scales with the number of instances are different in the left and right figures). This is an argument in favour of using the noise variance rather than the mutual estimation.

When comparing the upper and lower parts of Fig. 2 (both left -MI- and right -noise variance-), it is also interesting to see that for small samples, the order of selection between features can be inverted. For example, let us consider the Delta Test values in Figs. 2(b) and (d) for three cases. First, for 40 instances, the 6 functions will be ranked in the following order: f₁, f₂, f₄, f₅, f₃ and f₆, given that the features with the lower Delta values are selected first. Without the bias effect shown in this paper, it would have been expected that f₁, f₂ and f₃ would be selected first, as their capacity to predict Y is higher (or their noise is lower) than for f₄, f₅ and f₆. Second, for 70 instances, the 6 functions will be ranked in the following order f₁, f₂, f₃, f₄, f₅ and f₆. In this case, the order of selection between features is not inverted anymore but the bias still remains. Finally, for approximately 300 instances, the bias disappears, the 3 functions f₁, f₂, f₃ obtain the same Delta value and the 3 functions f₄, f₅, f₆ obtain another unique Delta value, higher than the one for f₁, f₂, f₃. These cases show that, for small samples, the bias has an influence on the order of selection between features and that it disappears with a larger dataset. A similar behaviour can be observed for MI in Figs. 2(a) and (c).

6 Conclusion

To the best of our knowledge, no work in the literature focuses on the bias explicitly associated to the smoothness in a feature selection context. Wookey and Konidaris [23] use smoothness as a prior knowledge during feature selection, but only for data regularization.

This paper shows that an overestimation of the noise variance and an underestimation of the mutual information can occur in small datasets when the function to estimate is not smooth. Experiments have been conducted with both criteria on functions with various smoothnesses and levels of noise, for different sizes of datasets. They confirm the theoretical discussion and show that the biases in the estimations are much more severe when using mutual information than when using the noise variance; this is an argument in favour of using the latter rather than the former.

The experiments also confirm that in a feature selection process, where a decision to select a feature is taken by comparing values of the criteria between different possible features or groups of features, the order of selection may be affected (a smooth feature with a low dependency to the output could be selected before a non-smooth with a high dependency). This is a serious shortcoming that should be taken into account when designing a feature selection algorithm. For example the noise variance could be explicitly estimated and used to remove the bias, or the selection process could be improved to favour non-smooth features.

References

Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
François, D., Rossi, F., Wertz, V., Verleysen, M.: Resampling methods for parameter-free and Robust feature selection with mutual information. Neurocomputing 70(7–9), 1276–1288 (2007)
Article Google Scholar
Verleysen, M., Rossi, F., François, D.: Advances in feature selection with mutual information. In: Similarity-Based Clustering, pp. 52–69 (2009)
Google Scholar
Frénay, B., van Heeswijk, M., Miche, Y., Verleysen, M., Lendasse, A.: Feature selection for nonlinear models with extreme learning machines. Neurocomputing 102, 111–124 (2013)
Article Google Scholar
Gomez-Verdejo, V., Verleysen, M., Fleury, J.: Information-theoretic feature selection for functional data classification. Neurocomputing 72(16–18), 3580–3589 (2009)
Article Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Article Google Scholar
Paul, J., D’Ambrosio, R., Dupont, P.: Kernel methods for heterogeneous feature selection. Neurocomputing 169, 187–195 (2015)
Article Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
Article MathSciNet Google Scholar
Frénay, B., Doquire, G., Verleysen, M.: Is mutual information adequate for feature selection in regression? Neural Netw. 48, 1–7 (2013)
Article Google Scholar
Doquire, G., Frénay, B., Verleysen, M.: Risk estimation and feature selection. In: Proceedings of the 21th International Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2013) (2013)
Google Scholar
Degeest, A., Verleysen, M., Frénay, B.: Feature ranking in changing environments where new features are introduced. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2015
Google Scholar
Brown, G., Pocock, A., Zhao, M., Lujan, M.: Conditional likelihood maximisation: a unifying framework for mutual information feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)
MathSciNet MATH Google Scholar
Battiti, R.: Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 5, 537–550 (1994)
Article Google Scholar
Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24, 175–186 (2014)
Article Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)
Article MathSciNet Google Scholar
Frénay, B., Doquire, G., Verleysen, M.: Theoretical and empirical study on the potential inadequacy of mutual information for feature selection in classification. Neurocomputing 112, 64–78 (2013)
Article Google Scholar
Guillén, A., Sovilj, D., Mateo, F., Rojas, I., Lendasse, A.: New methodologies based on delta test for variable selection in regression problems. In: Workshop on Parallel Architectures and Bioinspired Algorithms, Toronto, Canada (2008)
Google Scholar
Yu, Q., Séverin, E., Lendasse, A.: Variable selection for financial modeling. In: Proceedings of the CEF 2007, 13th International Conference on Computing in Economics and Finance, Montréal, Quebec, Canada, pp. 237–241 (2007)
Google Scholar
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69, 066138 (2004)
Article MathSciNet Google Scholar
Kozachenko, L.F., Leonenko, N.: Sample estimate of the entropy of a random vector. Probl. Inform. Transm. 23, 95–101 (1987)
MATH Google Scholar
Eirola, E., Liitiäinen, E., Lendasse, A., Corona, F., Verleysen, M.: Using the delta test for variable selection. In: Proceedings of ESANN 2008 (2008)
Google Scholar
Eirola, E., Lendasse, A., Corona, F., Verleysen, M.: The delta test: the 1-NN estimator as a feature selection criterion. In: Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), pp. 4214–4222, July 2014
Google Scholar
Wookey, D.S., Konidaris, G.D.: Regularized feature selection in reinforcement learning. Mach. Learn. 100(2), 655–676 (2015)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Haute-Ecole Bruxelles Brabant - ISIB, 150 Rue Royale, 1000, Brussels, Belgium
Alexandra Degeest
Machine Learning Group - ICTEAM, Université catholique de Louvain, Place du Levant 3, 1348, Louvain-La-Neuve, Belgium
Alexandra Degeest & Michel Verleysen
Faculty of Computer Science, NADI Institute - PReCISE Research Center, Université de Namur, Rue Grandgagnage 21, 5000, Namur, Belgium
Benoît Frénay

Authors

Alexandra Degeest
View author publications
You can also search for this author in PubMed Google Scholar
Michel Verleysen
View author publications
You can also search for this author in PubMed Google Scholar
Benoît Frénay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandra Degeest .

Editor information

Editors and Affiliations

School of Engineering, Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
University of Thessaly, Lamia, Greece
Vassilis Plagianakos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Degeest, A., Verleysen, M., Frénay, B. (2018). Smoothness Bias in Relevance Estimators for Feature Selection in Regression. In: Iliadis, L., Maglogiannis, I., Plagianakos, V. (eds) Artificial Intelligence Applications and Innovations. AIAI 2018. IFIP Advances in Information and Communication Technology, vol 519. Springer, Cham. https://doi.org/10.1007/978-3-319-92007-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-92007-8_25
Published: 22 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92006-1
Online ISBN: 978-3-319-92007-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)