GPmix is a clustering algorithm for functional data that are generated from Gaussian process mixtures. Although designed for Gaussian process mixtures, our experimental study demonstrated that GPmix works well even for functional data that are not from Gaussian process mixtures.
The main steps of the algorithm are:
- Smoothing: Apply smoothing methods on the raw data to get continuous functions.
- Projection: Project the functional data onto a few randomly generated functions.
- Learning GMMs: For each projection function, learn a univariate Gaussian mixture model from the projection coefficients.
- Ensemble: Extract a consensus clustering from the multiple GMMs.
If you used this package in your research, please cite it:
@InProceedings{pmlr-v235-akeweje24a,
title = {Learning Mixtures of {G}aussian Processes through Random Projection},
author = {Akeweje, Emmanuel and Zhang, Mimi},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {720--739},
year = {2024},
editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
volume = {235},
series = {Proceedings of Machine Learning Research},
month = {21--27 Jul},
publisher = {PMLR},
}
This quick start guide will demonstrate how to use the package with the CBF dataset, one of the real-world datasets tested in our paper. Follow these steps to prepare the dataset for analysis:
import numpy as np
data = np.concatenate([np.loadtxt('CBF\CBF_TEST.txt'), np.loadtxt('CBF\CBF_TRAIN.txt')])
X, y = data[:,1:], data[:,0]
To use the GPmix algorithm in your project, start by importing the necessary modules. The following import statements will include all the main functionalities from the GPmix package, as well as the specific utility for estimating the number of clusters:
from GPmix import *
from GPmix.misc import estimate_nclusters
Begin by initializing the Smoother
object, specifying the type of basis for the smoothing process. You can customize the smoothing by passing additional configurations through the basis_params
argument. If not specified, the system will automatically determine the best configurations using methods like Random Grid Search and Generalized Cross Validation. After initialization, apply the fit
method to the raw data to obtain the fitted functional data.
For this demonstration, we will use the Fourier basis.
sm = Smoother(basis= 'fourier')
fd = sm.fit(X)
fd.plot(group = y)
To project the fitted functional data onto specified projection functions, use the Projector
object. Initialize the Projector
object with the type of projection functions and the desired number of projections. The basis_type
argument specifies the type of projection functions. The n_proj
argument defines the number of projections. The basis_params
argument allows for further configuration of the projection functions.
For this demonstration, we will use wavelets as projection functions. We will specify the family of wavelets using basis_params
. After initializing, apply the fit
method to the functional data object to compute the projection coefficients. Here, we will use 14 projection functions generated from the Haar wavelet family.
proj = Projector(basis_type= 'wavelet', n_proj = 14, basis_params= {'wv_name': 'haar'})
coeffs = proj.fit(fd)
The UniGaussianMixtureEnsemble
object facilitates ensemble clustering by fitting a univariate Gaussian Mixture Model (GMM) to each set of projection coefficients. Follow these steps:
- Initialize the
UniGaussianMixtureEnsemble
object by specifying the number of clusters (n_clusters) you want to identify in your dataset. - Use the
fit_gmms
method to obtain a collection of GMMs, one for each set of projection coefficients. - Use the
get_clustering
method, which aggregates the results from the individual GMMs, to form a consensus clustering.
For this demonstration, we will identify 3 clusters in the functional data.
model = UniGaussianMixtureEnsemble(n_clusters= 3)
model.fit_gmms(coeffs)
pred_labels = model.get_clustering()
To visualize the clustering result, apply the plot_clustering
method to the functional data object:
model.plot_clustering(fd)
Furthermore, the UniGaussianMixtureEnsemble
object supports the calculation of several clustering validation indices. For external validation (comparing generated clusters against true labels), you can calculate Adjusted Mutual Information, Adjusted Rand Index, and Correct Classification Accuracy by passing the true labels as parameters. For internal validation (assessing the internal structure of the clusters), you can calculate the Silhouette Score and Davies-Bouldin Score by passing the functional data object as parameters. These metrics help evaluate the effectiveness of the clustering process.
For this demonstration, we calculate all the clustering validation metrics.
model.adjusted_mutual_info_score(y) # Adjusted Mutual Information
model.adjusted_rand_score(y) # Adjusted Rand Index
model.correct_classification_accuracy(y) # Correct Classification Accuracy
model.silhouette_score(fd) # Silhouette Score
model.davies_bouldin_score(fd) # Davies-Bouldin Score
To estimate the optimal number of clusters in the functional data, our package includes the estimate_nclusters
function. This function employs a systematic search to identify the number of clusters that minimize the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), as detailed in our paper. Here’s how you can apply this function to your data:
estimate_nclusters(fd)
The simulation scenarios investigated in our paper are available in simulations.py. To reproduce the results from the paper for each specific scenario, you will need to execute the following command after cloning the repo:
python GPmix_Clustering.py data_configs/scenario_<tag>_config.yml
Replace <tag>
with the appropriate scenario identifier, which ranges from A to L. Each tag corresponds to a different configuration file located in the data_configs. By executing the command with the relevant tag, the results for that particular scenario will be replicated.
Apply smoothing methods on the raw data to get continuous functions.
Smoother(basis = 'bspline', basis_params = {}, domain_range = None)
Parameters
-
basis {'bspline', 'fourier', 'wavelet', 'nadaraya_watson', 'knn'} : a string specifying the smoothing method to use. The default value is
'bspline'
. -
basis_params (dict) : additional parameters for the selected smoothing method. The default value is
{}
. Below are examples of how to specify these parameters for different smoothing methods:Smoother(basis = 'bspline', basis_params = {'order': 3, 'n_basis': 20}) Smoother(basis = 'wavelet', basis_params = {'wavelet': 'db5', 'mode': 'soft'}) Smoother(basis = 'knn', basis_params = {'bandwidth': 1.0}) Smoother(basis = 'fourier', basis_params = {'n_basis': 20, 'period': 1})
For all smoothing methods except wavelet smoothing, if
basis_params
is not specified, the parameter values are determined by the Generalized Cross-Validation (GCV) technique.
For wavelet smoothing, the wavelet shrinkage denoising technique is implemented, requiring two parameters:'wavelet'
: The wavelet family to use for smoothing. A list of all supported discrete wavelets can be obtained by running:print(pywt.wavelist(kind='discrete'))
.'mode'
: The method and extent of denoising. The available modes are: {'soft', 'hard', 'garrote', 'greater', 'less'}.
For wavelet smoothing, if
basis_params
is not specified, the default configurationbasis_params = {'wavelet': 'db5', 'mode': 'soft'}
will be used. -
domain_range (tuple) : the domain of the functions. The default value is
None
.
Ifdomain_range
is set toNone
, the domain range will default to[0,1]
if the data is array-like. If the data is anFDataGrid
object, it will use thedomain_range
of that object.
Attributes
- fd_smooth (FDataGrid): functional data obtained via the smoothing technique.
Methods
fit(X, return_data = True)
: Apply a smoothing method on the raw dataX
to get continuous functions.
- X : raw data, array-like of shape (n_samples, n_features) or FDataGrid object.
- return_data (bool) : Return the functional data if True. The default value is
True
.
Project the functional data onto a few randomly generated functions.
Projector(basis_type, n_proj = 3, basis_params = {})
Parameters
- basis_type {'fourier', 'fpc', 'wavelet', 'bspline', 'ou', 'rl-fpc'} : a string specifying the type of projection functions. Supported
basis_type
options are: eigen-functions from the fPC decomposition ('fpc'
), random linear combinations of eigen-functions ('rl-fpc'
), B-splines, Fourier basis, wavelets, and Ornstein-Uhlenbeck ('ou'
) random functions. - n_proj (int) : number of projection functions to generate. The default value is
3
. - basis_params (dict) : additional hyperparameters required by
'fourier'
,'bspline'
and'wavelet'
. The default value is{}
. Below are examples of how to specify these hyperparameters:ForProjector(basis_type = 'fourier', basis_params = {'period': 2}) Projector(basis_type = 'bspline', basis_params = {'order': 1}) Projector(basis_type = 'wavelet', basis_params = {'wv_name': 'haar', 'resolution': 1})
fourier
, the default configuration is setting the'period'
equal to the length of the domain of the functional data. This approach works well for all datasets investigated in our paper. However, we have included the period as a hyperparameter in case users want to project onto Fourier functions with lower (or higher) oscillations. Forbspline
, ifbasis_params
is not specified, the default configurationbasis_params = {'order': 3}
will be applied. Similarly, forwavelet
, ifbasis_params
is not specified, the default configurationbasis_params = {'wv_name': 'db5', 'resolution': 1}
will be applied.
Attributes
- n_features (int) : number of evaluation points for each sample curve and for the projection functions.
- basis (FDataGrid) : generated projection functions.
- coefficients (array-like of shape (n_proj, sample size)) : projection coefficients.
Methods
fit(FDataGrid)
: computing the projection coefficients. Return array-like object of shape (n_proj, sample size).plot_basis()
: plots the projection functions.plot_projection_coeffs(**kwargs)
: plots the distribution of projection coefficients. Takeskwargs
fromseaborn.histplot
.
For each projection function, learn a univariate Gaussian mixture model from the projection coefficients. Then extract a consensus clustering from the multiple GMMs.
UniGaussianMixtureEnsemble(n_clusters, init_method = 'kmeans', n_init = 10, mom_epsilon = 5e-2)
Parameters
- n_clusters (int) : number of mixture components in the GMMs.
- init_method {'kmeans', 'k-means++', 'random', 'random_from_data', 'mom'} : method for initializing the EM algorithm (for estimating GMM parameters). The default value is
'kmeans'
. - n_init (int) : number of repeats of the EM algorithm, each with a different initilization. The algorithm returns the best GMM fit. The default value is
10
. - mom_epsilon (float) : lower bound for GMM weights, only applicable if
init_method = 'mom'
. The default value is5e-2
.
Attributes
- n_projs (int) : number of base clusterings (or projections).
- data_size (int) : sample size.
- gmms (list) : a list of univariate GMMs, one for each projection function.
- clustering_weights_ (array-like of shape (n_projs,)) : weights for the base clusterings.
Methods
fit_gmms(projs_coeffs, n_jobs = -1, **kwargs)
: fit GMM to projection coefficients.- projs_coeffs (array-like of shape (n_proj, sample size)) : projection coefficients.
- n_jobs : number of concurrently running jobs to parrallelize fitting the gmms. The default value is
-1
, to use all available CPUs. - kwargs : any keyword argument of
joblib.Parallel
.
plot_gmms(ncol = 4, fontsize = 12, fig_kws = { }, **kwargs)
: visualization of GMM fits.- ncol (int) : number of subplot columns. The default value is
4
. - fontsize (int) : fontsize for the plot labels. The default value is
12
. - fig_kws : keyword arguments for the figures (subplots). The default value is
{}
. - kwargs : other keyword arguments for customizing seaborn
histplot
.
- ncol (int) : number of subplot columns. The default value is
get_clustering(weighted_sum = True, precompute_gmms = None)
: obtain the consensus clustering. Return array-like object of shape (sample size,), the cluster labels for the sample curves.- weighted_sum (bool) : specifying whether the total misclassification probability, which measures the overlap among the GMM components, should be weighted by the mixing proportion. The default value is
True
. - precompute_gmms (list) : a subset of the fitted GMMs. By default, the consensus clustering is extracted from all the fitted GMMs. This parameter allows selecting a subset of the fitted GMMs for constructing the consensus clustering.
- weighted_sum (bool) : specifying whether the total misclassification probability, which measures the overlap among the GMM components, should be weighted by the mixing proportion. The default value is
plot_clustering(FDataGrid)
: visualize the clustered functional data.adjusted_mutual_info_score(true_labels)
: computing the Adjusted Mutual Information.- true_labels (array-like of shape (sample size,)) : true cluster labels.
adjusted_rand_score(true_labels)
: computing the Adjusted Rand Index.- true_labels (array-like of shape (sample size,)) : true cluster labels.
correct_classification_accuracy(true_labels)
: computing the Correct Classification Accuracy.- true_labels (array-like of shape (sample size,)) : true cluster labels.
silhouette_score(FDataGrid)
: computing the Silhouette Score.davies_bouldin_score(FDataGrid)
: computing the Davies-Bouldin Score.
The function estimate_nclusters
employs a systematic search to identify the number of clusters that minimize the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC).
estimate_nclusters(fdata, ncluster_grid = None)
Parameters
- fdata (FDataGrid) : functional data object.
- ncluster_grid (array-like) : specifies the grid within which the number of clusters is searched. The default value is
[2, 3, ..., 14]
.
This project is under active development. If you find a bug, or anything that needs correction, please let us know.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.