Regularized adversarial learning for normalization of multi-batch untargeted metabolomics data

RALPS stands for Regularized Adversarial Learning Preserving Similarity. It's a novel method for eliminating batch effects in omics data, developed originally to harmonize distant-in-time multi-batch flow-injection mass-spectrometry measurements.

Published in Bioinformatics.

Requirements

hdbscan==0.8.27  
matplotlib==3.4.1  
numpy==1.20.0  
pandas==1.2.4  
scikit-learn==0.24.2  
scipy==1.6.3  
seaborn==0.11.1  
torch==1.8.1    
umap-learn==0.5.1

RALPS has been tested on CPU and GPU under MacOS and Windows.
Training time required to normalize a dataset with ~3000 samples and ~150 metabolites was 5.82 minutes per run on average (30 epochs).

How to run normalization

Run the following command from the src directory to normalize data with RALPS:
python ralps.py -n path/to/config.csv

Config file should contain paths to the data and batch information files, and some other parameters. All the required fields, as well as all the necessary parameters, are described below.
Find the input example files here.

Config file structure

Parameter	Comment	Default value
data_path	path to a csv data file	-
info_path	path to a csv batch info file	-
out_path	path to a new folder to save results to	-
latent_dim	dimension of the bottleneck linear layer	-1 (automatically derived from PCA)
variance_ratio	percent of explained variance to derive latent_dim	0.9,0.95,0.99
n_replicates	mean number of replicates in the data	3
grid_size	size of the randomized grid search (# of runs)	1
d_lr	classifier learning rate	0.00005-0.005
g_lr	autoencoder learning rate	0.00005-0.005
d_lambda	classifier loss coef	0.-10.
g_lambda	autoencoder regularization term coef	0.-10.
v_lambda	variation loss coef	0.-10.
train_ratio	train-test split ratio	0.9
batch_size	data loader batch size	32,64,128
epochs	# of epochs to train	30
skip_epochs	# of epochs to skip for model selection	3
keep_checkpoints	save all model checkpoints after training	False (keep only best model)
device	device to train on (Torch)	cpu
plots_extension	save plots with this extension	png
min_relevant_intensity	missing values before normalization are replaced with this; values below this after normalization are masked with zeros	1000
allowed_vc_increase	fraction of sample's VC increase allowed (not contributing to the variation loss)	0.05

For most parameters, coma separated values (e.g., 'batch_size') or dash separated intervals (e.g., 'd_lr') can be provided. For those, values will be uniformly sampled in the randomized grid search using defined options or intervals. Otherwise, the exact values provided will be used.
Default parameter values can be used by setting '-1'.

Data file structure

	sample_id_1	sample_id_M
feature_1	count	count
...
feature_N	count	count

Batch info file structure

	batch	group	benchmark
sample_id_1	1	reg_1	0
sample_id_2	1	reg_1	0
sample_id_3	2	0	0
...
sample_id_M-1	k	0	bench_M
sample_id_M	k	0	bench_M

Batch column indicates samples' batch labels.
Group column indicates groups of identical samples (replicates), used for regularization. If several samples have the same label (e.g., 'reg_1'), they are treated as replicates of the same material. While training, samples of the same group are encouraged to appear in the same cluster. Use '0' or '' to provide no information about similarity of samples.
Benchmark indicates groups of identical samples taken as benchmarks in model evaluation. They are not used for regularization while training, unless they appear in the group column as well.

How to evaluate checkpoints

If you choose to keep checkpoints in the config file, you will find the autoencoder model at each training epoch saved in the checkpoints directory. You can select a few checkpoints based on the training history to obtain alternative normalization solutions and the corresponding evaluation plots.
To do that, remove unnecessary checkpoints and run the following command from the src directory:
python ralps.py -e path/to/directory/with/checkpoints/

Important: This works only with default RALPS output (directories and filenames should not be changed).

How to remove outliers

If you wish to remove outliers from the normalized data, as proposed in the paper, run the following command from the src directory:
python ralps.py -r path/to/normalized/data.csv

Important: This works only with default RALPS output (directories and filenames should not be changed).

How to change default configuration

If you wish to reconfigure RALPS (e.g., to use a different clustering algorithm as default, or to change default parameter values), you can do so by editing src/constants.py.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
examples		examples
schematic		schematic
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regularized adversarial learning for normalization of multi-batch untargeted metabolomics data

Requirements

How to run normalization

Config file structure

Data file structure

Batch info file structure

How to evaluate checkpoints

How to remove outliers

How to change default configuration

About

Uh oh!

Releases

Packages

Languages

dmitrav/normalization

Folders and files

Latest commit

History

Repository files navigation

Regularized adversarial learning for normalization of multi-batch untargeted metabolomics data

Requirements

How to run normalization

Config file structure

Data file structure

Batch info file structure

How to evaluate checkpoints

How to remove outliers

How to change default configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages