RALPS stands for Regularized Adversarial Learning Preserving Similarity. It's a novel method for eliminating batch effects in omics data, developed originally to harmonize distant-in-time multi-batch flow-injection mass-spectrometry measurements.
Published in Bioinformatics.
hdbscan==0.8.27
matplotlib==3.4.1
numpy==1.20.0
pandas==1.2.4
scikit-learn==0.24.2
scipy==1.6.3
seaborn==0.11.1
torch==1.8.1
umap-learn==0.5.1
RALPS has been tested on CPU and GPU under MacOS and Windows.
Training time required to normalize a dataset with ~3000 samples and ~150 metabolites was 5.82 minutes per run on average (30 epochs).
Run the following command from the src directory to normalize data with RALPS:
python ralps.py -n path/to/config.csv
Config file should contain paths to the data and batch information files, and some other parameters.
All the required fields, as well as all the necessary parameters, are described below.
Find the input example files here.
| Parameter | Comment | Default value |
|---|---|---|
| data_path | path to a csv data file | - |
| info_path | path to a csv batch info file | - |
| out_path | path to a new folder to save results to | - |
| latent_dim | dimension of the bottleneck linear layer | -1 (automatically derived from PCA) |
| variance_ratio | percent of explained variance to derive latent_dim | 0.9,0.95,0.99 |
| n_replicates | mean number of replicates in the data | 3 |
| grid_size | size of the randomized grid search (# of runs) | 1 |
| d_lr | classifier learning rate | 0.00005-0.005 |
| g_lr | autoencoder learning rate | 0.00005-0.005 |
| d_lambda | classifier loss coef | 0.-10. |
| g_lambda | autoencoder regularization term coef | 0.-10. |
| v_lambda | variation loss coef | 0.-10. |
| train_ratio | train-test split ratio | 0.9 |
| batch_size | data loader batch size | 32,64,128 |
| epochs | # of epochs to train | 30 |
| skip_epochs | # of epochs to skip for model selection | 3 |
| keep_checkpoints | save all model checkpoints after training | False (keep only best model) |
| device | device to train on (Torch) | cpu |
| plots_extension | save plots with this extension | png |
| min_relevant_intensity | missing values before normalization are replaced with this; values below this after normalization are masked with zeros |
1000 |
| allowed_vc_increase | fraction of sample's VC increase allowed (not contributing to the variation loss) | 0.05 |
For most parameters, coma separated values (e.g., 'batch_size') or dash separated intervals (e.g., 'd_lr') can be provided.
For those, values will be uniformly sampled in the randomized grid search using defined options or intervals.
Otherwise, the exact values provided will be used.
Default parameter values can be used by setting '-1'.
| sample_id_1 | ... | sample_id_M | |
|---|---|---|---|
| feature_1 | count | count | |
| ... | |||
| feature_N | count | count |
| batch | group | benchmark | |
|---|---|---|---|
| sample_id_1 | 1 | reg_1 | 0 |
| sample_id_2 | 1 | reg_1 | 0 |
| sample_id_3 | 2 | 0 | 0 |
| ... | |||
| sample_id_M-1 | k | 0 | bench_M |
| sample_id_M | k | 0 | bench_M |
- Batch column indicates samples' batch labels.
- Group column indicates groups of identical samples (replicates), used for regularization.
If several samples have the same label (e.g.,
'reg_1'), they are treated as replicates of the same material. While training, samples of the same group are encouraged to appear in the same cluster. Use'0'or''to provide no information about similarity of samples. - Benchmark indicates groups of identical samples taken as benchmarks in model evaluation. They are not used for regularization while training, unless they appear in the group column as well.
If you choose to keep checkpoints in the config file, you will find the autoencoder model at each training epoch saved in the checkpoints directory.
You can select a few checkpoints based on the training history to obtain alternative normalization solutions and the corresponding evaluation plots.
To do that, remove unnecessary checkpoints and run the following command from the src directory:
python ralps.py -e path/to/directory/with/checkpoints/
Important: This works only with default RALPS output (directories and filenames should not be changed).
If you wish to remove outliers from the normalized data, as proposed in the paper, run the following command from the src directory:
python ralps.py -r path/to/normalized/data.csv
Important: This works only with default RALPS output (directories and filenames should not be changed).
If you wish to reconfigure RALPS (e.g., to use a different clustering algorithm as default, or to change default parameter values), you can do so by editing src/constants.py.
