ProteomicsPipeline

The pipeline is based on the drake package (William Michael Landau, (2018). The drake R package: a pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 3(21), 550, https://doi.org/10.21105/joss.00550) and can be used for the analysis of proteomics data that is labeled for binary classification.

Following steps are performed:

Wilcoxon rank sum test for univariate analysis
Fold change calculation
Builing of a classification model based on Random Forest
Ranking and selection of proteins (based on mean Gini index of Random Forest model)

The selected proteins are analysed by:

Gene ontology term enrichment analysis
Pathway enrichment analysis
Disease ontology enrichment analysis

Data requirements

The pipeline requires two input files in csv-format.

The proteomics data set contains the protein measurements. The first column must contain a unique sample identifier. The second column has to contain the label used for classification (e.g. "healthy" and "affected"). Only two different labels are allowed and the class of effect (e.g. with treatment/mutation/symptoms) should be listed first if the column is converted into a factor, i.e. the class of effect should be first alphabetically, to allow for the correct baseline for the fold change calculations.
The protein information data set should contain two columns. The first column will contain a unique description of all proteins/targets measured. The second column will contain the associated gene name of the target. For every target used in the proteomics data set (column names), there should be an entry in the first column of the protein information data set. Duplicated gene names are allowed.

Are any of the requirments not met, it will lead to the termination of the pipeline before any analysis steps.

Set up

Download the files of the repository.
Move the two required csv files into the same directory where you saved the downloaded files.
Edit plan.R by replacing "path/to/csvfile" with the file name of the two csv files in data_in and protein_info_in respectively.
Set the directory containing the repository files as your working directory.
Run the following code

install.packages("drake")
library(drake)
new_cache()

A folder to save your cache has been created. Your directory should have the following structure now:

.drake
R
- functions.R
- packages.R
- plan.R
make.R
Report.Rmd
your_proteomics_dataset.csv
your_protein_information_dataset.csv

Execute make.R in R to run the pipeline analysis. The results will be in the form of an HTML report that will be created in the same directory.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
R		R
LICENSE		LICENSE
README.md		README.md
Report.Rmd		Report.Rmd
make.R		make.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProteomicsPipeline

Data requirements

Set up

About

Releases

Packages

Languages

License

kathiwaury/ProteomicsPipeline

Folders and files

Latest commit

History

Repository files navigation

ProteomicsPipeline

Data requirements

Set up

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages