Skip to main content
Version of jmzTab library
<p>(<b>RF</b>) Random Forest without previous feature selection step; (<b>X2-CM-RFE-RF</b>), random forest classification after the feature selection step using univariate correlation filter with matrix... more
<p>(<b>RF</b>) Random Forest without previous feature selection step; (<b>X2-CM-RFE-RF</b>), random forest classification after the feature selection step using univariate correlation filter with matrix correlation and recursive feature elimination; (<b>X2-PCA-RFE-RF</b>), random forest classification after the feature selection step using univariate correlation filter with principal component analysis and recursive feature elimination. All methods include an internal cross-validation 10-fold step. All accuracy metrics were estimated following the approach previously reported by <i>Pochet et al</i>. [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0189875#pone.0189875.ref031" target="_blank">31</a>], where 20-fold randomized test data were used to summarize the accuracy of the FS combination.</p
for public discussion includes service-info, filter by descriptor language transition to openapi 3.0 (which introduces some formatting changes) only as opposed to converted from swagger by swagger2openapi behind the scenes , transition to... more
for public discussion includes service-info, filter by descriptor language transition to openapi 3.0 (which introduces some formatting changes) only as opposed to converted from swagger by swagger2openapi behind the scenes , transition to new doc system, add galaxy
ms-data-core-api: an open-source, metadata-oriented library for computational proteomics
Bioinformatics software development has become a cornerstone in modern biology research. Large-scale quantitative biology studies have created a demand for more complex workflows and data analysis pipelines. Challenges in reproducing... more
Bioinformatics software development has become a cornerstone in modern biology research. Large-scale quantitative biology studies have created a demand for more complex workflows and data analysis pipelines. Challenges in reproducing bioinformatics analyses are compounded by the fact that the programs themselves are difficult to install on computers because they rely on software libraries, compilers, and other files, and environment variables collectively called dependencies that are assumed to be available and, thus, are often poorly documented. The Bioconda and BioContainers community have created a complete ecosystem that allow bioinformatics software to be installed and executed under an isolated and controlled environment. Also, it provides infrastructure and basic guidelines to create and distribute bioinformatics containers with a special focus on omics technologies. These cross-platform containers can be integrated into more comprehensive bioinformatics pipelines and differe...
Cuban science and technology are known for important achievements, particularly in human healthcare and biotechnology. During the second half of XX century, the country developed a system of scientific institutions to address and solve... more
Cuban science and technology are known for important achievements, particularly in human healthcare and biotechnology. During the second half of XX century, the country developed a system of scientific institutions to address and solve major economical, cultural, social and health problems. However, the economic crisis faced by the island during the last three decades has had a major impact in Cuban scientific research. In addition to decreased investment, the emigration of thousands of young as well as senior scientists to other countries have had a major impact in Cuban research output. To date, no systematic analysis regarding scientific publications, citations, or patents granted to Cuban authors during this period, are available. Here, an analysis of Cuban scientific production since 1970, with an especial focus on the last three decades (1990 - 2019), is provided. All national metrics are compared with other countries, emphasizing those from Latin America. Preliminary results ...
1BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize the software containers... more
1BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize the software containers including the metadata, versions, licenses, and/or software dependencies. BioContainers supports multiple packaging and containers technologies such as Conda, Docker, and Singularity. Here, we introduce the BioContainers Registry and Restful API to make containerized bioinformatics tools more findable, accessible, interoperable, and reusable (FAIR). BioContainers registry provides a fast and convenient way to find and retrieve bioinformatics tools packages and containers. By doing so, it will increase the use of bioinformatics packages and containers while promoting replicability and reproducibility in research.
Congenital Heart Disease (CHD) affects approximately 7-9 children per 1000 live births. Numerous genetic studies have established a role for rare genomic variants at the copy number variation (CNV) and single nucleotide variant level. In... more
Congenital Heart Disease (CHD) affects approximately 7-9 children per 1000 live births. Numerous genetic studies have established a role for rare genomic variants at the copy number variation (CNV) and single nucleotide variant level. In particular, the role of de novo mutations (DNM) has been highlighted in syndromic and non-syndromic CHD. To identify novel haploinsufficient CHD disease genes we performed an integrative analysis of CNVs and DNMs identified in probands with CHD including cases with sporadic thoracic aortic aneurysm (TAA). We assembled CNV data from 7,958 cases and 14,082 controls and performed a gene-wise analysis of the burden of rare genomic deletions in cases versus controls. In addition, we performed mutation rate testing for DNMs identified in 2,489 parent-offspring trios. Our combined analysis revealed 21 genes which were significantly affected by rare genomic deletions and/or constrained non-synonymous de novo mutations in probands. Fourteen of these genes ha...
Metaproteomics – the characterization of proteins expressed by microbiomes – presents a range of technical challenges, from sampling to data processing and interpretation. In the iPRG 2020 study, we investigated the status of... more
Metaproteomics – the characterization of proteins expressed by microbiomes – presents a range of technical challenges, from sampling to data processing and interpretation. In the iPRG 2020 study, we investigated the status of metaproteomics data analysis workflows by posing questions to the metaproteomics studies in two studies. In two phases of the study, the participants were asked to deduce the organisms or taxa in a metaproteomics sample ("What species are represented in the sample?") and what biological phenomena have taken place ("What interactions took place between the species in the mixture?"). The outputs from these studies will be presented at the RG session at ABRF 2021.
To the Editor: Your editorial “Credit where credit is overdue”1 aptly summarized the existing situation in the proteomics field, where full data disclosure remains very much a work in progress. Importantly, it also correctly pointed out... more
To the Editor: Your editorial “Credit where credit is overdue”1 aptly summarized the existing situation in the proteomics field, where full data disclosure remains very much a work in progress. Importantly, it also correctly pointed out that ‘the software provided by the public repositories for searching and analysing proteomics data is not as efficient and as user friendly as it could be’. We therefore here introduce PRIDE Inspector
Plasma analysis by mass spectrometry-based proteomics remains a challenge due to its large dynamic range of 10 orders in magnitude. We created a methodology for protein identification known as Wise MS Transfer (WiMT). Melanoma plasma... more
Plasma analysis by mass spectrometry-based proteomics remains a challenge due to its large dynamic range of 10 orders in magnitude. We created a methodology for protein identification known as Wise MS Transfer (WiMT). Melanoma plasma samples from biobank archives were directly analyzed using simple sample preparation. WiMT is based on MS1 features between several MS runs together with custom protein databases for ID generation. This entails a multi-level dynamic protein database with different immunodepletion strategies by applying single-shot proteomics. The highest number of melanoma plasma proteins from undepleted and unfractionated plasma was reported, mapping >1200 proteins from >10,000 protein sequences with confirmed significance scoring. Of these, more than 660 proteins were annotated by WiMT from the resulting ~5800 protein sequences. We could verify 4000 proteins by MS1t analysis from HeLA extracts. The WiMT platform provided an output in which 12 previously well-kno...
Summary We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a... more
Summary We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs, and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD, and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling, notably optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we perform a reanalysis of four public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, rev...
General rights Unless other specific re-use rights are stated the following general rights apply: Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright... more
General rights Unless other specific re-use rights are stated the following general rights apply: Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal
The experimental design metadata is a cornerstone of biomedical research, especially for data scientists; and it is paramount in the context of data repositories. For every proteomics dataset we should capture at least three levels of... more
The experimental design metadata is a cornerstone of biomedical research, especially for data scientists; and it is paramount in the context of data repositories. For every proteomics dataset we should capture at least three levels of metadata: (i) dataset description and experimental protocols, (ii) data files, and (iii) the sample to data files related information. While the dataset description and the data files are mandated for all ProteomeXchange datasets; the information regarding the sample to data files is mostly missing. Recently members of the European Bioinformatics Community for Mass Spectrometry (EuBIC) have created an open-source project called Proteomics Experimental Design format (https://github.com/bigbio/proteomics-metadata-standard/) to enable the standardization of sample metadata of public proteomics dataset. Here, the project is presented to the proteomics community and we call for contributors including researchers, journals, and consortiums to provide feedback on metadata annotations of public proteomics data. We believe this work should improve reproducibility, facilitate the development of new tools dedicated to proteomics data analysis, and facilitate collaborations.
The field of computational proteomics is approaching the big data age, driven both by a continuous growth in the number of samples analysed per experiment, as well as by the growing amount of data obtained in each analytical run. In order... more
The field of computational proteomics is approaching the big data age, driven both by a continuous growth in the number of samples analysed per experiment, as well as by the growing amount of data obtained in each analytical run. In order to process these large amounts of data, it is increasingly necessary to use elastic compute resources such as Linux-based cluster environments and cloud infrastructures. Unfortunately, the vast majority of cross-platform proteomics tools are not able to operate directly on the proprietary formats generated by the diverse mass spectrometers. Here, we presented ThermoRawFileParser, an open-source, crossplatform tool that converts Thermo RAW files into open file formats such as MGF and to the HUPO-PSI standard file format mzML. To ensure the broadest possible availability, and to increase integration capabilities with popular workflow systems such as Galaxy or Nextflow, we have also built Conda and BioContainers containers around ThermoRawFileParser. ...
The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and... more
The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, the bioinformatics analysis is becoming an increasingly complex and convoluted process involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are targeted and design for single desktop application limiting the scalability and reproducibility of the data analysis. In this paper we overview the key steps of metabolomic and proteomics data processing including main tools and software use to perform the data analysis. We discuss the combination of software containers with workflows environments for large scale metabolomics and proteomics analysis. Finally, we introduced to the proteomics and metabol...
Summary: Making reproducible, auditable and scalable data-processing analysis workflows is an important challenge in the field of bioinformatics. Recently, software containers and cloud computing introduced a novel solution to address... more
Summary: Making reproducible, auditable and scalable data-processing analysis workflows is an important challenge in the field of bioinformatics. Recently, software containers and cloud computing introduced a novel solution to address these challenges. They simplify software installation, management and reproducibility by packaging tools and their dependencies. In this work we implemented a cloud provider agnostic and scalable container orchestration setup for the popular Galaxy workflow environment. This solution enables Galaxy to run on and offload jobs to most cloud providers (e.g. Amazon Web Services, Google Cloud or OpenStack, among others) through the Kubernetes container orchestrator. Availability: All code has been contributed to the Galaxy Project and is available (since Galaxy 17.05) at https://github.com/galaxyproject/ in the galaxy and galaxy-kubernetes repositories. https://public.phenomenal-h2020.eu/ is an example deployment.
The 2017 Dagstuhl Seminar on Computational Proteomics provided an opportunity for a broad discussion on the current state and future directions of the generation and use of peptide tandem mass spectrometry spectral libraries. Their use in... more
The 2017 Dagstuhl Seminar on Computational Proteomics provided an opportunity for a broad discussion on the current state and future directions of the generation and use of peptide tandem mass spectrometry spectral libraries. Their use in proteomics is growing slowly, but there are multiple challenges in the field that must be addressed to further increase the adoption of spectral libraries and related techniques. The primary bottlenecks are the paucity of high quality and comprehensive libraries and the general difficulty of adopting spectral library searching into existing workflows. There are several existing spectral library formats, but none capture a satisfactory level of metadata; therefore a logical next improvement is to design a more advanced, Proteomics Standards Initiative-approved spectral library format that can encode all of the desired metadata. The group discussed a series of metadata requirements organized into three designations of completeness or quality, tentati...
Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but... more
Scientific research relies on computer software, yet software is not always developed following practices that ensure its quality and sustainability. This manuscript does not aim to propose new software development best practices, but rather to provide simple recommendations that encourage the adoption of existing best practices. Software development best practices promote better quality software, and better quality software improves the reproducibility and reusability of research. These recommendations are designed around Open Source values, and provide practical suggestions that contribute to making research software and its source code more discoverable, reusable and transparent. This manuscript is aimed at developers, but also at organisations, projects, journals and funders that can increase the quality and sustainability of research software by encouraging the adoption of these recommendations.
Software Containers are changing the way scientists and researchers develop, deploy and exchange scientific software. They allow labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software... more
Software Containers are changing the way scientists and researchers develop, deploy and exchange scientific software. They allow labs of all sizes to easily install bioinformatics software, maintain multiple versions of the same software and combine tools into powerful analysis pipelines. However, containers and software packages should be produced under certain rules and standards in order to be reusable, compatible and easy to integrate into pipelines and analysis workflows. Here, we presented a set of recommendations developed by the BioContainers Community to produce standardized bioinformatics packages and containers. These recommendations provide practical guidelines to make bioinformatics software more discoverable, reusable and transparent. They are aimed to guide developers, organisations, journals and funders to increase the quality and sustainability of research software.
In the last decade, a revolution in liquid chromatography-mass spectrometry (LC-MS) based proteomics was unfolded with the introduction of dozens of novel instruments that incorporate additional data dimensions through innovative... more
In the last decade, a revolution in liquid chromatography-mass spectrometry (LC-MS) based proteomics was unfolded with the introduction of dozens of novel instruments that incorporate additional data dimensions through innovative acquisition methodologies, in turn inspiring specialized data analysis pipelines. Simultaneously, a growing number of proteomics datasets have been made publicly available through data repositories such as ProteomeXchange, Zenodo and Skyline Panorama. However, developing algorithms to mine this data and assessing the performance on different platforms is currently hampered by the lack of single benchmark experimental design. Therefore, we acquired a hybrid proteome mixture on different instrument platforms and in all currently available families of data acquisition. Here, we present a comprehensive Data-Dependent and Data-Independent Acquisition (DDA/DIA) dataset acquired using several of the most commonly used current day instrumental platforms. The datase...
Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq), Fundacao do Câncer, Fundacao de Amparo a Pesquisa do Estado do Rio de Janeiro (FAPERJ) for its BBP grant and Programa de Apoio a Pesquisa Estrategica em Saude da... more
Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq), Fundacao do Câncer, Fundacao de Amparo a Pesquisa do Estado do Rio de Janeiro (FAPERJ) for its BBP grant and Programa de Apoio a Pesquisa Estrategica em Saude da Fiocruz (PAPES VII)
Mass-spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs) in biological samples. However, most workflows require... more
Mass-spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs) in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI extended FASTA format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backward compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supp...
MaxDIA is a software platform for analyzing data-independent acquisition (DIA) proteomics data within the MaxQuant software environment. Using spectral libraries, MaxDIA achieves deep proteome coverage with substantially better... more
MaxDIA is a software platform for analyzing data-independent acquisition (DIA) proteomics data within the MaxQuant software environment. Using spectral libraries, MaxDIA achieves deep proteome coverage with substantially better coefficients of variation in protein quantification than other software. MaxDIA is equipped with accurate false discovery rate (FDR) estimates on both library-to-DIA match and protein levels, including when using whole-proteome predicted spectral libraries. This is the foundation of discovery DIA-hypothesis-free analysis of DIA samples without library and with reliable FDR control. MaxDIA performs three- or four-dimensional feature detection of fragment data, and scoring of matches is augmented by machine learning on the features of an identification. MaxDIA's bootstrap DIA workflow performs multiple rounds of matching with increasing quality of recalibration and stringency of matching to the library. Combining MaxDIA with two new technologies-BoxCar acqu...
Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an... more
Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep learning methods have been proposed to improve the performance of clustering algorithms and protein identification by training the algorithms with existing data and the use of multiple spectra and identified peptide features. While the efficiency of these algorithms is still under study in comparison with traditional approaches, their application in proteomics data analysis is becoming more common. Here, we propose the use of deep learning to improve spectral similarity comparison. We assessed the performance of deep learning for spectral similarity, with GLEAMS and a newly trained embedder model (DLEAMSE), which uses high-quality spectra fro...
Here we present the Universal Spectrum Explorer (USE), a web-based tool based on IPSA for cross-resource (peptide) spectrum visualization and comparison (https://www.proteomicsdb.org/use/). Mass spectra under investigation can either be... more
Here we present the Universal Spectrum Explorer (USE), a web-based tool based on IPSA for cross-resource (peptide) spectrum visualization and comparison (https://www.proteomicsdb.org/use/). Mass spectra under investigation can either be provided manually by the user (table format), or automatically retrieved from online repositories supporting access to spectral data via the universal spectrum identifier (USI), or requested from other resources and services implementing a newly designed REST interface. As a proof of principle, we implemented such an interface in ProteomicsDB thereby allowing the retrieval of spectra acquired within the ProteomeTools project or real-time prediction of tandem mass spectra from the deep learning framework Prosit. Annotated mirror spectrum plots can be exported from the USE as editable scalable high quality vector graphics. The USE was designed and implemented with minimal external dependencies allowing local usage and integration into other websites (h...
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis.... more
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.
Using 11 proteomics datasets, mostly available through the PRIDE database, we assembled a reference expression map for 191 cancer cell lines and 246 clinical tumour samples, across 13 lineages. We found unique peptides identified only in... more
Using 11 proteomics datasets, mostly available through the PRIDE database, we assembled a reference expression map for 191 cancer cell lines and 246 clinical tumour samples, across 13 lineages. We found unique peptides identified only in tumour samples despite a much higher coverage in cell lines. These were mainly mapped to proteins related to regulation of signalling receptor activity. Correlations between baseline expression in cell lines and tumours were calculated. We found these to be highly similar across all samples with most similarity found within a given sample type. Integration of proteomics and transcriptomics data showed median correlation across cell lines to be 0.58 (range between 0.43 and 0.66). Additionally, in agreement with previous studies, variation in mRNA levels was often a poor predictor of changes in protein abundance. To our knowledge, this work constitutes the first meta-analysis focusing on cancer-related public proteomics datasets. We therefore also hig...
Mass spectra provide the ultimate evidence for supporting the findings of mass spectrometry (MS) proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal... more
Mass spectra provide the ultimate evidence for supporting the findings of mass spectrometry (MS) proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datasets deposited to public proteomics repositories. USIs enable greater transparency for providing spectral evidence in support of key findings in publications, with more than 1 billion USI identifications from over 3 billion spectra already available through ProteomeXchange repositories.
The Omics Discovery Index is an open source platform that can be used to access, discover and disseminate omics datasets. OmicsDI integrates proteomics, genomics, metabolomics, models and transcriptomics datasets. Using an efficient... more
The Omics Discovery Index is an open source platform that can be used to access, discover and disseminate omics datasets. OmicsDI integrates proteomics, genomics, metabolomics, models and transcriptomics datasets. Using an efficient indexing system, OmicsDI integrates different biological entities including genes, transcripts, proteins, metabolites and the corresponding publications from PubMed. In addition, it implements a group of pipelines to estimate the impact of each dataset by tracing the number of citations, reanalysis and biological entities reported by each dataset. Here, we present the OmicsDI REST interface to enable programmatic access to any dataset in OmicsDI or all the datasets for a specific provider (database). Clients can perform queries on the API using different metadata information such as sample details (species, tissues, etc), instrumentation (mass spectrometer, sequencer), keywords and other provided annotations. In addition, we present two different librari...
MotivationSpectrum clustering has been proved to enhance proteomics data analysis: some originally unidentified spectra can be potentially identified and individual peptides can also be evaluated to find potentially mis-identifications by... more
MotivationSpectrum clustering has been proved to enhance proteomics data analysis: some originally unidentified spectra can be potentially identified and individual peptides can also be evaluated to find potentially mis-identifications by using clusters of identified spectra. The Phoenix Enhancer spectrum service/tool provides an infrastructure to perform data analysis on tandem mass spectra and the corresponding peptides against previously identified public data. Based on previously released PRIDE Cluster data and a newly developed pipeline, four functionalities are provided: i) evaluate the original peptide identifications in an individual dataset, to find low confident peptide spectrum matches (PSMs) which could correspond to mis-identifications; ii) provide confidence scores for all originally identified PSMs, to help users to evaluate their quality (complementary to getting a global false discovery rate); iii) identified potentially new PSMs to originally unidentified spectra; ...

And 70 more