Journal of the American Society for Mass Spectrometry, 2012
In this article, we present a computation- and memory-efficient method to calculate the probabili... more In this article, we present a computation- and memory-efficient method to calculate the probabilities of occurrence and exact center-masses of the aggregated isotopic distribution of a molecule. The method uses fundamental mathematical properties of polynomials given by the Newton-Girard theorem and Viete's formulae. The calculation is based on the atomic composition of the molecule and the natural abundances of the elemental isotopes in normal terrestrial matter. To evaluate the performance of the proposed method, which we named BRAIN, we compare it with the results obtained from five existing software packages (IsoPro, Mercury, Emass, NeutronCluster, and IsoDalton) for 10 biomolecules. Additionally, we compare the computed mass centers with the results obtained by calculating, and subsequently aggregating, the fine isotopic distribution for two of the exemplary biomolecules. The algorithm will be made available as a Bioconductor package in R, and is also available upon request.
Peripheral blood mononuclear cells (PBMCs) are main actors in inflammatory processes and linked t... more Peripheral blood mononuclear cells (PBMCs) are main actors in inflammatory processes and linked to many diseases, including rheumatoid arthritis, atherosclerosis, asthma, HIV and cancer. Moreover, they seem an interesting 'surrogate tissue' that can be used in biomarker discovery. In order to get a good experimental design for quantitative expression studies, the knowledge of the interindividual variation is an essential part. Therefore, PBMCs were isolated from 24 healthy volunteers (15 males, 9 females, ages 63-86) with no clinical signs of inflammation. The extracted proteins were separated using the two dimensional difference in gel electrophoresis technology (2D-DIGE), and the gel images were processed with the DeCyder 2D software. Protein spots present in at least 22 out of 24 healthy volunteers were selected for further statistical analysis. Determination of the coefficient of variation (CV) of the normalized spot volume values of these proteins, reveals that the tota...
Shotgun proteomics is a powerful technology to study the protein population of a biological syste... more Shotgun proteomics is a powerful technology to study the protein population of a biological system. This approach employs tandem mass spectrometry for amino acid sequencing. Fragmented ion masses can be used in correlative database-searching, like SEQUEST or Mascot, to identify peptides. The database-search method depends upon a score function that evaluates matches between the predicted ions and the ions observed in the tandem mass spectrum. Principally, peptide identification based on tandem MS and database-search algorithms does not take into account information about isotope distributions of the precursor ions. To determine the effectiveness of these search algorithms in terms of their ability to distinguish between correct and incorrect peptide assignments, we propose an additional metric that quantifies the similarity between the theoretical isotopic distribution for the precursor ions selected for tandem MS and the experimental mass spectra by using Pearson's χ(2) statistic. The observed association between Pearson's χ(2) statistic and the score function indicates that good scores can be obtained for molecules which exhibit atypical isotope profiles, while low scores can be obtained for fragment spectra which have a clear peptide-like isotope pattern. These results demonstrate that Pearson's χ(2) statistic can be used in conjunction with the score of database-search algorithms to increase the sensitivity and specificity of peptide identification. In this manuscript, we present a workflow that provides a new perspective on the quality of peptide-to-spectrum matches (PSM) employed in database-searching strategies for peptide identification. Additional views on a dataset can facilitate a more hypothesis-driven interpretation of the mass spectrometry signals. The similarity metric on the PSM scores contemplates the isotopic profile and results in a measure that conveys a degree of biomolecular similarity observed from the precursor of the selected tandem MS spectra. A close agreement between the PSM score and the similarity metric will result in a higher confidence for the identification of the selected precursor ion.
The phosphorylation of proteins is one of the most important post-translational modifications in ... more The phosphorylation of proteins is one of the most important post-translational modifications in nature. Knowledge of the quantity or degree of protein phosphorylation in biological samples is extremely important. A combination of liquid chromatography (LC) and inductively coupled plasma mass spectrometry (ICP-MS) allows the absolute and relative quantification of the phosphorus signal. A comparison between dynamic reaction cell quadrupole ICP-MS (DRC-Q-ICP-MS) and high-resolution sector field ICP-MS (SF-ICP-MS) in detecting signals of phosphorus-containing species using identical capillary LC (reversed-phase technology) and nebulizer settings was performed. A method to diminish the reversed-phase gradient-related signal instability in phosphorus detection with LC/ICP-MS applications was developed. Bis(4-nitrophenyl)phosphate (BNPP) was used as a standard to compare signal-to-noise ratios and limits of detection (LODs) between the two instrumental setups. The LOD reaches a value of 0.8 µg L(-1) when applying the DRC technology in Q-ICP-MS and an LOD of 0.09 µg L(-1) was found with the SF-ICP-MS setup. This BNPP standard was further used to compare the absolute quantification possibilities of phosphopeptides in these two setups. This one-to-one comparison of two interference-reducing ICP-MS instruments demonstrates that absolute quantification of individual LC-separated phosphopeptides is possible. However, based on the LOD values, SF-ICP-MS has a higher sensitivity in detecting phosphorus signals and thus is preferred in phosphopeptide analysis.
Spectral library searching is a popular approach for MS/MS-based peptide identification. Because ... more Spectral library searching is a popular approach for MS/MS-based peptide identification. Because the size of spectral libraries continues to grow, the performance of searching algorithms is an important issue. This technical note introduces a strategy based on a minimum shared peak count between two spectra to reduce the set of admissible candidate spectra when issuing a query. A theoretical validation through time complexity analysis and an experimental validation based on an implementation of the candidate reduction strategy show that the approach can achieve a reduction of the set of candidate spectra by (at least) an order of magnitude, resulting in a significant improvement in the speed of the search. Meanwhile, more than 99% of the positive search results is retained. This efficient strategy to drastically improve the speed of spectral library searching with a negligible loss of sensitivity can be applied to any current spectral library search tool, irrespective of the employed similarity metric.
Mass spectrometry-based proteomics experiments generate spectra that are rich in information. Oft... more Mass spectrometry-based proteomics experiments generate spectra that are rich in information. Often only a fraction of this information is used for peptide/protein identification, whereas a significant proportion of the peaks in a spectrum remain unexplained. In this paper we explore how a specific class of data mining techniques termed "frequent itemset mining" can be employed to discover patterns in the unassigned data, and how such patterns can help us interpret the origin of the unexpected/unexplained peaks. First a model is proposed that describes the origin of the observed peaks in a mass spectrum. For this purpose we use the classical correlative database search algorithm. Peaks that support a positive identification of the spectrum are termed explained peaks. Next, frequent itemset mining techniques are introduced to infer which unexplained peaks are associated in a spectrum. The method is validated on two types of experimental proteomic data. First, peptide mass fingerprint data is analyzed to explain the unassigned peaks in a full scan mass spectrum. Interestingly, a large numbers of experimental spectra reveals several highly frequent unexplained masses, and pattern mining on these frequent masses demonstrates that subsets of these peaks frequently co-occur. Further evaluation shows that several of these co-occurring peaks indeed have a known common origin, and other patterns are promising hypothesis generators for further analysis. Second, the proposed methodology is validated on tandem mass spectrometral data using a public spectral library, where associations within the mass differences of unassigned peaks and peptide modifications are explored. The investigation of the found patterns illustrates that meaningful patterns can be discovered that can be explained by features of the employed technology and found modifications. This simple approach offers opportunities to monitor accumulating unexplained mass spectrometry data for emerging new patterns, with possible applications for the development of mass exclusion lists, for the refinement of quality control strategies and for a further interpretation of unexplained spectral peaks in mass spectrometry and tandem mass spectrometry.
Combining liquid chromatography-mass spectrometry (LC-MS)-based metabolomics experiments that wer... more Combining liquid chromatography-mass spectrometry (LC-MS)-based metabolomics experiments that were collected over a long period of time remains problematic due to systematic variability between LC-MS measurements. Until now, most normalization methods for LC-MS data are model-driven, based on internal standards or intermediate quality control runs, where an external model is extrapolated to the dataset of interest. In the first part of this article, we evaluate several existing data-driven normalization approaches on LC-MS metabolomics experiments, which do not require the use of internal standards. According to variability measures, each normalization method performs relatively well, showing that the use of any normalization method will greatly improve data-analysis originating from multiple experimental runs. In the second part, we apply cyclic-Loess normalization to a Leishmania sample. This normalization method allows the removal of systematic variability between two measurement blocks over time and maintains the differential metabolites. In conclusion, normalization allows for pooling datasets from different measurement blocks over time and increases the statistical power of the analysis, hence paving the way to increase the scale of LC-MS metabolomics experiments. From our investigation, we recommend data-driven normalization methods over model-driven normalization methods, if only a few internal standards were used. Moreover, data-driven normalization methods are the best option to normalize datasets from untargeted LC-MS experiments.
Globally, colorectal cancer (CRC) is the third most common malignant neoplasm. However, highly se... more Globally, colorectal cancer (CRC) is the third most common malignant neoplasm. However, highly sensitive, specific, noninvasive tests that allow CRC diagnosis at an early stage are still needed. As circulatory blood reflects the physiological status of an individual and/or the disease status for several disorders, efforts have been undertaken to identify candidate diagnostic CRC markers in plasma and serum. In this review, the challenges, bottlenecks and promising properties of mass spectrometry (MS)-based proteomics in blood are discussed. More specifically, important aspects in clinical design, sample retrieval, sample preparation, and MS analysis are presented. The recent developments in targeted MS approaches in plasma or serum are highlighted as well.
Quality control is increasingly recognized as a crucial aspect of mass spectrometry based proteom... more Quality control is increasingly recognized as a crucial aspect of mass spectrometry based proteomics. Several recent papers discuss relevant parameters for quality control and present applications to extract these from the instrumental raw data. What has been missing, however, is a standard data exchange format for reporting these performance metrics. We therefore developed the qcML format, an XML-based standard that follows the design principles of the related mzML, mzIdentML, mzQuantML, and TraML standards from the HUPO-PSI (Proteomics Standards Initiative). In addition to the XML format, we also provide tools for the calculation of a wide range of quality metrics as well as a database format and interconversion tools, so that existing LIMS systems can easily add relational storage of the quality control data to their existing schema. We here describe the qcML specification, along with possible use cases and an illustrative example of the subsequent analysis possibilities. All information about qcML is available at http://code.google.com/p/qcml.
Journal of the American Society for Mass Spectrometry, 2012
In this article, we present a computation- and memory-efficient method to calculate the probabili... more In this article, we present a computation- and memory-efficient method to calculate the probabilities of occurrence and exact center-masses of the aggregated isotopic distribution of a molecule. The method uses fundamental mathematical properties of polynomials given by the Newton-Girard theorem and Viete's formulae. The calculation is based on the atomic composition of the molecule and the natural abundances of the elemental isotopes in normal terrestrial matter. To evaluate the performance of the proposed method, which we named BRAIN, we compare it with the results obtained from five existing software packages (IsoPro, Mercury, Emass, NeutronCluster, and IsoDalton) for 10 biomolecules. Additionally, we compare the computed mass centers with the results obtained by calculating, and subsequently aggregating, the fine isotopic distribution for two of the exemplary biomolecules. The algorithm will be made available as a Bioconductor package in R, and is also available upon request.
Peripheral blood mononuclear cells (PBMCs) are main actors in inflammatory processes and linked t... more Peripheral blood mononuclear cells (PBMCs) are main actors in inflammatory processes and linked to many diseases, including rheumatoid arthritis, atherosclerosis, asthma, HIV and cancer. Moreover, they seem an interesting 'surrogate tissue' that can be used in biomarker discovery. In order to get a good experimental design for quantitative expression studies, the knowledge of the interindividual variation is an essential part. Therefore, PBMCs were isolated from 24 healthy volunteers (15 males, 9 females, ages 63-86) with no clinical signs of inflammation. The extracted proteins were separated using the two dimensional difference in gel electrophoresis technology (2D-DIGE), and the gel images were processed with the DeCyder 2D software. Protein spots present in at least 22 out of 24 healthy volunteers were selected for further statistical analysis. Determination of the coefficient of variation (CV) of the normalized spot volume values of these proteins, reveals that the tota...
Shotgun proteomics is a powerful technology to study the protein population of a biological syste... more Shotgun proteomics is a powerful technology to study the protein population of a biological system. This approach employs tandem mass spectrometry for amino acid sequencing. Fragmented ion masses can be used in correlative database-searching, like SEQUEST or Mascot, to identify peptides. The database-search method depends upon a score function that evaluates matches between the predicted ions and the ions observed in the tandem mass spectrum. Principally, peptide identification based on tandem MS and database-search algorithms does not take into account information about isotope distributions of the precursor ions. To determine the effectiveness of these search algorithms in terms of their ability to distinguish between correct and incorrect peptide assignments, we propose an additional metric that quantifies the similarity between the theoretical isotopic distribution for the precursor ions selected for tandem MS and the experimental mass spectra by using Pearson's χ(2) statistic. The observed association between Pearson's χ(2) statistic and the score function indicates that good scores can be obtained for molecules which exhibit atypical isotope profiles, while low scores can be obtained for fragment spectra which have a clear peptide-like isotope pattern. These results demonstrate that Pearson's χ(2) statistic can be used in conjunction with the score of database-search algorithms to increase the sensitivity and specificity of peptide identification. In this manuscript, we present a workflow that provides a new perspective on the quality of peptide-to-spectrum matches (PSM) employed in database-searching strategies for peptide identification. Additional views on a dataset can facilitate a more hypothesis-driven interpretation of the mass spectrometry signals. The similarity metric on the PSM scores contemplates the isotopic profile and results in a measure that conveys a degree of biomolecular similarity observed from the precursor of the selected tandem MS spectra. A close agreement between the PSM score and the similarity metric will result in a higher confidence for the identification of the selected precursor ion.
The phosphorylation of proteins is one of the most important post-translational modifications in ... more The phosphorylation of proteins is one of the most important post-translational modifications in nature. Knowledge of the quantity or degree of protein phosphorylation in biological samples is extremely important. A combination of liquid chromatography (LC) and inductively coupled plasma mass spectrometry (ICP-MS) allows the absolute and relative quantification of the phosphorus signal. A comparison between dynamic reaction cell quadrupole ICP-MS (DRC-Q-ICP-MS) and high-resolution sector field ICP-MS (SF-ICP-MS) in detecting signals of phosphorus-containing species using identical capillary LC (reversed-phase technology) and nebulizer settings was performed. A method to diminish the reversed-phase gradient-related signal instability in phosphorus detection with LC/ICP-MS applications was developed. Bis(4-nitrophenyl)phosphate (BNPP) was used as a standard to compare signal-to-noise ratios and limits of detection (LODs) between the two instrumental setups. The LOD reaches a value of 0.8 µg L(-1) when applying the DRC technology in Q-ICP-MS and an LOD of 0.09 µg L(-1) was found with the SF-ICP-MS setup. This BNPP standard was further used to compare the absolute quantification possibilities of phosphopeptides in these two setups. This one-to-one comparison of two interference-reducing ICP-MS instruments demonstrates that absolute quantification of individual LC-separated phosphopeptides is possible. However, based on the LOD values, SF-ICP-MS has a higher sensitivity in detecting phosphorus signals and thus is preferred in phosphopeptide analysis.
Spectral library searching is a popular approach for MS/MS-based peptide identification. Because ... more Spectral library searching is a popular approach for MS/MS-based peptide identification. Because the size of spectral libraries continues to grow, the performance of searching algorithms is an important issue. This technical note introduces a strategy based on a minimum shared peak count between two spectra to reduce the set of admissible candidate spectra when issuing a query. A theoretical validation through time complexity analysis and an experimental validation based on an implementation of the candidate reduction strategy show that the approach can achieve a reduction of the set of candidate spectra by (at least) an order of magnitude, resulting in a significant improvement in the speed of the search. Meanwhile, more than 99% of the positive search results is retained. This efficient strategy to drastically improve the speed of spectral library searching with a negligible loss of sensitivity can be applied to any current spectral library search tool, irrespective of the employed similarity metric.
Mass spectrometry-based proteomics experiments generate spectra that are rich in information. Oft... more Mass spectrometry-based proteomics experiments generate spectra that are rich in information. Often only a fraction of this information is used for peptide/protein identification, whereas a significant proportion of the peaks in a spectrum remain unexplained. In this paper we explore how a specific class of data mining techniques termed "frequent itemset mining" can be employed to discover patterns in the unassigned data, and how such patterns can help us interpret the origin of the unexpected/unexplained peaks. First a model is proposed that describes the origin of the observed peaks in a mass spectrum. For this purpose we use the classical correlative database search algorithm. Peaks that support a positive identification of the spectrum are termed explained peaks. Next, frequent itemset mining techniques are introduced to infer which unexplained peaks are associated in a spectrum. The method is validated on two types of experimental proteomic data. First, peptide mass fingerprint data is analyzed to explain the unassigned peaks in a full scan mass spectrum. Interestingly, a large numbers of experimental spectra reveals several highly frequent unexplained masses, and pattern mining on these frequent masses demonstrates that subsets of these peaks frequently co-occur. Further evaluation shows that several of these co-occurring peaks indeed have a known common origin, and other patterns are promising hypothesis generators for further analysis. Second, the proposed methodology is validated on tandem mass spectrometral data using a public spectral library, where associations within the mass differences of unassigned peaks and peptide modifications are explored. The investigation of the found patterns illustrates that meaningful patterns can be discovered that can be explained by features of the employed technology and found modifications. This simple approach offers opportunities to monitor accumulating unexplained mass spectrometry data for emerging new patterns, with possible applications for the development of mass exclusion lists, for the refinement of quality control strategies and for a further interpretation of unexplained spectral peaks in mass spectrometry and tandem mass spectrometry.
Combining liquid chromatography-mass spectrometry (LC-MS)-based metabolomics experiments that wer... more Combining liquid chromatography-mass spectrometry (LC-MS)-based metabolomics experiments that were collected over a long period of time remains problematic due to systematic variability between LC-MS measurements. Until now, most normalization methods for LC-MS data are model-driven, based on internal standards or intermediate quality control runs, where an external model is extrapolated to the dataset of interest. In the first part of this article, we evaluate several existing data-driven normalization approaches on LC-MS metabolomics experiments, which do not require the use of internal standards. According to variability measures, each normalization method performs relatively well, showing that the use of any normalization method will greatly improve data-analysis originating from multiple experimental runs. In the second part, we apply cyclic-Loess normalization to a Leishmania sample. This normalization method allows the removal of systematic variability between two measurement blocks over time and maintains the differential metabolites. In conclusion, normalization allows for pooling datasets from different measurement blocks over time and increases the statistical power of the analysis, hence paving the way to increase the scale of LC-MS metabolomics experiments. From our investigation, we recommend data-driven normalization methods over model-driven normalization methods, if only a few internal standards were used. Moreover, data-driven normalization methods are the best option to normalize datasets from untargeted LC-MS experiments.
Globally, colorectal cancer (CRC) is the third most common malignant neoplasm. However, highly se... more Globally, colorectal cancer (CRC) is the third most common malignant neoplasm. However, highly sensitive, specific, noninvasive tests that allow CRC diagnosis at an early stage are still needed. As circulatory blood reflects the physiological status of an individual and/or the disease status for several disorders, efforts have been undertaken to identify candidate diagnostic CRC markers in plasma and serum. In this review, the challenges, bottlenecks and promising properties of mass spectrometry (MS)-based proteomics in blood are discussed. More specifically, important aspects in clinical design, sample retrieval, sample preparation, and MS analysis are presented. The recent developments in targeted MS approaches in plasma or serum are highlighted as well.
Quality control is increasingly recognized as a crucial aspect of mass spectrometry based proteom... more Quality control is increasingly recognized as a crucial aspect of mass spectrometry based proteomics. Several recent papers discuss relevant parameters for quality control and present applications to extract these from the instrumental raw data. What has been missing, however, is a standard data exchange format for reporting these performance metrics. We therefore developed the qcML format, an XML-based standard that follows the design principles of the related mzML, mzIdentML, mzQuantML, and TraML standards from the HUPO-PSI (Proteomics Standards Initiative). In addition to the XML format, we also provide tools for the calculation of a wide range of quality metrics as well as a database format and interconversion tools, so that existing LIMS systems can easily add relational storage of the quality control data to their existing schema. We here describe the qcML specification, along with possible use cases and an illustrative example of the subsequent analysis possibilities. All information about qcML is available at http://code.google.com/p/qcml.
Uploads
Papers by D. Valkenborg