RU2408011C2

RU2408011C2 - Method of increasing accuracy of determining sequence of amino acid residues of biopolymer based on mass-spectrometric analysis data, computer system

Info

Publication number: RU2408011C2
Application number: RU2009103057/15A
Authority: RU
Inventors: Александр Иванович Арчаков (RU); Александр Иванович Арчаков; Виктор Гаврилович Згода (RU); Виктор Гаврилович Згода; Андрей Валерьевич Лисица (RU); Андрей Валерьевич Лисица; Сергей Александрович Мошковский (RU); Сергей Александрович Мошковский; Алексей Леонидович Чернобровкин (RU); Алексей Леонидович Чернобровкин
Original assignee: Общество С Ограниченной Ответственностью "Интерлаб"
Priority date: 2009-01-30
Filing date: 2009-01-30
Publication date: 2010-12-27
Also published as: RU2009103057A; WO2010087740A1

Abstract

FIELD: chemistry. ^ SUBSTANCE: algorithm of comparing mass spectra with a genome database is applied repeatedly after updating the database with new records or after deleting records from the database or after replacing the database with a database composed of new records. Additional records are generated by making changes in the sequence of identified biopolymers which correspond to replacements, deletions, insertions and modifications of one or more amino acid residues. Present invention also relates to a computer system whose operation is based on the disclosed method. ^ EFFECT: high accuracy of identifying sequence of amino acid residues of a biopolymer. ^ 5 cl, 1 dwg, 1 ex

Description

Область техники, к которой относится изобретениеFIELD OF THE INVENTION

Настоящее изобретение относится к способам информационно-вычислительной обработки масс-спектрометрических данных, направленным на идентификацию первичной структуры биополимеров, в том числе белков и пептидов.The present invention relates to methods of computer processing of mass spectrometric data, aimed at identifying the primary structure of biopolymers, including proteins and peptides.

Предшествующий уровень техники изобретенияBACKGROUND OF THE INVENTION

Компьютерные методы обработки масс-спектрометрических данных, направленные на идентификацию первичной структуры биополимеров, в настоящее время являются основным способом проведения исследований в области протеомики.Computer methods for processing mass spectrometric data, aimed at identifying the primary structure of biopolymers, are currently the main way to conduct research in the field of proteomics.

В контексте данного изобретения биополимер рассматривается как закодированная в геноме последовательность аминокислотных остатков, содержащая, по меньшей мере, одну пептидную связь, и могущая содержать химические модификации остатков, в том числе, компонентами небелковой природы, такими как липиды, углеводороды, другие органические и неорганические элементы, например металлы. Последовательность аминокислотных остатков характеризуется вариабельностью, обусловленной следующими молекулярно-биологическими процессами: альтернативный сплайсинг, инсерции, делеции и замены единичных аминокислотных остатков. Последние три категории микровариабельности структуры белковых биополимеров обозначаются аббревиатурой SAP (Single Aminoacid Polymorphism). Совокупность индивидуальных особенностей белков организма образует его протеотип. Для определения протеотипа (протеотипирования) необходим способ идентификации микрогетерогенных различий в первичных структурах белков.In the context of the present invention, a biopolymer is considered to be a sequence of amino acid residues encoded in the genome containing at least one peptide bond and capable of containing chemical modifications of the residues, including non-protein components, such as lipids, hydrocarbons, other organic and inorganic elements for example metals. The sequence of amino acid residues is characterized by variability due to the following molecular biological processes: alternative splicing, insertions, deletions and substitutions of single amino acid residues. The last three categories of microvariability of the structure of protein biopolymers are denoted by the abbreviation SAP (Single Aminoacid Polymorphism). The set of individual characteristics of the body's proteins forms its proteotype. To determine the proteotype (proteotyping), a way is needed to identify microheterogeneous differences in the primary structures of proteins.

Идентификация первичной структуры биополимеров производится на основе масс-спектрометрических данных. Термин «масс-спектрометрические данные» обозначает информацию о массе или масс-зарядных характеристиках полных белков, пептидных фрагментов их гидролиза или фрагментов индуцированного распада ионов биополимеров. В ходе подготовки биополимеров к масс-спектрометрической идентификации их первичная структура может подвергаться специфичным для определенных аминокислотных остатков или неспецифичным модификациям, то есть модификациям, не зависящим от типа остатка в первичной структуре биополимера.The identification of the primary structure of biopolymers is based on mass spectrometric data. The term "mass spectrometric data" means information about the mass or mass-charge characteristics of complete proteins, peptide fragments of their hydrolysis, or fragments of induced decay of biopolymer ions. During the preparation of biopolymers for mass spectrometric identification, their primary structure may undergo specific amino acid residues or non-specific modifications, i.e., modifications that are independent of the type of residue in the primary structure of the biopolymer.

Обработка масс-спектрометрических данных производится с использованием биоинформационных алгоритмов. Большинство из них, например алгоритм Mowse [1], основываются на сравнении экспериментально полученных масс-спектрометрических данных с расчетными оценками, проведенными на основе геномных баз данных (ГБД). «Геномные базы данных» представляют собой совокупность информационных ресурсов, содержащих информационные записи о последовательностях аминокислотных остатков в белках, полученных на основании расшифровки геномной информации и (или) расшифровки экспрессируемых участков генома. Запись в ГБД включает в себя уникальный идентификатор белка и соответствующую этому белку последовательность аминокислотных остатков в буквенной кодировке. При сопоставлении масс-спектрометрических данных с геномной базой данных алгоритмом идентификации рассчитывается оценка статистической достоверности, позволяющая судить о вероятности правильной идентификации белка с учетом заданных масс-спектрометрических данных и определенной геномной базы данных. Белок считается идентифицированным, если оценка статистической достоверности превышает произвольно установленное пороговое значение.Mass spectrometric data processing is performed using bioinformation algorithms. Most of them, for example, the Mowse algorithm [1], are based on a comparison of experimentally obtained mass spectrometric data with calculated estimates based on genomic databases (GDB). “Genomic databases” are a collection of information resources containing information records about the sequences of amino acid residues in proteins obtained by decoding genomic information and (or) deciphering the expressed parts of the genome. The entry in the GDB includes a unique identifier of the protein and the corresponding sequence of amino acid residues in letter coding. When comparing mass spectrometric data with the genomic database, the identification algorithm calculates an estimate of statistical reliability, which makes it possible to judge the probability of correct identification of the protein, taking into account the specified mass spectrometric data and a specific genomic database. A protein is considered identified if the assessment of statistical significance exceeds an arbitrarily set threshold value.

При масс-спектрометрической идентификации биополимера возникают ситуации, когда часть масс-спектрометрических данных не совпадает с ГБД, поскольку в последних отсутствует информация об альтернативном сплайсинге (АС) и SAP. В то же время, внесение в ГБД дополнительной информации о всех возможных вариантах АС и SAP приводит к существенному снижению уровня статистической достоверности идентификации по причине экспоненциального увеличения комбинаторного пространства совпадающих с полученными масс-спектрометрическими данными вариантов первичных структур биополимеров [2].When mass spectrometric identification of a biopolymer occurs, situations arise when some of the mass spectrometric data do not coincide with the GDB, since the latter do not contain information about alternative splicing (AS) and SAP. At the same time, adding additional information about all possible AS and SAP options to the GDB leads to a significant decrease in the level of statistical reliability of identification due to the exponential increase in the combinatorial space of the variants of the primary structures of biopolymers coinciding with the mass spectrometric data [2].

В публикации [3] описан способ повышения точности определения аминокислотной последовательности пептидов - продуктов протеолиза белков - по данным масс-спектрометрического анализа, основанный на использовании расширенной ГБД. На предварительном этапе ГБД расширяют за счет включения в нее аминокислотных последовательностей белков, содержащих аннотированные в различных источниках SAP и пост-трансляционные модификации (ПТМ). При этом поиск информации о SAP и ПТМ осуществляется для всех белков, содержащихся в исходной базе данных.The publication [3] describes a method for increasing the accuracy of determining the amino acid sequence of peptides — protein proteolysis products — according to mass spectrometric analysis based on the use of extended HBD. At the preliminary stage, GDB is expanded by including in it amino acid sequences of proteins containing SAP and post-translational modifications (PTMs) annotated in various sources. At the same time, information on SAP and PTM is searched for for all proteins contained in the source database.

Раскрытие настоящего изобретенияDisclosure of the present invention

Предлагаемое в соответствии с настоящим изобретением решение указанной проблемы заключается в повторном применении алгоритмов масс-спектрометрической идентификации после внесения в ГБД новых записей, либо создание ГБД из новых записей, отражающих информацию об АС и SAP с учетом результатов идентификации белков по масс-спектрометрическим данным. Таким образом, настоящее изобретение относится к способу повышения точности определения последовательности аминокислотных остатков по данным масс-спектрометрического анализа, предусматривающему использование, по меньшей мере, одного алгоритма идентификации биополимеров, основанного на сопоставлении масс-спектрометрических данных с геномной базой данных, причем указанный алгоритм последовательно применяется, по меньшей мере, дважды.The proposed solution to this problem in accordance with the present invention is to re-apply mass spectrometric identification algorithms after entering new records in the GDB, or to create GDB from new records that reflect information about AS and SAP, taking into account the results of protein identification by mass spectrometric data. Thus, the present invention relates to a method for improving the accuracy of determining the sequence of amino acid residues according to mass spectrometric analysis, which involves the use of at least one biopolymer identification algorithm based on a comparison of mass spectrometric data with a genomic database, the algorithm being applied sequentially at least twice.

В соответствии с одним из вариантов выполнения настоящее изобретение предусматривает проведение первичной идентификации белков алгоритмом «АИ», добавление в ГБД вариантов первичной структуры, содержащих продукты АС и SAP только для идентифицированных белков, а затем повторное проведение идентификации на обогащенной базе данных либо тем же самым алгоритмом «АИ», либо другим алгоритмом «АИ′».In accordance with one embodiment, the present invention provides for the primary identification of proteins by the AI algorithm, adding primary structure variants containing AS and SAP products for the identified proteins to the GDB only, and then re-identifying on the enriched database or the same algorithm AI, or another AI algorithm.

В соответствии с другим вариантом выполнения настоящее изобретение предусматривает проведение первичной идентификации белков алгоритмом «АИ», создание ГБД, содержащую первичные структуры продуктов АС и SAP только идентифицированных ранее белков, а затем повторное проведение идентификации на обогащенной базе данных либо тем же самым алгоритмом «АИ», либо другим алгоритмом «АИ′».In accordance with another embodiment, the present invention provides for the primary identification of proteins by the AI algorithm, the creation of GDB containing the primary structures of AS and SAP products of only previously identified proteins, and then re-identification on the enriched database or the same AI algorithm , or another AI algorithm.

Отличительным преимуществом настоящего изобретения от аналогичных способов, предусматривающих использование комбинации биоинформационных алгоритмов для повышения уровня статистической достоверности идентификации, является то, что алгоритмы идентификации применяются последовательно, при этом сопряжение предыдущего алгоритма (АИ) с последующим (АИ′) осуществляется путем внесения изменений в ГБД. Для реализации предлагаемого способа достаточно использовать только один алгоритм масс-спектрометрической идентификации, а не по меньшей мере два, как, например, заявлено в патентной публикации [4].A distinctive advantage of the present invention from similar methods involving the use of a combination of bioinformation algorithms to increase the level of statistical reliability of identification is that identification algorithms are applied sequentially, while the previous algorithm (AI) is coupled with the subsequent one (AI ′) by making changes to the GDB. To implement the proposed method, it is sufficient to use only one mass spectrometric identification algorithm, and not at least two, as, for example, stated in the patent publication [4].

Также отличительным преимуществом настоящего изобретения от публикации [3] является то, что перед каждым повторным применением алгоритма в ГБД вносятся изменения, учитывающие результаты предыдущего(их) применения(ий) алгоритма (АИ). Это позволяет существенно увеличить эффективность поиска (за счет того, что каждая последующая идентификация является уточняющей по отношению к предыдущей(им)) и его достоверность (за счет резкого снижения вероятности получения ложноположительных результатов).Another distinctive advantage of the present invention from publication [3] is that before each repeated application of the algorithm, changes are made to the GDB, taking into account the results of the previous (their) application (s) of the algorithm (AI). This allows you to significantly increase the efficiency of the search (due to the fact that each subsequent identification is more precise in relation to the previous one (s)) and its reliability (due to a sharp decrease in the probability of obtaining false positive results).

Настоящее изобретение также относится к вычислительной системе, функционирование которой основано на раскрытом выше способе. На вход системы поступают масс-спектрометрические данные МСД. Эти данные используются для идентификации биополимеров по геномной базе данных ГБД алгоритмом АИ. Результаты идентификации (РИ) представляют собой перечень идентификаторов белков, для которых оценка достоверности идентификации превышает установленное пользователем пороговое значение. Для белков в составе РИ на основании содержащихся во внешних источниках информации ВИИ сведений об известных или предполагаемых продуктах АС и вариантах SAP генерируются варианты первичной структуры. В качестве ВИИ могут быть использованы специализированные базы данных генетического полиморфизма (например, НарМар), базы данных, содержащие сведения об известных модификация белковой структуры (например, UniProt), и также персональные данные о результатах генотипирования (например, 23andme.com). После этого алгоритм идентификации АИ′ применяется для проведения идентификации белков по базе данных ГБД′ на основе исходных масс-спектрометрических данных МСД. Результаты работы алгоритма АИ′, обозначенные как РИ′, сравниваются с предыдущими РИ, и устанавливается какие варианты изменений в первичной структуре белков были идентифицированы.The present invention also relates to a computer system, the operation of which is based on the method disclosed above. The MSD mass spectrometric data are input to the system. These data are used to identify biopolymers by the GDB genomic database using the AI algorithm. Identification results (RI) are a list of protein identifiers for which the assessment of identification reliability exceeds a threshold value set by the user. Primary structure variants are generated for proteins in the composition of radiation sources based on the information contained in external sources of information of the research institute of information about known or suspected AS products and variants of SAP. Specialized databases of genetic polymorphism (e.g., NarMar), databases containing information about known modifications of the protein structure (e.g., UniProt), and also personal data on the results of genotyping (e.g., 23andme.com) can be used as a VII. After that, the AI identification algorithm is used to identify proteins using the GDB database based on the initial MSD mass spectrometric data. The results of the AI ′ algorithm, designated as RI ′, are compared with previous RIs, and it is established which variants of changes in the primary protein structure have been identified.

Краткое описание чертежаBrief Description of the Drawing

На чертеже приведена схема вычислительной системы согласно настоящему изобретению. В настоящей схеме использованы следующие обозначения:The drawing shows a diagram of a computing system according to the present invention. In this scheme, the following notation is used:

МСД - исходные масс-спектрометрические данные, поступающие на вход системы;MSD - initial mass spectrometric data received at the input of the system;

ГБД - исходная геномная база данных;GDB - source genomic database;

АИ и АИ′ - алгоритмы масс-спектрометрической идентификации, причем допускается, что АИ тождественен АИ';AI and AI ′ are mass spectrometric identification algorithms, and it is assumed that AI is identical to AI ';

РИ - результаты первичной идентификации, представляющие собой перечень идентификаторов белков;RI - primary identification results, which are a list of protein identifiers;

РИ′ - результаты повторной идентификации, содержащие дополнительные варианты белков;RI ′ - re-identification results containing additional protein variants;

МГБД - модификация геномной базы данных;MGBD - modification of the genomic database;

ГБД′ - модифицированная геномная база данных, в которую включены варианты белков, содержащихся во внешних источниках информации (ВИИ).GBD ′ is a modified genomic database, which includes variants of proteins contained in external sources of information (VII).

Пример 1. Идентификация полиморфного варианта белка Trypsin-1 [Precursor] (Uniprot P07477) способом согласно настоящему изобретениюExample 1. Identification of a polymorphic variant of Trypsin-1 protein [Precursor] (Uniprot P07477) by the method according to the present invention

Масс-спектрометрические данные исследования образца стволовых клеток человека были загружены из системы Pride (http://www.ebi.ac.uk/pride/). Была произведена первичная масс-спектрометрическая идентификация белков загруженных масс-спектров программой Mascot с использованием базы данных NCBI-nr. Для одного из идентифицированных белков из базы данных Uniprot было получено 13 полиморфных вариантов. Новая база данных была сформирована путем добавления в базу данных NCBI-nr списка полиморфных вариантов белка Trypsin-1. Была произведена повторная масс-спектрометрическая идентификация белков программой Mascot с использованием новой базы данных. В результате вторичной идентификации был идентифицирован полиморфный вариант белка Trypsin-1 [Precursor], отличающийся от дикого типа заменой цистеина в позиции 139 на фенилаланин. В спектре ионной фрагментации был идентифицирован пептидMass spectrometric data from a study of a sample of human stem cells were downloaded from the Pride system (http://www.ebi.ac.uk/pride/). The initial mass spectrometric identification of the proteins of the loaded mass spectra was performed by the Mascot program using the NCBI-nr database. For one of the identified proteins from the Uniprot database, 13 polymorphic variants were obtained. A new database was created by adding to the NCBI-nr database a list of polymorphic variants of Trypsin-1 protein. Repeated mass spectrometric identification of proteins was carried out by the Mascot program using a new database. As a result of secondary identification, a polymorphic version of the Trypsin-1 [Precursor] protein was identified, which differs from the wild type by the replacement of cysteine at position 139 with phenylalanine. A peptide was identified in the ion fragmentation spectrum

K.(139)FLISGWGNTASSGADYPDELQCLDAPVLSQAK(170).C, содержащий указанную единичную аминокислотную замену.K. (139) FLISGWGNTASSGADYPDELQCLDAPVLSQAK (170) .C containing said single amino acid substitution.

Источники информацииInformation sources

[1]. Pappin D.J., Hojrup P., Bleasby A.J., Rapid identification of proteins by peptide-mass fingerprinting, Curr Biol 1993, 3(6), 327-332.[one]. Pappin D.J., Hojrup P., Bleasby A.J., Rapid identification of proteins by peptide-mass fingerprinting, Curr Biol 1993, 3 (6), 327-332.

[2]. Kim S., Gupta N., Pevzner P.A., Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J Proteome Res 2008, 7, 3354-3363.[2]. Kim S., Gupta N., Pevzner P.A., Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J Proteome Res 2008, 7, 3354-3363.

[3]. Alves G., Ogurtsov A., Yu Y., RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration. BMC Genomics 2008, 9, 505.[3]. Alves G., Ogurtsov A., Yu Y., RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration. BMC Genomics 2008, 9, 505.

[4]. Method and system for elucidating the primary structure of biopolymers; Bluggel M., Chamrad D., PROTAGEN AG, Dortmund (DE); United States Patent Application Publication US 2006/0188887 Al, Pub. Date: 24.08.2006.[four]. Method and system for elucidating the primary structure of biopolymers; Bluggel M., Chamrad D., PROTAGEN AG, Dortmund (DE); United States Patent Application Publication US 2006/0188887 Al, Pub. Date: 08.24.2006.

Claims

1. A method of improving the accuracy of determining the sequence of amino acid residues according to mass spectrometric analysis, which involves the use of at least one biopolymer identification algorithm based on a comparison of mass spectrometric data with a genomic database, the algorithm being sequentially applied at least twice, characterized in that before each repeated application of the algorithm, changes are made to the genomic database, taking into account the results of the previous (them) when change (s).

2. The method according to claim 1, in which, before re-applying the algorithm, additional entries are made to the genomic database.

3. The method according to claim 2, in which additional entries are generated by making changes in the sequence of the identified biopolymers corresponding to substitutions, deletions, inserts and modifications of one or more amino acid residues.

4. The method according to claim 1, in which, before re-applying the algorithm, the genomic database is replaced with a database consisting of records corresponding to previously identified biopolymers, as well as records created by making changes in the sequence of identified biopolymers corresponding to replacements, deletions, inserts and modifications of one or more amino acid residues.

5. A computing system, the operation of which is based on the method according to any one of claims 1 to 4.