CN114242174B - Identification and annotation method for endogenous retroviruses - Google Patents
Identification and annotation method for endogenous retroviruses Download PDFInfo
- Publication number
- CN114242174B CN114242174B CN202210019782.8A CN202210019782A CN114242174B CN 114242174 B CN114242174 B CN 114242174B CN 202210019782 A CN202210019782 A CN 202210019782A CN 114242174 B CN114242174 B CN 114242174B
- Authority
- CN
- China
- Prior art keywords
- sequences
- endogenous
- sequence
- paired
- ltr
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 108020004437 Endogenous Retroviruses Proteins 0.000 title claims abstract description 43
- 241001430294 unidentified retrovirus Species 0.000 claims abstract description 61
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 16
- 239000000523 sample Substances 0.000 claims abstract description 13
- 108020001580 protein domains Proteins 0.000 claims abstract description 11
- 241000700605 Viruses Species 0.000 claims abstract description 10
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 8
- 238000003306 harvesting Methods 0.000 claims abstract description 5
- 230000001177 retroviral effect Effects 0.000 claims description 33
- 239000012634 fragment Substances 0.000 claims description 29
- 230000003612 virological effect Effects 0.000 claims description 22
- 108010067390 Viral Proteins Proteins 0.000 claims description 19
- 238000002864 sequence alignment Methods 0.000 claims description 9
- 108010013377 Retroviridae Proteins Proteins 0.000 claims description 5
- 230000037433 frameshift Effects 0.000 claims description 5
- 102100034353 Integrase Human genes 0.000 claims description 4
- 108010078428 env Gene Products Proteins 0.000 claims description 4
- 238000003780 insertion Methods 0.000 claims description 4
- 230000037431 insertion Effects 0.000 claims description 4
- 238000013519 translation Methods 0.000 claims description 4
- 101710145242 Minor capsid protein P3-RTD Proteins 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 108010089520 pol Gene Products Proteins 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 231100000221 frame shift mutation induction Toxicity 0.000 claims description 2
- 238000009412 basement excavation Methods 0.000 abstract description 2
- 102000016917 Complement C1 Human genes 0.000 description 15
- 108010028774 Complement C1 Proteins 0.000 description 15
- 230000000295 complement effect Effects 0.000 description 5
- 241001275954 Cortinarius caperatus Species 0.000 description 3
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 3
- 230000000840 anti-viral effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 241000608675 Pteronotus parnellii Species 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 244000309715 mini pig Species 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000002994 raw material Substances 0.000 description 2
- 210000000605 viral structure Anatomy 0.000 description 2
- 241000881705 Porcine endogenous retrovirus Species 0.000 description 1
- 241000282898 Sus scrofa Species 0.000 description 1
- 108010003533 Viral Envelope Proteins Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008827 biological function Effects 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000013020 embryo development Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 108700004029 pol Genes Proteins 0.000 description 1
- 101150088264 pol gene Proteins 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000003757 reverse transcription PCR Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 108010037253 syncytin Proteins 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Peptides Or Proteins (AREA)
- Micro-Organisms Or Cultivation Processes Thereof (AREA)
- Preparation Of Compounds By Using Micro-Organisms (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
技术领域technical field
本发明涉及基因检测技术领域,尤其涉及一种用于内源性逆转录病毒的的鉴定注释方法。The invention relates to the technical field of gene detection, in particular to an identification and annotation method for endogenous retroviruses.
背景技术Background technique
宿主基因组中的病毒成分被称为内源性逆转录病毒(Endogenous retrovirus,ERV),当逆转录病毒感染宿主的时候,可以通过复制过程中的一些步骤将自身基因组或者部分元件整合到宿主基因组中。许多内源性逆转录病毒在数百万年前已经整合到宿主基因组中,作为病毒和生命(宿主)共同进化的古老的“化石记录”,它们为研究生命起源和进化提供了原始材料。在长期的共同进化过程中,它们已经成为宿主基因组的一部分,其中一些甚至演变成了功能基因,发挥着重要的生物学功能,比如与胚胎发育相关的合胞素基因,来源于内源性逆转录病毒的包膜蛋白。此外,整合进基因组的内源性逆转录病毒及其元件,还可以通过直接或者间接的方式发挥抗病毒的功能。然而,大部分整合进宿主基因组的内源性逆转录病毒或元件因其完整被破坏,加上漫长进化的过程中遗传变异事件,往往与现代的外源病毒具有较低的同源性,因此,其鉴定和注释一直是一个难点。总得来说,目前对宿主基因组中的内源性逆转录病毒及元件仍然缺乏高效的鉴定和注释方法。CN201710418223.3公开一种鉴定和选育PERV-pol基因缺陷型五指山小型猪新品系的方法,采用PCR和/或RT-PCR的方法鉴定细胞中是否携带有猪内源性反转录病毒的结构基因,进而能够鉴定待测猪是否为猪内源性反转录病毒pol基因缺陷型五指山小型猪近交系,该方法在实际运用过程中效率低,存在不足。The viral components in the host genome are called endogenous retroviruses (ERVs). When retroviruses infect the host, they can integrate their own genome or some elements into the host genome through some steps in the replication process. . Many endogenous retroviruses have been integrated into the host genome millions of years ago, and as the ancient "fossil record" of the co-evolution of viruses and life (host), they provide raw material for studying the origin and evolution of life. During long-term co-evolution, they have become part of the host genome, and some of them have even evolved into functional genes that play important biological functions, such as the syncytin gene associated with embryonic development, derived from endogenous reversal The viral envelope protein. In addition, endogenous retroviruses and their elements integrated into the genome can also exert antiviral functions directly or indirectly. However, most of the endogenous retroviruses or elements integrated into the host genome are destroyed because of their integrity, and the genetic variation events during the long evolution process often have low homology with modern exogenous viruses. Therefore, , whose identification and annotation has always been a difficulty. In general, there is still a lack of efficient identification and annotation methods for endogenous retroviruses and elements in the host genome. CN201710418223.3 discloses a method for identifying and breeding a new line of PERV-pol gene-deficient Wuzhishan miniature pigs, using PCR and/or RT-PCR to identify whether the cells carry the structure of porcine endogenous retrovirus Gene, and then can identify whether the pig to be tested is an inbred line of Wuzhishan miniature pigs deficient in endogenous retrovirus pol gene. This method has low efficiency and shortcomings in the actual application process.
基于此,本发明提出了一种用于内源性逆转录病毒的鉴定注释方法,可以实现快速、高效地对宿主基因组进行内源性逆转录病毒及元件的挖掘、鉴定和注释。Based on this, the present invention proposes a method for identifying and annotating endogenous retroviruses, which can quickly and efficiently perform mining, identification and annotation of endogenous retroviruses and elements in the host genome.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种用于内源性逆转录病毒的鉴定注释方法,结合同源序列搜寻和隐式马尔可夫模型预测法,实现快速、高效的鉴定注释内源性逆转录病毒。The purpose of the present invention is to provide a method for identifying and annotating endogenous retroviruses, which combines homologous sequence search and hidden Markov model prediction method to realize rapid and efficient identification and annotation of endogenous retroviruses.
基于上述技术目的,本发明采用如下技术方案:Based on above-mentioned technical purpose, the present invention adopts following technical scheme:
一方面,本发明提供一种用于内源性逆转录病毒的鉴定注释方法,具体步骤包括:In one aspect, the present invention provides a method for identifying and annotating endogenous retroviruses, the specific steps comprising:
步骤1):选取典型且相对保守的病毒蛋白作为探针,采用同源序列比对法,识别与探针相似的命中片段,对连续化命中片段输出一个合理的命中区域,向该合理的命中区域的两侧各延伸侧翼序列,去除含有包含关系的区域片段后,得到内源性逆转录病毒候选序列;Step 1): Select a typical and relatively conservative viral protein as a probe, use the homologous sequence alignment method to identify the hit fragments similar to the probe, output a reasonable hit region for the consecutive hit fragments, and send the reasonable hit to the reasonable hit. The extended flanking sequences on both sides of the region, after removing the region fragments containing the inclusion relationship, obtain the endogenous retrovirus candidate sequence;
步骤2):使用LTR harvest对步骤1)获得的内源性逆转录病毒候选序列鉴定成对的长末端重复序列,进而提取出潜在完整内源性逆转录病毒序列和不含有成对LTR序列的内源性逆转录病毒获选序列;Step 2): Use LTR harvest to identify the paired long terminal repeat sequences of the endogenous retroviral candidate sequences obtained in step 1), and then extract the potential complete endogenous retrovirus sequences and those that do not contain paired LTR sequences. endogenous retroviral selection sequences;
步骤3):基于逆转录病毒的典型蛋白结构域序列使用隐式马尔科夫模型鉴定步骤2)提取的潜在完整内源性逆转录病毒序列和不含有成对LTR序列的内源性逆转录病毒获选序列的病毒蛋白结构域,基于隐式马尔科夫模型预测结果,去除先前步骤1)由同源序列比对方法产生的假阳性结果;Step 3): Potential complete endogenous retrovirus sequences extracted from step 2) and endogenous retroviruses that do not contain paired LTR sequences were identified using a hidden Markov model based on the canonical protein domain sequences of retroviruses The viral protein domain of the selected sequence is predicted based on the hidden Markov model, and the false positive results generated by the homologous sequence alignment method in the previous step 1) are removed;
步骤4):基于步骤3)鉴定的病毒蛋白结构域,在宿主基因组进行定位,生成在宿主基因组上的注释文件,提取内源性逆转录病毒的全长序列和病毒结构域序列,生成对应的定位和注释文件。Step 4): Based on the viral protein domain identified in step 3), locate the host genome, generate an annotation file on the host genome, extract the full-length sequence and viral domain sequence of the endogenous retrovirus, and generate the corresponding Locate and annotate files.
进一步地,所述典型且相对保守的病毒蛋白选自:Pol蛋白序列、ENV蛋白序列、RT蛋白序列中的至少一种。Further, the typical and relatively conserved viral protein is selected from at least one of: Pol protein sequence, ENV protein sequence, and RT protein sequence.
进一步地,所述同源序列比对法选自BLAST局部比对法。Further, the homologous sequence alignment method is selected from the BLAST local alignment method.
进一步地,所述命中区域输出具体过程如下:病毒序列在宿主基因组中的移码事件会导致出现许多连续化的命中片段,通过判断各个命中片段之间的距离是否合理(判断合理标准是大于10kb长度碱基)、在宿主基因组上插入方向是否相同(相同则归为基因组同一条链上的连续化命中片段,不相同则归为基因组不同链上的连续化命中片段),识别可能发生过移码突变的内源性逆转录病毒片段,对每条基因组链上的连续命中片段输出一个合理的命中区域,并返回该命中区域在基因组上的定位。Further, the specific process of the output of the hit region is as follows: the frameshift event of the viral sequence in the host genome will lead to the occurrence of many consecutive hit fragments, and by judging whether the distance between each hit fragment is reasonable (the reasonable standard for judging is greater than 10kb). length bases), whether the insertion direction on the host genome is the same (the same is classified as a contiguous hit fragment on the same strand of the genome, and if it is not the same, it is classified as a contiguous hit fragment on a different chain of the genome), and the identification may be over-shifted. The mutated endogenous retroviral fragment outputs a reasonable hit region for consecutive hits on each genome strand, and returns the location of the hit region on the genome.
进一步地,所述侧翼序列的长度至少为15kb,即上述的合理的命中区域向两侧各延伸至少15kb长度的碱基序列,去除含有包含关系的区域片段后,然后提取这些片段序列作为内源性逆转录病毒候选序列。Further, the length of the flanking sequence is at least 15kb, that is, the above-mentioned reasonable hit region extends the base sequence of at least 15kb length to both sides, after removing the region fragments containing the inclusion relationship, then extract these fragment sequences as endogenous. Sexual retroviral candidate sequences.
进一步地,步骤2)中,鉴定长末端重复序列的具体方法为:对步骤1)获得的内源性逆转录病毒候选序列构建增强性的后缀索引数组,进行成对LTR序列搜寻,实现成对长末端重复序列的鉴定。Further, in step 2), the specific method for identifying long terminal repeat sequences is: constructing an enhanced suffix index array for the endogenous retrovirus candidate sequences obtained in step 1), and performing paired LTR sequence search to achieve paired LTR sequences. Identification of long terminal repeats.
进一步地,步骤2)中,所述成对LTR序列的搜寻标准是:只有距离合适(如1kb到15kb范围内)、序列长度合适(如大于100bp)且相似性大于 80%的LTR序列才会被认为是候选的成对LTR序列。Further, in step 2), the search criteria for the paired LTR sequences are: only LTR sequences with suitable distance (such as within the range of 1kb to 15kb), suitable sequence length (such as greater than 100bp) and similarity greater than 80% will be used. Considered candidate paired LTR sequences.
进一步地,所述潜在完整内源性逆转录病毒序列和不含有成对LTR序列的内源性逆转录病毒获选序列提取过程为:对于每个候选的成对LTR序列及其中间的区域,将它们作为潜在的完整内源性逆转录病毒序列,返回其在基因组上的定位,再根据定位去除重复的区域,然后提取出这些潜在完整内源性逆转录病毒序列;同时,对于不含有成对LTR序列的内源性逆转录病毒获选序列,也单独提取出来,用作不完整内源性逆转录病毒及其病毒元件的鉴定。Further, the extraction process of the potential complete endogenous retrovirus sequence and the endogenous retrovirus selection sequence that does not contain the paired LTR sequence is: for each candidate paired LTR sequence and the region in the middle, Take them as potential complete endogenous retroviral sequences, return their positions on the genome, remove the repetitive regions according to the positioning, and then extract these potential complete endogenous retroviral sequences; Endogenous retrovirus-selected sequences for LTR sequences were also extracted separately and used for identification of incomplete endogenous retroviruses and their viral elements.
进一步地,所述逆转录病毒蛋白结构域序列是指:从逆转录病毒蛋白结构相关数据库(如Gypsy数据库)下载逆转录病毒的典型蛋白结构域序列,包括GAG、DUT、PR、RT、INT、RNaseH、ENV等;基于6个编码翻译框翻译成蛋白质序列并作为输入序列,分别使用隐式马尔科夫模型鉴定含有成对LTR的潜在完整内源性逆转录病毒序列、不含成对LTR的内源性逆转录病毒候选序列的病毒蛋白结构域。Further, the retrovirus protein domain sequence refers to: download the typical protein domain sequence of retrovirus from a retrovirus protein structure related database (such as Gypsy database), including GAG, DUT, PR, RT, INT, RNaseH, ENV, etc.; based on the 6 coding translation frames translated into protein sequences and used as input sequences, the hidden Markov model was used to identify potential complete endogenous retroviral sequences containing paired LTRs, and those without paired LTRs, respectively. Viral protein domains of endogenous retroviral candidate sequences.
进一步地,所述步骤4)中,基于步骤3)鉴定出的至少含有上述病毒蛋白结构域之一的输入序列,(1)如果输入序列为含有成对LTR的潜在完整内源性逆转录病毒序列,先将两侧LTR、鉴定出的病毒结构域在宿主基因组上的进行定位,生成在宿主基因组上的注释文件,再提取出完整的内源性逆转录病毒全长序列、病毒结构域序列、两侧LTR序列,并生成两侧LTR和病毒结构域在完整内源性逆转录病毒全长序列上的定位和注释文件;(2)如果输入序列为不含成对LTR的内源性逆转录病毒候选序列,先将鉴定出的结构域在宿主基因组上的进行定位,生成在宿主基因组上的注释文件,再根据在宿主基因组上的定位提取出潜在的内源性逆转录病毒全长序列和病毒结构域序列,并生成病毒结构域在潜在的内源性逆转录病毒全长序列上的定位和注释文件。Further, in the step 4), based on the input sequence identified in step 3) that contains at least one of the above viral protein domains, (1) if the input sequence is a potentially complete endogenous retrovirus containing paired LTRs Sequence, first locate the LTRs on both sides and the identified viral domains on the host genome, generate an annotation file on the host genome, and then extract the complete endogenous retrovirus full-length sequence and viral domain sequence , flank LTR sequences, and generate the location and annotation files of flank LTRs and viral domains on the complete endogenous retrovirus full-length sequence; (2) If the input sequence is an endogenous reverse without paired LTRs The candidate sequence of the retrovirus, first locate the identified domain on the host genome, generate an annotation file on the host genome, and then extract the potential endogenous retrovirus full-length sequence according to the location on the host genome and viral domain sequences, and generate a mapping and annotation file of viral domains on potential endogenous retroviral full-length sequences.
相对于现有技术,本发明提供的技术方案具备有益效果如下:本发明的方法可以实现快速、高效地对宿主基因组进行内源性逆转录病毒及元件的充分挖掘、鉴定和注释,其中同源序列比对和隐式马尔科夫模型联合预测极大地减少了假阳性率。Compared with the prior art, the technical solutions provided by the present invention have the following beneficial effects: the method of the present invention can rapidly and efficiently perform full excavation, identification and annotation of endogenous retroviruses and elements on the host genome, wherein homologous The combination of sequence alignment and hidden Markov model prediction greatly reduces the false positive rate.
1)本发明通过对宿主基因组中的内源性逆转录病毒及元件的高效鉴定和注释,为研究病毒和生命的起源、遗传进化提供了大量的原材料。1) The present invention provides a large amount of raw materials for studying the origin and genetic evolution of viruses and life through the efficient identification and annotation of endogenous retroviruses and elements in the host genome.
2)本发明方法有助于鉴定和筛选能够直接或间接的发挥抗病毒功能的内源性逆转录病毒及元件,将推动相关学科和领域的蓬勃发展。2) The method of the present invention helps to identify and screen endogenous retroviruses and elements that can directly or indirectly exert antiviral functions, and will promote the vigorous development of related disciplines and fields.
附图说明Description of drawings
图1为内源性逆转录病毒的鉴定注释方法的流程图。Figure 1 is a flow chart of the identification and annotation method of endogenous retroviruses.
具体实施方式Detailed ways
以下实施例旨在说明本发明内容,而不是对本发明保护范围的进一步限定。The following examples are intended to illustrate the content of the present invention, rather than to further limit the protection scope of the present invention.
实施例1Example 1
本发明制备一种用于内源性逆转录病毒的的鉴定注释方法,具体方法如下:The present invention prepares a method for identifying and annotating endogenous retroviruses, and the specific method is as follows:
1. 内源性逆转录病毒候选序列的搜寻1. Search for endogenous retroviral candidate sequences
主要基于同源序列搜寻方法,初步鉴定内源性逆转录病毒候选序列。首先选取典型且保守的病毒蛋白作为探针(比如使用逆转录病毒的Pol蛋白序列或者ENV蛋白序列),再采用同源序列比对方法(如BLAST局部比对搜寻),识别基因组上与探针相似的命中片段。如果使用多个病毒蛋白作为探针序列,则合并多个探针序列的搜寻结果。然后对命中片段按照相同链(命中片段的基因方向和输入序列相同)和互补链(命中片段的基因方向和输入序列相反)进行分类。Mainly based on the homologous sequence search method, the endogenous retrovirus candidate sequences were preliminarily identified. First select typical and conserved viral proteins as probes (such as using retrovirus Pol protein sequence or ENV protein sequence), and then use homologous sequence alignment methods (such as BLAST local alignment search) to identify the genome with probes Similar hit fragments. If multiple viral proteins are used as probe sequences, the search results for multiple probe sequences are combined. The hits are then sorted according to the same strand (the gene orientation of the hit is the same as the input sequence) and the complementary strand (the gene orientation of the hit is opposite to the input sequence).
病毒序列在宿主基因组中的移码事件会导致出现许多连续化的命中片段,通过判断命中片段之间的距离是否合理(判断合理标准是大于10kb长度碱基)、在宿主基因组上插入方向是否相同(相同则归为基因组同一条链上的连续化命中片段,不相同则归为基因组不同链上的连续化命中片段),识别可能发生过移码突变的内源性逆转录病毒片段,对连续命中片段输出一个合理的命中区域,并返回该合理命中区域在基因组上的定位。最后,对每个合理的命中区域分别向两侧各延伸侧翼序列(侧翼序列如长度为15kb的碱基序列),去除含有包含关系的区域片段后,得到内源性逆转录病毒候选序列。The frameshift event of the viral sequence in the host genome will lead to many consecutive hit fragments. By judging whether the distance between the hit fragments is reasonable (the reasonable standard is greater than 10kb length bases), and whether the insertion direction on the host genome is the same (The same is classified as a contiguous hit fragment on the same strand of the genome, and if it is not the same, it is classified as a contiguous hit fragment on a different chain of the genome), identify endogenous retroviral fragments that may have undergone frameshift mutation, and identify consecutive The hit fragment outputs a plausible hit region and returns the location of the plausible hit region on the genome. Finally, for each reasonable hit region, the flanking sequences (flanking sequences such as base sequences with a length of 15 kb) are respectively extended to both sides, and the region fragments containing the inclusion relationship are removed to obtain the endogenous retrovirus candidate sequence.
2. 成对长末端重复序列鉴定 2. Pairwise long terminal repeat identification
完整的逆转录病毒的一个重要特征是在病毒的两侧存在长末端重复序列(LTR序列)。使用LTR harvest鉴定长末端重复序列,首先对上述内源性逆转录病毒候选序列构建增强性的后缀索引数组(这一步构建检索索引主要是加快后续检索速度),这一步骤能够节省后续计算的大量时间,然后进行成对LTR序列的搜寻。只有距离合适(如1kb到15kb范围内)、序列长度合适(如大于100bp)且相似性大于80%的LTR序列才会被认为是候选的成对LTR序列。对于每个候选的成对LTR序列及其中间的区域,将它们作为潜在的完整内源性逆转录病毒序列,返回其在基因组上的定位,再根据定位去除重复的区域,然后提取出这些潜在的完整内源性逆转录病毒的序列。An important feature of intact retroviruses is the presence of long terminal repeats (LTR sequences) flanking the virus. Use LTR harvest to identify long terminal repeat sequences. First, build an enhanced suffix index array for the above-mentioned endogenous retrovirus candidate sequences (the construction of the retrieval index in this step is mainly to speed up the subsequent retrieval speed). This step can save a lot of subsequent calculations. time, followed by a search for paired LTR sequences. Only LTR sequences with suitable distance (such as in the range of 1kb to 15kb), suitable sequence length (such as more than 100bp) and more than 80% similarity will be considered as candidate paired LTR sequences. For each candidate paired LTR sequence and the region in between, use them as a potential complete endogenous retroviral sequence, return its location on the genome, remove the duplicated region according to the location, and then extract these potential The sequence of the complete endogenous retrovirus.
同时,对于不含有成对LTR序列的内源性逆转录病毒获选序列,也单独提取出来,用作不完整内源性逆转录病毒及其病毒元件的鉴定。At the same time, the selected sequences of endogenous retroviruses that do not contain paired LTR sequences are also extracted separately and used for the identification of incomplete endogenous retroviruses and their viral elements.
病毒蛋白结构域的鉴定Identification of viral protein domains
从逆转录病毒蛋白结构相关数据库(如Gypsy数据库)下载逆转录病毒的典型蛋白结构域序列,包括GAG、DUT、PR、RT、INT、RNaseH、ENV等。对每条含有成对LTR的潜在完整内源性逆转录病毒序列、不含成对LTR的内源性逆转录病毒候选序列,基于6个编码翻译框翻译成6条蛋白质序列并作为输入序列,分别使用隐式马尔科夫模型鉴定病毒结构域,基于隐式马尔科夫模型预测结果,去除先前由同源序列比对方法产生的假阳性结果。Download the typical protein domain sequences of retroviruses, including GAG, DUT, PR, RT, INT, RNaseH, ENV, etc., from databases related to retrovirus protein structure (such as Gypsy database). For each potential complete endogenous retroviral sequence containing paired LTRs, and endogenous retroviral candidate sequences without paired LTRs, based on the 6 coding translation frames, it was translated into 6 protein sequences and used as input sequences, The viral domains were identified using a hidden Markov model, respectively, and the results were predicted based on the hidden Markov model, removing the false positive results previously generated by the homologous sequence alignment method.
内源性逆转录病毒的注释Annotation of endogenous retroviruses
对于鉴定出的至少含有上述病毒蛋白结构域之一的输入序列,(1)如果输入序列为含有成对LTR的潜在完整内源性逆转录病毒序列,先将两侧LTR、鉴定出的病毒结构域在宿主基因组上的进行定位,生成在宿主基因组上的注释文件,再提取出完整的内源性逆转录病毒全长序列、病毒结构域序列、两侧LTR序列,并生成两侧LTR和病毒结构域在完整内源性逆转录病毒全长序列上的定位和注释文件;(2)如果输入序列为不含成对LTR的内源性逆转录病毒候选序列,先将鉴定出的结构域在宿主基因组上的进行定位,生成在宿主基因组上的注释文件,再根据在宿主基因组上的定位提取出潜在的内源性逆转录病毒全长序列和病毒结构域序列,并生成病毒结构域在潜在的内源性逆转录病毒全长序列上的定位和注释文件。For the identified input sequence that contains at least one of the above viral protein domains, (1) if the input sequence is a potentially complete endogenous retroviral sequence containing paired LTRs, first combine the flanking LTRs, the identified viral structure The domain is located on the host genome, an annotation file is generated on the host genome, and the complete endogenous retrovirus full-length sequence, viral domain sequence, and flank LTR sequences are extracted, and the two sides LTR and virus are generated. The localization and annotation files of the domains on the complete endogenous retrovirus full-length sequence; (2) If the input sequence is an endogenous retrovirus candidate sequence without paired LTRs, first place the identified domains in the Positioning on the host genome, generating annotation files on the host genome, and then extracting potential endogenous retrovirus full-length sequences and viral domain sequences based on the positioning on the host genome, and generating viral domains in potential Mapping and annotation files on the full-length sequences of endogenous retroviruses.
范例运用:Example use:
1.宿主基因组数据下载1. Host genome data download
从NCBI数据库(https://www.ncbi.nlm.nih.gov/)下载蝙蝠物种Pteronotus parnellii(GCA_000465405.1)的基因组序列文件,基因组文件大小约为2Gb。Download the genome sequence file of the bat species Pteronotus parnellii (GCA_000465405.1) from the NCBI database (https://www.ncbi.nlm.nih.gov/), the genome file size is about 2Gb.
获取内源性逆转录病毒候选序列Obtaining endogenous retroviral candidate sequences
(1)将上述蝙蝠基因组序列构建为本地核苷酸数据库,同时使用RT蛋白和ENV蛋白序列作为探针序列,基于BLAST里的tblastn方法,识别数据库中蝙蝠基因组上与探针序列相似的命中片段。(1) Build the above bat genome sequence into a local nucleotide database, and use RT protein and ENV protein sequences as probe sequences. Based on the tblastn method in BLAST, identify the hit fragments on the bat genome that are similar to the probe sequences in the database. .
(2)合并两个探针序列的BLAST搜寻结果,对命中区域按照相同链(命中区域基因方向和输入序列相同)和互补链(命中区域基因方向和输入序列相反)进行分类。通过判断命中片段之间的距离是否大于10kb,结合命中片段在宿主基因组上插入的方向,为连续命中片段输出一个合理的命中区域,得到。(2) Combine the BLAST search results of the two probe sequences, and classify the hit regions according to the same strand (the gene direction of the hit region is the same as the input sequence) and the complementary strand (the gene direction of the hit region is opposite to the input sequence). By judging whether the distance between the hit fragments is greater than 10kb, combined with the insertion direction of the hit fragments on the host genome, a reasonable hit region is output for the consecutive hit fragments, and obtained.
(3)对上述每个合理的命中区域,向两侧各延伸15kb侧翼序列,去除含有包含关系的区域片段后,得到内源性逆转录病毒候选序列。在蝙蝠物种Pteronotus parnellii中,共鉴定到628条内源性逆转录病毒候选序列。(3) For each of the above reasonable hit regions, extend 15kb flanking sequences to both sides, remove the region fragments containing the inclusion relationship, and obtain endogenous retrovirus candidate sequences. A total of 628 endogenous retroviral candidate sequences were identified in the bat species Pteronotus parnellii .
鉴定内源性逆转录病毒候选序列中的成对LTRIdentification of paired LTRs in endogenous retroviral candidate sequences
使用LTR harvest对上述628条内源性逆转录病毒候选序列构建检索索引数组,对每条候选序列鉴定其是否含有成对LTR序列。成对LTR序列的判定标准为:LTR序列大于100bp,两条LTR序列的距离在1kb到15kb范围内,两条LTR序列的相似度大于80%。对于每个候选的成对LTR序列及其中间的区域,将它们作为潜在的完整内源性逆转录病毒序列,返回其在基因组上的定位,再根据定位去除重复的区域,然后提取出这些潜在的完整内源性逆转录病毒的序列。在上述628条内源性逆转录病毒候选序列中,共在38条序列中鉴定到成对LTR序列(如表1)。序列名称中含有“same”表示鉴定的内源性逆转录病毒和输入序列在相同链上,序列名称中含有“complement”表示鉴定的内源性逆转录病毒和在输入序列的互补链上。Use LTR harvest to construct a retrieval index array for the above 628 endogenous retrovirus candidate sequences, and identify whether each candidate sequence contains paired LTR sequences. The criteria for paired LTR sequences are: the LTR sequence is greater than 100bp, the distance between the two LTR sequences is in the range of 1kb to 15kb, and the similarity between the two LTR sequences is greater than 80%. For each candidate paired LTR sequence and the region in between, use them as a potential complete endogenous retroviral sequence, return its location on the genome, remove the duplicated region according to the location, and then extract these potential The sequence of the complete endogenous retrovirus. Among the above 628 endogenous retrovirus candidate sequences, paired LTR sequences were identified in 38 sequences (as shown in Table 1). The inclusion of "same" in the sequence name indicates that the identified endogenous retrovirus is on the same strand as the input sequence, and the inclusion of "complement" in the sequence name indicates that the identified endogenous retrovirus is on the complementary strand of the input sequence.
表1Table 1
4.鉴定病毒蛋白结构域4. Identification of viral protein domains
从Gypsy数据库下载逆转录病毒的典型蛋白结构域序列,包括GAG、DUT、PR、RT、INT、RNaseH、ENV等。将上述628条内源性逆转录病毒候选序列,基于6个编码翻译框翻译成蛋白质序列(共3768条)。对这3768条蛋白质序列,使用上述的逆转录病毒的典型蛋白结构域,基于隐式马尔科夫模型,鉴定病毒结构域。Download the typical protein domain sequences of retroviruses from the Gypsy database, including GAG, DUT, PR, RT, INT, RNaseH, ENV, etc. The above 628 endogenous retrovirus candidate sequences were translated into protein sequences based on 6 coding translation frames (3768 in total). For these 3768 protein sequences, viral domains were identified based on the Hidden Markov Model using the above-mentioned typical protein domains of retroviruses.
5.病毒蛋白结构域注释5. Annotation of viral protein domains
对38条含有成对LTR的内源性逆转录病毒候选序列,以及剩下的不含成对LTR的内源性逆转录病毒候选序列,基于隐式马尔科夫模型的搜寻结果,如果含有1种及以上的逆转录病毒的典型蛋白结构域,才将其确定为最终的内源性逆转录病毒候选序列,这一步能去掉一些先前由BLAST方法鉴定的假阳性结果。最终共鉴定出25条含有成对LTR的内源性逆转录病毒序列,580条不含成对LTR的内源性逆转录病毒序或元件。最后,将这些候选序列的病毒蛋白结构域精准定位到宿主基因组和提取出来的内源性逆转录病毒上,生成详细的注释文档。这些内源性逆转录病毒蛋白结构域为鉴定和筛选能够直接或间接的发挥抗病毒功能的内源性逆转录病毒及元件。For the 38 endogenous retroviral candidate sequences that contain paired LTRs, and the remaining endogenous retroviral candidate sequences that do not contain paired LTRs, the search results based on the hidden Markov model, if it contains 1 The typical protein domains of retroviruses from one species or more were identified as the final endogenous retrovirus candidate sequences, which could remove some of the false-positive results previously identified by the BLAST method. A total of 25 endogenous retroviral sequences containing paired LTRs were finally identified, and 580 endogenous retroviral sequences or elements without paired LTRs were identified. Finally, the viral protein domains of these candidate sequences are precisely mapped to the host genome and the extracted endogenous retroviruses to generate detailed annotation documents. These endogenous retroviral protein domains are used to identify and screen endogenous retroviruses and elements that can directly or indirectly exert antiviral functions.
表2为含有成对LTR的内源性逆转录病毒序列的在提取出的内源性逆转录病毒上的注释文档示例。注释文档中详细标注了每条内源性逆转录病毒在原宿主基因组中的位置(对应序列名称中的数字)、位于哪条链上(序列名称中“same”表示位于和输入序列相同链上,“complement”表示位于和输入序列互补链上)、每个蛋白结构域在提取出的内源性逆转录病毒上的起始和终止位置。Table 2 is an example of an annotated document on the extracted endogenous retrovirus of endogenous retroviral sequences containing paired LTRs. The annotation document details the position of each endogenous retrovirus in the original host genome (corresponding to the number in the sequence name) and which strand it is located on (“same” in the sequence name means it is located on the same strand as the input sequence. , "complement" indicates the start and end positions of each protein domain on the extracted endogenous retrovirus.
表2Table 2
虽然,上文中已经用一般性说明、具体实施方式及试验,对本发明作了详尽的描述,但在本发明基础上,可以对之作一些修改或改进,这对本领域技术人员而言是显而易见的。因此,在不偏离本发明精神的基础上所做的这些修改或改进,均属于本发明要求保护的范围。Although the present invention has been described in detail above with general description, specific embodiments and tests, some modifications or improvements can be made on the basis of the present invention, which is obvious to those skilled in the art . Therefore, these modifications or improvements made without departing from the spirit of the present invention fall within the scope of the claimed protection of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210019782.8A CN114242174B (en) | 2022-01-10 | 2022-01-10 | Identification and annotation method for endogenous retroviruses |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210019782.8A CN114242174B (en) | 2022-01-10 | 2022-01-10 | Identification and annotation method for endogenous retroviruses |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114242174A CN114242174A (en) | 2022-03-25 |
CN114242174B true CN114242174B (en) | 2022-08-16 |
Family
ID=80746192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210019782.8A Active CN114242174B (en) | 2022-01-10 | 2022-01-10 | Identification and annotation method for endogenous retroviruses |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114242174B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116072222B (en) * | 2023-02-16 | 2024-02-06 | 湖南大学 | Methods and applications of viral genome identification and splicing |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033502A (en) * | 2015-03-20 | 2016-10-19 | 深圳华大基因股份有限公司 | Methods and devices for identifying viruses |
CN110600079A (en) * | 2019-08-12 | 2019-12-20 | 中国水稻研究所 | Transgene identification method and identification device |
CN112342270A (en) * | 2019-08-07 | 2021-02-09 | 中国医学科学院病原生物学研究所 | Human respiratory virus targeted enrichment capture probe set and its application |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1404842A2 (en) * | 2001-06-29 | 2004-04-07 | Novartis AG | Perv screening method and use thereof |
US20090105092A1 (en) * | 2006-11-28 | 2009-04-23 | The Trustees Of Columbia University In The City Of New York | Viral database methods |
-
2022
- 2022-01-10 CN CN202210019782.8A patent/CN114242174B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033502A (en) * | 2015-03-20 | 2016-10-19 | 深圳华大基因股份有限公司 | Methods and devices for identifying viruses |
CN112342270A (en) * | 2019-08-07 | 2021-02-09 | 中国医学科学院病原生物学研究所 | Human respiratory virus targeted enrichment capture probe set and its application |
CN110600079A (en) * | 2019-08-12 | 2019-12-20 | 中国水稻研究所 | Transgene identification method and identification device |
Also Published As
Publication number | Publication date |
---|---|
CN114242174A (en) | 2022-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11761035B2 (en) | Methods and systems for generation and error-correction of unique molecular index sets with heterogeneous molecular lengths | |
US20250215518A1 (en) | Systems and methods for analyzing viral nucleic acids | |
US11898198B2 (en) | Universal short adapters with variable length non-random unique molecular identifiers | |
ES2869292T3 (en) | Fast and accurate mapping of targeted sequencing reads | |
McNulty et al. | Endosymbiont DNA in endobacteria-free filarial nematodes indicates ancient horizontal genetic transfer | |
Powell et al. | Empirical evaluation of partitioning schemes for phylogenetic analyses of mitogenomic data: an avian case study | |
US20180340215A1 (en) | Sample analysis, presence determination of a target sequence | |
Rana et al. | Comparison of de novo transcriptome assemblers and k-mer strategies using the killifish, Fundulus heteroclitus | |
Yi et al. | Mutational spectrum of SARS-CoV-2 during the global pandemic | |
CN105986013A (en) | Method and device for determining microbial species | |
WO2021047363A1 (en) | Method for using whole genome re-sequencing data to quickly identify transgenic or gene editing material and insertion sites thereof | |
CN108866171A (en) | A kind of species identification method based on new-generation sequencing | |
CN114242174B (en) | Identification and annotation method for endogenous retroviruses | |
Rehman et al. | Comprehensive comparative genomic and microsatellite analysis of SARS, MERS, BAT‐SARS, and COVID‐19 coronaviruses | |
Viner et al. | Systematic placement of Lagarobasidium cymosum and description of two new species | |
CN115394361A (en) | Method, apparatus and medium for constructing a microbial genome database | |
Peletto et al. | Efficient isolation on Vero. DogSLAMtag cells and full genome characterization of Dolphin Morbillivirus (DMV) by next generation sequencing | |
CN118703709B (en) | A PCR primer set for whole genome detection of African swine fever virus and a nanopore sequencing method based on single molecule sequencing | |
Joshi et al. | Delimiting continuity: Comparison of target enrichment and double digest restriction‐site associated DNA sequencing for delineating admixing parapatric Melitaea butterflies | |
CN108733974B (en) | Mitochondrial sequence splicing and copy number determination method based on high-throughput sequencing | |
Lynn et al. | The sanguicolous apostome Metacollinia luciensis Jankowski 1980 (Colliniidae, Apostomatia, Ciliophora) is not closely related to other sanguicolous apostomes | |
Sahin et al. | Molecular characterization of the complete genome of a novel ormycovirus infecting the ectomycorrhizal fungus Hortiboletus rubellus | |
Yepes-Blandón et al. | Draft genome assembly for the colombian freshwater bocachico fish, Prochilodus magdalenae | |
Liu et al. | An interlaboratory proficiency test using metagenomic sequencing as a diagnostic tool for the detection of RNA viruses in swine fecal material | |
Chen et al. | Description of Myxidium pseudocuneiforme n. sp.(Myxosporea: Myxidiidae) from Cyprinus carpio in China, with the resolution on a taxonomic dilemma of Myxidium cuneiforme |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |