CN114242174B

CN114242174B - Identification and annotation method for endogenous retroviruses

Info

Publication number: CN114242174B
Application number: CN202210019782.8A
Authority: CN
Inventors: 葛行义; 周秩建; 叶生宝; 邱烨
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-08-16
Anticipated expiration: 2042-01-10
Also published as: CN114242174A

Abstract

The invention provides an identification annotation method for endogenous retroviruses, which comprises the following steps: selecting virus protein as a probe, identifying a hit segment similar to the probe, outputting a hit region, and extending flanking sequences to two sides of the hit region to obtain an endogenous retrovirus candidate sequence; identifying paired LTR sequences from endogenous retrovirus candidate sequences by using LTR harvest, and extracting potential complete endogenous retrovirus sequences and endogenous retrovirus candidate sequences without paired LTR sequences; identifying a virus protein structural domain by using a hidden Markov model based on a typical protein structural domain sequence of the retrovirus, and removing a false positive result; endogenous retroviruses were annotated and protein domain sequences were extracted. The method can realize the rapid and efficient excavation, identification and annotation of endogenous retroviruses and elements of the host genome, and greatly reduce the false positive rate.

Description

A method for identification and annotation of endogenous retroviruses

技术领域technical field

本发明涉及基因检测技术领域，尤其涉及一种用于内源性逆转录病毒的的鉴定注释方法。The invention relates to the technical field of gene detection, in particular to an identification and annotation method for endogenous retroviruses.

背景技术Background technique

宿主基因组中的病毒成分被称为内源性逆转录病毒（Endogenous retrovirus，ERV），当逆转录病毒感染宿主的时候，可以通过复制过程中的一些步骤将自身基因组或者部分元件整合到宿主基因组中。许多内源性逆转录病毒在数百万年前已经整合到宿主基因组中，作为病毒和生命（宿主）共同进化的古老的“化石记录”，它们为研究生命起源和进化提供了原始材料。在长期的共同进化过程中，它们已经成为宿主基因组的一部分，其中一些甚至演变成了功能基因，发挥着重要的生物学功能，比如与胚胎发育相关的合胞素基因，来源于内源性逆转录病毒的包膜蛋白。此外，整合进基因组的内源性逆转录病毒及其元件，还可以通过直接或者间接的方式发挥抗病毒的功能。然而，大部分整合进宿主基因组的内源性逆转录病毒或元件因其完整被破坏，加上漫长进化的过程中遗传变异事件，往往与现代的外源病毒具有较低的同源性，因此，其鉴定和注释一直是一个难点。总得来说，目前对宿主基因组中的内源性逆转录病毒及元件仍然缺乏高效的鉴定和注释方法。CN201710418223.3公开一种鉴定和选育PERV-pol基因缺陷型五指山小型猪新品系的方法，采用PCR和/或RT-PCR的方法鉴定细胞中是否携带有猪内源性反转录病毒的结构基因，进而能够鉴定待测猪是否为猪内源性反转录病毒pol基因缺陷型五指山小型猪近交系，该方法在实际运用过程中效率低，存在不足。The viral components in the host genome are called endogenous retroviruses (ERVs). When retroviruses infect the host, they can integrate their own genome or some elements into the host genome through some steps in the replication process. . Many endogenous retroviruses have been integrated into the host genome millions of years ago, and as the ancient "fossil record" of the co-evolution of viruses and life (host), they provide raw material for studying the origin and evolution of life. During long-term co-evolution, they have become part of the host genome, and some of them have even evolved into functional genes that play important biological functions, such as the syncytin gene associated with embryonic development, derived from endogenous reversal The viral envelope protein. In addition, endogenous retroviruses and their elements integrated into the genome can also exert antiviral functions directly or indirectly. However, most of the endogenous retroviruses or elements integrated into the host genome are destroyed because of their integrity, and the genetic variation events during the long evolution process often have low homology with modern exogenous viruses. Therefore, , whose identification and annotation has always been a difficulty. In general, there is still a lack of efficient identification and annotation methods for endogenous retroviruses and elements in the host genome. CN201710418223.3 discloses a method for identifying and breeding a new line of PERV-pol gene-deficient Wuzhishan miniature pigs, using PCR and/or RT-PCR to identify whether the cells carry the structure of porcine endogenous retrovirus Gene, and then can identify whether the pig to be tested is an inbred line of Wuzhishan miniature pigs deficient in endogenous retrovirus pol gene. This method has low efficiency and shortcomings in the actual application process.

基于此，本发明提出了一种用于内源性逆转录病毒的鉴定注释方法，可以实现快速、高效地对宿主基因组进行内源性逆转录病毒及元件的挖掘、鉴定和注释。Based on this, the present invention proposes a method for identifying and annotating endogenous retroviruses, which can quickly and efficiently perform mining, identification and annotation of endogenous retroviruses and elements in the host genome.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种用于内源性逆转录病毒的鉴定注释方法，结合同源序列搜寻和隐式马尔可夫模型预测法，实现快速、高效的鉴定注释内源性逆转录病毒。The purpose of the present invention is to provide a method for identifying and annotating endogenous retroviruses, which combines homologous sequence search and hidden Markov model prediction method to realize rapid and efficient identification and annotation of endogenous retroviruses.

基于上述技术目的，本发明采用如下技术方案：Based on above-mentioned technical purpose, the present invention adopts following technical scheme:

一方面，本发明提供一种用于内源性逆转录病毒的鉴定注释方法，具体步骤包括：In one aspect, the present invention provides a method for identifying and annotating endogenous retroviruses, the specific steps comprising:

步骤1）：选取典型且相对保守的病毒蛋白作为探针，采用同源序列比对法，识别与探针相似的命中片段，对连续化命中片段输出一个合理的命中区域，向该合理的命中区域的两侧各延伸侧翼序列，去除含有包含关系的区域片段后，得到内源性逆转录病毒候选序列；Step 1): Select a typical and relatively conservative viral protein as a probe, use the homologous sequence alignment method to identify the hit fragments similar to the probe, output a reasonable hit region for the consecutive hit fragments, and send the reasonable hit to the reasonable hit. The extended flanking sequences on both sides of the region, after removing the region fragments containing the inclusion relationship, obtain the endogenous retrovirus candidate sequence;

步骤2）：使用LTR harvest对步骤1）获得的内源性逆转录病毒候选序列鉴定成对的长末端重复序列，进而提取出潜在完整内源性逆转录病毒序列和不含有成对LTR序列的内源性逆转录病毒获选序列；Step 2): Use LTR harvest to identify the paired long terminal repeat sequences of the endogenous retroviral candidate sequences obtained in step 1), and then extract the potential complete endogenous retrovirus sequences and those that do not contain paired LTR sequences. endogenous retroviral selection sequences;

步骤3）：基于逆转录病毒的典型蛋白结构域序列使用隐式马尔科夫模型鉴定步骤2）提取的潜在完整内源性逆转录病毒序列和不含有成对LTR序列的内源性逆转录病毒获选序列的病毒蛋白结构域，基于隐式马尔科夫模型预测结果，去除先前步骤1）由同源序列比对方法产生的假阳性结果；Step 3): Potential complete endogenous retrovirus sequences extracted from step 2) and endogenous retroviruses that do not contain paired LTR sequences were identified using a hidden Markov model based on the canonical protein domain sequences of retroviruses The viral protein domain of the selected sequence is predicted based on the hidden Markov model, and the false positive results generated by the homologous sequence alignment method in the previous step 1) are removed;

步骤4）：基于步骤3）鉴定的病毒蛋白结构域，在宿主基因组进行定位，生成在宿主基因组上的注释文件，提取内源性逆转录病毒的全长序列和病毒结构域序列，生成对应的定位和注释文件。Step 4): Based on the viral protein domain identified in step 3), locate the host genome, generate an annotation file on the host genome, extract the full-length sequence and viral domain sequence of the endogenous retrovirus, and generate the corresponding Locate and annotate files.

进一步地，所述典型且相对保守的病毒蛋白选自：Pol蛋白序列、ENV蛋白序列、RT蛋白序列中的至少一种。Further, the typical and relatively conserved viral protein is selected from at least one of: Pol protein sequence, ENV protein sequence, and RT protein sequence.

进一步地，所述同源序列比对法选自BLAST局部比对法。Further, the homologous sequence alignment method is selected from the BLAST local alignment method.

进一步地，所述命中区域输出具体过程如下：病毒序列在宿主基因组中的移码事件会导致出现许多连续化的命中片段，通过判断各个命中片段之间的距离是否合理（判断合理标准是大于10kb长度碱基）、在宿主基因组上插入方向是否相同（相同则归为基因组同一条链上的连续化命中片段，不相同则归为基因组不同链上的连续化命中片段），识别可能发生过移码突变的内源性逆转录病毒片段，对每条基因组链上的连续命中片段输出一个合理的命中区域，并返回该命中区域在基因组上的定位。Further, the specific process of the output of the hit region is as follows: the frameshift event of the viral sequence in the host genome will lead to the occurrence of many consecutive hit fragments, and by judging whether the distance between each hit fragment is reasonable (the reasonable standard for judging is greater than 10kb). length bases), whether the insertion direction on the host genome is the same (the same is classified as a contiguous hit fragment on the same strand of the genome, and if it is not the same, it is classified as a contiguous hit fragment on a different chain of the genome), and the identification may be over-shifted. The mutated endogenous retroviral fragment outputs a reasonable hit region for consecutive hits on each genome strand, and returns the location of the hit region on the genome.

进一步地，所述侧翼序列的长度至少为15kb，即上述的合理的命中区域向两侧各延伸至少15kb长度的碱基序列，去除含有包含关系的区域片段后，然后提取这些片段序列作为内源性逆转录病毒候选序列。Further, the length of the flanking sequence is at least 15kb, that is, the above-mentioned reasonable hit region extends the base sequence of at least 15kb length to both sides, after removing the region fragments containing the inclusion relationship, then extract these fragment sequences as endogenous. Sexual retroviral candidate sequences.

进一步地，步骤2）中，鉴定长末端重复序列的具体方法为：对步骤1）获得的内源性逆转录病毒候选序列构建增强性的后缀索引数组，进行成对LTR序列搜寻，实现成对长末端重复序列的鉴定。Further, in step 2), the specific method for identifying long terminal repeat sequences is: constructing an enhanced suffix index array for the endogenous retrovirus candidate sequences obtained in step 1), and performing paired LTR sequence search to achieve paired LTR sequences. Identification of long terminal repeats.

进一步地，步骤2）中，所述成对LTR序列的搜寻标准是：只有距离合适（如1kb到15kb范围内）、序列长度合适（如大于100bp）且相似性大于 80%的LTR序列才会被认为是候选的成对LTR序列。Further, in step 2), the search criteria for the paired LTR sequences are: only LTR sequences with suitable distance (such as within the range of 1kb to 15kb), suitable sequence length (such as greater than 100bp) and similarity greater than 80% will be used. Considered candidate paired LTR sequences.

进一步地，所述潜在完整内源性逆转录病毒序列和不含有成对LTR序列的内源性逆转录病毒获选序列提取过程为：对于每个候选的成对LTR序列及其中间的区域，将它们作为潜在的完整内源性逆转录病毒序列，返回其在基因组上的定位，再根据定位去除重复的区域，然后提取出这些潜在完整内源性逆转录病毒序列；同时，对于不含有成对LTR序列的内源性逆转录病毒获选序列，也单独提取出来，用作不完整内源性逆转录病毒及其病毒元件的鉴定。Further, the extraction process of the potential complete endogenous retrovirus sequence and the endogenous retrovirus selection sequence that does not contain the paired LTR sequence is: for each candidate paired LTR sequence and the region in the middle, Take them as potential complete endogenous retroviral sequences, return their positions on the genome, remove the repetitive regions according to the positioning, and then extract these potential complete endogenous retroviral sequences; Endogenous retrovirus-selected sequences for LTR sequences were also extracted separately and used for identification of incomplete endogenous retroviruses and their viral elements.

进一步地，所述逆转录病毒蛋白结构域序列是指：从逆转录病毒蛋白结构相关数据库（如Gypsy数据库）下载逆转录病毒的典型蛋白结构域序列，包括GAG、DUT、PR、RT、INT、RNaseH、ENV等；基于6个编码翻译框翻译成蛋白质序列并作为输入序列，分别使用隐式马尔科夫模型鉴定含有成对LTR的潜在完整内源性逆转录病毒序列、不含成对LTR的内源性逆转录病毒候选序列的病毒蛋白结构域。Further, the retrovirus protein domain sequence refers to: download the typical protein domain sequence of retrovirus from a retrovirus protein structure related database (such as Gypsy database), including GAG, DUT, PR, RT, INT, RNaseH, ENV, etc.; based on the 6 coding translation frames translated into protein sequences and used as input sequences, the hidden Markov model was used to identify potential complete endogenous retroviral sequences containing paired LTRs, and those without paired LTRs, respectively. Viral protein domains of endogenous retroviral candidate sequences.

进一步地，所述步骤4）中，基于步骤3）鉴定出的至少含有上述病毒蛋白结构域之一的输入序列，（1）如果输入序列为含有成对LTR的潜在完整内源性逆转录病毒序列，先将两侧LTR、鉴定出的病毒结构域在宿主基因组上的进行定位，生成在宿主基因组上的注释文件，再提取出完整的内源性逆转录病毒全长序列、病毒结构域序列、两侧LTR序列，并生成两侧LTR和病毒结构域在完整内源性逆转录病毒全长序列上的定位和注释文件；（2）如果输入序列为不含成对LTR的内源性逆转录病毒候选序列，先将鉴定出的结构域在宿主基因组上的进行定位，生成在宿主基因组上的注释文件，再根据在宿主基因组上的定位提取出潜在的内源性逆转录病毒全长序列和病毒结构域序列，并生成病毒结构域在潜在的内源性逆转录病毒全长序列上的定位和注释文件。Further, in the step 4), based on the input sequence identified in step 3) that contains at least one of the above viral protein domains, (1) if the input sequence is a potentially complete endogenous retrovirus containing paired LTRs Sequence, first locate the LTRs on both sides and the identified viral domains on the host genome, generate an annotation file on the host genome, and then extract the complete endogenous retrovirus full-length sequence and viral domain sequence , flank LTR sequences, and generate the location and annotation files of flank LTRs and viral domains on the complete endogenous retrovirus full-length sequence; (2) If the input sequence is an endogenous reverse without paired LTRs The candidate sequence of the retrovirus, first locate the identified domain on the host genome, generate an annotation file on the host genome, and then extract the potential endogenous retrovirus full-length sequence according to the location on the host genome and viral domain sequences, and generate a mapping and annotation file of viral domains on potential endogenous retroviral full-length sequences.

相对于现有技术，本发明提供的技术方案具备有益效果如下：本发明的方法可以实现快速、高效地对宿主基因组进行内源性逆转录病毒及元件的充分挖掘、鉴定和注释，其中同源序列比对和隐式马尔科夫模型联合预测极大地减少了假阳性率。Compared with the prior art, the technical solutions provided by the present invention have the following beneficial effects: the method of the present invention can rapidly and efficiently perform full excavation, identification and annotation of endogenous retroviruses and elements on the host genome, wherein homologous The combination of sequence alignment and hidden Markov model prediction greatly reduces the false positive rate.

1）本发明通过对宿主基因组中的内源性逆转录病毒及元件的高效鉴定和注释，为研究病毒和生命的起源、遗传进化提供了大量的原材料。1) The present invention provides a large amount of raw materials for studying the origin and genetic evolution of viruses and life through the efficient identification and annotation of endogenous retroviruses and elements in the host genome.

2）本发明方法有助于鉴定和筛选能够直接或间接的发挥抗病毒功能的内源性逆转录病毒及元件，将推动相关学科和领域的蓬勃发展。2) The method of the present invention helps to identify and screen endogenous retroviruses and elements that can directly or indirectly exert antiviral functions, and will promote the vigorous development of related disciplines and fields.

附图说明Description of drawings

图1为内源性逆转录病毒的鉴定注释方法的流程图。Figure 1 is a flow chart of the identification and annotation method of endogenous retroviruses.

具体实施方式Detailed ways

以下实施例旨在说明本发明内容，而不是对本发明保护范围的进一步限定。The following examples are intended to illustrate the content of the present invention, rather than to further limit the protection scope of the present invention.

实施例1Example 1

本发明制备一种用于内源性逆转录病毒的的鉴定注释方法，具体方法如下：The present invention prepares a method for identifying and annotating endogenous retroviruses, and the specific method is as follows:

1. 内源性逆转录病毒候选序列的搜寻1. Search for endogenous retroviral candidate sequences

主要基于同源序列搜寻方法，初步鉴定内源性逆转录病毒候选序列。首先选取典型且保守的病毒蛋白作为探针（比如使用逆转录病毒的Pol蛋白序列或者ENV蛋白序列），再采用同源序列比对方法（如BLAST局部比对搜寻），识别基因组上与探针相似的命中片段。如果使用多个病毒蛋白作为探针序列，则合并多个探针序列的搜寻结果。然后对命中片段按照相同链（命中片段的基因方向和输入序列相同）和互补链（命中片段的基因方向和输入序列相反）进行分类。Mainly based on the homologous sequence search method, the endogenous retrovirus candidate sequences were preliminarily identified. First select typical and conserved viral proteins as probes (such as using retrovirus Pol protein sequence or ENV protein sequence), and then use homologous sequence alignment methods (such as BLAST local alignment search) to identify the genome with probes Similar hit fragments. If multiple viral proteins are used as probe sequences, the search results for multiple probe sequences are combined. The hits are then sorted according to the same strand (the gene orientation of the hit is the same as the input sequence) and the complementary strand (the gene orientation of the hit is opposite to the input sequence).

病毒序列在宿主基因组中的移码事件会导致出现许多连续化的命中片段，通过判断命中片段之间的距离是否合理（判断合理标准是大于10kb长度碱基）、在宿主基因组上插入方向是否相同（相同则归为基因组同一条链上的连续化命中片段，不相同则归为基因组不同链上的连续化命中片段），识别可能发生过移码突变的内源性逆转录病毒片段，对连续命中片段输出一个合理的命中区域，并返回该合理命中区域在基因组上的定位。最后，对每个合理的命中区域分别向两侧各延伸侧翼序列（侧翼序列如长度为15kb的碱基序列），去除含有包含关系的区域片段后，得到内源性逆转录病毒候选序列。The frameshift event of the viral sequence in the host genome will lead to many consecutive hit fragments. By judging whether the distance between the hit fragments is reasonable (the reasonable standard is greater than 10kb length bases), and whether the insertion direction on the host genome is the same (The same is classified as a contiguous hit fragment on the same strand of the genome, and if it is not the same, it is classified as a contiguous hit fragment on a different chain of the genome), identify endogenous retroviral fragments that may have undergone frameshift mutation, and identify consecutive The hit fragment outputs a plausible hit region and returns the location of the plausible hit region on the genome. Finally, for each reasonable hit region, the flanking sequences (flanking sequences such as base sequences with a length of 15 kb) are respectively extended to both sides, and the region fragments containing the inclusion relationship are removed to obtain the endogenous retrovirus candidate sequence.

2. 成对长末端重复序列鉴定 2. Pairwise long terminal repeat identification

完整的逆转录病毒的一个重要特征是在病毒的两侧存在长末端重复序列（LTR序列）。使用LTR harvest鉴定长末端重复序列，首先对上述内源性逆转录病毒候选序列构建增强性的后缀索引数组（这一步构建检索索引主要是加快后续检索速度），这一步骤能够节省后续计算的大量时间，然后进行成对LTR序列的搜寻。只有距离合适（如1kb到15kb范围内）、序列长度合适（如大于100bp）且相似性大于80%的LTR序列才会被认为是候选的成对LTR序列。对于每个候选的成对LTR序列及其中间的区域，将它们作为潜在的完整内源性逆转录病毒序列，返回其在基因组上的定位，再根据定位去除重复的区域，然后提取出这些潜在的完整内源性逆转录病毒的序列。An important feature of intact retroviruses is the presence of long terminal repeats (LTR sequences) flanking the virus. Use LTR harvest to identify long terminal repeat sequences. First, build an enhanced suffix index array for the above-mentioned endogenous retrovirus candidate sequences (the construction of the retrieval index in this step is mainly to speed up the subsequent retrieval speed). This step can save a lot of subsequent calculations. time, followed by a search for paired LTR sequences. Only LTR sequences with suitable distance (such as in the range of 1kb to 15kb), suitable sequence length (such as more than 100bp) and more than 80% similarity will be considered as candidate paired LTR sequences. For each candidate paired LTR sequence and the region in between, use them as a potential complete endogenous retroviral sequence, return its location on the genome, remove the duplicated region according to the location, and then extract these potential The sequence of the complete endogenous retrovirus.

同时，对于不含有成对LTR序列的内源性逆转录病毒获选序列，也单独提取出来，用作不完整内源性逆转录病毒及其病毒元件的鉴定。At the same time, the selected sequences of endogenous retroviruses that do not contain paired LTR sequences are also extracted separately and used for the identification of incomplete endogenous retroviruses and their viral elements.

病毒蛋白结构域的鉴定Identification of viral protein domains

从逆转录病毒蛋白结构相关数据库（如Gypsy数据库）下载逆转录病毒的典型蛋白结构域序列，包括GAG、DUT、PR、RT、INT、RNaseH、ENV等。对每条含有成对LTR的潜在完整内源性逆转录病毒序列、不含成对LTR的内源性逆转录病毒候选序列，基于6个编码翻译框翻译成6条蛋白质序列并作为输入序列，分别使用隐式马尔科夫模型鉴定病毒结构域，基于隐式马尔科夫模型预测结果，去除先前由同源序列比对方法产生的假阳性结果。Download the typical protein domain sequences of retroviruses, including GAG, DUT, PR, RT, INT, RNaseH, ENV, etc., from databases related to retrovirus protein structure (such as Gypsy database). For each potential complete endogenous retroviral sequence containing paired LTRs, and endogenous retroviral candidate sequences without paired LTRs, based on the 6 coding translation frames, it was translated into 6 protein sequences and used as input sequences, The viral domains were identified using a hidden Markov model, respectively, and the results were predicted based on the hidden Markov model, removing the false positive results previously generated by the homologous sequence alignment method.

内源性逆转录病毒的注释Annotation of endogenous retroviruses

对于鉴定出的至少含有上述病毒蛋白结构域之一的输入序列，（1）如果输入序列为含有成对LTR的潜在完整内源性逆转录病毒序列，先将两侧LTR、鉴定出的病毒结构域在宿主基因组上的进行定位，生成在宿主基因组上的注释文件，再提取出完整的内源性逆转录病毒全长序列、病毒结构域序列、两侧LTR序列，并生成两侧LTR和病毒结构域在完整内源性逆转录病毒全长序列上的定位和注释文件；（2）如果输入序列为不含成对LTR的内源性逆转录病毒候选序列，先将鉴定出的结构域在宿主基因组上的进行定位，生成在宿主基因组上的注释文件，再根据在宿主基因组上的定位提取出潜在的内源性逆转录病毒全长序列和病毒结构域序列，并生成病毒结构域在潜在的内源性逆转录病毒全长序列上的定位和注释文件。For the identified input sequence that contains at least one of the above viral protein domains, (1) if the input sequence is a potentially complete endogenous retroviral sequence containing paired LTRs, first combine the flanking LTRs, the identified viral structure The domain is located on the host genome, an annotation file is generated on the host genome, and the complete endogenous retrovirus full-length sequence, viral domain sequence, and flank LTR sequences are extracted, and the two sides LTR and virus are generated. The localization and annotation files of the domains on the complete endogenous retrovirus full-length sequence; (2) If the input sequence is an endogenous retrovirus candidate sequence without paired LTRs, first place the identified domains in the Positioning on the host genome, generating annotation files on the host genome, and then extracting potential endogenous retrovirus full-length sequences and viral domain sequences based on the positioning on the host genome, and generating viral domains in potential Mapping and annotation files on the full-length sequences of endogenous retroviruses.

范例运用：Example use:

1.宿主基因组数据下载1. Host genome data download

从NCBI数据库（https://www.ncbi.nlm.nih.gov/）下载蝙蝠物种Pteronotus parnellii（GCA_000465405.1）的基因组序列文件，基因组文件大小约为2Gb。Download the genome sequence file of the bat species Pteronotus parnellii (GCA_000465405.1) from the NCBI database (https://www.ncbi.nlm.nih.gov/), the genome file size is about 2Gb.

获取内源性逆转录病毒候选序列Obtaining endogenous retroviral candidate sequences

（1）将上述蝙蝠基因组序列构建为本地核苷酸数据库，同时使用RT蛋白和ENV蛋白序列作为探针序列，基于BLAST里的tblastn方法，识别数据库中蝙蝠基因组上与探针序列相似的命中片段。(1) Build the above bat genome sequence into a local nucleotide database, and use RT protein and ENV protein sequences as probe sequences. Based on the tblastn method in BLAST, identify the hit fragments on the bat genome that are similar to the probe sequences in the database. .

（2）合并两个探针序列的BLAST搜寻结果，对命中区域按照相同链（命中区域基因方向和输入序列相同）和互补链（命中区域基因方向和输入序列相反）进行分类。通过判断命中片段之间的距离是否大于10kb，结合命中片段在宿主基因组上插入的方向，为连续命中片段输出一个合理的命中区域，得到。(2) Combine the BLAST search results of the two probe sequences, and classify the hit regions according to the same strand (the gene direction of the hit region is the same as the input sequence) and the complementary strand (the gene direction of the hit region is opposite to the input sequence). By judging whether the distance between the hit fragments is greater than 10kb, combined with the insertion direction of the hit fragments on the host genome, a reasonable hit region is output for the consecutive hit fragments, and obtained.

（3）对上述每个合理的命中区域，向两侧各延伸15kb侧翼序列，去除含有包含关系的区域片段后，得到内源性逆转录病毒候选序列。在蝙蝠物种Pteronotus parnellii中，共鉴定到628条内源性逆转录病毒候选序列。(3) For each of the above reasonable hit regions, extend 15kb flanking sequences to both sides, remove the region fragments containing the inclusion relationship, and obtain endogenous retrovirus candidate sequences. A total of 628 endogenous retroviral candidate sequences were identified in the bat species Pteronotus parnellii .

鉴定内源性逆转录病毒候选序列中的成对LTRIdentification of paired LTRs in endogenous retroviral candidate sequences

使用LTR harvest对上述628条内源性逆转录病毒候选序列构建检索索引数组，对每条候选序列鉴定其是否含有成对LTR序列。成对LTR序列的判定标准为：LTR序列大于100bp，两条LTR序列的距离在1kb到15kb范围内，两条LTR序列的相似度大于80%。对于每个候选的成对LTR序列及其中间的区域，将它们作为潜在的完整内源性逆转录病毒序列，返回其在基因组上的定位，再根据定位去除重复的区域，然后提取出这些潜在的完整内源性逆转录病毒的序列。在上述628条内源性逆转录病毒候选序列中，共在38条序列中鉴定到成对LTR序列（如表1）。序列名称中含有“same”表示鉴定的内源性逆转录病毒和输入序列在相同链上，序列名称中含有“complement”表示鉴定的内源性逆转录病毒和在输入序列的互补链上。Use LTR harvest to construct a retrieval index array for the above 628 endogenous retrovirus candidate sequences, and identify whether each candidate sequence contains paired LTR sequences. The criteria for paired LTR sequences are: the LTR sequence is greater than 100bp, the distance between the two LTR sequences is in the range of 1kb to 15kb, and the similarity between the two LTR sequences is greater than 80%. For each candidate paired LTR sequence and the region in between, use them as a potential complete endogenous retroviral sequence, return its location on the genome, remove the duplicated region according to the location, and then extract these potential The sequence of the complete endogenous retrovirus. Among the above 628 endogenous retrovirus candidate sequences, paired LTR sequences were identified in 38 sequences (as shown in Table 1). The inclusion of "same" in the sequence name indicates that the identified endogenous retrovirus is on the same strand as the input sequence, and the inclusion of "complement" in the sequence name indicates that the identified endogenous retrovirus is on the complementary strand of the input sequence.

表1Table 1

4.鉴定病毒蛋白结构域4. Identification of viral protein domains

从Gypsy数据库下载逆转录病毒的典型蛋白结构域序列，包括GAG、DUT、PR、RT、INT、RNaseH、ENV等。将上述628条内源性逆转录病毒候选序列，基于6个编码翻译框翻译成蛋白质序列（共3768条）。对这3768条蛋白质序列，使用上述的逆转录病毒的典型蛋白结构域，基于隐式马尔科夫模型，鉴定病毒结构域。Download the typical protein domain sequences of retroviruses from the Gypsy database, including GAG, DUT, PR, RT, INT, RNaseH, ENV, etc. The above 628 endogenous retrovirus candidate sequences were translated into protein sequences based on 6 coding translation frames (3768 in total). For these 3768 protein sequences, viral domains were identified based on the Hidden Markov Model using the above-mentioned typical protein domains of retroviruses.

5.病毒蛋白结构域注释5. Annotation of viral protein domains

对38条含有成对LTR的内源性逆转录病毒候选序列，以及剩下的不含成对LTR的内源性逆转录病毒候选序列，基于隐式马尔科夫模型的搜寻结果，如果含有1种及以上的逆转录病毒的典型蛋白结构域，才将其确定为最终的内源性逆转录病毒候选序列，这一步能去掉一些先前由BLAST方法鉴定的假阳性结果。最终共鉴定出25条含有成对LTR的内源性逆转录病毒序列，580条不含成对LTR的内源性逆转录病毒序或元件。最后，将这些候选序列的病毒蛋白结构域精准定位到宿主基因组和提取出来的内源性逆转录病毒上，生成详细的注释文档。这些内源性逆转录病毒蛋白结构域为鉴定和筛选能够直接或间接的发挥抗病毒功能的内源性逆转录病毒及元件。For the 38 endogenous retroviral candidate sequences that contain paired LTRs, and the remaining endogenous retroviral candidate sequences that do not contain paired LTRs, the search results based on the hidden Markov model, if it contains 1 The typical protein domains of retroviruses from one species or more were identified as the final endogenous retrovirus candidate sequences, which could remove some of the false-positive results previously identified by the BLAST method. A total of 25 endogenous retroviral sequences containing paired LTRs were finally identified, and 580 endogenous retroviral sequences or elements without paired LTRs were identified. Finally, the viral protein domains of these candidate sequences are precisely mapped to the host genome and the extracted endogenous retroviruses to generate detailed annotation documents. These endogenous retroviral protein domains are used to identify and screen endogenous retroviruses and elements that can directly or indirectly exert antiviral functions.

表2为含有成对LTR的内源性逆转录病毒序列的在提取出的内源性逆转录病毒上的注释文档示例。注释文档中详细标注了每条内源性逆转录病毒在原宿主基因组中的位置（对应序列名称中的数字）、位于哪条链上（序列名称中“same”表示位于和输入序列相同链上，“complement”表示位于和输入序列互补链上）、每个蛋白结构域在提取出的内源性逆转录病毒上的起始和终止位置。Table 2 is an example of an annotated document on the extracted endogenous retrovirus of endogenous retroviral sequences containing paired LTRs. The annotation document details the position of each endogenous retrovirus in the original host genome (corresponding to the number in the sequence name) and which strand it is located on (“same” in the sequence name means it is located on the same strand as the input sequence. , "complement" indicates the start and end positions of each protein domain on the extracted endogenous retrovirus.

表2Table 2

paired_LTR ERVspaired_LTR ERVs 5'-LTR5'-LTR 3'-LTR3'-LTR GAGGAG DUTDUT APAP RTRT RNaseHRNaseH INTINT ENVENV KE836813.1|5550|11251|complementKE836813.1|5550|11251|complement 1..4971..497 5181..57025181..5702 1923..24351923..2435 2834..29682834..2968 3188..32803188..3280 3634..42723634..4272 not foundnot found not foundnot found not foundnot found AWGZ01387580.1|212|8442|complementAWGZ01387580.1|212|8442|complement 1..2741..274 7979..82317979..8231 1756..22021756..2202 2776..30152776..3015 3103..33423103..3342 3754..41673754..4167 4950..50904950..5090 5127..55465127..5546 7350..74877350..7487 KE834585.1|28634|33361|complementKE834585.1|28634|33361|complement 1..1101..110 4621..47284621..4728 1948..21691948..2169 2536..28022536..2802 2887..31322887..3132 3451..40203451..4020 not foundnot found not foundnot found not foundnot found KE910746.1|37935|44952|sameKE910746.1|37935|44952|same 1..3061..306 6692..70186692..7018 not foundnot found not foundnot found not foundnot found 2493..27952493..2795 3556..37263556..3726 3903..42893903..4289 not foundnot found KE902113.1|5948|11935|complementKE902113.1|5948|11935|complement 1..2511..251 5724..59885724..5988 not foundnot found not foundnot found not foundnot found 2616..29122616..2912 not foundnot found 4185..43074185..4307 not foundnot found KE839960.1|10792|18188|sameKE839960.1|10792|18188|same 1..5311..531 6942..73976942..7397 2897..31992897..3199 not foundnot found 3855..39173855..3917 4165..45334165..4533 not foundnot found not foundnot found 6397..66696397..6669 KE870169.1|1525|10645|sameKE870169.1|1525|10645|same 1..2301..230 8873..91218873..9121 1877..20231877..2023 not foundnot found 2568..26542568..2654 2933..36312933..3631 4365..46344365..4634 4860..51924860..5192 not foundnot found AWGZ01396048.1|10776|19052|complementAWGZ01396048.1|10776|19052|complement 1..2241..224 8058..82778058..8277 1511..17051511..1705 not foundnot found 2363..25452363..2545 2868..34732868..3473 4141..45214141..4521 5033..53025033..5302 7432..75577432..7557 KE827911.1|26386|33260|complementKE827911.1|26386|33260|complement 1..2161..216 6644..68756644..6875 not foundnot found not foundnot found not foundnot found 2315..26352315..2635 not foundnot found 4402..45274402..4527 not foundnot found KE893677.1|2553|9228|sameKE893677.1|2553|9228|same 1..2351..235 6460..66766460..6676 1919..20471919..2047 not foundnot found 2663..27432663..2743 3015..32573015..3257 4306..46084306..4608 5103..53125103..5312 not foundnot found KE842560.1|4542|9092|complementKE842560.1|4542|9092|complement 1..1461..146 4422..45514422..4551 not foundnot found not foundnot found not foundnot found 971..1207971..1207 not foundnot found not foundnot found not foundnot found KE834830.1|23532|26760|sameKE834830.1|23532|26760|same 1..1261..126 3096..32293096..3229 not foundnot found not foundnot found not foundnot found not foundnot found not foundnot found 440..1342440..1342 not foundnot found KE873916.1|24431|30941|sameKE873916.1|24431|30941|same 1..1631..163 6350..65116350..6511 not foundnot found not foundnot found not foundnot found 2514..26392514..2639 not foundnot found 4383..45414383..4541 5926..60545926..6054 AWGZ01271927.1|5500|7412|complementAWGZ01271927.1|5500|7412|complement 1..2001..200 1715..19131715..1913 not foundnot found not foundnot found not foundnot found 289..693289..693 1687..19051687..1905 not foundnot found not foundnot found KE824373.1|19456|27134|complementKE824373.1|19456|27134|complement 1..2011..201 7449..76797449..7679 not foundnot found not foundnot found 2504..26172504..2617 2895..30742895..3074 4211..46154211..4615 5194..52805194..5280 not foundnot found AWGZ01389043.1|13908|22015|complementAWGZ01389043.1|13908|22015|complement 1..2251..225 7874..81087874..8108 2023..21812023..2181 not foundnot found 2561..27672561..2767 3078..34163078..3416 4503..47904503..4790 5020..54095020..5409 7072..76777072..7677 KE839612.1|16878|25796|sameKE839612.1|16878|25796|same 1..1341..134 8788..89198788..8919 2810..30072810..3007 not foundnot found not foundnot found 4115..45494115..4549 5296..55205296..5520 6139..63186139..6318 8086..84218086..8421 KE919690.1|3176|9835|complementKE919690.1|3176|9835|complement 1..1941..194 6479..66606479..6660 not foundnot found not foundnot found 3048..32423048..3242 3718..39423718..3942 4792..50974792..5097 5581..59915581..5991 not foundnot found KE849596.1|25855|31240|complementKE849596.1|25855|31240|complement 1..1861..186 5194..53865194..5386 1897..19981897..1998 not foundnot found 2599..26672599..2667 2985..31852985..3185 4243..44314243..4431 not foundnot found 4740..48444740..4844 KE874998.1|3250|13590|complementKE874998.1|3250|13590|complement 1..1391..139 10210..1034110210..10341 3800..40873800..4087 not foundnot found not foundnot found 5251..57245251..5724 6498..67496498..6749 7129..75007129..7500 9172..93849172..9384 KE820627.1|23|7334|sameKE820627.1|23|7334|same 1..4541..454 6842..73126842..7312 3312..34253312..3425 not foundnot found not foundnot found 4132..43414132..4341 not foundnot found not foundnot found 6454..65406454..6540 KE905742.1|4419|10795|complementKE905742.1|4419|10795|complement 1..1381..138 6249..63776249..6377 1133..13331133..1333 not foundnot found not foundnot found 2733..29212733..2921 3641..39073641..3907 4302..45324302..4532 not foundnot found AWGZ01282780.1|7302|14150|sameAWGZ01282780.1|7302|14150|same 1..1941..194 6661..68496661..6849 not foundnot found not foundnot found not foundnot found 2521..28412521..2841 not foundnot found 4606..48154606..4815 6126..64406126..6440 AWGZ01040320.1|700|1872|complementAWGZ01040320.1|700|1872|complement 1..1131..113 1061..11731061..1173 not foundnot found not foundnot found not foundnot found not foundnot found not foundnot found not foundnot found 785..922785..922 KE887893.1|13652|18794|sameKE887893.1|13652|18794|same 1..3611..361 4785..51434785..5143 not foundnot found not foundnot found not foundnot found not foundnot found not foundnot found not foundnot found 4417..45244417..4524

虽然，上文中已经用一般性说明、具体实施方式及试验，对本发明作了详尽的描述，但在本发明基础上，可以对之作一些修改或改进，这对本领域技术人员而言是显而易见的。因此，在不偏离本发明精神的基础上所做的这些修改或改进，均属于本发明要求保护的范围。Although the present invention has been described in detail above with general description, specific embodiments and tests, some modifications or improvements can be made on the basis of the present invention, which is obvious to those skilled in the art . Therefore, these modifications or improvements made without departing from the spirit of the present invention fall within the scope of the claimed protection of the present invention.

Claims

1. a kind of identification annotation method for endogenous retrovirus, is characterized in that, concrete steps comprise:

Step 1): Select a typical and relatively conservative viral protein as a probe, use the homologous sequence alignment method to identify the hit fragments similar to the probe, output a reasonable hit region for the consecutive hit fragments, and send the reasonable hit to the reasonable hit. The extended flanking sequences on both sides of the region, after removing the region fragments containing the inclusion relationship, obtain the endogenous retrovirus candidate sequence;

Step 2): Use LTR harvest to identify the paired long terminal repeats of the endogenous retroviral candidate sequences obtained in step 1), and then extract the potential complete endogenous retrovirus sequences and those that do not contain paired LTR sequences. endogenous retroviral selection sequences;

Step 3): Potential complete endogenous retrovirus sequences extracted from step 2) and endogenous retroviruses that do not contain paired LTR sequences were identified using a hidden Markov model based on the canonical protein domain sequences of retroviruses The viral protein domain of the selected sequence, and further remove the false positive result based on the homologous sequence alignment method in step 1);

Step 4): Based on the viral protein domain identified in step 3), locate the host genome, generate an annotation file on the host genome, extract the full-length sequence and viral domain sequence of the endogenous retrovirus, and generate the corresponding Mapping and annotation files on endogenous retroviruses.

2. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein the typical and relatively conservative viral proteins are selected from the group consisting of: Pol protein sequence, ENV protein sequence, RT protein sequence. at least one.

3. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein the homologous sequence alignment method is selected from the BLAST local alignment method.

4. according to the method for identifying and annotating endogenous retroviruses according to claim 1, it is characterized in that, the specific process of outputting the hit region is as follows: the frameshift event of the virus sequence in the host genome can cause many serializations to occur. By judging the distance between each hit fragment and the insertion direction on the host genome, identify endogenous retrovirus fragments that may have undergone frameshift mutation, output a reasonable hit region for consecutive hit fragments, and return The location of the hit region on the genome.

5. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein the length of the flanking sequence is at least 15kb.

6. The method for identifying and annotating endogenous retroviruses according to claim 1, characterized in that, in step 2), the specific method for identifying long terminal repeats is: reversing the endogenous retrovirus obtained in step 1). An enhanced suffix index array is constructed from the candidate sequences of the record virus, and the paired LTR sequence search is performed to realize the identification of paired long terminal repeats.

7. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein in step 2), the search criteria for the paired LTR sequences are: a distance of 1kb to 15kb, a sequence length greater than LTR sequences with 100 bp and more than 80% similarity are candidate paired LTR sequences.

8. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein the potentially complete endogenous retrovirus sequences and the endogenous retroviruses that do not contain paired LTR sequences The selected sequence extraction process is as follows: for each candidate paired LTR sequence and the region in between, take them as potential complete endogenous retroviral sequences, return their location on the genome, and then remove duplicates according to the location. Matching regions, and then extract these potentially complete endogenous retroviral sequences; at the same time, for endogenous retroviral selected sequences that do not contain paired LTR sequences, they are also extracted separately and used as incomplete endogenous retroviruses Identification of the virus and its viral elements.

9. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein the retrovirus protein domain sequence refers to: downloading retroviruses from a retrovirus protein structure correlation database. The canonical protein domain sequences, based on the 6 coding translation frames, were translated into protein sequences and used as input sequences to identify potential complete endogenous retroviral sequences with paired LTRs and without paired LTRs using Hidden Markov Models, respectively The viral protein domains of endogenous retroviral candidate sequences.

10. The method for identifying and annotating endogenous retroviruses according to claim 1, wherein in the step 4), based on the input identified in step 3) that at least contains one of the above-mentioned viral protein domains Sequence, (1) If the input sequence is a potential complete endogenous retroviral sequence containing paired LTRs, first locate the LTRs on both sides and the identified viral domains on the host genome, and generate a sequence on the host genome. Annotation file, and then extract the complete endogenous retrovirus full-length sequence, viral domain sequence, and flanking LTR sequences, and generate the flanking LTR and viral domains on the complete endogenous retrovirus full-length sequence. Location and annotation file; (2) If the input sequence is an endogenous retrovirus candidate sequence without paired LTR, first locate the identified domain on the host genome to generate an annotation file on the host genome , and then extract the potential endogenous retrovirus full-length sequence and viral domain sequence according to the location on the host genome, and generate the location and annotation of the viral domain on the potential endogenous retrovirus full-length sequence document.