[go: up one dir, main page]

CN112614544B - Optimization method of Kraken2 software output results and method of identifying species types in samples - Google Patents

Optimization method of Kraken2 software output results and method of identifying species types in samples Download PDF

Info

Publication number
CN112614544B
CN112614544B CN202011583243.4A CN202011583243A CN112614544B CN 112614544 B CN112614544 B CN 112614544B CN 202011583243 A CN202011583243 A CN 202011583243A CN 112614544 B CN112614544 B CN 112614544B
Authority
CN
China
Prior art keywords
species
reads
kmermax
kmersum
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011583243.4A
Other languages
Chinese (zh)
Other versions
CN112614544A (en
Inventor
王涛
肖姗姗
常壹昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xingyuan Future Technology Co ltd
Original Assignee
Hangzhou Repugene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Repugene Technology Co ltd filed Critical Hangzhou Repugene Technology Co ltd
Priority to CN202011583243.4A priority Critical patent/CN112614544B/en
Publication of CN112614544A publication Critical patent/CN112614544A/en
Application granted granted Critical
Publication of CN112614544B publication Critical patent/CN112614544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a Kraken software output result optimization method, which comprises the following steps: matching sub-reads of each read in the sequencing result with species sequences in a known database, obtaining the number of kmers of the sub-reads matched with each species in each read, selecting the maximum value of the number of kmers in each read, and recording the maximum value as kmermax numbers; comparing the kmermax number with a first threshold, and removing the reads corresponding to the kmermax number when the kmermax number is smaller than or equal to the first threshold so as to filter all reads. The method can accurately optimize Kraken software output results, avoid false positive phenomenon, and can be applied to identifying species types in samples.

Description

Kraken2软件输出结果的优化方法及鉴定样本中物种类型的 方法Optimization method of Kraken2 software output results and identification of species types in samples

技术领域Technical Field

本发明涉及生物领域。具体地,本发明涉及Kraken2软件输出结果的优化方法及鉴定样本中物种类型的方法。The present invention relates to the biological field. Specifically, the present invention relates to a method for optimizing the output results of Kraken2 software and a method for identifying species types in a sample.

背景技术Background Art

宏基因组是指特定环境下所有生物遗传物质的总和,以其作为研究对象,通过测序分析、功能基因筛选等方式可以获取样本中的生物组成及生物之间、生物和环境之间的关系。Metagenome refers to the sum of all biological genetic materials in a specific environment. By taking it as the research object, we can obtain the biological composition of the sample and the relationship between organisms and between organisms and the environment through sequencing analysis, functional gene screening, etc.

宏基因组测序简称mNGS,即metagenomics next generation sequencing,是一种对环境中所有生物基因组不加分离进行混合测序的技术。Metagenome sequencing, abbreviated as mNGS, stands for metagenomics next generation sequencing, which is a technology that performs mixed sequencing of the genomes of all organisms in the environment without separation.

病原微生物,指能使人或动物致病的微生物。包括寄生虫、真菌、细菌病毒等。Pathogenic microorganisms refer to microorganisms that can cause disease in humans or animals, including parasites, fungi, bacteria and viruses.

mNGS可以实现对样本中所有的微生物进行鉴定,并且可以鉴定新的物种,这是采用常规的实验方法如涂片、生化鉴定、培养鉴定及基于多重PCR的检测技术所无法实现的。mNGS因无需培养、不依赖探针序列、无偏好覆盖广等优点,在临床疑似病原体、罕见病原体、急危重症等临床感染病原检测应用较广。mNGS can identify all microorganisms in the sample and can identify new species, which is impossible to achieve using conventional experimental methods such as smears, biochemical identification, culture identification, and multiplex PCR-based detection technology. mNGS is widely used in clinical suspected pathogens, rare pathogens, critical illnesses, and other clinical infectious pathogen detection due to its advantages such as no need for culture, no reliance on probe sequences, and wide coverage without bias.

mNGS检测的是样本中所有生物的基因组。在检测前,样本中的微生物组成大多情况下是未知的。检测时需要将测序得到的read和已知的数据库进行比对来确认其表达通路或物种组成等。mNGS detects the genomes of all organisms in a sample. Before testing, the composition of microorganisms in a sample is mostly unknown. During testing, the reads obtained by sequencing need to be compared with known databases to confirm their expression pathways or species composition, etc.

用于序列比对的知名工具BLAST可以对较长的序列得到较好的比对结果,但该工具应用于高通量测序得到的短序列数据时效果欠佳。此外,由于mNGS进行物种判定时往往需要和大型的数据库进行匹配,BLAST的比对速度限制了其在临床检测等时间紧迫的场景中的应用。BLAST, a well-known tool for sequence alignment, can obtain good alignment results for longer sequences, but the tool is not very effective when applied to short sequence data obtained by high-throughput sequencing. In addition, since mNGS often needs to match large databases when determining species, the alignment speed of BLAST limits its application in time-sensitive scenarios such as clinical testing.

Kraken2软件是一款常用的mNGS物种鉴定工具,其在建库时构建各物种独特的kmer,并在比对时将每条read的上的kmer情况进行统计。Kraken2具有极快的运行速度,但由于其默认对结果不进行筛选,导致原始结果中假阳性率非常高,可以达到85%以上。Kraken2 software is a commonly used mNGS species identification tool. It constructs unique kmers for each species when building a library, and counts the kmer status of each read during alignment. Kraken2 has an extremely fast running speed, but because it does not filter the results by default, the false positive rate in the original results is very high, which can reach more than 85%.

发明内容Summary of the invention

本发明旨在至少在一定程度上解决现有技术中存在的技术问题至少之一。为此,本发明提出了Kraken2软件输出结果的优化方法及鉴定样本中物种类型的方法,该方法可以有效地优化Kraken2软件输出结果,准确获知物种类型,从而可以应用于样本中物种类型的鉴定,降低假阳性率。The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent. To this end, the present invention proposes a method for optimizing the output results of Kraken2 software and a method for identifying species types in samples, which can effectively optimize the output results of Kraken2 software and accurately know the species types, so that it can be applied to the identification of species types in samples and reduce the false positive rate.

本发明提出了一种Kraken2软件输出结果的优化方法。根据本发明的实施例,所述方法包括:将测序结果中每个读段的子读段与已知数据库中的物种序列进行匹配,获取每个读段中匹配上每个物种的子读段的kmer数,选择每个读段中各kmer数中的最大值,记作kmermax数;将所述kmermax数与第一阈值进行比较,以便对所有读段进行过滤,去除不符合要求的读段。The present invention proposes a method for optimizing the output results of Kraken2 software. According to an embodiment of the present invention, the method includes: matching the sub-read segments of each read segment in the sequencing result with the species sequence in the known database, obtaining the number of kmers in each read segment that matches the sub-read segment of each species, selecting the maximum value of each kmer number in each read segment, and recording it as the kmermax number; comparing the kmermax number with a first threshold value to filter all read segments and remove read segments that do not meet the requirements.

根据本发明实施例的优化方法中,对Kraken2软件的output参数输出文件中的每一条read,统计其匹配到物种的kmer数的最大值,并将该最大值与第一阈值进行比较,筛选出不符合要求的读段,从而以便于对所有读段进行过滤,提高结果准确性,避免出现假阳性现象。In the optimization method according to an embodiment of the present invention, for each read in the output parameter output file of the Kraken2 software, the maximum number of kmers matched to the species is counted, and the maximum value is compared with the first threshold to filter out reads that do not meet the requirements, thereby facilitating the filtering of all reads, improving the accuracy of the results, and avoiding false positives.

根据本发明的实施例,上述Kraken2软件输出结果的优化方法还可以具有下列附加技术特征:According to an embodiment of the present invention, the optimization method of the output result of the above-mentioned Kraken2 software may also have the following additional technical features:

根据本发明的实施例,所述第一阈值为15~30。由此,可以去除小于等于15~30的kmermax数所对应的读段,避免其干扰,导致检测结果不准确,出现假阳性现象。需要说明的是,本发明所描述的第一阈值是指15~30中的任意一个具体数值,例如20。According to an embodiment of the present invention, the first threshold is 15 to 30. Thus, the reads corresponding to kmermax numbers less than or equal to 15 to 30 can be removed to avoid interference, which may lead to inaccurate detection results and false positives. It should be noted that the first threshold described in the present invention refers to any specific value between 15 and 30, such as 20.

根据本发明的实施例,所述方法进一步包括:将所述过滤后剩余的读段作为候选读段,每个所述候选读段中kmermax数对应的物种作为候选物种;针对每个所述候选物种,选择所述候选物种匹配上的所有读段的kmer数之和,记作kmersum数;将所述kmersum数与第二阈值进行比较,当所述kmersum数小于第二阈值时,去除所述kmersum数对应的候选物种,以便对所述候选物种进行过滤,去除不符合要求的物种,剩余物种为目标物种。According to an embodiment of the present invention, the method further includes: taking the remaining reads after the filtering as candidate reads, and the species corresponding to the kmermax number in each candidate read as a candidate species; for each candidate species, selecting the sum of the kmer numbers of all reads matching the candidate species, recorded as the kmersum number; comparing the kmersum number with a second threshold, and when the kmersum number is less than the second threshold, removing the candidate species corresponding to the kmersum number, so as to filter the candidate species, remove species that do not meet the requirements, and the remaining species are target species.

如前所述,通过对每个物种,通过统计kmermax数,并将其与第一阈值进行比较,过滤去除不符合要求的读段。进一步地,对剩余读段所匹配上的候选物种作为研究对象,统计每个物种所匹配上的所有读段的kmer数之和(即kmersum数),并将kmersum数与第二阈值进行比较,以便进一步对候选物种进行过滤,去除不符合要求的物种,从而获得目标物质。由此,可以提高结果的准确性,减少假阳性现象。As mentioned above, for each species, by counting the kmermax number and comparing it with the first threshold, the reads that do not meet the requirements are filtered out. Further, the candidate species matched by the remaining reads are taken as the research object, and the sum of the kmer numbers of all reads matched by each species (i.e., the kmersum number) is counted, and the kmersum number is compared with the second threshold, so as to further filter the candidate species, remove the species that do not meet the requirements, and thus obtain the target substance. In this way, the accuracy of the results can be improved and the false positive phenomenon can be reduced.

根据本发明的实施例,所述第二阈值为900~1100。由此,可以去除小于900~1100的kmersum数所对应的读段及与其匹配的候选物种,避免其干扰,导致检测结果不准确,出现假阳性现象。需要说明的是,本发明所描述的第二阈值是指900~1100中的任意一个具体数值,例如1000。According to an embodiment of the present invention, the second threshold is 900 to 1100. Thus, the reads corresponding to the kmersum numbers less than 900 to 1100 and the candidate species matched therewith can be removed to avoid their interference, which may lead to inaccurate detection results and false positive phenomena. It should be noted that the second threshold described in the present invention refers to any specific value between 900 and 1100, such as 1000.

根据本发明的实施例,所述方法进一步包括:将不同所述候选物种对应的kmersum数进行降序排列,当出现前一个kmersum数除以后一个kmersum数大于等于第三阈值且所述后一个kmersum数小于第二阈值时,所述前一个kmersum数及其之前的kmersum数所对应的候选物种为目标物种。According to an embodiment of the present invention, the method further includes: arranging the kmersum numbers corresponding to different candidate species in descending order, and when the previous kmersum number divided by the next kmersum number is greater than or equal to a third threshold and the next kmersum number is less than a second threshold, the candidate species corresponding to the previous kmersum number and the kmersum number before it is the target species.

通过对不同候选物种对应的kmersum数进行降序排列,首次出现第三阈值倍降幅且kmersum数小于第二阈值时,该前一个kmersum数及其之前的kmersum数所对应的候选物种为目标物种。由此,可以进一步保证检测结果的准确性,避免出现假阳性现象。By arranging the kmersum numbers corresponding to different candidate species in descending order, when the third threshold value decreases for the first time and the kmersum number is less than the second threshold value, the candidate species corresponding to the previous kmersum number and the kmersum number before it are the target species. In this way, the accuracy of the detection results can be further guaranteed and false positives can be avoided.

根据本发明的实施例,所述第三阈值为3~5。由此,可以进一步保证检测结果的准确性,避免出现假阳性现象。According to an embodiment of the present invention, the third threshold is 3 to 5. Thus, the accuracy of the detection result can be further guaranteed and false positive phenomenon can be avoided.

根据本发明的实施例,所述测序结果的数量为10K-30M。需要说明的是,当测序结果中含有宿主数据,需要预先将宿主数据删除,上述10K-30M是指去宿主数据之后的测序结果数量。According to an embodiment of the present invention, the number of sequencing results is 10K-30M. It should be noted that when the sequencing results contain host data, the host data needs to be deleted in advance, and the above 10K-30M refers to the number of sequencing results after removing the host data.

为了方便理解,下面举例详细说明优化方法:For ease of understanding, the following example explains the optimization method in detail:

举例1:图1显示了一个Kraken2 output参数的输出结果示意图。Example 1: Figure 1 shows a schematic diagram of the output results of a Kraken2 output parameter.

举例2:假设测序结果中总共有10个读段,分别记作read1、read2、…、read10。Example 2: Assume that there are a total of 10 reads in the sequencing results, recorded as read1, read2, ..., read10.

针对read1,具体分为三个子读段,分别记作read1-1、read1-2、read1-3。分别将read1-1、read1-2、read1-3与Kraken2软件中的物种数据库信息进行比对,发现read1-1与物种S1匹配的kmer数为5;read1-2与物种S1匹配的kmer数15,与物种S2匹配的kmer数为20;read1-2与物种S3匹配的kmer数为80。则,read1中匹配到的物种kmer数分别为:物种S1的kmer数为20,物种S2的kmer数为20,物种S3匹配的kmer数为80,其中,最大的kmer数(即kmermax数)为80。For read1, it is divided into three sub-segments, which are recorded as read1-1, read1-2, and read1-3. Read1-1, read1-2, and read1-3 are compared with the species database information in the Kraken2 software. It is found that the number of kmers matched by read1-1 and species S1 is 5; the number of kmers matched by read1-2 and species S1 is 15, and the number of kmers matched by read1-2 and species S2 is 20; the number of kmers matched by read1-2 and species S3 is 80. Therefore, the number of species kmers matched in read1 is: the number of kmers of species S1 is 20, the number of kmers of species S2 is 20, and the number of kmers matched by species S3 is 80, among which the largest number of kmers (i.e., kmermax number) is 80.

以此类推,read2、read3、…、read10的kmermax数及其匹配的物种如下表所示。由于read3的kmermax数小于10,所以除去该read结果。候选物种为S1、S2、S3、S4、S5。Similarly, the kmermax numbers of read2, read3, ..., read10 and their matching species are shown in the following table. Since the kmermax number of read3 is less than 10, the read result is removed. The candidate species are S1, S2, S3, S4, and S5.

将物种S1~S5所对应的read数,统计每个物种的kmer数之和,并以降序排列,如下表。可以看出,从S4到S5,kmersum数首次出现3倍以上降幅,并且600小于1000。因此,去除kmersum数为600及以下的数据及对应的候选物种。最终,确定的样本中的物种为S3、S1和S4。The read counts corresponding to species S1 to S5 were counted, and the sum of the kmer counts of each species were calculated and arranged in descending order, as shown in the following table. It can be seen that from S4 to S5, the kmersum number dropped by more than 3 times for the first time, and 600 was less than 1000. Therefore, the data with kmersum numbers of 600 and below and the corresponding candidate species were removed. Finally, the species in the sample were determined to be S3, S1, and S4.

物种Species kmersum数kmersum number S3S3 20002000 S1S1 12001200 S4S4 600600 S5S5 100100 S2S2 8080

在本发明的另一方面,本发明提出了一种鉴定样本中物种类型的方法。根据本发明的实施例,所述方法包括:(1)将样本进行宏基因组测序,得到测序结果;(2)利用Kraken2软件对所述测序结果进行分析,并采用前面所述Kraken2软件输出结果的优化方法进行优化,以便确定所述样本中物种类型。由此,利用根据本发明实施例的方法可以准确地鉴定出样本中的物种类型,假阳性率低。In another aspect of the present invention, the present invention proposes a method for identifying species types in a sample. According to an embodiment of the present invention, the method includes: (1) performing metagenomic sequencing on the sample to obtain sequencing results; (2) analyzing the sequencing results using Kraken2 software, and optimizing the results using the optimization method of the output results of the Kraken2 software described above, so as to determine the species types in the sample. Thus, the method according to an embodiment of the present invention can accurately identify the species types in the sample with a low false positive rate.

本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本发明的上述和/或附加的方面和优点结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and easily understood from the description of the embodiments in conjunction with the following drawings, in which:

图1是Kraken2 output参数的输出结果示意图;Figure 1 is a schematic diagram of the output results of Kraken2 output parameters;

图2是一个实施例中不同数据量灵敏度与kmermax(kmer最大值)关系的示意图;FIG2 is a schematic diagram of the relationship between different data volume sensitivities and kmermax (kmer maximum value) in one embodiment;

图3是一个实施例中不同数据量相对特异性与kmermax(kmer最大值)关系的示意图;FIG3 is a schematic diagram of the relationship between relative specificity of different data amounts and kmermax (kmer maximum value) in one embodiment;

图4是一个实施例中不同数据量准确率与kmermax(kmer最大值)关系的示意图。FIG. 4 is a schematic diagram of the relationship between the accuracy of different data amounts and kmermax (kmer maximum value) in one embodiment.

具体实施方式DETAILED DESCRIPTION

下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解,下面的实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。The scheme of the present invention will be explained below in conjunction with the embodiments. It will be appreciated by those skilled in the art that the following embodiments are only used to illustrate the present invention and should not be considered as limiting the scope of the present invention. Where specific techniques or conditions are not indicated in the embodiments, the techniques or conditions described in the literature in this area or the product specifications are used. The reagents or instruments used are not indicated by the manufacturer and are all conventional products that can be obtained commercially.

实施例1:40个物种模拟测序数据的分析Example 1: Analysis of simulated sequencing data of 40 species

首先从数据库中随机选取了40个代表性的物种(表1)基因组,涵盖真核、细菌、病毒,其中包括部分物种亲缘关系较近的同属物种。之后采用测序数据模拟工具ART生成双端75bp的测序数据并运行Kraken2来对这些物种进行鉴定。最后对Kraken2的结果用kmermax和kmersum这两个参数进行筛选,计算在不同数据量下的灵敏度,相对特异性和正确率。First, 40 representative species (Table 1) genomes were randomly selected from the database, covering eukaryotic, bacterial, and viral species, including some closely related species of the same genus. Then, the sequencing data simulation tool ART was used to generate double-end 75bp sequencing data and run Kraken2 to identify these species. Finally, the results of Kraken2 were screened using the two parameters kmermax and kmersum to calculate the sensitivity, relative specificity, and accuracy under different data volumes.

灵敏度为检出的真阳性物种数目X占理论真阳性物种数目(40)的比例,即X/40。Sensitivity is the ratio of the number of true positive species X detected to the theoretical number of true positive species (40), that is, X/40.

由于原始数据不存在真阴性值,为了对结果进行评估,我们以Kraken2原始结果为依据,该结果中的假阳性物种作为理论的真阴性的物种,其数目为N。相对特异性计算时以这些真阴性物种为参考,对实验组S得到的检出物种数D,其真阴性TN为N-(D-X),相对特异性为TN/N。Since there is no true negative value in the original data, in order to evaluate the results, we use the original results of Kraken2 as the basis, and the false positive species in the results are used as the theoretical true negative species, and their number is N. When calculating the relative specificity, these true negative species are used as references. For the number of detected species D obtained in the experimental group S, its true negative TN is N-(D-X), and the relative specificity is TN/N.

正确率为真阳性X占检出物种数D的比例,即X/D。The accuracy rate is the ratio of true positives X to the number of detected species D, that is, X/D.

表1 40个模拟测试的代表性物种的基因组Table 1 Genomes of 40 representative species tested in simulation

检测数据及结果如表2和表3所示。可以看出,采用方式三可以提高检测的正确率,减少假阳性发生率。同时,图2~4显示了不同数据量下kmermax与检测灵敏度、相对特异性和准确率的关系,其中,数据量对应读段数目参见表4。可以看出,不同kmermax会显著影响检测灵敏度、相对特异性和准确率,综合考虑,选择kmermax小于等于20作为筛选去除不当数据的标准,通过去除kmermax小于等于20的数据,可以有效地提高结果的准确性,灵敏度、相对特异性和准确率高。The test data and results are shown in Tables 2 and 3. It can be seen that the use of method 3 can improve the accuracy of detection and reduce the incidence of false positives. At the same time, Figures 2 to 4 show the relationship between kmermax and detection sensitivity, relative specificity and accuracy under different data volumes. The corresponding number of reads for the data volume is shown in Table 4. It can be seen that different kmermax will significantly affect the detection sensitivity, relative specificity and accuracy. Considering comprehensively, kmermax less than or equal to 20 is selected as the standard for screening and removing inappropriate data. By removing data with kmermax less than or equal to 20, the accuracy of the results can be effectively improved, with high sensitivity, relative specificity and accuracy.

表2Kmermax>0(无筛选)和kmermax>20(去除kmermax<=20)的物种情况(数据量:1)Table 2 Species with kmermax>0 (no screening) and kmermax>20 (removing kmermax<=20) (data volume: 1)

表3 40物种筛选结果(数据量:1)Table 3 40 species screening results (data volume: 1)

表4图2~图4中数据量对应读段数目Table 4. Number of read segments corresponding to the data volume in Figures 2 to 4

数据量Data volume 读段数目Number of read segments 0.010.01 1556015560 0.020.02 3114831148 0.050.05 7785477854 0.10.1 155720155720 0.20.2 311408311408 0.50.5 778572778572 11 15570921557092 22 31141863114186

实施例2:ZymoBIOMICSTM MICROBIAL Community Standard(Catalog No.D6300)标准品的测序数据的分析Example 2: Analysis of sequencing data of ZymoBIOMICS MICROBIAL Community Standard (Catalog No. D6300)

将ZymoBIOMICSTM MICROBIAL Community Standard标准品(表5)添加到三蒸水中进行建库测序,在对双端75bp的测序数据进行质控去人源后运行Kraken2来进行物种鉴定并用kmermax和kmersum这两个参数来进行筛选。最后计算在不同数据量下的灵敏度,相对特异性和正确率。ZymoBIOMICS TM MICROBIAL Community Standard (Table 5) was added to triple distilled water for library construction and sequencing. After quality control and removal of human origin for the double-end 75bp sequencing data, Kraken2 was run for species identification and screening was performed using the kmermax and kmersum parameters. Finally, the sensitivity, relative specificity and accuracy were calculated under different data amounts.

表5 ZymoBIOMICSTM MICROBIAL Community Standard标准品Table 5 ZymoBIOMICS TM MICROBIAL Community Standard

灵敏度为检出的真阳性物种数目X占理论真阳性物种数目(10)的比例,即X/10。Sensitivity is the ratio of the number of true positive species detected (X) to the theoretical number of true positive species (10), that is, X/10.

由于原始数据不存在真阴性值,为了对结果进行评估,我们以Kraken2原始结果为依据,该结果中的假阳性物种作为理论的真阴性的物种,其数目为N。相对特异性计算时以这些真阴性物种为参考,对实验组S得到的检出物种数D,其真阴性TN为N-(D-X),相对特异性为TN/N。Since there is no true negative value in the original data, in order to evaluate the results, we use the original results of Kraken2 as the basis, and the false positive species in the results are used as the theoretical true negative species, and their number is N. When calculating the relative specificity, these true negative species are used as references. For the number of detected species D obtained in the experimental group S, its true negative TN is N-(D-X), and the relative specificity is TN/N.

正确率为真阳性物种数目X占检出物种数D的比例,即X/D。The accuracy rate is the ratio of the number of true positive species X to the number of detected species D, that is, X/D.

检测数据及结果见表6和表7。The test data and results are shown in Tables 6 and 7.

表6 Kmermax>0(无筛选)和kmermax>20(去除kmermax<=20)的物种情况(数据量:197472读段)Table 6 Species with kmermax>0 (no screening) and kmermax>20 (removing kmermax<=20) (data volume: 197472 reads)

表7 ZymoBIOMICSTM MICROBIAL Community Standard标准品检测结果(数据量:197472读段)Table 7 ZymoBIOMICS MICROBIAL Community Standard test results (data volume: 197472 reads)

在表7的筛选结果中,“去除kmermax<=20或kmersum<1000且首次出现3倍以上降幅”条件组相对特异性和正确率均为100%。其产生的一个假阴性是由于枯草芽孢杆菌(Bacillus subtilis)与其所在属的物种相似度极高,筛选时只匹配到该物种的read数目较少,因而kmersum较少导致。当考虑属层级物种相似度的影响并重新统计kmersum后,枯草芽孢杆菌可正常检出,此时灵敏度达到100%,而相对特异性和正确率不受影响。In the screening results of Table 7, the relative specificity and accuracy of the condition group "remove kmermax <= 20 or kmersum < 1000 and the first occurrence of a 3-fold drop" are both 100%. One false negative is caused by the high similarity between Bacillus subtilis and the species of its genus. During the screening, only a small number of reads matching this species were found, resulting in a small kmersum. When the influence of the similarity of species at the genus level was considered and the kmersum was recalculated, Bacillus subtilis could be detected normally, and the sensitivity reached 100%, while the relative specificity and accuracy were not affected.

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it is to be understood that the above embodiments are exemplary and are not to be construed as limitations of the present invention. A person skilled in the art may change, modify, replace and vary the above embodiments within the scope of the present invention.

Claims (6)

1.一种Kraken2软件输出结果的优化方法,其特征在于,包括:1. A method for optimizing the output results of Kraken2 software, characterized by comprising: 将测序结果中每个读段的子读段与已知数据库中的物种序列进行匹配,获取每个读段中匹配上每个物种的子读段的kmer数,选择每个读段中各kmer数中的最大值,记作kmermax数;Match the sub-reads of each read in the sequencing results with the species sequences in the known database, obtain the number of kmers in each read that matches the sub-reads of each species, and select the maximum value of each kmer number in each read, recorded as kmermax number; 将所述kmermax数与第一阈值进行比较,当所述kmermax数小于等于第一阈值时,去除所述kmermax数对应的读段,以便对所有读段进行过滤;Comparing the kmermax number with a first threshold, and when the kmermax number is less than or equal to the first threshold, removing the reads corresponding to the kmermax number, so as to filter all the reads; 将所述过滤后剩余的读段作为候选读段,每个所述候选读段中kmermax数对应的物种作为候选物种;The remaining reads after the filtering are used as candidate reads, and the species corresponding to the kmermax number in each candidate read is used as a candidate species; 针对每个所述候选物种,选择所述候选物种匹配上的所有读段的kmer数之和,记作kmersum数;For each candidate species, the sum of the kmer numbers of all reads matching the candidate species is selected and recorded as the kmersum number; 将不同所述候选物种对应的kmersum数进行降序排列,当出现前一个kmersum数除以后一个kmersum数大于等于第三阈值且所述后一个kmersum数小于第二阈值时,所述前一个kmersum数及其之前的kmersum数所对应的候选物种为目标物种。The kmersum numbers corresponding to the different candidate species are arranged in descending order, and when the previous kmersum number divided by the next kmersum number is greater than or equal to the third threshold and the next kmersum number is less than the second threshold, the candidate species corresponding to the previous kmersum number and the kmersum number before it are the target species. 2.根据权利要求1所述的方法,其特征在于,所述第一阈值为15~30。2. The method according to claim 1, characterized in that the first threshold is 15~30. 3.根据权利要求1所述的方法,其特征在于,所述第二阈值为900~1100。3. The method according to claim 1, characterized in that the second threshold is 900~1100. 4.根据权利要求1所述的方法,其特征在于,所述第三阈值为3~5。4. The method according to claim 1, characterized in that the third threshold is 3-5. 5.根据权利要求1所述的方法,其特征在于,所述测序结果的数量为10K-30M。5. The method according to claim 1, characterized in that the number of sequencing results is 10K-30M. 6.一种鉴定样本中物种类型的方法,其特征在于,包括:6. A method for identifying species types in a sample, comprising: (1)将样本进行宏基因组测序,得到测序结果;(1) Perform metagenomic sequencing on the sample to obtain sequencing results; (2)利用Kraken2软件对所述测序结果进行分析,并采用权利要求1~5任一项所述Kraken2软件输出结果的优化方法进行优化,以便确定所述样本中物种类型。(2) Analyzing the sequencing results using Kraken2 software, and optimizing the results using the optimization method of the Kraken2 software output according to any one of claims 1 to 5, so as to determine the species type in the sample.
CN202011583243.4A 2020-12-28 2020-12-28 Optimization method of Kraken2 software output results and method of identifying species types in samples Active CN112614544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011583243.4A CN112614544B (en) 2020-12-28 2020-12-28 Optimization method of Kraken2 software output results and method of identifying species types in samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011583243.4A CN112614544B (en) 2020-12-28 2020-12-28 Optimization method of Kraken2 software output results and method of identifying species types in samples

Publications (2)

Publication Number Publication Date
CN112614544A CN112614544A (en) 2021-04-06
CN112614544B true CN112614544B (en) 2024-05-17

Family

ID=75248491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011583243.4A Active CN112614544B (en) 2020-12-28 2020-12-28 Optimization method of Kraken2 software output results and method of identifying species types in samples

Country Status (1)

Country Link
CN (1) CN112614544B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115044689B (en) * 2022-06-17 2025-06-06 北京大学第三医院(北京大学第三临床医学院) A method for processing samples of lower respiratory tract aspirates for ventilator-associated pneumonia

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1735468A2 (en) * 2004-01-27 2006-12-27 Compugen USA, Inc. Novel nucleotide and amino acid sequences, and assays and methods of use thereof for diagnosis of prostate cancer
CN102750461A (en) * 2012-06-14 2012-10-24 东北大学 Biological sequence local comparison method capable of obtaining complete solution
CN103065067A (en) * 2012-12-26 2013-04-24 深圳先进技术研究院 Method and system for filtering sequence segments in short-sequence assembly
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system
CN108866171A (en) * 2017-05-10 2018-11-23 深圳华大基因研究院 A kind of species identification method based on new-generation sequencing
CN109082479A (en) * 2017-06-14 2018-12-25 深圳华大基因研究院 The method and apparatus of microbial species are identified from sample
CN109256178A (en) * 2018-07-26 2019-01-22 中山大学 The Leon-RC compression method of gene order-checking data
CN111462819A (en) * 2020-02-26 2020-07-28 康美华大基因技术有限公司 Method for analyzing intestinal microorganism detection data, automatic interpretation system and medium
CN111836905A (en) * 2018-02-13 2020-10-27 特温斯特兰德生物科学有限公司 Methods and reagents for detecting and assessing genotoxicity
CN111951895A (en) * 2020-07-09 2020-11-17 苏州协云基因科技有限公司 Pathogen analysis method, analysis device, apparatus and storage medium based on metagenomics

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130345095A1 (en) * 2011-03-02 2013-12-26 Bgi Tech Solutions Co., Ltd. Method and device for assembling genome sequence

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1735468A2 (en) * 2004-01-27 2006-12-27 Compugen USA, Inc. Novel nucleotide and amino acid sequences, and assays and methods of use thereof for diagnosis of prostate cancer
CN102750461A (en) * 2012-06-14 2012-10-24 东北大学 Biological sequence local comparison method capable of obtaining complete solution
CN103065067A (en) * 2012-12-26 2013-04-24 深圳先进技术研究院 Method and system for filtering sequence segments in short-sequence assembly
CN108866171A (en) * 2017-05-10 2018-11-23 深圳华大基因研究院 A kind of species identification method based on new-generation sequencing
CN109082479A (en) * 2017-06-14 2018-12-25 深圳华大基因研究院 The method and apparatus of microbial species are identified from sample
CN111836905A (en) * 2018-02-13 2020-10-27 特温斯特兰德生物科学有限公司 Methods and reagents for detecting and assessing genotoxicity
CN108334750A (en) * 2018-04-19 2018-07-27 江苏先声医学诊断有限公司 A kind of macro genomic data analysis method and system
CN109256178A (en) * 2018-07-26 2019-01-22 中山大学 The Leon-RC compression method of gene order-checking data
CN111462819A (en) * 2020-02-26 2020-07-28 康美华大基因技术有限公司 Method for analyzing intestinal microorganism detection data, automatic interpretation system and medium
CN111951895A (en) * 2020-07-09 2020-11-17 苏州协云基因科技有限公司 Pathogen analysis method, analysis device, apparatus and storage medium based on metagenomics

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
从16S rRNA基因序列初步探讨拟壁钱属和壁钱属蜘蛛系统发生关系;潘鸿春;吴宝山;郝家胜;朱国萍;;动物分类学报;20070130(第01期);全文 *
用噬菌体展示技术筛选喉癌细胞靶向肽的研究;冯俊;李丽;杨洪斌;刘世喜;;华西医学(第12期);正文全文 *

Also Published As

Publication number Publication date
CN112614544A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
van Boheemen et al. Retrospective validation of a metagenomic sequencing protocol for combined detection of RNA and DNA viruses using respiratory samples from pediatric patients
CN113160882B (en) Pathogenic microorganism metagenome detection method based on third generation sequencing
de Vries et al. Benchmark of thirteen bioinformatic pipelines for metagenomic virus diagnostics using datasets from clinical samples
CN111607639B (en) Method and device for quantitative detection of metagenomic pathogens based on internal reference
US20250182850A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
CN113167782A (en) Methods for sample quality assessment
Bubb et al. Considerations in the analysis of plant chromatin accessibility data
CN110875082B (en) Microorganism detection method and device based on targeted amplification sequencing
CN112614544B (en) Optimization method of Kraken2 software output results and method of identifying species types in samples
CN112331268B (en) Method for obtaining specific sequence of target species and method for detecting target species
KR20250005003A (en) Biomarker for diagnosing atopic dermatitis and use thereof
CN113571128B (en) Method for establishing reference threshold for detecting pathogen in metagenome
JP5403563B2 (en) Gene identification method and expression analysis method in comprehensive fragment analysis
CN114496089B (en) Pathogenic microorganism identification method
Jooste et al. In silico probe-based detection of citrus viruses in NGS data
WO2023021978A1 (en) Method for examining autoimmune disease
CN103093122A (en) Identification tool of high-throughput biological chip detection results
CN114334005A (en) Method and system for analyzing and identifying broad-spectrum pathogenic microorganisms
US20190057134A1 (en) System and method for automated microarray information citation analysis
CN120220810A (en) Method and device for quantifying enterovirome
van Boheemen et al. Metagenomic sequencing for combined detection of RNA and DNA viruses in respiratory samples from paediatric patients
Kao et al. Development of an oral swab based microbiome test for the detection of feline dental disease
CN118230820A (en) Metagene sequencing data-based drug-resistant gene species source identification method
Teo et al. A comparative study of metagenomics analysis pipelines at the species level
de Campos et al. Anellovirus abundance as an indicator for viral metagenomic classifier utility in plasma samples

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information

Address after: 7-10 / F, block D, beidameng workshop, 188 Lianchuang street, Wuchang Street, Yuhang District, Hangzhou, Zhejiang 311100

Applicant after: HANGZHOU REPUGENE TECHNOLOGY Co.,Ltd.

Address before: 7-10 / F, block D, beidameng workshop, 188 Lianchuang street, Wuchang Street, Hangzhou, Zhejiang 311100

Applicant before: HANGZHOU REPUGENE TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: 7th Floor, Building 3, No. 188, Lianchuang Street, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province, 311100

Patentee after: Hangzhou Xingyuan Future Technology Co.,Ltd.

Country or region after: China

Address before: 7-10 / F, block D, beidameng workshop, 188 Lianchuang street, Wuchang Street, Yuhang District, Hangzhou, Zhejiang 311100

Patentee before: HANGZHOU REPUGENE TECHNOLOGY Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address