CN112614544B

CN112614544B - Optimization method of Kraken2 software output results and method of identifying species types in samples

Info

Publication number: CN112614544B
Application number: CN202011583243.4A
Authority: CN
Inventors: 王涛; 肖姗姗; 常壹昭
Original assignee: Hangzhou Repugene Technology Co ltd
Current assignee: Hangzhou Xingyuan Future Technology Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2024-05-17
Anticipated expiration: 2040-12-28
Also published as: CN112614544A

Abstract

The invention provides a Kraken software output result optimization method, which comprises the following steps: matching sub-reads of each read in the sequencing result with species sequences in a known database, obtaining the number of kmers of the sub-reads matched with each species in each read, selecting the maximum value of the number of kmers in each read, and recording the maximum value as kmermax numbers; comparing the kmermax number with a first threshold, and removing the reads corresponding to the kmermax number when the kmermax number is smaller than or equal to the first threshold so as to filter all reads. The method can accurately optimize Kraken software output results, avoid false positive phenomenon, and can be applied to identifying species types in samples.

Description

Optimization method of Kraken2 software output results and identification of species types in samples

技术领域Technical Field

本发明涉及生物领域。具体地，本发明涉及Kraken2软件输出结果的优化方法及鉴定样本中物种类型的方法。The present invention relates to the biological field. Specifically, the present invention relates to a method for optimizing the output results of Kraken2 software and a method for identifying species types in a sample.

背景技术Background Art

宏基因组是指特定环境下所有生物遗传物质的总和，以其作为研究对象，通过测序分析、功能基因筛选等方式可以获取样本中的生物组成及生物之间、生物和环境之间的关系。Metagenome refers to the sum of all biological genetic materials in a specific environment. By taking it as the research object, we can obtain the biological composition of the sample and the relationship between organisms and between organisms and the environment through sequencing analysis, functional gene screening, etc.

宏基因组测序简称mNGS，即metagenomics next generation sequencing，是一种对环境中所有生物基因组不加分离进行混合测序的技术。Metagenome sequencing, abbreviated as mNGS, stands for metagenomics next generation sequencing, which is a technology that performs mixed sequencing of the genomes of all organisms in the environment without separation.

病原微生物，指能使人或动物致病的微生物。包括寄生虫、真菌、细菌病毒等。Pathogenic microorganisms refer to microorganisms that can cause disease in humans or animals, including parasites, fungi, bacteria and viruses.

mNGS可以实现对样本中所有的微生物进行鉴定，并且可以鉴定新的物种，这是采用常规的实验方法如涂片、生化鉴定、培养鉴定及基于多重PCR的检测技术所无法实现的。mNGS因无需培养、不依赖探针序列、无偏好覆盖广等优点，在临床疑似病原体、罕见病原体、急危重症等临床感染病原检测应用较广。mNGS can identify all microorganisms in the sample and can identify new species, which is impossible to achieve using conventional experimental methods such as smears, biochemical identification, culture identification, and multiplex PCR-based detection technology. mNGS is widely used in clinical suspected pathogens, rare pathogens, critical illnesses, and other clinical infectious pathogen detection due to its advantages such as no need for culture, no reliance on probe sequences, and wide coverage without bias.

mNGS检测的是样本中所有生物的基因组。在检测前，样本中的微生物组成大多情况下是未知的。检测时需要将测序得到的read和已知的数据库进行比对来确认其表达通路或物种组成等。mNGS detects the genomes of all organisms in a sample. Before testing, the composition of microorganisms in a sample is mostly unknown. During testing, the reads obtained by sequencing need to be compared with known databases to confirm their expression pathways or species composition, etc.

用于序列比对的知名工具BLAST可以对较长的序列得到较好的比对结果，但该工具应用于高通量测序得到的短序列数据时效果欠佳。此外，由于mNGS进行物种判定时往往需要和大型的数据库进行匹配，BLAST的比对速度限制了其在临床检测等时间紧迫的场景中的应用。BLAST, a well-known tool for sequence alignment, can obtain good alignment results for longer sequences, but the tool is not very effective when applied to short sequence data obtained by high-throughput sequencing. In addition, since mNGS often needs to match large databases when determining species, the alignment speed of BLAST limits its application in time-sensitive scenarios such as clinical testing.

Kraken2软件是一款常用的mNGS物种鉴定工具，其在建库时构建各物种独特的kmer，并在比对时将每条read的上的kmer情况进行统计。Kraken2具有极快的运行速度，但由于其默认对结果不进行筛选，导致原始结果中假阳性率非常高，可以达到85％以上。Kraken2 software is a commonly used mNGS species identification tool. It constructs unique kmers for each species when building a library, and counts the kmer status of each read during alignment. Kraken2 has an extremely fast running speed, but because it does not filter the results by default, the false positive rate in the original results is very high, which can reach more than 85%.

发明内容Summary of the invention

本发明旨在至少在一定程度上解决现有技术中存在的技术问题至少之一。为此，本发明提出了Kraken2软件输出结果的优化方法及鉴定样本中物种类型的方法，该方法可以有效地优化Kraken2软件输出结果，准确获知物种类型，从而可以应用于样本中物种类型的鉴定，降低假阳性率。The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent. To this end, the present invention proposes a method for optimizing the output results of Kraken2 software and a method for identifying species types in samples, which can effectively optimize the output results of Kraken2 software and accurately know the species types, so that it can be applied to the identification of species types in samples and reduce the false positive rate.

本发明提出了一种Kraken2软件输出结果的优化方法。根据本发明的实施例，所述方法包括：将测序结果中每个读段的子读段与已知数据库中的物种序列进行匹配，获取每个读段中匹配上每个物种的子读段的kmer数，选择每个读段中各kmer数中的最大值，记作kmermax数；将所述kmermax数与第一阈值进行比较，以便对所有读段进行过滤，去除不符合要求的读段。The present invention proposes a method for optimizing the output results of Kraken2 software. According to an embodiment of the present invention, the method includes: matching the sub-read segments of each read segment in the sequencing result with the species sequence in the known database, obtaining the number of kmers in each read segment that matches the sub-read segment of each species, selecting the maximum value of each kmer number in each read segment, and recording it as the kmermax number; comparing the kmermax number with a first threshold value to filter all read segments and remove read segments that do not meet the requirements.

根据本发明实施例的优化方法中，对Kraken2软件的output参数输出文件中的每一条read，统计其匹配到物种的kmer数的最大值，并将该最大值与第一阈值进行比较，筛选出不符合要求的读段，从而以便于对所有读段进行过滤，提高结果准确性，避免出现假阳性现象。In the optimization method according to an embodiment of the present invention, for each read in the output parameter output file of the Kraken2 software, the maximum number of kmers matched to the species is counted, and the maximum value is compared with the first threshold to filter out reads that do not meet the requirements, thereby facilitating the filtering of all reads, improving the accuracy of the results, and avoiding false positives.

根据本发明的实施例，上述Kraken2软件输出结果的优化方法还可以具有下列附加技术特征：According to an embodiment of the present invention, the optimization method of the output result of the above-mentioned Kraken2 software may also have the following additional technical features:

根据本发明的实施例，所述第一阈值为15～30。由此，可以去除小于等于15～30的kmermax数所对应的读段，避免其干扰，导致检测结果不准确，出现假阳性现象。需要说明的是，本发明所描述的第一阈值是指15～30中的任意一个具体数值，例如20。According to an embodiment of the present invention, the first threshold is 15 to 30. Thus, the reads corresponding to kmermax numbers less than or equal to 15 to 30 can be removed to avoid interference, which may lead to inaccurate detection results and false positives. It should be noted that the first threshold described in the present invention refers to any specific value between 15 and 30, such as 20.

根据本发明的实施例，所述方法进一步包括：将所述过滤后剩余的读段作为候选读段，每个所述候选读段中kmermax数对应的物种作为候选物种；针对每个所述候选物种，选择所述候选物种匹配上的所有读段的kmer数之和，记作kmersum数；将所述kmersum数与第二阈值进行比较，当所述kmersum数小于第二阈值时，去除所述kmersum数对应的候选物种，以便对所述候选物种进行过滤，去除不符合要求的物种，剩余物种为目标物种。According to an embodiment of the present invention, the method further includes: taking the remaining reads after the filtering as candidate reads, and the species corresponding to the kmermax number in each candidate read as a candidate species; for each candidate species, selecting the sum of the kmer numbers of all reads matching the candidate species, recorded as the kmersum number; comparing the kmersum number with a second threshold, and when the kmersum number is less than the second threshold, removing the candidate species corresponding to the kmersum number, so as to filter the candidate species, remove species that do not meet the requirements, and the remaining species are target species.

如前所述，通过对每个物种，通过统计kmermax数，并将其与第一阈值进行比较，过滤去除不符合要求的读段。进一步地，对剩余读段所匹配上的候选物种作为研究对象，统计每个物种所匹配上的所有读段的kmer数之和(即kmersum数)，并将kmersum数与第二阈值进行比较，以便进一步对候选物种进行过滤，去除不符合要求的物种，从而获得目标物质。由此，可以提高结果的准确性，减少假阳性现象。As mentioned above, for each species, by counting the kmermax number and comparing it with the first threshold, the reads that do not meet the requirements are filtered out. Further, the candidate species matched by the remaining reads are taken as the research object, and the sum of the kmer numbers of all reads matched by each species (i.e., the kmersum number) is counted, and the kmersum number is compared with the second threshold, so as to further filter the candidate species, remove the species that do not meet the requirements, and thus obtain the target substance. In this way, the accuracy of the results can be improved and the false positive phenomenon can be reduced.

根据本发明的实施例，所述第二阈值为900～1100。由此，可以去除小于900～1100的kmersum数所对应的读段及与其匹配的候选物种，避免其干扰，导致检测结果不准确，出现假阳性现象。需要说明的是，本发明所描述的第二阈值是指900～1100中的任意一个具体数值，例如1000。According to an embodiment of the present invention, the second threshold is 900 to 1100. Thus, the reads corresponding to the kmersum numbers less than 900 to 1100 and the candidate species matched therewith can be removed to avoid their interference, which may lead to inaccurate detection results and false positive phenomena. It should be noted that the second threshold described in the present invention refers to any specific value between 900 and 1100, such as 1000.

根据本发明的实施例，所述方法进一步包括：将不同所述候选物种对应的kmersum数进行降序排列，当出现前一个kmersum数除以后一个kmersum数大于等于第三阈值且所述后一个kmersum数小于第二阈值时，所述前一个kmersum数及其之前的kmersum数所对应的候选物种为目标物种。According to an embodiment of the present invention, the method further includes: arranging the kmersum numbers corresponding to different candidate species in descending order, and when the previous kmersum number divided by the next kmersum number is greater than or equal to a third threshold and the next kmersum number is less than a second threshold, the candidate species corresponding to the previous kmersum number and the kmersum number before it is the target species.

通过对不同候选物种对应的kmersum数进行降序排列，首次出现第三阈值倍降幅且kmersum数小于第二阈值时，该前一个kmersum数及其之前的kmersum数所对应的候选物种为目标物种。由此，可以进一步保证检测结果的准确性，避免出现假阳性现象。By arranging the kmersum numbers corresponding to different candidate species in descending order, when the third threshold value decreases for the first time and the kmersum number is less than the second threshold value, the candidate species corresponding to the previous kmersum number and the kmersum number before it are the target species. In this way, the accuracy of the detection results can be further guaranteed and false positives can be avoided.

根据本发明的实施例，所述第三阈值为3～5。由此，可以进一步保证检测结果的准确性，避免出现假阳性现象。According to an embodiment of the present invention, the third threshold is 3 to 5. Thus, the accuracy of the detection result can be further guaranteed and false positive phenomenon can be avoided.

根据本发明的实施例，所述测序结果的数量为10K-30M。需要说明的是，当测序结果中含有宿主数据，需要预先将宿主数据删除，上述10K-30M是指去宿主数据之后的测序结果数量。According to an embodiment of the present invention, the number of sequencing results is 10K-30M. It should be noted that when the sequencing results contain host data, the host data needs to be deleted in advance, and the above 10K-30M refers to the number of sequencing results after removing the host data.

为了方便理解，下面举例详细说明优化方法：For ease of understanding, the following example explains the optimization method in detail:

举例1：图1显示了一个Kraken2 output参数的输出结果示意图。Example 1: Figure 1 shows a schematic diagram of the output results of a Kraken2 output parameter.

举例2：假设测序结果中总共有10个读段，分别记作read1、read2、…、read10。Example 2: Assume that there are a total of 10 reads in the sequencing results, recorded as read1, read2, ..., read10.

针对read1，具体分为三个子读段，分别记作read1-1、read1-2、read1-3。分别将read1-1、read1-2、read1-3与Kraken2软件中的物种数据库信息进行比对，发现read1-1与物种S1匹配的kmer数为5；read1-2与物种S1匹配的kmer数15，与物种S2匹配的kmer数为20；read1-2与物种S3匹配的kmer数为80。则，read1中匹配到的物种kmer数分别为：物种S1的kmer数为20，物种S2的kmer数为20，物种S3匹配的kmer数为80，其中，最大的kmer数(即kmermax数)为80。For read1, it is divided into three sub-segments, which are recorded as read1-1, read1-2, and read1-3. Read1-1, read1-2, and read1-3 are compared with the species database information in the Kraken2 software. It is found that the number of kmers matched by read1-1 and species S1 is 5; the number of kmers matched by read1-2 and species S1 is 15, and the number of kmers matched by read1-2 and species S2 is 20; the number of kmers matched by read1-2 and species S3 is 80. Therefore, the number of species kmers matched in read1 is: the number of kmers of species S1 is 20, the number of kmers of species S2 is 20, and the number of kmers matched by species S3 is 80, among which the largest number of kmers (i.e., kmermax number) is 80.

以此类推，read2、read3、…、read10的kmermax数及其匹配的物种如下表所示。由于read3的kmermax数小于10，所以除去该read结果。候选物种为S1、S2、S3、S4、S5。Similarly, the kmermax numbers of read2, read3, ..., read10 and their matching species are shown in the following table. Since the kmermax number of read3 is less than 10, the read result is removed. The candidate species are S1, S2, S3, S4, and S5.

将物种S1～S5所对应的read数，统计每个物种的kmer数之和，并以降序排列，如下表。可以看出，从S4到S5，kmersum数首次出现3倍以上降幅，并且600小于1000。因此，去除kmersum数为600及以下的数据及对应的候选物种。最终，确定的样本中的物种为S3、S1和S4。The read counts corresponding to species S1 to S5 were counted, and the sum of the kmer counts of each species were calculated and arranged in descending order, as shown in the following table. It can be seen that from S4 to S5, the kmersum number dropped by more than 3 times for the first time, and 600 was less than 1000. Therefore, the data with kmersum numbers of 600 and below and the corresponding candidate species were removed. Finally, the species in the sample were determined to be S3, S1, and S4.

物种Species kmersum数kmersum number S3S3 20002000 S1S1 12001200 S4S4 600600 S5S5 100100 S2S2 8080

在本发明的另一方面，本发明提出了一种鉴定样本中物种类型的方法。根据本发明的实施例，所述方法包括：(1)将样本进行宏基因组测序，得到测序结果；(2)利用Kraken2软件对所述测序结果进行分析，并采用前面所述Kraken2软件输出结果的优化方法进行优化，以便确定所述样本中物种类型。由此，利用根据本发明实施例的方法可以准确地鉴定出样本中的物种类型，假阳性率低。In another aspect of the present invention, the present invention proposes a method for identifying species types in a sample. According to an embodiment of the present invention, the method includes: (1) performing metagenomic sequencing on the sample to obtain sequencing results; (2) analyzing the sequencing results using Kraken2 software, and optimizing the results using the optimization method of the output results of the Kraken2 software described above, so as to determine the species types in the sample. Thus, the method according to an embodiment of the present invention can accurately identify the species types in the sample with a low false positive rate.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

本发明的上述和/或附加的方面和优点结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and easily understood from the description of the embodiments in conjunction with the following drawings, in which:

图1是Kraken2 output参数的输出结果示意图；Figure 1 is a schematic diagram of the output results of Kraken2 output parameters;

图2是一个实施例中不同数据量灵敏度与kmermax(kmer最大值)关系的示意图；FIG2 is a schematic diagram of the relationship between different data volume sensitivities and kmermax (kmer maximum value) in one embodiment;

图3是一个实施例中不同数据量相对特异性与kmermax(kmer最大值)关系的示意图；FIG3 is a schematic diagram of the relationship between relative specificity of different data amounts and kmermax (kmer maximum value) in one embodiment;

图4是一个实施例中不同数据量准确率与kmermax(kmer最大值)关系的示意图。FIG. 4 is a schematic diagram of the relationship between the accuracy of different data amounts and kmermax (kmer maximum value) in one embodiment.

具体实施方式DETAILED DESCRIPTION

下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解，下面的实施例仅用于说明本发明，而不应视为限定本发明的范围。实施例中未注明具体技术或条件的，按照本领域内的文献所描述的技术或条件或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者，均为可以通过市购获得的常规产品。The scheme of the present invention will be explained below in conjunction with the embodiments. It will be appreciated by those skilled in the art that the following embodiments are only used to illustrate the present invention and should not be considered as limiting the scope of the present invention. Where specific techniques or conditions are not indicated in the embodiments, the techniques or conditions described in the literature in this area or the product specifications are used. The reagents or instruments used are not indicated by the manufacturer and are all conventional products that can be obtained commercially.

实施例1：40个物种模拟测序数据的分析Example 1: Analysis of simulated sequencing data of 40 species

首先从数据库中随机选取了40个代表性的物种(表1)基因组，涵盖真核、细菌、病毒,其中包括部分物种亲缘关系较近的同属物种。之后采用测序数据模拟工具ART生成双端75bp的测序数据并运行Kraken2来对这些物种进行鉴定。最后对Kraken2的结果用kmermax和kmersum这两个参数进行筛选，计算在不同数据量下的灵敏度，相对特异性和正确率。First, 40 representative species (Table 1) genomes were randomly selected from the database, covering eukaryotic, bacterial, and viral species, including some closely related species of the same genus. Then, the sequencing data simulation tool ART was used to generate double-end 75bp sequencing data and run Kraken2 to identify these species. Finally, the results of Kraken2 were screened using the two parameters kmermax and kmersum to calculate the sensitivity, relative specificity, and accuracy under different data volumes.

灵敏度为检出的真阳性物种数目X占理论真阳性物种数目(40)的比例，即X/40。Sensitivity is the ratio of the number of true positive species X detected to the theoretical number of true positive species (40), that is, X/40.

由于原始数据不存在真阴性值，为了对结果进行评估，我们以Kraken2原始结果为依据，该结果中的假阳性物种作为理论的真阴性的物种，其数目为N。相对特异性计算时以这些真阴性物种为参考,对实验组S得到的检出物种数D，其真阴性TN为N-(D-X)，相对特异性为TN/N。Since there is no true negative value in the original data, in order to evaluate the results, we use the original results of Kraken2 as the basis, and the false positive species in the results are used as the theoretical true negative species, and their number is N. When calculating the relative specificity, these true negative species are used as references. For the number of detected species D obtained in the experimental group S, its true negative TN is N-(D-X), and the relative specificity is TN/N.

正确率为真阳性X占检出物种数D的比例，即X/D。The accuracy rate is the ratio of true positives X to the number of detected species D, that is, X/D.

表1 40个模拟测试的代表性物种的基因组Table 1 Genomes of 40 representative species tested in simulation

检测数据及结果如表2和表3所示。可以看出，采用方式三可以提高检测的正确率，减少假阳性发生率。同时，图2～4显示了不同数据量下kmermax与检测灵敏度、相对特异性和准确率的关系，其中，数据量对应读段数目参见表4。可以看出，不同kmermax会显著影响检测灵敏度、相对特异性和准确率，综合考虑，选择kmermax小于等于20作为筛选去除不当数据的标准，通过去除kmermax小于等于20的数据，可以有效地提高结果的准确性，灵敏度、相对特异性和准确率高。The test data and results are shown in Tables 2 and 3. It can be seen that the use of method 3 can improve the accuracy of detection and reduce the incidence of false positives. At the same time, Figures 2 to 4 show the relationship between kmermax and detection sensitivity, relative specificity and accuracy under different data volumes. The corresponding number of reads for the data volume is shown in Table 4. It can be seen that different kmermax will significantly affect the detection sensitivity, relative specificity and accuracy. Considering comprehensively, kmermax less than or equal to 20 is selected as the standard for screening and removing inappropriate data. By removing data with kmermax less than or equal to 20, the accuracy of the results can be effectively improved, with high sensitivity, relative specificity and accuracy.

表2Kmermax>0(无筛选)和kmermax>20(去除kmermax<＝20)的物种情况(数据量：1)Table 2 Species with kmermax>0 (no screening) and kmermax>20 (removing kmermax<=20) (data volume: 1)

表3 40物种筛选结果(数据量：1)Table 3 40 species screening results (data volume: 1)

表4图2～图4中数据量对应读段数目Table 4. Number of read segments corresponding to the data volume in Figures 2 to 4

数据量Data volume 读段数目Number of read segments 0.010.01 1556015560 0.020.02 3114831148 0.050.05 7785477854 0.10.1 155720155720 0.20.2 311408311408 0.50.5 778572778572 11 15570921557092 22 31141863114186

实施例2：ZymoBIOMICS^TM MICROBIAL Community Standard(Catalog No.D6300)标准品的测序数据的分析Example 2: Analysis of sequencing data of ZymoBIOMICS ^™ MICROBIAL Community Standard (Catalog No. D6300)

将ZymoBIOMICS^TM MICROBIAL Community Standard标准品(表5)添加到三蒸水中进行建库测序，在对双端75bp的测序数据进行质控去人源后运行Kraken2来进行物种鉴定并用kmermax和kmersum这两个参数来进行筛选。最后计算在不同数据量下的灵敏度，相对特异性和正确率。ZymoBIOMICS ^TM MICROBIAL Community Standard (Table 5) was added to triple distilled water for library construction and sequencing. After quality control and removal of human origin for the double-end 75bp sequencing data, Kraken2 was run for species identification and screening was performed using the kmermax and kmersum parameters. Finally, the sensitivity, relative specificity and accuracy were calculated under different data amounts.

表5 ZymoBIOMICS^TM MICROBIAL Community Standard标准品Table 5 ZymoBIOMICS ^TM MICROBIAL Community Standard

灵敏度为检出的真阳性物种数目X占理论真阳性物种数目(10)的比例，即X/10。Sensitivity is the ratio of the number of true positive species detected (X) to the theoretical number of true positive species (10), that is, X/10.

正确率为真阳性物种数目X占检出物种数D的比例，即X/D。The accuracy rate is the ratio of the number of true positive species X to the number of detected species D, that is, X/D.

检测数据及结果见表6和表7。The test data and results are shown in Tables 6 and 7.

表6 Kmermax>0(无筛选)和kmermax>20(去除kmermax<＝20)的物种情况(数据量：197472读段)Table 6 Species with kmermax>0 (no screening) and kmermax>20 (removing kmermax<=20) (data volume: 197472 reads)

表7 ZymoBIOMICS^TM MICROBIAL Community Standard标准品检测结果(数据量：197472读段)Table 7 ZymoBIOMICS ^™ MICROBIAL Community Standard test results (data volume: 197472 reads)

在表7的筛选结果中，“去除kmermax<＝20或kmersum<1000且首次出现3倍以上降幅”条件组相对特异性和正确率均为100％。其产生的一个假阴性是由于枯草芽孢杆菌(Bacillus subtilis)与其所在属的物种相似度极高，筛选时只匹配到该物种的read数目较少，因而kmersum较少导致。当考虑属层级物种相似度的影响并重新统计kmersum后，枯草芽孢杆菌可正常检出，此时灵敏度达到100％，而相对特异性和正确率不受影响。In the screening results of Table 7, the relative specificity and accuracy of the condition group "remove kmermax <= 20 or kmersum < 1000 and the first occurrence of a 3-fold drop" are both 100%. One false negative is caused by the high similarity between Bacillus subtilis and the species of its genus. During the screening, only a small number of reads matching this species were found, resulting in a small kmersum. When the influence of the similarity of species at the genus level was considered and the kmersum was recalculated, Bacillus subtilis could be detected normally, and the sensitivity reached 100%, while the relative specificity and accuracy were not affected.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, without contradiction.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it is to be understood that the above embodiments are exemplary and are not to be construed as limitations of the present invention. A person skilled in the art may change, modify, replace and vary the above embodiments within the scope of the present invention.

Claims

1. A method for optimizing the output results of Kraken2 software, characterized by comprising:

Match the sub-reads of each read in the sequencing results with the species sequences in the known database, obtain the number of kmers in each read that matches the sub-reads of each species, and select the maximum value of each kmer number in each read, recorded as kmermax number;

Comparing the kmermax number with a first threshold, and when the kmermax number is less than or equal to the first threshold, removing the reads corresponding to the kmermax number, so as to filter all the reads;

The remaining reads after the filtering are used as candidate reads, and the species corresponding to the kmermax number in each candidate read is used as a candidate species;

For each candidate species, the sum of the kmer numbers of all reads matching the candidate species is selected and recorded as the kmersum number;

The kmersum numbers corresponding to the different candidate species are arranged in descending order, and when the previous kmersum number divided by the next kmersum number is greater than or equal to the third threshold and the next kmersum number is less than the second threshold, the candidate species corresponding to the previous kmersum number and the kmersum number before it are the target species.

2. The method according to claim 1, characterized in that the first threshold is 15~30.

3. The method according to claim 1, characterized in that the second threshold is 900~1100.

4. The method according to claim 1, characterized in that the third threshold is 3-5.

5. The method according to claim 1, characterized in that the number of sequencing results is 10K-30M.

6. A method for identifying species types in a sample, comprising:

(1) Perform metagenomic sequencing on the sample to obtain sequencing results;

(2) Analyzing the sequencing results using Kraken2 software, and optimizing the results using the optimization method of the Kraken2 software output according to any one of claims 1 to 5, so as to determine the species type in the sample.