CN113782097B

CN113782097B - Anchor point screening method and device based on bloom filter and computer equipment

Info

Publication number: CN113782097B
Application number: CN202111041904.5A
Authority: CN
Inventors: 张昂; 廖湘科; 崔英博; 杨灿群; 黄春; 唐滔; 彭林; 夏泽宇
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2022-06-24
Anticipated expiration: 2041-09-07
Also published as: CN113782097A

Abstract

The present application relates to a method, apparatus and computer equipment for anchor point screening based on Bloom filter. The method includes: selecting a query sequence and a fragment of the reference sequence between the two anchor points as the query sequence fragment and the reference sequence fragment according to the pre-located anchor points, and respectively selecting the reference sequence fragment and the query sequence fragment according to the preset length. Generate multiple consecutive overlapping sub-segments, build indexes through multiple preset hash functions, map the reference sequence sub-segments to the bit vector of the Bloom filter, and then query according to the index, when the query sequence sub-segment is in the reference sequence If it does not exist in the query sequence sub-segment, it is judged that the query sequence sub-segment has not passed the screening; all the query sequence sub-segments in the query sequence fragment are traversed, the total number of the query sequence sub-segments that have not passed the screening is counted, and when it is greater than the preset threshold, the left anchor point is removed; Iterates through all anchors until the anchor filtering is complete. The present invention can improve the accuracy and speed of DNA sequence alignment.

Description

An anchor point screening method, device and computer equipment based on Bloom filter

技术领域technical field

本申请涉及计算机技术领域，特别是涉及一种基于布隆过滤器的长读DNA序列比对锚点筛选方法、装置和计算机设备。The present application relates to the field of computer technology, in particular to a method, device and computer equipment for long-read DNA sequence alignment anchor point screening based on Bloom filter.

背景技术Background technique

第三代测序属于单分子检测技术，无需对模板进行扩增，避免了聚合酶链式反应带来的碱基偏好性。而且，三代测序的读长更长，能够发现二代测序无法发现的基因组重复片段和结构变异等信息，在基因组组装、结构变异检测和基因组重测序等领域都取得了新的突破。The third-generation sequencing is a single-molecule detection technology that does not need to amplify the template, avoiding the base preference caused by the polymerase chain reaction. Moreover, the read length of the third-generation sequencing is longer, and it can find information such as genome repeats and structural variation that cannot be found by the second-generation sequencing. New breakthroughs have been made in the fields of genome assembly, structural variation detection, and genome resequencing.

序列比对是测序数据分析中基础而重要的环节，比对的结果是其他步骤的前提。不同于面向二代短读序列的比对算法，实现三代长读序列的快速准确比对面临读段长度更长和测序错误率更高等方面的挑战。针对这一问题，三代长读序列比对大多采用启发式方法，即“种子-扩展”，其思想是先从读段和参考基因组中选取一些短片段作为种子；再通过种子的精确匹配进行锚点定位，将比对范围由整个基因组缩小至部分候选区域；最后利用动态规划方法，对候选区域进行碱基比对，细化比对结果，实现扩展验证。因此，序列比对算法主要包括种子生成、锚点定位和碱基比对三个步骤。Sequence alignment is a basic and important link in sequencing data analysis, and the result of the alignment is the premise of other steps. Different from the alignment algorithms for second-generation short-read sequences, achieving fast and accurate alignment of third-generation long-read sequences faces challenges in terms of longer read lengths and higher sequencing error rates. In response to this problem, most of the three generations of long-read sequence alignments use a heuristic method, namely "seed-expansion", the idea is to first select some short fragments from the reads and the reference genome as seeds; Point positioning, the comparison range is narrowed from the entire genome to some candidate regions; finally, the dynamic programming method is used to perform base comparisons on the candidate regions, refine the comparison results, and realize extended verification. Therefore, the sequence alignment algorithm mainly includes three steps: seed generation, anchor location and base alignment.

由于测序错误和基因组本身的局部同源性，现有比对工具在全局定位时会定位到一些错误的锚点，对这些错误锚点之间的片段进行比对，将会产生次优结果甚至错误结果。而且，这些错误锚点同样需要进行扩展验证，会带来大量无用的计算，在影响比对精度的同时也降低了速度。Due to sequencing errors and the local homology of the genome itself, the existing alignment tools will locate some wrong anchors during global positioning, and aligning fragments between these wrong anchors will produce suboptimal results or even wrong result. Moreover, these wrong anchor points also require extended verification, which will bring a lot of useless calculations, which will not only affect the comparison accuracy, but also reduce the speed.

发明内容SUMMARY OF THE INVENTION

基于此，有必要针对上述技术问题，提供一种能够剔除错误锚点的基于布隆过滤器的长读DNA序列比对锚点筛选方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide a method, device, computer equipment and storage medium for long-read DNA sequence alignment anchor screening based on Bloom filter capable of eliminating false anchors.

一种基于布隆过滤器的锚点筛选方法，所述方法包括：An anchor point screening method based on Bloom filter, the method comprises:

获取待比对的查询序列、参考序列以及预先定位得到的多个锚点；所述查询序列为长读DNA序列；Obtain the query sequence to be compared, the reference sequence and multiple anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;

选取所述查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取所述参考序列在所述第一锚点和所述第二锚点之间的片段为参考序列片段；Select the fragment of the query sequence between the first anchor point and the second anchor point as the query sequence fragment, and select the fragment of the reference sequence between the first anchor point and the second anchor point as the reference sequence fragment;

根据所述参考序列片段按照预设长度生成多个连续重叠的参考序列子片段，根据所述查询序列片段按照所述预设长度生成多个连续重叠的查询序列子片段；generating a plurality of consecutive overlapping reference sequence sub-segments according to the reference sequence fragment according to a preset length, and generating a plurality of consecutive overlapping query sequence sub-segments according to the query sequence fragment according to the preset length;

通过预设的多个哈希函数建立索引，将所述参考序列子片段映射到布隆过滤器的位向量中；Indexing is established by a plurality of preset hash functions, and the sub-segment of the reference sequence is mapped to the bit vector of the Bloom filter;

根据所述索引查询所述参考序列中是否存在所述查询序列子片段，当所述查询序列子片段在所述参考序列中不存在时，判断所述查询序列子片段未通过筛选；Query whether the sub-segment of the query sequence exists in the reference sequence according to the index, and when the sub-segment of the query sequence does not exist in the reference sequence, determine that the sub-segment of the query sequence has not passed the screening;

遍历所述查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当所述累计值大于预设阈值时，剔除所述第一锚点；Traverse all the query sequence sub-segments in the query sequence fragment, and count the cumulative value of the query sequence sub-fragments that have not passed the screening, and when the cumulative value is greater than a preset threshold, remove the first anchor point;

遍历所有锚点，直到完成所述所有锚点的筛选。Iterates through all anchors until filtering of all said anchors is complete.

在其中一个实施例中，还包括：在根据所述参考序列片段按照预设长度生成多个连续重叠的参考序列子片段，根据所述查询序列片段按照所述预设长度生成多个连续重叠的查询序列子片段之前，删除所述查询序列片段和所述参考序列片段两端相同的部分，对所述查询序列片段和所述参考序列片段进行更新。In one of the embodiments, the method further includes: generating a plurality of consecutive overlapping reference sequence sub-segments according to the reference sequence segment according to a preset length, and generating a plurality of consecutive overlapping reference sequence sub-segments according to the query sequence segment according to the preset length. Before the sub-segment of the query sequence, the same parts at both ends of the query sequence segment and the reference sequence segment are deleted, and the query sequence segment and the reference sequence segment are updated.

在其中一个实施例中，还包括：在删除所述查询序列片段和所述参考序列片段两端相同的部分，对所述查询序列片段和所述参考序列片段进行更新之前，将所述查询序列片段和所述参考序列片段对齐；由序列两端向中间延伸，逐个碱基进行比对；得到所述查询序列片段和所述参考序列片段两端相同的部分。In one of the embodiments, the method further includes: before deleting the same parts at both ends of the query sequence segment and the reference sequence segment, and updating the query sequence segment and the reference sequence segment, updating the query sequence segment The fragment is aligned with the reference sequence fragment; extends from the two ends of the sequence to the middle, and aligns base by base; obtains the same part at both ends of the query sequence fragment and the reference sequence fragment.

在其中一个实施例中，还包括：若所述查询序列片段和所述参考序列片段正在比对的两个碱基相同，继续向下一个碱基延伸；若所述查询序列片段和所述参考序列片段正在比对的两个碱基不同，停止延伸。In one embodiment, it further includes: if the two bases that are being aligned between the query sequence fragment and the reference sequence fragment are the same, continue to extend to the next base; if the query sequence fragment and the reference sequence fragment are the same The two bases at which the sequence fragment is being aligned are different and the extension is stopped.

在其中一个实施例中，还包括：根据预设的多个哈希函数得到每个参考序列子片段的多个哈希值；根据所述参考序列子片段的哈希值将布隆过滤器位向量对应位置的值置1。In one of the embodiments, the method further includes: obtaining multiple hash values of each reference sequence sub-segment according to a plurality of preset hash functions; The value at the corresponding position of the vector is set to 1.

在其中一个实施例中，还包括：根据预设的多个哈希函数得到每个查询序列子片段的多个哈希值；根据所述查询序列子片段的哈希值在所述布隆过滤器位向量的对应位置进行查询；若所述查询序列子片段的多个哈希值对应位置的值全部为1，则判断所述查询序列子片段通过筛选；若不全部为1，则判断所述查询序列子片段未通过筛选。In one of the embodiments, the method further includes: obtaining multiple hash values of each sub-segment of the query sequence according to a plurality of preset hash functions; filtering the Bloom filter according to the hash values of the sub-segments of the query sequence If the corresponding positions of the multiple hash values of the sub-segment of the query sequence are all 1, it is judged that the sub-segment of the query sequence has passed the screening; if not all of them are 1, it is judged that the The subfragments of the query sequence did not pass the screening.

在其中一个实施例中，还包括：以所述第二锚点为第一锚点，以所述第二锚点右侧相邻的锚点为第二锚点，进行下一轮的锚点筛选。In one of the embodiments, the method further includes: using the second anchor point as the first anchor point, and using the anchor point adjacent to the right side of the second anchor point as the second anchor point, and performing the next round of anchor points filter.

一种基于布隆过滤器的锚点筛选装置，所述装置包括：An anchor point screening device based on a Bloom filter, the device comprising:

序列获取模块，用于获取待比对的查询序列、参考序列以及预先定位得到的多个锚点；所述查询序列为长读DNA序列；a sequence acquisition module, used to acquire a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;

片段选取模块，用于选取所述查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取所述参考序列在所述第一锚点和所述第二锚点之间的片段为参考序列片段；A fragment selection module is used to select a fragment of the query sequence between the first anchor point and the second anchor point as a query sequence fragment, and select the reference sequence between the first anchor point and the second anchor point. The fragments between are reference sequence fragments;

子片段生成模块，用于根据所述参考序列片段按照预设长度生成多个连续重叠的参考序列子片段，根据所述查询序列片段按照所述预设长度生成多个连续重叠的查询序列子片段；a sub-segment generation module, configured to generate a plurality of consecutive overlapping reference sequence sub-segments according to the reference sequence segment according to a preset length, and generate a plurality of consecutive overlapping query sequence sub-segments according to the query sequence segment according to the preset length ;

索引建立模块，用于通过预设的多个哈希函数建立索引，将所述参考序列子片段映射到布隆过滤器的位向量中；an index establishment module, for establishing an index through a plurality of preset hash functions, and mapping the reference sequence sub-segment into the bit vector of the Bloom filter;

查询模块，用于根据所述索引查询所述参考序列中是否存在所述查询序列子片段，当所述查询序列子片段在所述参考序列中不存在时，判断所述查询序列子片段未通过筛选；A query module, configured to query whether the sub-segment of the query sequence exists in the reference sequence according to the index, and when the sub-segment of the query sequence does not exist in the reference sequence, determine that the sub-segment of the query sequence has failed filter;

锚点剔除模块，用于遍历所述查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当所述累计值大于预设阈值时，剔除所述第一锚点；An anchor point elimination module is used to traverse all query sequence sub-segments in the query sequence segment, and count the cumulative value of the query sequence sub-segments that have not passed the screening, and when the cumulative value is greater than a preset threshold, remove the first anchor;

遍历模块，用于遍历所有锚点，直到完成所述所有锚点的筛选。The traversal module is used to traverse all the anchors until the screening of all the anchors is completed.

一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：A computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

上述基于布隆过滤器的锚点筛选方法、装置、计算机设备和存储介质，根据预先定位到的锚点，选取查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取参考序列在第一锚点和第二锚点之间的片段为参考序列片段，分别对参考序列片段和查询序列片段按照预设长度生成多个连续重叠的子片段，通过预设的多个哈希函数建立索引，将参考序列子片段映射到布隆过滤器的位向量中，再根据索引查询参考序列中是否存在查询序列子片段，当查询序列子片段在参考序列中不存在时，判断查询序列子片段未通过筛选；遍历查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当累计值大于预设阈值时，剔除第一锚点；遍历所有锚点，直到完成所有锚点的筛选。本发明通过布隆过滤器实现DNA序列比对锚点的筛选，剔除了错误锚点，可以提高DNA序列比对的精度和速度。The above-mentioned anchor point screening method, device, computer equipment and storage medium based on Bloom filter, according to the pre-positioned anchor point, select the fragment of the query sequence between the first anchor point and the second anchor point as the query sequence fragment, A segment of the reference sequence between the first anchor point and the second anchor point is selected as a reference sequence segment, and a plurality of consecutive overlapping sub-segments are generated for the reference sequence segment and the query sequence segment according to a preset length, and the preset multiple sub-segments are generated. The hash function builds an index, maps the sub-segment of the reference sequence to the bit vector of the Bloom filter, and then queries whether there is a sub-segment of the query sequence in the reference sequence according to the index. When the sub-segment of the query sequence does not exist in the reference sequence, judge The query sequence sub-segment fails the screening; traverses all the query sequence sub-segments in the query sequence fragment, and counts the cumulative value of the query sequence sub-segments that do not pass the screening. When the cumulative value is greater than the preset threshold, remove the first anchor point; traverse all the query sequence sub-segments Anchors until all anchors are filtered. The invention realizes the screening of DNA sequence alignment anchor points through Bloom filter, eliminates wrong anchor points, and can improve the accuracy and speed of DNA sequence alignment.

附图说明Description of drawings

图1为一个实施例中基于布隆过滤器的锚点筛选方法的流程示意图；1 is a schematic flowchart of an anchor point screening method based on a Bloom filter in one embodiment;

图2为一个实施例中从序列片段得到子片段的示意图；Figure 2 is a schematic diagram of obtaining subfragments from sequence fragments in one embodiment;

图3为另一个实施例中基于布隆过滤器的锚点筛选方法的流程示意图；3 is a schematic flowchart of an anchor point screening method based on a Bloom filter in another embodiment;

图4为一个实施例中基于布隆过滤器的锚点筛选装置的结构框图；4 is a structural block diagram of an apparatus for anchor point screening based on a Bloom filter in one embodiment;

图5为一个实施例中计算机设备的内部结构图。FIG. 5 is a diagram of the internal structure of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

本申请提供的基于布隆过滤器的锚点筛选方法，可以应用于如下应用环境中。其中，终端执行一种基于布隆过滤器的锚点筛选方法，根据预先定位到的锚点，选取查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取参考序列在第一锚点和第二锚点之间的片段为参考序列片段，分别对参考序列片段和查询序列片段按照预设长度生成多个连续重叠的子片段，通过预设的多个哈希函数建立索引，将参考序列子片段映射到布隆过滤器的位向量中，再根据索引查询参考序列中是否存在查询序列子片段，当查询序列子片段在参考序列中不存在时，判断查询序列子片段未通过筛选；遍历查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当累计值大于预设阈值时，剔除第一锚点。其中，终端可以但不限于是各种个人计算机、笔记本电脑和平板电脑。The Bloom filter-based anchor point screening method provided in this application can be applied to the following application environments. The terminal executes an anchor point screening method based on a Bloom filter, selects a fragment of the query sequence between the first anchor point and the second anchor point as the query sequence fragment according to the pre-located anchor points, and selects the reference sequence The segment between the first anchor point and the second anchor point is the reference sequence segment, and the reference sequence segment and the query sequence segment are respectively generated according to preset lengths. Multiple consecutive overlapping sub-segments, through preset multiple hash functions Build an index, map the reference sequence sub-segment to the bit vector of the Bloom filter, and then query whether the query sequence sub-segment exists in the reference sequence according to the index. When the query sequence sub-segment does not exist in the reference sequence, determine the query sequence sub-segment. The fragment fails the screening; it traverses all the sub-fragments of the query sequence in the query sequence fragment, and counts the cumulative value of the sub-fragments of the query sequence that do not pass the screening. When the cumulative value is greater than the preset threshold, the first anchor point is eliminated. Wherein, the terminal may be, but not limited to, various personal computers, notebook computers and tablet computers.

在一个实施例中，如图1所示，提供了一种基于布隆过滤器的锚点筛选方法，包括以下步骤：In one embodiment, as shown in FIG. 1 , a Bloom filter-based anchor point screening method is provided, comprising the following steps:

步骤102，获取待比对的查询序列、参考序列以及预先定位得到的多个锚点。Step 102: Obtain the query sequence to be compared, the reference sequence, and multiple anchor points obtained by pre-positioning.

查询序列为长读DNA序列。三代长读序列比对大多采用启发式方法，即“种子-扩展”，长读DNA序列比对算法主要包括种子生成、锚点定位和碱基比对三个步骤。The query sequence is a long-read DNA sequence. The three-generation long-read sequence alignment mostly adopts the heuristic method, namely "seed-expansion". The long-read DNA sequence alignment algorithm mainly includes three steps: seed generation, anchor point positioning and base alignment.

步骤104，选取查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取参考序列在第一锚点和第二锚点之间的片段为参考序列片段。Step 104 , a segment of the query sequence between the first anchor point and the second anchor point is selected as a query sequence segment, and a segment of the reference sequence between the first anchor point and the second anchor point is selected as a reference sequence segment.

以锚点为间隔，对锚点之间的序列进行比对。本发明在进行锚点筛选时，也是以锚点为间隔，通过锚点之间的序列进行锚点筛选。Align the sequences between the anchors at the interval of the anchors. When performing anchor point screening in the present invention, anchor points are also used as intervals, and anchor point screening is performed by the sequence between the anchor points.

步骤106，根据参考序列片段按照预设长度生成多个连续重叠的参考序列子片段，根据查询序列片段按照预设长度生成多个连续重叠的查询序列子片段。Step 106: Generate a plurality of continuously overlapping reference sequence sub-segments according to the reference sequence segment according to a preset length, and generate a plurality of consecutively overlapping query sequence sub-segments according to the query sequence segment according to the preset length.

如图2所示，子片段连续重叠，覆盖整个序列片段。As shown in Figure 2, the sub-segments overlap continuously, covering the entire sequence segment.

步骤108，通过预设的多个哈希函数建立索引，将参考序列子片段映射到布隆过滤器的位向量中。In step 108, indexes are established by using a plurality of preset hash functions, and the reference sequence sub-segments are mapped to the bit vector of the Bloom filter.

布隆过滤器是一个位向量，要映射一个值到布隆过滤器中，需要使用多个不同的哈希函数生成多个哈希值，并对每个生成的哈希值指向的bit位置1。A bloom filter is a bit vector. To map a value to a bloom filter, multiple hash values need to be generated using multiple different hash functions, and the bit position pointed to by each generated hash value is 1 .

步骤110，根据索引查询参考序列中是否存在查询序列子片段，当查询序列子片段在参考序列中不存在时，判断查询序列子片段未通过筛选。Step 110 , according to the index to query whether the sub-segment of the query sequence exists in the reference sequence, and when the sub-segment of the query sequence does not exist in the reference sequence, it is determined that the sub-segment of the query sequence has not passed the screening.

当查询序列子片段在参考序列中不存在时，说明查询序列片段存在参考序列片段中不存在的序列片段，即，查询序列片段与参考序列片段存在差异。When the sub-segment of the query sequence does not exist in the reference sequence, it means that the query sequence segment has a sequence segment that does not exist in the reference sequence segment, that is, there is a difference between the query sequence segment and the reference sequence segment.

步骤112，遍历查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当累计值大于预设阈值时，剔除第一锚点。Step 112 , traverse all the query sequence sub-segments in the query sequence segment, and count the accumulated values of the query sequence sub-segments that have not passed the screening. When the accumulated value is greater than a preset threshold, the first anchor point is eliminated.

当查询序列子片段中有相当多的子片段在参考序列片段中不存在时，说明查询序列片段与参考序列片段的差异较大，可以判断锚点是错误锚点。When a considerable number of sub-segments in the sub-segment of the query sequence do not exist in the reference sequence segment, it indicates that the query sequence segment is quite different from the reference sequence segment, and it can be judged that the anchor point is the wrong anchor point.

步骤114，遍历所有锚点，直到完成所有锚点的筛选。Step 114, traverse all anchor points until all anchor points are screened.

上述基于布隆过滤器的锚点筛选方法中，根据预先定位到的锚点，选取查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取参考序列在第一锚点和第二锚点之间的片段为参考序列片段，分别对参考序列片段和查询序列片段按照预设长度生成多个连续重叠的子片段，通过预设的多个哈希函数建立索引，将参考序列子片段映射到布隆过滤器的位向量中，再根据索引查询参考序列中是否存在查询序列子片段，当查询序列子片段在参考序列中不存在时，判断查询序列子片段未通过筛选；遍历查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当累计值大于预设阈值时，剔除第一锚点；遍历所有锚点，直到完成所有锚点的筛选。本发明通过布隆过滤器实现DNA序列比对锚点的筛选，剔除了错误锚点，可以提高DNA序列比对的精度和速度。In the above-mentioned anchor point screening method based on Bloom filter, according to the pre-positioned anchor points, the fragment of the query sequence between the first anchor point and the second anchor point is selected as the query sequence fragment, and the reference sequence is selected at the first anchor point. The segment between the point and the second anchor point is the reference sequence segment, and the reference sequence segment and the query sequence segment are respectively generated according to preset lengths. Multiple consecutive overlapping sub-segments are indexed by multiple preset hash functions, and the The reference sequence sub-segment is mapped to the bit vector of the Bloom filter, and then the query sequence sub-segment is queried according to the index. If the query sequence sub-segment does not exist in the reference sequence, it is judged that the query sequence sub-segment has not passed the screening ; traverse all query sequence sub-segments in the query sequence segment, and count the cumulative value of the query sequence sub-segments that have not passed the screening, when the cumulative value is greater than the preset threshold, remove the first anchor point; traverse all anchor points until all anchor points are completed point filter. The invention realizes the screening of DNA sequence alignment anchor points through Bloom filter, eliminates wrong anchor points, and can improve the accuracy and speed of DNA sequence alignment.

在其中一个实施例中，还包括：在根据参考序列片段按照预设长度生成多个连续重叠的参考序列子片段，根据查询序列片段按照预设长度生成多个连续重叠的查询序列子片段之前，删除查询序列片段和参考序列片段两端相同的部分，对查询序列片段和参考序列片段进行更新。In one embodiment, the method further includes: before generating a plurality of consecutive overlapping reference sequence sub-segments according to the reference sequence segment according to a preset length, and before generating a plurality of consecutive overlapping query sequence sub-segments according to the query sequence segment according to the preset length, Delete the same parts at both ends of the query sequence fragment and the reference sequence fragment, and update the query sequence fragment and the reference sequence fragment.

删除查询序列片段和参考序列片段两端相同的部分，对查询序列片段和参考序列片段进行更新，可以减少冗余数据。Deleting the same parts at both ends of the query sequence fragment and the reference sequence fragment, and updating the query sequence fragment and the reference sequence fragment, can reduce redundant data.

在其中一个实施例中，还包括：在删除查询序列片段和参考序列片段两端相同的部分，对查询序列片段和参考序列片段进行更新之前，将查询序列片段和参考序列片段对齐；由序列两端向中间延伸，逐个碱基进行比对；得到查询序列片段和参考序列片段两端相同的部分。In one of the embodiments, the method further includes: aligning the query sequence segment and the reference sequence segment before deleting the same parts at both ends of the query sequence segment and the reference sequence segment and updating the query sequence segment and the reference sequence segment; The ends are extended to the middle, and the base-by-base alignment is carried out; the identical parts at both ends of the query sequence fragment and the reference sequence fragment are obtained.

在其中一个实施例中，还包括：若查询序列片段和参考序列片段正在比对的两个碱基相同，继续向下一个碱基延伸；若查询序列片段和参考序列片段正在比对的两个碱基不同，停止延伸。In one embodiment, it also includes: if the two bases that are being aligned between the query sequence fragment and the reference sequence fragment are the same, continue to extend to the next base; if the query sequence fragment and the reference sequence fragment are aligning two bases The bases are different and the extension is stopped.

在其中一个实施例中，还包括：根据预设的多个哈希函数得到每个参考序列子片段的多个哈希值；根据参考序列子片段的哈希值将布隆过滤器位向量对应位置的值置1。In one of the embodiments, the method further includes: obtaining multiple hash values of each reference sequence sub-segment according to a plurality of preset hash functions; corresponding to the Bloom filter bit vector according to the hash values of the reference sequence sub-segments The value of position is set to 1.

在其中一个实施例中，还包括：根据预设的多个哈希函数得到每个查询序列子片段的多个哈希值；根据查询序列子片段的哈希值在布隆过滤器位向量的对应位置进行查询；若查询序列子片段的多个哈希值对应位置的值全部为1，则判断查询序列子片段通过筛选；若不全部为1，则判断查询序列子片段未通过筛选。In one of the embodiments, the method further includes: obtaining multiple hash values of each sub-segment of the query sequence according to a plurality of preset hash functions; The corresponding position is queried; if the values of the corresponding positions of the multiple hash values of the query sequence sub-segment are all 1, the query sequence sub-segment is judged to pass the screening; if not all are 1, the query sequence sub-segment is judged to have not passed the screening.

在其中一个实施例中，还包括：以第二锚点为第一锚点，以第二锚点右侧相邻的锚点为第二锚点，进行下一轮的锚点筛选。In one of the embodiments, the method further includes: taking the second anchor point as the first anchor point, and taking the adjacent anchor point on the right side of the second anchor point as the second anchor point, and performing the next round of anchor point screening.

应该理解的是，虽然图1的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of FIG. 1 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a phase.

在一个具体实施例中，如图3所示，提供了一种基于布隆过滤器的锚点筛选方法，包括：In a specific embodiment, as shown in FIG. 3 , a method for anchor point screening based on a Bloom filter is provided, including:

步骤1：将查询序列query和参考序列target按锚点对齐，其中锚点数量为N；Step 1: Align the query sequence query and the reference sequence target by anchor points, where the number of anchor points is N;

步骤2：选取左侧第一锚点P1和左侧第二锚点P2间的片段，其中查询序列的片段为qseq1，参考序列的片段为tseq1；Step 2: Select the fragment between the first anchor point P1 on the left and the second anchor point P2 on the left, where the fragment of the query sequence is qseq1, and the fragment of the reference sequence is tseq1;

步骤3：去除查询序列片段qseq1和参考序列片段tseq1两端相同的部分；Step 3: Remove the same parts at both ends of the query sequence fragment qseq1 and the reference sequence fragment tseq1;

步骤4：为参考序列片段tseq1剩余部分建立索引；Step 4: Create an index for the remaining part of the reference sequence fragment tseq1;

步骤5：在索引中查询qseq1剩余部分，判定是否保留左侧锚点P1；Step 5: Query the remaining part of qseq1 in the index to determine whether to retain the left anchor point P1;

步骤6：完成对锚点P1的筛选；Step 6: Complete the screening of anchor point P1;

步骤7：回到步骤2，处理锚点P2和锚点P3件的片段，直至N个锚点筛选完成。Step 7: Go back to Step 2, and process the fragments of the anchor point P2 and the anchor point P3 until the N anchor points are screened.

更具体的步骤为：More specific steps are:

步骤3.1：将qseq1和tseq1两端对齐后，由两端向中间延伸，逐个碱基进行比对；Step 3.1: After aligning the two ends of qseq1 and tseq1, extend from the two ends to the middle, and align base by base;

步骤3.1.1：若两个碱基相同，继续向下一个碱基延伸；Step 3.1.1: If the two bases are the same, continue to extend to the next base;

步骤3.1.2：若两个碱基不同，停止延伸；Step 3.1.2: If the two bases are different, stop the extension;

步骤3.2：去除两端相同的部分；Step 3.2: Remove the same parts at both ends;

步骤4.1：对参考序列片段tseq1剩余部分生成连续重叠的长度为q的片段q-mer；Step 4.1: Generate a continuous overlapping fragment q-mer of length q for the remaining part of the reference sequence fragment tseq1;

步骤4.2：使用哈希函数计算出q-mer的哈希值；Step 4.2: Use the hash function to calculate the hash value of q-mer;

步骤4.3：将计算出的q-mer的哈希值插入布隆过滤器的位向量中；Step 4.3: Insert the calculated hash value of q-mer into the bit vector of the Bloom filter;

重复步骤4.1～4.3，直到处理完所有q-mer；Repeat steps 4.1 to 4.3 until all q-mers are processed;

步骤5.1：对查询序列片段qseq1剩余部分生成连续重叠的长度为q的片段q-mer；Step 5.1: Generate a continuous overlapping fragment q-mer of length q for the remaining part of the query sequence fragment qseq1;

步骤5.2：使用哈希函数计算出q-mer的哈希值；Step 5.2: Use the hash function to calculate the hash value of q-mer;

步骤5.3：在布隆过滤器的位向量中查询步骤5.2计算出的哈希值；Step 5.3: Query the hash value calculated in step 5.2 in the bit vector of the Bloom filter;

步骤5.3.1：位向量中对应的哈希值全部为1，该q-mer通过筛选；Step 5.3.1: The corresponding hash values in the bit vector are all 1, and the q-mer passes the screening;

步骤5.3.2：位向量中对应的哈希值至少存在一位为0，该q-mer不通过筛选，未通过q-mer数量累计值Count加1；Step 5.3.2: At least one bit of the corresponding hash value in the bit vector is 0, the q-mer does not pass the screening, and the cumulative value Count of the number of failed q-mers is incremented by 1;

步骤5.3.3：判断Count值是否大于过滤阈值T；Step 5.3.3: Determine whether the Count value is greater than the filtering threshold T;

步骤5.3.3.1：Count大于过滤阈值T，说明锚点P1定位不正确，剔除锚点P1；Step 5.3.3.1: Count is greater than the filtering threshold T, indicating that the anchor point P1 is not positioned correctly, and the anchor point P1 is eliminated;

步骤5.3.3.2：Count不大于过滤阈值T，说明锚点P定位正确，保留锚点P1；Step 5.3.3.2: Count is not greater than the filtering threshold T, indicating that the anchor point P is positioned correctly, and the anchor point P1 is reserved;

在一个实施例中，如图4所示，提供了一种基于布隆过滤器的锚点筛选装置，包括：序列获取模块402、片段选取模块404、子片段生成模块406、索引建立模块408、查询模块410、锚点剔除模块412和遍历模块414，其中：In one embodiment, as shown in FIG. 4, an anchor point screening apparatus based on a Bloom filter is provided, including: a sequence acquisition module 402, a segment selection module 404, a sub-segment generation module 406, an index establishment module 408, The query module 410, the anchor point culling module 412 and the traversal module 414, wherein:

序列获取模块402，用于获取待比对的查询序列、参考序列以及预先定位得到的多个锚点；查询序列为长读DNA序列；The sequence obtaining module 402 is used to obtain the query sequence to be compared, the reference sequence and the multiple anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;

片段选取模块404，用于选取查询序列在第一锚点和第二锚点之间的片段为查询序列片段，选取参考序列在第一锚点和第二锚点之间的片段为参考序列片段；A fragment selection module 404, configured to select a fragment of the query sequence between the first anchor point and the second anchor point as a query sequence fragment, and select a fragment of the reference sequence between the first anchor point and the second anchor point as a reference sequence fragment ;

子片段生成模块406，用于根据参考序列片段按照预设长度生成多个连续重叠的参考序列子片段，根据查询序列片段按照预设长度生成多个连续重叠的查询序列子片段；A sub-segment generation module 406, configured to generate a plurality of continuously overlapping reference sequence sub-segments according to a preset length according to the reference sequence segment, and generate a plurality of consecutively overlapping query sequence sub-segments according to the query sequence segment according to a preset length;

索引建立模块408，用于通过预设的多个哈希函数建立索引，将参考序列子片段映射到布隆过滤器的位向量中；The index establishment module 408 is used to establish an index through a plurality of preset hash functions, and map the reference sequence sub-segment into the bit vector of the Bloom filter;

查询模块410，用于根据索引查询参考序列中是否存在查询序列子片段，当查询序列子片段在参考序列中不存在时，判断查询序列子片段未通过筛选；The query module 410 is configured to query whether the sub-segment of the query sequence exists in the reference sequence according to the index, and when the sub-segment of the query sequence does not exist in the reference sequence, determine that the sub-segment of the query sequence has not passed the screening;

锚点剔除模块412，用于遍历查询序列片段中所有查询序列子片段，并统计未通过筛选的查询序列子片段的累计值，当累计值大于预设阈值时，剔除第一锚点；Anchor point elimination module 412, configured to traverse all query sequence sub-segments in the query sequence segment, and count the cumulative value of the query sequence sub-segments that have not passed the screening, and when the cumulative value is greater than a preset threshold, remove the first anchor point;

遍历模块414，用于遍历所有锚点，直到完成所有锚点的筛选。The traversal module 414 is used to traverse all the anchor points until the screening of all the anchor points is completed.

片段选取模块404还用于删除查询序列片段和参考序列片段两端相同的部分，对查询序列片段和参考序列片段进行更新。The segment selection module 404 is further configured to delete the same parts at both ends of the query sequence segment and the reference sequence segment, and update the query sequence segment and the reference sequence segment.

片段选取模块404还用于将查询序列片段和参考序列片段对齐；由序列两端向中间延伸，逐个碱基进行比对；得到查询序列片段和参考序列片段两端相同的部分。若查询序列片段和参考序列片段正在比对的两个碱基相同，继续向下一个碱基延伸；若查询序列片段和参考序列片段正在比对的两个碱基不同，停止延伸。The fragment selection module 404 is also used for aligning the query sequence fragment and the reference sequence fragment; extending from both ends of the sequence to the middle, and performing base-by-base alignment; obtaining the same part at both ends of the query sequence fragment and the reference sequence fragment. If the two bases of the query sequence fragment and the reference sequence fragment being aligned are the same, continue to extend to the next base; if the two bases of the query sequence fragment and the reference sequence fragment being aligned are different, stop the extension.

索引建立模块408还用于根据预设的多个哈希函数得到每个参考序列子片段的多个哈希值；根据参考序列子片段的哈希值将布隆过滤器位向量对应位置的值置1。The index establishment module 408 is further configured to obtain a plurality of hash values of each reference sequence sub-segment according to a plurality of preset hash functions; Set to 1.

查询模块410还用于根据预设的多个哈希函数得到每个查询序列子片段的多个哈希值；根据查询序列子片段的哈希值在布隆过滤器位向量的对应位置进行查询；若查询序列子片段的多个哈希值对应位置的值全部为1，则判断查询序列子片段通过筛选；若不全部为1，则判断查询序列子片段未通过筛选。The query module 410 is further configured to obtain a plurality of hash values of each sub-segment of the query sequence according to a plurality of preset hash functions; query the corresponding position of the Bloom filter bit vector according to the hash value of the sub-segment of the query sequence ; If the values of the corresponding positions of the multiple hash values of the query sequence sub-segment are all 1, the query sequence sub-segment is judged to pass the screening; if not all are 1, the query sequence sub-segment is judged to have not passed the screening.

查询模块410还用于以第二锚点为第一锚点，以第二锚点右侧相邻的锚点为第二锚点，进行下一轮的锚点筛选。The query module 410 is further configured to use the second anchor point as the first anchor point, and use the adjacent anchor point to the right of the second anchor point as the second anchor point to perform the next round of anchor point screening.

关于基于布隆过滤器的锚点筛选装置的具体限定可以参见上文中对于基于布隆过滤器的锚点筛选方法的限定，在此不再赘述。上述基于布隆过滤器的锚点筛选装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the apparatus for anchor point screening based on a Bloom filter, reference may be made to the above definition of the method for anchor point screening based on a Bloom filter, which will not be repeated here. Each module in the above-mentioned Bloom filter-based anchor point screening apparatus can be implemented in whole or in part by software, hardware, and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于布隆过滤器的锚点筛选方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 5 . The computer equipment includes a processor, memory, a network interface, a display screen, and an input device connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a Bloom filter-based anchor point screening method. The display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.

本领域技术人员可以理解，图5中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，该存储器存储有计算机程序，该处理器执行计算机程序时实现上述方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps in the above method embodiments when the processor executes the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, implements the steps in the above method embodiments.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. an anchor point screening method based on Bloom filter, is characterized in that, described method comprises:

Obtain the query sequence to be compared, the reference sequence and multiple anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;

Select the fragment of the query sequence between the first anchor point and the second anchor point as the query sequence fragment, and select the fragment of the reference sequence between the first anchor point and the second anchor point as the reference sequence fragment;

generating a plurality of consecutive overlapping reference sequence sub-segments according to the reference sequence fragment according to a preset length, and generating a plurality of consecutive overlapping query sequence sub-segments according to the query sequence fragment according to the preset length;

Indexing is established by a plurality of preset hash functions, and the sub-segment of the reference sequence is mapped to the bit vector of the Bloom filter;

Query whether the sub-segment of the query sequence exists in the reference sequence according to the index, and when the sub-segment of the query sequence does not exist in the reference sequence, determine that the sub-segment of the query sequence has not passed the screening;

Traverse all the query sequence sub-segments in the query sequence fragment, and count the cumulative value of the query sequence sub-fragments that have not passed the screening, and when the cumulative value is greater than a preset threshold, remove the first anchor point;

Iterates through all anchors until filtering of all said anchors is complete.

2 . The method according to claim 1 , wherein a plurality of consecutive overlapping reference sequence sub-segments are generated according to the reference sequence segment according to a preset length, and the query sequence segment is generated according to the preset length according to the query sequence segment. 3 . Before multiple consecutive overlapping sub-segments of the query sequence, also include:

The same parts at both ends of the query sequence fragment and the reference sequence fragment are deleted, and the query sequence fragment and the reference sequence fragment are updated.

3. The method according to claim 2, wherein, before deleting the same part at both ends of the query sequence fragment and the reference sequence fragment, and before updating the query sequence fragment and the reference sequence fragment, Also includes:

aligning the query sequence fragment and the reference sequence fragment;

Extend from both ends of the sequence to the middle, and align base by base;

The same portion at both ends of the query sequence fragment and the reference sequence fragment is obtained.

4. The method according to claim 3, wherein the two ends of the sequence are extended to the middle, and the base-by-base alignment is performed, comprising:

If the two bases on which the query sequence fragment and the reference sequence fragment are being aligned are the same, continue to extend to the next base;

If the query sequence fragment and the reference sequence fragment are different by two bases being aligned, the extension is stopped.

5. The method according to claim 4, wherein indexing is established by multiple preset hash functions, and the reference sequence sub-segment is mapped to the bit vector of the Bloom filter, comprising:

Obtain multiple hash values of each reference sequence sub-segment according to the preset multiple hash functions;

The value of the corresponding position of the bloom filter bit vector is set to 1 according to the hash value of the reference sequence sub-segment.

6 . The method according to claim 5 , wherein the query sequence sub-segment is inquired about whether the query sequence sub-segment exists in the reference sequence according to the index, when the query sequence sub-segment does not exist in the reference sequence. 7 . , judging that the sub-fragments of the query sequence have not passed the screening, including:

Obtain multiple hash values of each query sequence sub-segment according to multiple preset hash functions;

Perform a query at the corresponding position of the Bloom filter bit vector according to the hash value of the sub-segment of the query sequence;

If the values of the corresponding positions of the plurality of hash values of the sub-segment of the query sequence are all 1, it is judged that the sub-segment of the query sequence passes the screening;

If not all are 1, it is judged that the sub-fragment of the query sequence has not passed the screening.

7. The method according to any one of claims 1 to 6, wherein after removing the first anchor point, the method comprises:

Taking the second anchor point as the first anchor point, and taking the adjacent anchor point on the right side of the second anchor point as the second anchor point, the next round of anchor point screening is performed.

8. An anchor point screening device based on a Bloom filter, wherein the device comprises:

a sequence acquisition module, used to acquire a query sequence to be compared, a reference sequence and a plurality of anchor points obtained by pre-positioning; the query sequence is a long-read DNA sequence;

A fragment selection module is used to select a fragment of the query sequence between the first anchor point and the second anchor point as a query sequence fragment, and select the reference sequence between the first anchor point and the second anchor point. The fragments between are reference sequence fragments;

a sub-segment generation module, configured to generate a plurality of consecutive overlapping reference sequence sub-segments according to the reference sequence segment according to a preset length, and generate a plurality of consecutive overlapping query sequence sub-segments according to the query sequence segment according to the preset length ;

an index establishment module, for establishing an index through a plurality of preset hash functions, and mapping the reference sequence sub-segment into the bit vector of the Bloom filter;

A query module, configured to query whether the sub-segment of the query sequence exists in the reference sequence according to the index, and when the sub-segment of the query sequence does not exist in the reference sequence, determine that the sub-segment of the query sequence has failed filter;

An anchor point elimination module is used to traverse all query sequence sub-segments in the query sequence segment, and count the cumulative value of the query sequence sub-segments that have not passed the screening, and when the cumulative value is greater than a preset threshold, remove the first anchor;

The traversal module is used to traverse all the anchors until the screening of all the anchors is completed.

9. The apparatus according to claim 8, wherein the segment selection module is further configured to delete the same part at both ends of the query sequence segment and the reference sequence segment, and the query sequence segment and the reference sequence segment are Reference sequence fragments are updated.

10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when the processor executes the computer program .