[go: up one dir, main page]

CN117373538B - A biological sequence comparison method and system based on multi-thread computing - Google Patents

A biological sequence comparison method and system based on multi-thread computing Download PDF

Info

Publication number
CN117373538B
CN117373538B CN202311676549.8A CN202311676549A CN117373538B CN 117373538 B CN117373538 B CN 117373538B CN 202311676549 A CN202311676549 A CN 202311676549A CN 117373538 B CN117373538 B CN 117373538B
Authority
CN
China
Prior art keywords
sequence
data
kmer
rabbitfx
biological
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311676549.8A
Other languages
Chinese (zh)
Other versions
CN117373538A (en
Inventor
殷泽坤
陈冉
闫立峰
刘卫国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202311676549.8A priority Critical patent/CN117373538B/en
Publication of CN117373538A publication Critical patent/CN117373538A/en
Application granted granted Critical
Publication of CN117373538B publication Critical patent/CN117373538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of biological sequence comparison, and provides a biological sequence comparison method and system based on multi-thread calculation, wherein the technical scheme is as follows: based on the original OrderMinHash algorithm, optimizing single-thread models such as improving the computing efficiency by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access to exchange algorithm circulation sequence, converting data types by using bit-based functions, optimizing priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like, designing a multi-thread computing model, allowing control of the number of threads, ensuring that the speed ratio is basically proportional to the number of threads in a load balancing and other modes, and completing the optimization of the algorithm on a general multi-core computing platform. Solves the problem of low sequence comparison efficiency of the traditional method.

Description

Biological sequence comparison method and system based on multithread calculation
Technical Field
The invention belongs to the technical field of biological sequence comparison, and particularly relates to a biological sequence comparison method and system based on multithread calculation.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Sequence alignment is used as a core operation in bioinformatics flow, and a first used sketch method has a core idea of a local sensitive hash strategy (LSH, location Sensitive Hash). The method is a dimension reduction method and is carried out in two steps. First, sequences (or partial sequences) are generalized into a much smaller sketch than the original sequence, with the important information retained, to estimate the degree of similarity of the two sequences. Second, pairs of sequences that may be similar are found by directly comparing the skchs (without referencing the original sequence) or using the skchs as keys of a hash table. A more thorough, computationally expensive alignment process can then be used on the candidate pairs to refine the actual alignment. In the LSH method, the distance between the slots is used as a first approximation of the distance between the sequences. That is, two sequences that are very similar must have similar skips, and conversely, dissimilar sequences have dissimilar skips.
The Sketch method innovatively uses LSH and it is possible to separate pairs of sequences that do not have high quality alignments from pairs of sequences that may have high quality alignments with high probability and relatively low computational effort. Thus, LSH reduces overall computational requirements while not introducing many false negatives. Sequence alignment sequences with low similarity are filtered by means of a search and the comparison of the original sequences is replaced by a search of the comparison sequences.
In practice, the sequence alignment program uses minHash LSH to represent Jaccard similarity, rather than LSH to represent edit distance or alignment score. Although both techniques have proven useful in practice, they have a major drawback: neither Jaccard similarity nor Hamming similarity directly corresponds to the edit distance, which is the number of operations (mismatch, insert, delete) required to convert one string to another.
The original MinHash algorithm is an improvement of the MinHash algorithm based on Jaccard similarity, and the consideration of the relative sequence of base arrangement in the sequence is added due to the lack of practical LSH applied to editing distance, guillaume Mar ç ais et al. OrderMinHash is based on a Kmer method, the weight and the occurrence frequency of each Kmer element are added, hash operation is carried out on each element in a Kmer set generated by a sequence by using m hash functions, and the previous L small hash values are taken for each hash operation, so that the sketch of the sequence is constructed. And (5) enhancing the screening of sequences with low similarity, and obtaining the sequence similarity by comparing sequences of the search.
However, the traditional original OrderMinHash algorithm is a practical LSH for comparing sequence editing distances, and although the space complexity of sequence comparison is reduced by using a sketch method, the time complexity of the algorithm still does not reach the requirement in the face of data of GB level or even TB level, and the sequence comparison efficiency is low. The following problems are presented in detail:
(1) Reading and resolving sequences is inefficient: the original algorithm uses a kseq single-thread analysis sequence, cannot perform thread expansion, and cannot fully utilize cpu resources.
(2) The temporal complexity of the body algorithm to construct the sketch is too high: since in the OrderMinHash algorithm, M hash functions are used for each element in the sequence kmer set, and L values are taken for each hash function to construct a sketch, three layers of loops are nested, and the algorithm time becomes unacceptable for GB level and even TB level data.
(3) Space utilization is too low: in the sequence reading and the search construction process, a large amount of useless data is reserved, and the space cannot be released in time, so that a large memory space is required to process GB and even TB level data.
Disclosure of Invention
In order to solve at least one technical problem in the background art, the invention provides a biological sequence comparison method and a biological sequence comparison system based on multithread calculation, which are based on the original OrderMinHash algorithm, and complete the optimization of the algorithm on a general multi-core computing platform by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access memory in a cyclic sequence of a switching algorithm, optimizing conversion data types by using bit-wise functions, priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like by using a SIMD vector processing unit, designing a multithread computing model, allowing the control of the number of threads, ensuring that the acceleration ratio is basically proportional to the number of threads in a load balancing mode and the like.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a first aspect of the present invention provides a biological sequence alignment method based on multi-threaded computing, comprising the steps of:
reading a FASTA file;
introducing a RabbitFX library, analyzing a FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library, acquiring the name of a sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the sketch of a plurality of sequences simultaneously through multithread, and forming the OMH structure of the sequence by the acquired name of the sequence and the sketch of the sequence together;
and reading an OMH structure of the sequence, and carrying out similarity measurement on the name of the obtained sequence and the search of the sequence to obtain a comparison result of the sequence.
Further, the step of introducing the RabbitFX library, and analyzing the FASTA file by using a multithreading I/O model framework taking a producer-consumer as a core in the RabbitFX library comprises the following steps:
the producer reads data from the input FASTA file, acquires space from the data pool, and formats the data into data blocks;
each data block is pushed into a data queue;
the user receives the data block from the data queue, processes the data block by using a user-defined function, and releases the space occupied by the data block for recycling after the processing is finished.
Further, the RabbitFX employs a linked list strategy, if one chunk cannot contain a complete sequence, the RabbitFX will link to the next chunk until it is guaranteed that the complete sequence is accessed in one list.
Further, the step of constructing the sequence specifically includes:
replacing branch sentences by using a bitwise function, and converting the data type of the sequence;
selecting the minimum L value of each hash function of the kmer set element by adopting a priority queue, optimizing the cycle by exchanging the cycle sequence when the minimum L value of each hash function of the kmer set element is obtained,
the outer layer loops through the kmer set elements and the inner layer loops through the hash function set.
Further, the converting the data type of the sequence by using a bitwise function instead of a branch statement includes:
the biological sequence consists of A, G, C and T, and adopts character strings, and A, G, C and T are converted into two-bit binary numbers by using a bitwise function, and the kmer element is also converted into a uint unsigned integer type from a string type.
Further, the selecting the minimum L value of each hash function of the kmer set element by using the priority queue specifically includes:
setting a priority queue with the size L for each hash function, calculating the hash value of the current kmer set element, immediately judging whether the current kmer set element can enter the priority queue, namely comparing the hash value of the element with the maximum value of the current queue, popping the maximum value if the hash value is smaller than the maximum value, entering the queue, and directly traversing the next kmer set element if the hash value is larger than the maximum value until all the kmer set elements are traversed.
Further, the data type of the sequence is converted using the AVX2 or AVX512 instruction set.
Further, when the minimum L value of each hash function of the kmer set element is selected, an AVX512 instruction set is used, a hash register array is set, the array size is set to be 8, the kmer elements are divided into a plurality of rows and 8 columns, the hash value is calculated by taking 8 as a group of input loops, the input loops are stored into the corresponding registers, whether the hash value enters a priority queue is judged, and meanwhile, direct expansion loops are selected to reduce jump instructions.
Further, the FASTA file contains one or more pieces of biological sequence information, each piece of biological sequence information is composed of two parts of a biological sequence specification and a base arrangement, wherein the biological sequence specification is taken as a single line, takes ">" as a starter, and simultaneously takes ">" as a starter of a single biological sequence.
In a second aspect, the present invention provides a gene sequence alignment system based on multithread calculation, comprising:
the file reading module is used for reading the FASTA file;
the file analysis module is used for introducing a RabbitFX library and analyzing the FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library;
the multithread construction module is used for acquiring the name of the sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the multiple sequences of the task at the same time through multithread, and forming the OMH structure body of the sequence by the acquired name of the sequence and the task of the sequence;
and a sequence comparison module for reading the OMH structure of the sequence, and performing similarity measurement on the obtained name of the sequence and the sketch of the sequence to obtain a sequence comparison result.
Compared with the prior art, the invention has the beneficial effects that:
1. in the invention, besides analyzing the file, the thread number is read to perform the thread expansibility of the parameter control test algorithm, and the task of constructing the sky of the sequence is added while the biological sequence is read and analyzed, thereby realizing the simultaneous construction of multiple sky of the sequence by multithreading and greatly reducing the time for constructing the sky of the whole file sequence.
2. The invention uses unit_64 type to replace string type to optimize storage and access by mapping and converting data type according to bit function, which not only can reduce memory storage, but also can unify elements, thereby facilitating subsequent vectorization access elements.
3. The invention can optimize the circulation by exchanging the circulation sequence, the outer circulation traverses the kmer set elements, the inner circulation traverses the hash function set, and the exchanged circulation only needs to visit 1 time for each kmer set element, thereby reducing the access time of the cache and causing no other additional expenditure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a diagram of a RabbitFX library working model provided by an embodiment of the invention;
FIG. 2 is a graph of the Genome1 (MB level) algorithm optimization provided by an embodiment of the present invention;
FIG. 3 is a block diagram of the Genome1 (MB level) thread extensibility provided by an embodiment of the present invention;
FIG. 4 is a block diagram of the Genome2 (GB level) thread extensibility provided by an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The whole idea of the invention
Based on the original OrderMinHash algorithm, optimizing single-thread models such as improving the computing efficiency by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access to exchange algorithm circulation sequence, converting data types by using bit-based functions, optimizing priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like, designing a multi-thread computing model, allowing control of the number of threads, ensuring that the speed ratio is basically proportional to the number of threads in a load balancing and other modes, and completing the optimization of the algorithm on a general multi-core computing platform. Through algorithm hotspot function analysis, the optimization of the invention can be divided into two directions, namely optimization of sequence reading analysis and optimization of a sketch structure.
Example 1
The embodiment provides a sequence-based alignment optimization method, which comprises the following steps:
step 1: reading a FASTA file;
in this embodiment, the FASTA file contains one or more pieces of biological sequence information each consisting of two parts of a biological sequence specification and a base arrangement. Biological sequence descriptions are taken as a single row, the ">" is taken as a starter of a single biological sequence, the information such as the name of the sequence is followed, the next row is started to be arranged by bases, the number of bases in a row is not more than 80, and the number of bases in a row is not limited.
Step 2: analyzing the FASTA file, and comparing the gene sequences based on the analysis result;
the original OrderMinHash algorithm uses kseq to analyze FASTA files, so that the efficiency is low;
referring to fig. 1, in order to achieve fast reading, parsing of FASTA files and full use of memory space, the present rabitifx library is directly introduced, and a multithread I/O model framework with producer-consumer as a core in the rabitifx library is fully utilized: the producer reads data from the incoming FASTA file, takes space from the data pool, and formats the data into data blocks. Each data block is then pushed into a data queue. The user receives the data block from the data queue, processes the data block by using a user-defined function, and releases the space occupied by the data block for recycling after the processing is finished.
Also, because the number of base lines in the FASTA file is not limited, it may result in a sequence that exceeds a block of data, and therefore, the robbbitfx employs a linked list strategy, and if a block cannot contain a complete sequence, the robbbitfx will link to the next block until it is ensured that the complete sequence can be accessed in a list.
Besides analyzing the file, the thread number is read to perform the thread expansibility of the parameter control test algorithm, and the task of constructing the sky of the sequence is added while the biological sequence is read and analyzed, so that the sky of a plurality of sequences is constructed simultaneously by multithreading, and the time for constructing the sky of the whole file sequence is greatly reduced.
The method uses a multithread model to read and analyze the FASTA file, the sequence order in the original file is changed from the read sequence, although the distance between the sequences can be directly compared through the skip, it cannot be clearly determined which two sequences are compared, if the read sequence of the sequence is regulated, the read speed of the sequence can be reduced, contrary to the original purpose of fast reading and analyzing the file, therefore, the sequence of the skip is clearly defined by defining the OMH structure, namely, the name of the sequence is obtained while the sequence is read and analyzed, and the OMH structure of the sequence is formed together with the sequence skip.
And reading an OMH structure body of the sequence, obtaining the name of the sequence and the search for similarity comparison, and finally outputting a distance matrix of the sequence name.
Program execution command: relative location of FASTA file under current directory/executable file name-threads number,
input: FASFTA file, output: distance matrix of sequences in FASTA file.
In step 2, the process of optimizing the sketch structure includes:
step 201: converting the data type, and replacing branch sentences by using a bitwise function;
the original OrderMinHash algorithm method for solving the kmer set and the occurrence number of the biological sequence is to construct a disordered key value pair set, take kmer elements as keywords, take the occurrence number as key values and traverse the biological sequence through a window with the size of K. Because the biological sequence is composed of A, G, C, T and is a character string, the single kmer set element not only occupies more space in the memory, but also is not fixed, and is difficult to access in a vectorization way.
Therefore, the method can be represented by using two-bit binary numbers for numbering A, G, C, T, so that the memory storage is greatly reduced, and the most common method is to compare each read character 1 to 4 times by using a switch statement to determine the number, but the method is time-consuming, introduces a large number of branch statements, and reduces the efficiency of cpu to calculate data.
By observation, bits 2 and 3 of the binary representation of A, G, C, T are found to exactly meet the numbering requirements, so the direct use of the bitwise function in this embodiment converts A, G, C, T to a two-bit binary number, and the kmer element will also convert from string type to uint unsigned integer type.
The unit_64 type is used for replacing string type to optimize storage and access through the proposed mapping conversion data type according to the bit function, so that not only can the memory storage be reduced, but also elements can be unified, and the subsequent vectorization access elements are convenient.
Step 202: taking the minimum L value by using a priority queue;
when the original OrderMinHash algorithm selects the minimum L value of each hash function of a kmer set element, the hash value of each kmer set element is usually calculated and stored for each hash function, all hash values are sorted after all hash values are calculated, and then the first L hash values are selected.
Therefore, in this embodiment, selecting the priority queue to select the minimum L value of each hash function of the kmer set element specifically includes: setting a priority queue with the size L for each hash function, immediately judging whether the hash value of the current kmer set element can enter the priority queue or not, namely comparing the hash value of the element with the maximum value of the current queue, popping up the maximum value if the hash value is smaller than the maximum value, entering the queue, and directly traversing the next kmer set element if the hash value is larger than the maximum value until all the kmer set elements are traversed. Compared with the former method, the method only needs to maintain 500 priority queues, reduces the memory space occupied by the hash value, and the query speed of the priority queues is usually faster when the size L of the priority queues is 2-5.
Step 203: cycle optimization, exchanging a cycle sequence;
the original OrderMinHash algorithm obtains the first L minimum values of each hash function of the kmer set, namely, the first L minimum values are obtained by circularly traversing the hash function set on the outer layer, circularly traversing the kmer set elements on the inner layer, maintaining M priority queues, but for each element of the kmer set, M times of searching in a memory are needed, the cache access time is increased and wasted, so that the circulation can be optimized in a manner of exchanging circulation sequences, the outer layer circularly traversing the kmer set elements, the inner layer circularly traversing the hash function set, and the exchanged circulation only needs to access 1 time for each kmer set element, thereby reducing the cache access time and causing no other additional expenditure.
In addition, in this embodiment, the SIMD vector processing unit is used to improve the calculation efficiency, and specifically includes:
1. the compilation parameters were optimized using-O2 or-O3, using-march=native.
2. After converting string type data into unit_64 type data by using an instruction set, 4 or 8 data can be accessed by 1 vector unit if using an AVX2 or AVX512 instruction set because the hash function used is mapped and is also unit_64 type data, thereby greatly improving the calculation efficiency.
In this embodiment, vectorizing the step of selecting the previous L hash value of each hash function, using the AVX512 instruction set, setting a hash register array, setting the array size to 8, dividing kmer elements into 8 rows and 8 columns, calculating hash values with 8 input loops as a group, storing the calculated hash values in the corresponding registers, and determining whether the hash values enter the priority queue, and selecting a direct expansion loop to reduce the jump instruction because the number of loops is smaller than 8.
Through verification, in the traditional algorithm, due to improper method selection, unreasonable task allocation and the like in a sequence reading and analyzing module and a sequence sketch construction module, the algorithm efficiency is low and the thread expansibility is poor.
As shown in fig. 2-4, on MB-level data, the improved algorithm efficiency is about 50 times of the original algorithm efficiency, the thread expansion efficiency can be further improved to about 900 times, on GB-level-scale data, sequencing time can be controlled within several hours or even within one hour, rapid and efficient genome data processing is realized, and the practical scientific research problem is solved.
Example two
The embodiment provides a gene sequence comparison system based on multithread calculation, which comprises the following steps:
the file reading module is used for reading the FASTA file;
the file analysis module is used for introducing a RabbitFX library and analyzing the FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library;
the multithread construction module is used for acquiring the name of the sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the multiple sequences of the task at the same time through multithread, and forming the OMH structure body of the sequence by the acquired name of the sequence and the task of the sequence;
and a sequence comparison module for reading the OMH structure of the sequence, and performing similarity measurement on the obtained name of the sequence and the sketch of the sequence to obtain a sequence comparison result.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1.一种基于多线程计算的生物序列比对方法,其特征在于,包括如下步骤:1. A biological sequence alignment method based on multi-thread calculation, which is characterized by including the following steps: 读取FASTA文件;Read FASTA files; 引入RabbitFX库,利用RabbitFX库中以生产者-消费者为核心的多线程I/O模型框架,对FASTA文件进行解析,在读取、解析FASTA文件时获取序列的名称,同时添加构造该序列的sketch任务,通过多线程同时构造多条序列的sketch,将获取的序列的名称与序列的sketch共同构成该序列的OMH结构体;Introduce the RabbitFX library, use the multi-threaded I/O model framework with the producer-consumer as the core in the RabbitFX library to parse the FASTA file, obtain the name of the sequence when reading and parsing the FASTA file, and add the sequence to construct the sequence. The sketch task uses multi-threading to construct sketches of multiple sequences at the same time, and the name of the obtained sequence and the sketch of the sequence together form the OMH structure of the sequence; 所述引入RabbitFX库,利用RabbitFX库中以生产者-消费者为核心的多线程I/O模型框架,对FASTA文件进行解析,包括:The RabbitFX library is introduced and the multi-threaded I/O model framework with producer-consumer as the core in the RabbitFX library is used to parse the FASTA file, including: 生产者从输入FASTA文件中读取数据,从数据池中获取空间,并将数据格式化为数据块;The producer reads data from the input FASTA file, obtains space from the data pool, and formats the data into data blocks; 每个数据块被推送到数据队列中;Each data block is pushed to the data queue; 使用者从数据队列接收数据块,并使用用户定义的函数处理数据块,处理完后释放数据块所占有的空间进行循环利用;The user receives data blocks from the data queue and uses user-defined functions to process the data blocks. After processing, the space occupied by the data blocks is released for recycling; 构造序列的sketch时,具体包括:When constructing the sketch of the sequence, the details include: 采用按位函数代替分支语句,转换序列的数据类型;Use bitwise functions to replace branch statements and convert the data type of the sequence; 采用优先队列选取kmer集合元素每个哈希函数的最小L值,在求kmer集合元素每个哈希函数的最小L值时,通过交换循环顺序的方式对循环进行优化,The priority queue is used to select the minimum L value of each hash function of the kmer set element. When finding the minimum L value of each hash function of the kmer set element, the loop is optimized by exchanging the loop order. 外层循环遍历kmer集合元素,内层循环遍历哈希函数集;The outer loop traverses the kmer set elements, and the inner loop traverses the hash function set; 读取序列的OMH结构体,将得到的序列的名称和序列的sketch进行相似度度量,得到序列的比对结果。Read the OMH structure of the sequence, measure the similarity between the name of the sequence obtained and the sketch of the sequence, and obtain the sequence comparison result. 2.如权利要求1所述的一种基于多线程计算的生物序列比对方法,其特征在于,RabbitFX采用链表策略,如果一个块不能包含完整的序列,RabbitFX将链接到下一个块,直到以确保在一个列表中访问完整的序列。2. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that RabbitFX adopts a linked list strategy. If a block cannot contain a complete sequence, RabbitFX will link to the next block until Ensures access to complete sequences in a list. 3.如权利要求1所述的一种基于多线程计算的生物序列比对方法,其特征在于,转换序列的数据类型时,使用AVX2或AVX512指令集转换。3. A biological sequence comparison method based on multi-thread calculation according to claim 1, characterized in that when converting the data type of the sequence, AVX2 or AVX512 instruction set conversion is used. 4.如权利要求1所述的一种基于多线程计算的生物序列比对方法,其特征在于,所述采用按位函数代替分支语句,转换序列的数据类型,包括:4. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that the use of bitwise functions instead of branch statements to convert the data type of the sequence includes: 生物序列由A、G、C、T组成,采用字符类字符串,使用按位函数将A、G、C、T转化为两位二进制数,kmer元素也将由string类型转化为uint无符号整数类型。The biological sequence consists of A, G, C, and T, using character strings. Bitwise functions are used to convert A, G, C, and T into two-digit binary numbers. The kmer element will also be converted from string type to uint unsigned integer type. . 5.如权利要求1所述的一种基于多线程计算的生物序列比对方法,其特征在于,所述采用优先队列选取kmer集合元素每个哈希函数的最小L值,具体包括:5. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that the priority queue is used to select the minimum L value of each hash function of the kmer set element, specifically including: 针对每一个哈希函数设置一个大小为L的优先队列,计算完当前kmer集合元素的哈希值,立刻判断能否进入该优先队列,即将元素哈希值与当前队列的最大值比较,若小于最大值则弹出最大值,该哈希值进入队列,若大于最大值则直接遍历下一个kmer集合元素,直至遍历完所有kmer集合元素。Set a priority queue of size L for each hash function. After calculating the hash value of the current kmer set element, immediately determine whether it can enter the priority queue. That is, compare the element hash value with the maximum value of the current queue. If it is less than The maximum value pops up, and the hash value enters the queue. If it is greater than the maximum value, the next kmer set element is traversed directly until all kmer set elements are traversed. 6.如权利要求1所述的一种基于多线程计算的生物序列比对方法,其特征在于,在选取kmer集合元素每个哈希函数的最小L值时,使用AVX512指令集,设置hash寄存器数组,设数组大小为8,同时将kmer元素划分成多行8列,以8个为一组投入循环计算哈希值,存入对应寄存器,并判断该哈希值是否进入优先队列,同时,选择直接展开循环来减少跳转指令。6. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that when selecting the minimum L value of each hash function of the kmer set element, the AVX512 instruction set is used to set the hash register Array, assuming the array size is 8, divide the kmer elements into multiple rows and 8 columns, put 8 elements into a loop to calculate the hash value, store it in the corresponding register, and determine whether the hash value enters the priority queue. At the same time, Choose to directly unroll the loop to reduce jump instructions. 7.如权利要求1所述的一种基于多线程计算的生物序列比对方法,其特征在于,FASTA文件包含一条或多条生物序列信息,每一条生物序列信息由生物序列说明和碱基排列两部分组成,其中生物序列说明单独作为一行,以“>”作为起始符,同时“>”也是单条生物序列的起始符。7. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that the FASTA file contains one or more pieces of biological sequence information, and each piece of biological sequence information consists of a biological sequence description and a base arrangement. It consists of two parts. The biological sequence description is a separate line, with ">" as the starting character. At the same time, ">" is also the starting character of a single biological sequence. 8.一种基于多线程计算的基因序列比对系统,其特征在于,包括:8. A gene sequence comparison system based on multi-threaded computing, characterized by including: 文件读取模块,用于读取FASTA文件;File reading module, used to read FASTA files; 文件解析模块,用于引入RabbitFX库,利用RabbitFX库中以生产者-消费者为核心的多线程I/O模型框架,对FASTA文件进行解析;The file parsing module is used to introduce the RabbitFX library and use the multi-threaded I/O model framework with producer-consumer as the core in the RabbitFX library to parse FASTA files; 所述引入RabbitFX库,利用RabbitFX库中以生产者-消费者为核心的多线程I/O模型框架,对FASTA文件进行解析,包括:The RabbitFX library is introduced and the multi-threaded I/O model framework with producer-consumer as the core in the RabbitFX library is used to parse the FASTA file, including: 生产者从输入FASTA文件中读取数据,从数据池中获取空间,并将数据格式化为数据块;The producer reads data from the input FASTA file, obtains space from the data pool, and formats the data into data blocks; 每个数据块被推送到数据队列中;Each data block is pushed to the data queue; 使用者从数据队列接收数据块,并使用用户定义的函数处理数据块,处理完后释放数据块所占有的空间进行循环利用;The user receives data blocks from the data queue and uses user-defined functions to process the data blocks. After processing, the space occupied by the data blocks is released for recycling; 多线程构造模块,用于在读取、解析FASTA文件时获取序列的名称,同时添加构造该序列的sketch任务,通过多线程同时构造多条序列的sketch,将获取的序列的名称与序列的sketch共同构成该序列的OMH结构体;The multi-thread construction module is used to obtain the name of the sequence when reading and parsing the FASTA file, and at the same time add the sketch task of constructing the sequence. It constructs the sketches of multiple sequences at the same time through multi-threads, and combines the obtained sequence names with the sequence sketches. Together they constitute the OMH structure of the sequence; 构造序列的sketch时,具体包括:When constructing the sketch of the sequence, the details include: 采用按位函数代替分支语句,转换序列的数据类型;Use bitwise functions to replace branch statements and convert the data type of the sequence; 采用优先队列选取kmer集合元素每个哈希函数的最小L值,在求kmer集合元素每个哈希函数的最小L值时,通过交换循环顺序的方式对循环进行优化,The priority queue is used to select the minimum L value of each hash function of the kmer set element. When finding the minimum L value of each hash function of the kmer set element, the loop is optimized by exchanging the loop order. 外层循环遍历kmer集合元素,内层循环遍历哈希函数集;The outer loop traverses the kmer set elements, and the inner loop traverses the hash function set; 序列比对模块,用于读取序列的OMH结构体,将得到的序列的名称和序列的sketch进行相似度度量,得到序列的比对结果。The sequence alignment module is used to read the OMH structure of the sequence, measure the similarity between the name of the sequence obtained and the sketch of the sequence, and obtain the sequence alignment result.
CN202311676549.8A 2023-12-08 2023-12-08 A biological sequence comparison method and system based on multi-thread computing Active CN117373538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311676549.8A CN117373538B (en) 2023-12-08 2023-12-08 A biological sequence comparison method and system based on multi-thread computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311676549.8A CN117373538B (en) 2023-12-08 2023-12-08 A biological sequence comparison method and system based on multi-thread computing

Publications (2)

Publication Number Publication Date
CN117373538A CN117373538A (en) 2024-01-09
CN117373538B true CN117373538B (en) 2024-03-19

Family

ID=89398856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311676549.8A Active CN117373538B (en) 2023-12-08 2023-12-08 A biological sequence comparison method and system based on multi-thread computing

Country Status (1)

Country Link
CN (1) CN117373538B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119007838B (en) * 2024-10-24 2024-12-17 山东大学 Grouping-based protein sequence clustering method and system
CN119720231B (en) * 2024-11-05 2025-12-02 中国科学院信息工程研究所 Evidence storage methods and systems for large-scale risk data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016168753A1 (en) * 2015-04-17 2016-10-20 Battelle Memorial Institute Biosequence-based approach to analyzing binaries
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN106778079A (en) * 2016-11-22 2017-05-31 重庆邮电大学 A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN115641911A (en) * 2022-10-19 2023-01-24 哈尔滨工业大学 A Method for Overlap Detection Between Sequences

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9892237B2 (en) * 2014-02-06 2018-02-13 Reference Genomics, Inc. System and method for characterizing biological sequence data through a probabilistic data structure

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016168753A1 (en) * 2015-04-17 2016-10-20 Battelle Memorial Institute Biosequence-based approach to analyzing binaries
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN106778079A (en) * 2016-11-22 2017-05-31 重庆邮电大学 A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
CN115641911A (en) * 2022-10-19 2023-01-24 哈尔滨工业大学 A Method for Overlap Detection Between Sequences

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
局部序列比对算法及其并行加速研究进展;刘阳;王小磊;李江域;毛逸清;赵东升;;军事医学(第07期);全文 *
蛋白质序列比对算法在众核结构上的并行优化;叶笑春;林伟;范东睿;张浩;;软件学报(第12期);全文 *

Also Published As

Publication number Publication date
CN117373538A (en) 2024-01-09

Similar Documents

Publication Publication Date Title
CN117373538B (en) A biological sequence comparison method and system based on multi-thread computing
CN106133721B (en) Parallel decision tree processor architecture
US20110066806A1 (en) System and method for memory bandwidth friendly sorting on multi-core architectures
CN102667715A (en) Data processing device, data processing method, program conversion processing device, and program conversion processing method, program conversion processing device, data processing device, program conversion processing method, and data processing me
US7653619B1 (en) Integrated search engine devices having pipelined search and tree maintenance sub-engines therein that support variable tree height
CN114064551A (en) CPU + GPU heterogeneous high-concurrency sequence alignment calculation acceleration method
CN108920412B (en) Algorithm Automatic Tuning Method for Heterogeneous Computer Architecture
CN117573316A (en) Optimization methods, processing methods, systems and storage media for business calculation graphs
CN118132078A (en) Program efficiency optimization method and system based on multi-compilation fusion optimization
CN119806644B (en) Logic design system for accelerating controller hardware
CN110111837A (en) The searching method and system of protein similarity based on two stages structure alignment
CN118260765B (en) A code cloning detection method and device for power data security monitoring system
US11366646B2 (en) Method and apparatus for predicting and scheduling copy instruction for software pipelined loops
Wang et al. Accelerating the Smith-Waterman algorithm by GPU for high-throughput sequence alignment
CN112100446B (en) Search method, readable storage medium, and electronic device
CN116168765B (en) Gene sequence generation method and system based on improved stroboemer
US20040260708A1 (en) Array compression method
Qiu et al. Exploring scalable parallelization for edit Distance-based Motif Search
Dai et al. Synchronous parallel block coordinate descent method for nonsmooth convex function minimization
JP3639480B2 (en) Similar data retrieval method, similar data retrieval device, and similar data retrieval program recording medium
CN113849180B (en) Automatic compiling vectorization method based on rearrangement instruction fusion
CN118708192B (en) A high-performance sparse computing programming framework implementation method and system
CN120030239B (en) Recommendation system calculation method and system for data redundancy perception
Soman et al. Efficient Discrete Range Searching primitives on the GPU with applications
CN119783775A (en) Automatic tuning method, device, storage medium, and program product of code generator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant