CN117373538B

CN117373538B - A biological sequence comparison method and system based on multi-thread computing

Info

Publication number: CN117373538B
Application number: CN202311676549.8A
Authority: CN
Inventors: 殷泽坤; 陈冉; 闫立峰; 刘卫国
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-19
Anticipated expiration: 2043-12-08
Also published as: CN117373538A

Abstract

The invention belongs to the technical field of biological sequence comparison, and provides a biological sequence comparison method and system based on multi-thread calculation, wherein the technical scheme is as follows: based on the original OrderMinHash algorithm, optimizing single-thread models such as improving the computing efficiency by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access to exchange algorithm circulation sequence, converting data types by using bit-based functions, optimizing priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like, designing a multi-thread computing model, allowing control of the number of threads, ensuring that the speed ratio is basically proportional to the number of threads in a load balancing and other modes, and completing the optimization of the algorithm on a general multi-core computing platform. Solves the problem of low sequence comparison efficiency of the traditional method.

Description

Biological sequence comparison method and system based on multithread calculation

Technical Field

The invention belongs to the technical field of biological sequence comparison, and particularly relates to a biological sequence comparison method and system based on multithread calculation.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Sequence alignment is used as a core operation in bioinformatics flow, and a first used sketch method has a core idea of a local sensitive hash strategy (LSH, location Sensitive Hash). The method is a dimension reduction method and is carried out in two steps. First, sequences (or partial sequences) are generalized into a much smaller sketch than the original sequence, with the important information retained, to estimate the degree of similarity of the two sequences. Second, pairs of sequences that may be similar are found by directly comparing the skchs (without referencing the original sequence) or using the skchs as keys of a hash table. A more thorough, computationally expensive alignment process can then be used on the candidate pairs to refine the actual alignment. In the LSH method, the distance between the slots is used as a first approximation of the distance between the sequences. That is, two sequences that are very similar must have similar skips, and conversely, dissimilar sequences have dissimilar skips.

The Sketch method innovatively uses LSH and it is possible to separate pairs of sequences that do not have high quality alignments from pairs of sequences that may have high quality alignments with high probability and relatively low computational effort. Thus, LSH reduces overall computational requirements while not introducing many false negatives. Sequence alignment sequences with low similarity are filtered by means of a search and the comparison of the original sequences is replaced by a search of the comparison sequences.

In practice, the sequence alignment program uses minHash LSH to represent Jaccard similarity, rather than LSH to represent edit distance or alignment score. Although both techniques have proven useful in practice, they have a major drawback: neither Jaccard similarity nor Hamming similarity directly corresponds to the edit distance, which is the number of operations (mismatch, insert, delete) required to convert one string to another.

The original MinHash algorithm is an improvement of the MinHash algorithm based on Jaccard similarity, and the consideration of the relative sequence of base arrangement in the sequence is added due to the lack of practical LSH applied to editing distance, guillaume Mar ç ais et al. OrderMinHash is based on a Kmer method, the weight and the occurrence frequency of each Kmer element are added, hash operation is carried out on each element in a Kmer set generated by a sequence by using m hash functions, and the previous L small hash values are taken for each hash operation, so that the sketch of the sequence is constructed. And (5) enhancing the screening of sequences with low similarity, and obtaining the sequence similarity by comparing sequences of the search.

However, the traditional original OrderMinHash algorithm is a practical LSH for comparing sequence editing distances, and although the space complexity of sequence comparison is reduced by using a sketch method, the time complexity of the algorithm still does not reach the requirement in the face of data of GB level or even TB level, and the sequence comparison efficiency is low. The following problems are presented in detail:

(1) Reading and resolving sequences is inefficient: the original algorithm uses a kseq single-thread analysis sequence, cannot perform thread expansion, and cannot fully utilize cpu resources.

(2) The temporal complexity of the body algorithm to construct the sketch is too high: since in the OrderMinHash algorithm, M hash functions are used for each element in the sequence kmer set, and L values are taken for each hash function to construct a sketch, three layers of loops are nested, and the algorithm time becomes unacceptable for GB level and even TB level data.

(3) Space utilization is too low: in the sequence reading and the search construction process, a large amount of useless data is reserved, and the space cannot be released in time, so that a large memory space is required to process GB and even TB level data.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a biological sequence comparison method and a biological sequence comparison system based on multithread calculation, which are based on the original OrderMinHash algorithm, and complete the optimization of the algorithm on a general multi-core computing platform by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access memory in a cyclic sequence of a switching algorithm, optimizing conversion data types by using bit-wise functions, priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like by using a SIMD vector processing unit, designing a multithread computing model, allowing the control of the number of threads, ensuring that the acceleration ratio is basically proportional to the number of threads in a load balancing mode and the like.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a first aspect of the present invention provides a biological sequence alignment method based on multi-threaded computing, comprising the steps of:

reading a FASTA file;

introducing a RabbitFX library, analyzing a FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library, acquiring the name of a sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the sketch of a plurality of sequences simultaneously through multithread, and forming the OMH structure of the sequence by the acquired name of the sequence and the sketch of the sequence together;

and reading an OMH structure of the sequence, and carrying out similarity measurement on the name of the obtained sequence and the search of the sequence to obtain a comparison result of the sequence.

Further, the step of introducing the RabbitFX library, and analyzing the FASTA file by using a multithreading I/O model framework taking a producer-consumer as a core in the RabbitFX library comprises the following steps:

the producer reads data from the input FASTA file, acquires space from the data pool, and formats the data into data blocks;

each data block is pushed into a data queue;

the user receives the data block from the data queue, processes the data block by using a user-defined function, and releases the space occupied by the data block for recycling after the processing is finished.

Further, the RabbitFX employs a linked list strategy, if one chunk cannot contain a complete sequence, the RabbitFX will link to the next chunk until it is guaranteed that the complete sequence is accessed in one list.

Further, the step of constructing the sequence specifically includes:

replacing branch sentences by using a bitwise function, and converting the data type of the sequence;

selecting the minimum L value of each hash function of the kmer set element by adopting a priority queue, optimizing the cycle by exchanging the cycle sequence when the minimum L value of each hash function of the kmer set element is obtained,

the outer layer loops through the kmer set elements and the inner layer loops through the hash function set.

Further, the converting the data type of the sequence by using a bitwise function instead of a branch statement includes:

the biological sequence consists of A, G, C and T, and adopts character strings, and A, G, C and T are converted into two-bit binary numbers by using a bitwise function, and the kmer element is also converted into a uint unsigned integer type from a string type.

Further, the selecting the minimum L value of each hash function of the kmer set element by using the priority queue specifically includes:

setting a priority queue with the size L for each hash function, calculating the hash value of the current kmer set element, immediately judging whether the current kmer set element can enter the priority queue, namely comparing the hash value of the element with the maximum value of the current queue, popping the maximum value if the hash value is smaller than the maximum value, entering the queue, and directly traversing the next kmer set element if the hash value is larger than the maximum value until all the kmer set elements are traversed.

Further, the data type of the sequence is converted using the AVX2 or AVX512 instruction set.

Further, when the minimum L value of each hash function of the kmer set element is selected, an AVX512 instruction set is used, a hash register array is set, the array size is set to be 8, the kmer elements are divided into a plurality of rows and 8 columns, the hash value is calculated by taking 8 as a group of input loops, the input loops are stored into the corresponding registers, whether the hash value enters a priority queue is judged, and meanwhile, direct expansion loops are selected to reduce jump instructions.

Further, the FASTA file contains one or more pieces of biological sequence information, each piece of biological sequence information is composed of two parts of a biological sequence specification and a base arrangement, wherein the biological sequence specification is taken as a single line, takes ">" as a starter, and simultaneously takes ">" as a starter of a single biological sequence.

In a second aspect, the present invention provides a gene sequence alignment system based on multithread calculation, comprising:

the file reading module is used for reading the FASTA file;

the file analysis module is used for introducing a RabbitFX library and analyzing the FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library;

the multithread construction module is used for acquiring the name of the sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the multiple sequences of the task at the same time through multithread, and forming the OMH structure body of the sequence by the acquired name of the sequence and the task of the sequence;

and a sequence comparison module for reading the OMH structure of the sequence, and performing similarity measurement on the obtained name of the sequence and the sketch of the sequence to obtain a sequence comparison result.

Compared with the prior art, the invention has the beneficial effects that:

1. in the invention, besides analyzing the file, the thread number is read to perform the thread expansibility of the parameter control test algorithm, and the task of constructing the sky of the sequence is added while the biological sequence is read and analyzed, thereby realizing the simultaneous construction of multiple sky of the sequence by multithreading and greatly reducing the time for constructing the sky of the whole file sequence.

2. The invention uses unit_64 type to replace string type to optimize storage and access by mapping and converting data type according to bit function, which not only can reduce memory storage, but also can unify elements, thereby facilitating subsequent vectorization access elements.

3. The invention can optimize the circulation by exchanging the circulation sequence, the outer circulation traverses the kmer set elements, the inner circulation traverses the hash function set, and the exchanged circulation only needs to visit 1 time for each kmer set element, thereby reducing the access time of the cache and causing no other additional expenditure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a diagram of a RabbitFX library working model provided by an embodiment of the invention;

FIG. 2 is a graph of the Genome1 (MB level) algorithm optimization provided by an embodiment of the present invention;

FIG. 3 is a block diagram of the Genome1 (MB level) thread extensibility provided by an embodiment of the present invention;

FIG. 4 is a block diagram of the Genome2 (GB level) thread extensibility provided by an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The whole idea of the invention

Based on the original OrderMinHash algorithm, optimizing single-thread models such as improving the computing efficiency by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access to exchange algorithm circulation sequence, converting data types by using bit-based functions, optimizing priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like, designing a multi-thread computing model, allowing control of the number of threads, ensuring that the speed ratio is basically proportional to the number of threads in a load balancing and other modes, and completing the optimization of the algorithm on a general multi-core computing platform. Through algorithm hotspot function analysis, the optimization of the invention can be divided into two directions, namely optimization of sequence reading analysis and optimization of a sketch structure.

Example 1

The embodiment provides a sequence-based alignment optimization method, which comprises the following steps:

step 1: reading a FASTA file;

in this embodiment, the FASTA file contains one or more pieces of biological sequence information each consisting of two parts of a biological sequence specification and a base arrangement. Biological sequence descriptions are taken as a single row, the ">" is taken as a starter of a single biological sequence, the information such as the name of the sequence is followed, the next row is started to be arranged by bases, the number of bases in a row is not more than 80, and the number of bases in a row is not limited.

Step 2: analyzing the FASTA file, and comparing the gene sequences based on the analysis result;

the original OrderMinHash algorithm uses kseq to analyze FASTA files, so that the efficiency is low;

referring to fig. 1, in order to achieve fast reading, parsing of FASTA files and full use of memory space, the present rabitifx library is directly introduced, and a multithread I/O model framework with producer-consumer as a core in the rabitifx library is fully utilized: the producer reads data from the incoming FASTA file, takes space from the data pool, and formats the data into data blocks. Each data block is then pushed into a data queue. The user receives the data block from the data queue, processes the data block by using a user-defined function, and releases the space occupied by the data block for recycling after the processing is finished.

Also, because the number of base lines in the FASTA file is not limited, it may result in a sequence that exceeds a block of data, and therefore, the robbbitfx employs a linked list strategy, and if a block cannot contain a complete sequence, the robbbitfx will link to the next block until it is ensured that the complete sequence can be accessed in a list.

Besides analyzing the file, the thread number is read to perform the thread expansibility of the parameter control test algorithm, and the task of constructing the sky of the sequence is added while the biological sequence is read and analyzed, so that the sky of a plurality of sequences is constructed simultaneously by multithreading, and the time for constructing the sky of the whole file sequence is greatly reduced.

The method uses a multithread model to read and analyze the FASTA file, the sequence order in the original file is changed from the read sequence, although the distance between the sequences can be directly compared through the skip, it cannot be clearly determined which two sequences are compared, if the read sequence of the sequence is regulated, the read speed of the sequence can be reduced, contrary to the original purpose of fast reading and analyzing the file, therefore, the sequence of the skip is clearly defined by defining the OMH structure, namely, the name of the sequence is obtained while the sequence is read and analyzed, and the OMH structure of the sequence is formed together with the sequence skip.

And reading an OMH structure body of the sequence, obtaining the name of the sequence and the search for similarity comparison, and finally outputting a distance matrix of the sequence name.

Program execution command: relative location of FASTA file under current directory/executable file name-threads number,

input: FASFTA file, output: distance matrix of sequences in FASTA file.

In step 2, the process of optimizing the sketch structure includes:

step 201: converting the data type, and replacing branch sentences by using a bitwise function;

the original OrderMinHash algorithm method for solving the kmer set and the occurrence number of the biological sequence is to construct a disordered key value pair set, take kmer elements as keywords, take the occurrence number as key values and traverse the biological sequence through a window with the size of K. Because the biological sequence is composed of A, G, C, T and is a character string, the single kmer set element not only occupies more space in the memory, but also is not fixed, and is difficult to access in a vectorization way.

Therefore, the method can be represented by using two-bit binary numbers for numbering A, G, C, T, so that the memory storage is greatly reduced, and the most common method is to compare each read character 1 to 4 times by using a switch statement to determine the number, but the method is time-consuming, introduces a large number of branch statements, and reduces the efficiency of cpu to calculate data.

By observation, bits 2 and 3 of the binary representation of A, G, C, T are found to exactly meet the numbering requirements, so the direct use of the bitwise function in this embodiment converts A, G, C, T to a two-bit binary number, and the kmer element will also convert from string type to uint unsigned integer type.

The unit_64 type is used for replacing string type to optimize storage and access through the proposed mapping conversion data type according to the bit function, so that not only can the memory storage be reduced, but also elements can be unified, and the subsequent vectorization access elements are convenient.

Step 202: taking the minimum L value by using a priority queue;

when the original OrderMinHash algorithm selects the minimum L value of each hash function of a kmer set element, the hash value of each kmer set element is usually calculated and stored for each hash function, all hash values are sorted after all hash values are calculated, and then the first L hash values are selected.

Therefore, in this embodiment, selecting the priority queue to select the minimum L value of each hash function of the kmer set element specifically includes: setting a priority queue with the size L for each hash function, immediately judging whether the hash value of the current kmer set element can enter the priority queue or not, namely comparing the hash value of the element with the maximum value of the current queue, popping up the maximum value if the hash value is smaller than the maximum value, entering the queue, and directly traversing the next kmer set element if the hash value is larger than the maximum value until all the kmer set elements are traversed. Compared with the former method, the method only needs to maintain 500 priority queues, reduces the memory space occupied by the hash value, and the query speed of the priority queues is usually faster when the size L of the priority queues is 2-5.

Step 203: cycle optimization, exchanging a cycle sequence;

the original OrderMinHash algorithm obtains the first L minimum values of each hash function of the kmer set, namely, the first L minimum values are obtained by circularly traversing the hash function set on the outer layer, circularly traversing the kmer set elements on the inner layer, maintaining M priority queues, but for each element of the kmer set, M times of searching in a memory are needed, the cache access time is increased and wasted, so that the circulation can be optimized in a manner of exchanging circulation sequences, the outer layer circularly traversing the kmer set elements, the inner layer circularly traversing the hash function set, and the exchanged circulation only needs to access 1 time for each kmer set element, thereby reducing the cache access time and causing no other additional expenditure.

In addition, in this embodiment, the SIMD vector processing unit is used to improve the calculation efficiency, and specifically includes:

1. the compilation parameters were optimized using-O2 or-O3, using-march=native.

2. After converting string type data into unit_64 type data by using an instruction set, 4 or 8 data can be accessed by 1 vector unit if using an AVX2 or AVX512 instruction set because the hash function used is mapped and is also unit_64 type data, thereby greatly improving the calculation efficiency.

In this embodiment, vectorizing the step of selecting the previous L hash value of each hash function, using the AVX512 instruction set, setting a hash register array, setting the array size to 8, dividing kmer elements into 8 rows and 8 columns, calculating hash values with 8 input loops as a group, storing the calculated hash values in the corresponding registers, and determining whether the hash values enter the priority queue, and selecting a direct expansion loop to reduce the jump instruction because the number of loops is smaller than 8.

Through verification, in the traditional algorithm, due to improper method selection, unreasonable task allocation and the like in a sequence reading and analyzing module and a sequence sketch construction module, the algorithm efficiency is low and the thread expansibility is poor.

As shown in fig. 2-4, on MB-level data, the improved algorithm efficiency is about 50 times of the original algorithm efficiency, the thread expansion efficiency can be further improved to about 900 times, on GB-level-scale data, sequencing time can be controlled within several hours or even within one hour, rapid and efficient genome data processing is realized, and the practical scientific research problem is solved.

Example two

The embodiment provides a gene sequence comparison system based on multithread calculation, which comprises the following steps:

the file reading module is used for reading the FASTA file;

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A biological sequence alignment method based on multi-thread calculation, which is characterized by including the following steps:

Read FASTA files;

Introduce the RabbitFX library, use the multi-threaded I/O model framework with the producer-consumer as the core in the RabbitFX library to parse the FASTA file, obtain the name of the sequence when reading and parsing the FASTA file, and add the sequence to construct the sequence. The sketch task uses multi-threading to construct sketches of multiple sequences at the same time, and the name of the obtained sequence and the sketch of the sequence together form the OMH structure of the sequence;

The RabbitFX library is introduced and the multi-threaded I/O model framework with producer-consumer as the core in the RabbitFX library is used to parse the FASTA file, including:

The producer reads data from the input FASTA file, obtains space from the data pool, and formats the data into data blocks;

Each data block is pushed to the data queue;

The user receives data blocks from the data queue and uses user-defined functions to process the data blocks. After processing, the space occupied by the data blocks is released for recycling;

When constructing the sketch of the sequence, the details include:

Use bitwise functions to replace branch statements and convert the data type of the sequence;

The priority queue is used to select the minimum L value of each hash function of the kmer set element. When finding the minimum L value of each hash function of the kmer set element, the loop is optimized by exchanging the loop order.

The outer loop traverses the kmer set elements, and the inner loop traverses the hash function set;

Read the OMH structure of the sequence, measure the similarity between the name of the sequence obtained and the sketch of the sequence, and obtain the sequence comparison result.

2. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that RabbitFX adopts a linked list strategy. If a block cannot contain a complete sequence, RabbitFX will link to the next block until Ensures access to complete sequences in a list.

3. A biological sequence comparison method based on multi-thread calculation according to claim 1, characterized in that when converting the data type of the sequence, AVX2 or AVX512 instruction set conversion is used.

4. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that the use of bitwise functions instead of branch statements to convert the data type of the sequence includes:

The biological sequence consists of A, G, C, and T, using character strings. Bitwise functions are used to convert A, G, C, and T into two-digit binary numbers. The kmer element will also be converted from string type to uint unsigned integer type. .

5. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that the priority queue is used to select the minimum L value of each hash function of the kmer set element, specifically including:

Set a priority queue of size L for each hash function. After calculating the hash value of the current kmer set element, immediately determine whether it can enter the priority queue. That is, compare the element hash value with the maximum value of the current queue. If it is less than The maximum value pops up, and the hash value enters the queue. If it is greater than the maximum value, the next kmer set element is traversed directly until all kmer set elements are traversed.

6. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that when selecting the minimum L value of each hash function of the kmer set element, the AVX512 instruction set is used to set the hash register Array, assuming the array size is 8, divide the kmer elements into multiple rows and 8 columns, put 8 elements into a loop to calculate the hash value, store it in the corresponding register, and determine whether the hash value enters the priority queue. At the same time, Choose to directly unroll the loop to reduce jump instructions.

7. A biological sequence comparison method based on multi-thread calculation as claimed in claim 1, characterized in that the FASTA file contains one or more pieces of biological sequence information, and each piece of biological sequence information consists of a biological sequence description and a base arrangement. It consists of two parts. The biological sequence description is a separate line, with ">" as the starting character. At the same time, ">" is also the starting character of a single biological sequence.

8. A gene sequence comparison system based on multi-threaded computing, characterized by including:

File reading module, used to read FASTA files;

The file parsing module is used to introduce the RabbitFX library and use the multi-threaded I/O model framework with producer-consumer as the core in the RabbitFX library to parse FASTA files;

Each data block is pushed to the data queue;

The multi-thread construction module is used to obtain the name of the sequence when reading and parsing the FASTA file, and at the same time add the sketch task of constructing the sequence. It constructs the sketches of multiple sequences at the same time through multi-threads, and combines the obtained sequence names with the sequence sketches. Together they constitute the OMH structure of the sequence;

When constructing the sketch of the sequence, the details include:

The sequence alignment module is used to read the OMH structure of the sequence, measure the similarity between the name of the sequence obtained and the sketch of the sequence, and obtain the sequence alignment result.