CN117373538B - A biological sequence comparison method and system based on multi-thread computing - Google Patents
A biological sequence comparison method and system based on multi-thread computing Download PDFInfo
- Publication number
- CN117373538B CN117373538B CN202311676549.8A CN202311676549A CN117373538B CN 117373538 B CN117373538 B CN 117373538B CN 202311676549 A CN202311676549 A CN 202311676549A CN 117373538 B CN117373538 B CN 117373538B
- Authority
- CN
- China
- Prior art keywords
- sequence
- data
- kmer
- rabbitfx
- biological
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to the technical field of biological sequence comparison, and provides a biological sequence comparison method and system based on multi-thread calculation, wherein the technical scheme is as follows: based on the original OrderMinHash algorithm, optimizing single-thread models such as improving the computing efficiency by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access to exchange algorithm circulation sequence, converting data types by using bit-based functions, optimizing priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like, designing a multi-thread computing model, allowing control of the number of threads, ensuring that the speed ratio is basically proportional to the number of threads in a load balancing and other modes, and completing the optimization of the algorithm on a general multi-core computing platform. Solves the problem of low sequence comparison efficiency of the traditional method.
Description
Technical Field
The invention belongs to the technical field of biological sequence comparison, and particularly relates to a biological sequence comparison method and system based on multithread calculation.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Sequence alignment is used as a core operation in bioinformatics flow, and a first used sketch method has a core idea of a local sensitive hash strategy (LSH, location Sensitive Hash). The method is a dimension reduction method and is carried out in two steps. First, sequences (or partial sequences) are generalized into a much smaller sketch than the original sequence, with the important information retained, to estimate the degree of similarity of the two sequences. Second, pairs of sequences that may be similar are found by directly comparing the skchs (without referencing the original sequence) or using the skchs as keys of a hash table. A more thorough, computationally expensive alignment process can then be used on the candidate pairs to refine the actual alignment. In the LSH method, the distance between the slots is used as a first approximation of the distance between the sequences. That is, two sequences that are very similar must have similar skips, and conversely, dissimilar sequences have dissimilar skips.
The Sketch method innovatively uses LSH and it is possible to separate pairs of sequences that do not have high quality alignments from pairs of sequences that may have high quality alignments with high probability and relatively low computational effort. Thus, LSH reduces overall computational requirements while not introducing many false negatives. Sequence alignment sequences with low similarity are filtered by means of a search and the comparison of the original sequences is replaced by a search of the comparison sequences.
In practice, the sequence alignment program uses minHash LSH to represent Jaccard similarity, rather than LSH to represent edit distance or alignment score. Although both techniques have proven useful in practice, they have a major drawback: neither Jaccard similarity nor Hamming similarity directly corresponds to the edit distance, which is the number of operations (mismatch, insert, delete) required to convert one string to another.
The original MinHash algorithm is an improvement of the MinHash algorithm based on Jaccard similarity, and the consideration of the relative sequence of base arrangement in the sequence is added due to the lack of practical LSH applied to editing distance, guillaume Mar ç ais et al. OrderMinHash is based on a Kmer method, the weight and the occurrence frequency of each Kmer element are added, hash operation is carried out on each element in a Kmer set generated by a sequence by using m hash functions, and the previous L small hash values are taken for each hash operation, so that the sketch of the sequence is constructed. And (5) enhancing the screening of sequences with low similarity, and obtaining the sequence similarity by comparing sequences of the search.
However, the traditional original OrderMinHash algorithm is a practical LSH for comparing sequence editing distances, and although the space complexity of sequence comparison is reduced by using a sketch method, the time complexity of the algorithm still does not reach the requirement in the face of data of GB level or even TB level, and the sequence comparison efficiency is low. The following problems are presented in detail:
(1) Reading and resolving sequences is inefficient: the original algorithm uses a kseq single-thread analysis sequence, cannot perform thread expansion, and cannot fully utilize cpu resources.
(2) The temporal complexity of the body algorithm to construct the sketch is too high: since in the OrderMinHash algorithm, M hash functions are used for each element in the sequence kmer set, and L values are taken for each hash function to construct a sketch, three layers of loops are nested, and the algorithm time becomes unacceptable for GB level and even TB level data.
(3) Space utilization is too low: in the sequence reading and the search construction process, a large amount of useless data is reserved, and the space cannot be released in time, so that a large memory space is required to process GB and even TB level data.
Disclosure of Invention
In order to solve at least one technical problem in the background art, the invention provides a biological sequence comparison method and a biological sequence comparison system based on multithread calculation, which are based on the original OrderMinHash algorithm, and complete the optimization of the algorithm on a general multi-core computing platform by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access memory in a cyclic sequence of a switching algorithm, optimizing conversion data types by using bit-wise functions, priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like by using a SIMD vector processing unit, designing a multithread computing model, allowing the control of the number of threads, ensuring that the acceleration ratio is basically proportional to the number of threads in a load balancing mode and the like.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a first aspect of the present invention provides a biological sequence alignment method based on multi-threaded computing, comprising the steps of:
reading a FASTA file;
introducing a RabbitFX library, analyzing a FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library, acquiring the name of a sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the sketch of a plurality of sequences simultaneously through multithread, and forming the OMH structure of the sequence by the acquired name of the sequence and the sketch of the sequence together;
and reading an OMH structure of the sequence, and carrying out similarity measurement on the name of the obtained sequence and the search of the sequence to obtain a comparison result of the sequence.
Further, the step of introducing the RabbitFX library, and analyzing the FASTA file by using a multithreading I/O model framework taking a producer-consumer as a core in the RabbitFX library comprises the following steps:
the producer reads data from the input FASTA file, acquires space from the data pool, and formats the data into data blocks;
each data block is pushed into a data queue;
the user receives the data block from the data queue, processes the data block by using a user-defined function, and releases the space occupied by the data block for recycling after the processing is finished.
Further, the RabbitFX employs a linked list strategy, if one chunk cannot contain a complete sequence, the RabbitFX will link to the next chunk until it is guaranteed that the complete sequence is accessed in one list.
Further, the step of constructing the sequence specifically includes:
replacing branch sentences by using a bitwise function, and converting the data type of the sequence;
selecting the minimum L value of each hash function of the kmer set element by adopting a priority queue, optimizing the cycle by exchanging the cycle sequence when the minimum L value of each hash function of the kmer set element is obtained,
the outer layer loops through the kmer set elements and the inner layer loops through the hash function set.
Further, the converting the data type of the sequence by using a bitwise function instead of a branch statement includes:
the biological sequence consists of A, G, C and T, and adopts character strings, and A, G, C and T are converted into two-bit binary numbers by using a bitwise function, and the kmer element is also converted into a uint unsigned integer type from a string type.
Further, the selecting the minimum L value of each hash function of the kmer set element by using the priority queue specifically includes:
setting a priority queue with the size L for each hash function, calculating the hash value of the current kmer set element, immediately judging whether the current kmer set element can enter the priority queue, namely comparing the hash value of the element with the maximum value of the current queue, popping the maximum value if the hash value is smaller than the maximum value, entering the queue, and directly traversing the next kmer set element if the hash value is larger than the maximum value until all the kmer set elements are traversed.
Further, the data type of the sequence is converted using the AVX2 or AVX512 instruction set.
Further, when the minimum L value of each hash function of the kmer set element is selected, an AVX512 instruction set is used, a hash register array is set, the array size is set to be 8, the kmer elements are divided into a plurality of rows and 8 columns, the hash value is calculated by taking 8 as a group of input loops, the input loops are stored into the corresponding registers, whether the hash value enters a priority queue is judged, and meanwhile, direct expansion loops are selected to reduce jump instructions.
Further, the FASTA file contains one or more pieces of biological sequence information, each piece of biological sequence information is composed of two parts of a biological sequence specification and a base arrangement, wherein the biological sequence specification is taken as a single line, takes ">" as a starter, and simultaneously takes ">" as a starter of a single biological sequence.
In a second aspect, the present invention provides a gene sequence alignment system based on multithread calculation, comprising:
the file reading module is used for reading the FASTA file;
the file analysis module is used for introducing a RabbitFX library and analyzing the FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library;
the multithread construction module is used for acquiring the name of the sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the multiple sequences of the task at the same time through multithread, and forming the OMH structure body of the sequence by the acquired name of the sequence and the task of the sequence;
and a sequence comparison module for reading the OMH structure of the sequence, and performing similarity measurement on the obtained name of the sequence and the sketch of the sequence to obtain a sequence comparison result.
Compared with the prior art, the invention has the beneficial effects that:
1. in the invention, besides analyzing the file, the thread number is read to perform the thread expansibility of the parameter control test algorithm, and the task of constructing the sky of the sequence is added while the biological sequence is read and analyzed, thereby realizing the simultaneous construction of multiple sky of the sequence by multithreading and greatly reducing the time for constructing the sky of the whole file sequence.
2. The invention uses unit_64 type to replace string type to optimize storage and access by mapping and converting data type according to bit function, which not only can reduce memory storage, but also can unify elements, thereby facilitating subsequent vectorization access elements.
3. The invention can optimize the circulation by exchanging the circulation sequence, the outer circulation traverses the kmer set elements, the inner circulation traverses the hash function set, and the exchanged circulation only needs to visit 1 time for each kmer set element, thereby reducing the access time of the cache and causing no other additional expenditure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
FIG. 1 is a diagram of a RabbitFX library working model provided by an embodiment of the invention;
FIG. 2 is a graph of the Genome1 (MB level) algorithm optimization provided by an embodiment of the present invention;
FIG. 3 is a block diagram of the Genome1 (MB level) thread extensibility provided by an embodiment of the present invention;
FIG. 4 is a block diagram of the Genome2 (GB level) thread extensibility provided by an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The whole idea of the invention
Based on the original OrderMinHash algorithm, optimizing single-thread models such as improving the computing efficiency by optimizing compiling parameters, calling RabbitFx library to quickly analyze FASTA files, optimizing access to exchange algorithm circulation sequence, converting data types by using bit-based functions, optimizing priority queues and the like, fully utilizing an AVX512 instruction set to improve the computing efficiency and the like, designing a multi-thread computing model, allowing control of the number of threads, ensuring that the speed ratio is basically proportional to the number of threads in a load balancing and other modes, and completing the optimization of the algorithm on a general multi-core computing platform. Through algorithm hotspot function analysis, the optimization of the invention can be divided into two directions, namely optimization of sequence reading analysis and optimization of a sketch structure.
Example 1
The embodiment provides a sequence-based alignment optimization method, which comprises the following steps:
step 1: reading a FASTA file;
in this embodiment, the FASTA file contains one or more pieces of biological sequence information each consisting of two parts of a biological sequence specification and a base arrangement. Biological sequence descriptions are taken as a single row, the ">" is taken as a starter of a single biological sequence, the information such as the name of the sequence is followed, the next row is started to be arranged by bases, the number of bases in a row is not more than 80, and the number of bases in a row is not limited.
Step 2: analyzing the FASTA file, and comparing the gene sequences based on the analysis result;
the original OrderMinHash algorithm uses kseq to analyze FASTA files, so that the efficiency is low;
referring to fig. 1, in order to achieve fast reading, parsing of FASTA files and full use of memory space, the present rabitifx library is directly introduced, and a multithread I/O model framework with producer-consumer as a core in the rabitifx library is fully utilized: the producer reads data from the incoming FASTA file, takes space from the data pool, and formats the data into data blocks. Each data block is then pushed into a data queue. The user receives the data block from the data queue, processes the data block by using a user-defined function, and releases the space occupied by the data block for recycling after the processing is finished.
Also, because the number of base lines in the FASTA file is not limited, it may result in a sequence that exceeds a block of data, and therefore, the robbbitfx employs a linked list strategy, and if a block cannot contain a complete sequence, the robbbitfx will link to the next block until it is ensured that the complete sequence can be accessed in a list.
Besides analyzing the file, the thread number is read to perform the thread expansibility of the parameter control test algorithm, and the task of constructing the sky of the sequence is added while the biological sequence is read and analyzed, so that the sky of a plurality of sequences is constructed simultaneously by multithreading, and the time for constructing the sky of the whole file sequence is greatly reduced.
The method uses a multithread model to read and analyze the FASTA file, the sequence order in the original file is changed from the read sequence, although the distance between the sequences can be directly compared through the skip, it cannot be clearly determined which two sequences are compared, if the read sequence of the sequence is regulated, the read speed of the sequence can be reduced, contrary to the original purpose of fast reading and analyzing the file, therefore, the sequence of the skip is clearly defined by defining the OMH structure, namely, the name of the sequence is obtained while the sequence is read and analyzed, and the OMH structure of the sequence is formed together with the sequence skip.
And reading an OMH structure body of the sequence, obtaining the name of the sequence and the search for similarity comparison, and finally outputting a distance matrix of the sequence name.
Program execution command: relative location of FASTA file under current directory/executable file name-threads number,
input: FASFTA file, output: distance matrix of sequences in FASTA file.
In step 2, the process of optimizing the sketch structure includes:
step 201: converting the data type, and replacing branch sentences by using a bitwise function;
the original OrderMinHash algorithm method for solving the kmer set and the occurrence number of the biological sequence is to construct a disordered key value pair set, take kmer elements as keywords, take the occurrence number as key values and traverse the biological sequence through a window with the size of K. Because the biological sequence is composed of A, G, C, T and is a character string, the single kmer set element not only occupies more space in the memory, but also is not fixed, and is difficult to access in a vectorization way.
Therefore, the method can be represented by using two-bit binary numbers for numbering A, G, C, T, so that the memory storage is greatly reduced, and the most common method is to compare each read character 1 to 4 times by using a switch statement to determine the number, but the method is time-consuming, introduces a large number of branch statements, and reduces the efficiency of cpu to calculate data.
By observation, bits 2 and 3 of the binary representation of A, G, C, T are found to exactly meet the numbering requirements, so the direct use of the bitwise function in this embodiment converts A, G, C, T to a two-bit binary number, and the kmer element will also convert from string type to uint unsigned integer type.
The unit_64 type is used for replacing string type to optimize storage and access through the proposed mapping conversion data type according to the bit function, so that not only can the memory storage be reduced, but also elements can be unified, and the subsequent vectorization access elements are convenient.
Step 202: taking the minimum L value by using a priority queue;
when the original OrderMinHash algorithm selects the minimum L value of each hash function of a kmer set element, the hash value of each kmer set element is usually calculated and stored for each hash function, all hash values are sorted after all hash values are calculated, and then the first L hash values are selected.
Therefore, in this embodiment, selecting the priority queue to select the minimum L value of each hash function of the kmer set element specifically includes: setting a priority queue with the size L for each hash function, immediately judging whether the hash value of the current kmer set element can enter the priority queue or not, namely comparing the hash value of the element with the maximum value of the current queue, popping up the maximum value if the hash value is smaller than the maximum value, entering the queue, and directly traversing the next kmer set element if the hash value is larger than the maximum value until all the kmer set elements are traversed. Compared with the former method, the method only needs to maintain 500 priority queues, reduces the memory space occupied by the hash value, and the query speed of the priority queues is usually faster when the size L of the priority queues is 2-5.
Step 203: cycle optimization, exchanging a cycle sequence;
the original OrderMinHash algorithm obtains the first L minimum values of each hash function of the kmer set, namely, the first L minimum values are obtained by circularly traversing the hash function set on the outer layer, circularly traversing the kmer set elements on the inner layer, maintaining M priority queues, but for each element of the kmer set, M times of searching in a memory are needed, the cache access time is increased and wasted, so that the circulation can be optimized in a manner of exchanging circulation sequences, the outer layer circularly traversing the kmer set elements, the inner layer circularly traversing the hash function set, and the exchanged circulation only needs to access 1 time for each kmer set element, thereby reducing the cache access time and causing no other additional expenditure.
In addition, in this embodiment, the SIMD vector processing unit is used to improve the calculation efficiency, and specifically includes:
1. the compilation parameters were optimized using-O2 or-O3, using-march=native.
2. After converting string type data into unit_64 type data by using an instruction set, 4 or 8 data can be accessed by 1 vector unit if using an AVX2 or AVX512 instruction set because the hash function used is mapped and is also unit_64 type data, thereby greatly improving the calculation efficiency.
In this embodiment, vectorizing the step of selecting the previous L hash value of each hash function, using the AVX512 instruction set, setting a hash register array, setting the array size to 8, dividing kmer elements into 8 rows and 8 columns, calculating hash values with 8 input loops as a group, storing the calculated hash values in the corresponding registers, and determining whether the hash values enter the priority queue, and selecting a direct expansion loop to reduce the jump instruction because the number of loops is smaller than 8.
Through verification, in the traditional algorithm, due to improper method selection, unreasonable task allocation and the like in a sequence reading and analyzing module and a sequence sketch construction module, the algorithm efficiency is low and the thread expansibility is poor.
As shown in fig. 2-4, on MB-level data, the improved algorithm efficiency is about 50 times of the original algorithm efficiency, the thread expansion efficiency can be further improved to about 900 times, on GB-level-scale data, sequencing time can be controlled within several hours or even within one hour, rapid and efficient genome data processing is realized, and the practical scientific research problem is solved.
Example two
The embodiment provides a gene sequence comparison system based on multithread calculation, which comprises the following steps:
the file reading module is used for reading the FASTA file;
the file analysis module is used for introducing a RabbitFX library and analyzing the FASTA file by utilizing a multithread I/O model framework taking a producer-consumer as a core in the RabbitFX library;
the multithread construction module is used for acquiring the name of the sequence when the FASTA file is read and analyzed, adding a task of constructing the sequence, constructing the multiple sequences of the task at the same time through multithread, and forming the OMH structure body of the sequence by the acquired name of the sequence and the task of the sequence;
and a sequence comparison module for reading the OMH structure of the sequence, and performing similarity measurement on the obtained name of the sequence and the sketch of the sequence to obtain a sequence comparison result.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311676549.8A CN117373538B (en) | 2023-12-08 | 2023-12-08 | A biological sequence comparison method and system based on multi-thread computing |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311676549.8A CN117373538B (en) | 2023-12-08 | 2023-12-08 | A biological sequence comparison method and system based on multi-thread computing |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN117373538A CN117373538A (en) | 2024-01-09 |
| CN117373538B true CN117373538B (en) | 2024-03-19 |
Family
ID=89398856
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311676549.8A Active CN117373538B (en) | 2023-12-08 | 2023-12-08 | A biological sequence comparison method and system based on multi-thread computing |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117373538B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119007838B (en) * | 2024-10-24 | 2024-12-17 | 山东大学 | Grouping-based protein sequence clustering method and system |
| CN119720231B (en) * | 2024-11-05 | 2025-12-02 | 中国科学院信息工程研究所 | Evidence storage methods and systems for large-scale risk data |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016168753A1 (en) * | 2015-04-17 | 2016-10-20 | Battelle Memorial Institute | Biosequence-based approach to analyzing binaries |
| CN106096332A (en) * | 2016-06-28 | 2016-11-09 | 深圳大学 | Parallel fast matching method and system thereof towards the DNA sequence stored |
| CN106778079A (en) * | 2016-11-22 | 2017-05-31 | 重庆邮电大学 | A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce |
| WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
| CN115641911A (en) * | 2022-10-19 | 2023-01-24 | 哈尔滨工业大学 | A Method for Overlap Detection Between Sequences |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9892237B2 (en) * | 2014-02-06 | 2018-02-13 | Reference Genomics, Inc. | System and method for characterizing biological sequence data through a probabilistic data structure |
-
2023
- 2023-12-08 CN CN202311676549.8A patent/CN117373538B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016168753A1 (en) * | 2015-04-17 | 2016-10-20 | Battelle Memorial Institute | Biosequence-based approach to analyzing binaries |
| CN106096332A (en) * | 2016-06-28 | 2016-11-09 | 深圳大学 | Parallel fast matching method and system thereof towards the DNA sequence stored |
| WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
| CN106778079A (en) * | 2016-11-22 | 2017-05-31 | 重庆邮电大学 | A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce |
| CN115641911A (en) * | 2022-10-19 | 2023-01-24 | 哈尔滨工业大学 | A Method for Overlap Detection Between Sequences |
Non-Patent Citations (2)
| Title |
|---|
| 局部序列比对算法及其并行加速研究进展;刘阳;王小磊;李江域;毛逸清;赵东升;;军事医学(第07期);全文 * |
| 蛋白质序列比对算法在众核结构上的并行优化;叶笑春;林伟;范东睿;张浩;;软件学报(第12期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117373538A (en) | 2024-01-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN117373538B (en) | A biological sequence comparison method and system based on multi-thread computing | |
| CN106133721B (en) | Parallel decision tree processor architecture | |
| US20110066806A1 (en) | System and method for memory bandwidth friendly sorting on multi-core architectures | |
| CN102667715A (en) | Data processing device, data processing method, program conversion processing device, and program conversion processing method, program conversion processing device, data processing device, program conversion processing method, and data processing me | |
| US7653619B1 (en) | Integrated search engine devices having pipelined search and tree maintenance sub-engines therein that support variable tree height | |
| CN114064551A (en) | CPU + GPU heterogeneous high-concurrency sequence alignment calculation acceleration method | |
| CN108920412B (en) | Algorithm Automatic Tuning Method for Heterogeneous Computer Architecture | |
| CN117573316A (en) | Optimization methods, processing methods, systems and storage media for business calculation graphs | |
| CN118132078A (en) | Program efficiency optimization method and system based on multi-compilation fusion optimization | |
| CN119806644B (en) | Logic design system for accelerating controller hardware | |
| CN110111837A (en) | The searching method and system of protein similarity based on two stages structure alignment | |
| CN118260765B (en) | A code cloning detection method and device for power data security monitoring system | |
| US11366646B2 (en) | Method and apparatus for predicting and scheduling copy instruction for software pipelined loops | |
| Wang et al. | Accelerating the Smith-Waterman algorithm by GPU for high-throughput sequence alignment | |
| CN112100446B (en) | Search method, readable storage medium, and electronic device | |
| CN116168765B (en) | Gene sequence generation method and system based on improved stroboemer | |
| US20040260708A1 (en) | Array compression method | |
| Qiu et al. | Exploring scalable parallelization for edit Distance-based Motif Search | |
| Dai et al. | Synchronous parallel block coordinate descent method for nonsmooth convex function minimization | |
| JP3639480B2 (en) | Similar data retrieval method, similar data retrieval device, and similar data retrieval program recording medium | |
| CN113849180B (en) | Automatic compiling vectorization method based on rearrangement instruction fusion | |
| CN118708192B (en) | A high-performance sparse computing programming framework implementation method and system | |
| CN120030239B (en) | Recommendation system calculation method and system for data redundancy perception | |
| Soman et al. | Efficient Discrete Range Searching primitives on the GPU with applications | |
| CN119783775A (en) | Automatic tuning method, device, storage medium, and program product of code generator |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |