CN111326216B - A fast partitioning method for big data gene sequencing files - Google Patents
A fast partitioning method for big data gene sequencing files Download PDFInfo
- Publication number
- CN111326216B CN111326216B CN202010122470.0A CN202010122470A CN111326216B CN 111326216 B CN111326216 B CN 111326216B CN 202010122470 A CN202010122470 A CN 202010122470A CN 111326216 B CN111326216 B CN 111326216B
- Authority
- CN
- China
- Prior art keywords
- file
- node
- processed
- files
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 36
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 17
- 238000000638 solvent extraction Methods 0.000 title claims abstract 7
- 238000004590 computer program Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 abstract description 14
- 230000011218 segmentation Effects 0.000 description 4
- 238000012252 genetic analysis Methods 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
本发明涉及高性能计算领域,特别涉及一种大数据基因测序文件的快速切分方法,使得在多节点基因分析过程中,无须对测序文件进行实际的切分,不产生子文件,根据后续分析程序提供灵活的划分方案,从而使得各个节点负载更均衡,减少了硬盘读写,提高了划分效率。
The present invention relates to the field of high-performance computing, in particular to a method for quickly splitting large data gene sequencing files, so that in the multi-node gene analysis process, there is no need to actually split the sequencing files, no sub-files are generated, and a flexible partitioning scheme is provided according to subsequent analysis programs, so that the load of each node is more balanced, hard disk reading and writing is reduced, and partitioning efficiency is improved.
Description
技术领域technical field
本发明涉及高性能计算领域,特别涉及一种大数据基因测序文件的快速切分方法。The invention relates to the field of high-performance computing, in particular to a method for quickly splitting large data gene sequencing files.
背景技术Background technique
随着大健康领域的快速发展,基因分析技术扮演了越来越重要的角色。基因测序仪产出了海量的测序文件,最常用的测序文件格式为fastq格式。每个测序文件少则几G,多则几十G到上百G。如何快速的处理这些大数据,越来越成为基因分析的瓶颈。With the rapid development of the general health field, genetic analysis technology has played an increasingly important role. The gene sequencer produces a large number of sequencing files, and the most commonly used sequencing file format is the fastq format. Each sequencing file ranges from a few gigabytes to tens of gigabytes to hundreds of gigabytes. How to quickly process these big data has increasingly become the bottleneck of genetic analysis.
由于测序文件很大,用单节点进行分析处理要花费大量的时间,因此需要多个节点进行并行计算,来减少基因分析的时间。这就需要对测序文件进行划分,每个节点只处理测序文件的一部分,最后将处理结果合并,从而在较短的时间内获得基因分析的完整结果。Because the sequencing file is very large, it takes a lot of time to analyze and process with a single node, so multiple nodes are required to perform parallel computing to reduce the time for gene analysis. This requires dividing the sequencing file, each node only processes a part of the sequencing file, and finally merges the processing results, so as to obtain the complete result of genetic analysis in a short period of time.
当用多个节点去处理测序文件时,普通的切分方法是将测序文件按照节点个数进行均分,然后生成多个子文件,写入硬盘,各个节点分别读取相应的子文件进行处理。这种方法虽然简便方便,但是会增加硬盘读写负担。When multiple nodes are used to process sequencing files, the common segmentation method is to divide the sequencing files equally according to the number of nodes, then generate multiple sub-files, write them to the hard disk, and each node reads the corresponding sub-files for processing. Although this method is simple and convenient, it will increase the read and write burden of the hard disk.
并且,普通的切分方法可能会影响后续程序的结果。在测序分析中,通常利用bwa、bowtie等序列比对软件进行分析比对。例如,bwa程序是对文件进行按块读取的,其运算过程中每次处理fastq文件的一个块。由于普通的切分方法没有考虑到这一点,影响了bwa的结果,容易造成比对结果的不一致。Also, common segmentation methods may affect the results of subsequent procedures. In sequencing analysis, sequence comparison software such as bwa and bowtie are usually used for analysis and comparison. For example, the bwa program reads files by blocks, and processes one block of the fastq file each time during its operation. Since the common segmentation method does not take this into consideration, the result of bwa is affected, and it is easy to cause inconsistency in the comparison results.
发明内容Contents of the invention
本发明提供一种针对大数据基因测序文件的快速划分方法,包括:The present invention provides a method for quickly dividing large data gene sequencing files, including:
步骤101,设置文件块的大小;Step 101, setting the size of the file block;
步骤102,根据步骤101设置的文件块大小对fastq文件进行分析统计,划分成多个文件块;将每个文件块的位置信息和文件块的总数保存到信息文件中;Step 102, the fastq file is analyzed and counted according to the file block size set in step 101, and is divided into a plurality of file blocks; the position information of each file block and the total number of file blocks are saved in the information file;
步骤103,根据节点的数量和每个节点的核数,计算每个节点要处理的文件块数量数量,确定每个节点要处理的文件部分在fastq文件中的开始位置和结束位置;Step 103, calculate the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determine the start position and end position of the file part to be processed by each node in the fastq file;
步骤104,根据步骤103确定的每个节点要处理的文件部分在fastq文件中的起始位置和结束位置,生成读取指令,通过管道的方式提供给后续程序。Step 104, according to the start position and end position of the file part to be processed by each node determined in step 103 in the fastq file, generate a read instruction, and provide it to a subsequent program through a pipeline.
优选地,上述方法中的步骤102还包括:如果文件块的结束位置在某个序列的中间,则将文件块扩展至该序列的结尾。Preferably, step 102 in the above method further includes: if the end position of the file block is in the middle of a certain sequence, extending the file block to the end of the sequence.
优选地,上述方法中的文件块的位置信息包括文件块的起始位置和结束位置。Preferably, the position information of the file block in the above method includes the start position and the end position of the file block.
优选地,上述方法中的文件块的大小的取值范围可在1M至100M之间。Preferably, the value range of the file block size in the above method may be between 1M and 100M.
优选地,上述方法中的步骤103中每个节点需要处理的文件块数量按如下公式计算:Preferably, the number of file blocks to be processed by each node in step 103 of the above method is calculated according to the following formula:
其中,Bi为第i个节点处理的文件块数;Among them, B i is the number of file blocks processed by the i-th node;
ci为第i个节点的核数;c i is the number of cores of the i-th node;
Bt为总的文件块数;B t is the total number of file blocks;
n为总的节点数;n is the total number of nodes;
j为整数,范围为1到n;j is an integer ranging from 1 to n;
cj为第j个节点的核数。c j is the number of cores of the jth node.
根据本发明的另一个方面,提供一种针对大数据基因测序文件的快速划分方法,包括:According to another aspect of the present invention, there is provided a method for rapidly dividing large data gene sequencing files, including:
步骤201,按照序列对fastq文件进行分析统计,获得每条序列的位置信息和序列的总数;Step 201, analyze and count the fastq files according to the sequence, and obtain the position information of each sequence and the total number of sequences;
步骤202,根据节点的数量和每个节点的核数,计算每个节点需要处理的序列数量,确定出每个节点要处理的文件部分在fastq文件中的开始位置和结束位置;Step 202, according to the number of nodes and the number of cores of each node, calculate the number of sequences that each node needs to process, and determine the start position and end position of the file part to be processed by each node in the fastq file;
步骤203,根据步骤202得出的每个节点要处理的文件部分在fastq文件中的起始位置和结束位置,生成读取指令,通过管道的方式提供给后续程序。Step 203, according to the start position and end position of the file part to be processed by each node obtained in step 202 in the fastq file, generate a read instruction, and provide it to a subsequent program through a pipeline.
优选地,上述方法的序列的位置信息包括序列的起始位置和结束位置。Preferably, the position information of the sequence in the above method includes the start position and end position of the sequence.
优选地,上述方法中的步骤202中每个节点需要处理的序列数量按如下公式计算:Preferably, the number of sequences that each node needs to process in step 202 of the above method is calculated according to the following formula:
其中,Si为第i个节点处理的序列数;Among them, S i is the number of sequences processed by the i-th node;
ci为第i个节点的核数;c i is the number of cores of the i-th node;
St为总的序列数;S t is the total sequence number;
n为总的节点数;n is the total number of nodes;
j为整数,范围为1到n;j is an integer ranging from 1 to n;
cj为第j个节点的核数。c j is the number of cores of the jth node.
一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现上述任一方法。A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, any one of the above-mentioned methods is implemented.
一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于所述处理器执行所述程序时实现上述任一方法。A computer device includes a memory and a processor, and a computer program capable of running on the processor is stored in the memory, and is characterized in that any one of the above methods is implemented when the processor executes the program.
本发明针对现有技术的不足,对fastq文件采用了懒划分的策略,不生成子文件,从而避免了子文件的读写和存储。并且加入了多种划分方式,供后续分析软件使用。本发明的方法减少了硬盘读写次数,提高了文件划分速度,消除了比对错误。Aiming at the deficiencies of the prior art, the present invention adopts a strategy of lazily dividing fastq files and does not generate sub-files, thereby avoiding the reading, writing and storage of sub-files. And added a variety of division methods for subsequent analysis software. The method of the invention reduces the times of reading and writing of the hard disk, improves the speed of file division, and eliminates comparison errors.
附图说明Description of drawings
以下参照附图对本发明实施例作进一步说明,其中:Embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:
图1是根据本发明一个实施例的按块划分方法的流程示意图。Fig. 1 is a schematic flowchart of a method for dividing by blocks according to an embodiment of the present invention.
图2是根据本发明一个实施例的按序列划分方法的流程示意图。Fig. 2 is a schematic flowchart of a method for dividing by sequence according to an embodiment of the present invention.
具体实施方式Detailed ways
在详细说明本方法之前,先简单介绍fastq文件的格式。fastq文件是文本文件,每四行为一个序列,第一行是序列的名称信息,第二行是碱基序列,第三行是说明信息,第四行是序列的质量分数信息。每个序列的长度不完全相同。测序文件分为单端测序文件和双端测序文件,单端测序文件只包含一个文件,双端测序文件包含一对文件,这一对文件中的每个序列都是对应的。Before explaining this method in detail, briefly introduce the format of the fastq file. The fastq file is a text file, with four lines per sequence, the first line is the name information of the sequence, the second line is the base sequence, the third line is the description information, and the fourth line is the quality score information of the sequence. Each sequence is not exactly the same length. Sequencing files are divided into single-end sequencing files and paired-end sequencing files. A single-end sequencing file contains only one file, and a paired-end sequencing file contains a pair of files. Each sequence in this pair of files corresponds to each other.
根据本发明的一个实施例,结合图1介绍按块划分的方法,该方法包括以下步骤。According to an embodiment of the present invention, a method for dividing by blocks is introduced with reference to FIG. 1 , and the method includes the following steps.
步骤101,设置文件块的大小,优选的,其取值范围可在1M至100M之间。Step 101, setting the size of the file block, preferably, its value range can be between 1M and 100M.
发明人研究发现,基因测序分析工具bwa在比对过程中,对测序文件按块读取。发明人发现,当使用多节点运行bwa工具并行地对fastq文件比对时,按块划分fastq文件有利于负载均衡。此文件块的大小可根据不同的分析软件以及节点的处理能力采用不同的取值,优选的,其取值范围可在1M至100M之间。在本发明的一个实施例中,此文件块的大小取值为10M时,可以获得较佳的处理速度和负载均衡效果。The inventor found that the gene sequencing analysis tool bwa reads the sequencing files in blocks during the comparison process. The inventors found that when multiple nodes are used to run the bwa tool to compare the fastq files in parallel, dividing the fastq files by blocks is beneficial to load balancing. The size of the file block can take different values according to different analysis software and the processing capability of the nodes. Preferably, the value range can be between 1M and 100M. In an embodiment of the present invention, when the size of the file block is 10M, better processing speed and load balancing effect can be obtained.
步骤102,根据上一步骤设置的块大小对fastq文件进行分析统计,并划分成多个文件块;以及将分析结果保存到信息文件中。Step 102, analyze and count the fastq file according to the block size set in the previous step, and divide it into multiple file blocks; and save the analysis result into the information file.
按块大小分析fastq文件时,以块大小10M为例,从文件起始位置向后偏移10M字节,如果此字节数据恰好在一个序列的结尾,则将第一个文件块的起始位置设为0,结束位置设为10M。如果此字节数据在某个序列的中间,则将第一个文件块的结束位置设置为该序列的结尾位置。找到了fastq文件的第一个文件块位置后,将其起始位置和结束位置存入信息文件。可以看出,此第一个文件块的大小大于或等于10M,并且包含了完整的序列数据。然后,将第一个文件块的结束位置偏移一个字节,作为第二个文件块的起始位置,继续向后偏移10M字节,如果当前字节数据恰好处于一个序列的结尾,则将当前字节的位置设为第二块的结束位置,如果当前字节数据处于一个序列的开始或中间,则将第二块的结束位置设置为该序列的结尾位置。以此类推,一直分析到fastq文件的结尾,找到所有文件块的起始位置和结束位置并存入信息文件。根据本发明的一个实施例,显然,在信息文件中也可以只存入起始位置。可以看出,最后一个文件块的大小可能小于10M,因此,除了最后一个文件块之外,其他每个文件块的大小会略有不同,在10M附近浮动。并且,每个文件块都包含了多个完整的序列,一个序列只存在于一个文件块中。根据本发明的一个实施例,在分析过程中,还累计文件块的个数并将其存入信息文件。When analyzing a fastq file by block size, take a block size of 10M as an example, offset 10M bytes backward from the start position of the file, and if this byte data happens to be at the end of a sequence, set the start position of the first file block to 0 and the end position to 10M. If this byte data is in the middle of a sequence, sets the end of the first file block to the end of the sequence. After finding the first file block position of the fastq file, store its start position and end position in the information file. It can be seen that the size of the first file block is greater than or equal to 10M and contains complete sequence data. Then, offset the end position of the first file block by one byte as the start position of the second file block, continue to offset backward by 10M bytes, if the current byte data is at the end of a sequence, set the current byte position as the end position of the second block, and if the current byte data is at the beginning or middle of a sequence, set the end position of the second block as the end position of the sequence. By analogy, analyze until the end of the fastq file, find the start position and end position of all file blocks and store them in the information file. According to an embodiment of the present invention, obviously, only the starting position can also be stored in the information file. It can be seen that the size of the last file block may be less than 10M, therefore, except the last file block, the size of each other file block will be slightly different, floating around 10M. Moreover, each file block contains multiple complete sequences, and a sequence only exists in one file block. According to an embodiment of the present invention, during the analysis process, the number of file blocks is also accumulated and stored in the information file.
对于双端测序文件,如果两个文件大小一致,则按照其中一个文件进行按块统计分析,如果两个文件大小不一致,则两个文件分别按照本发明中另一个按序列划分的方法进行划分。For the paired-end sequencing files, if the two files have the same size, perform block-by-block statistical analysis according to one of the files, and if the two files are inconsistent in size, then the two files are divided according to another method of dividing by sequence in the present invention.
步骤103,根据节点的数量和每个节点的核数,计算每个节点需要处理的文件块数量,根据步骤102得出的统计信息,确定出每个节点要处理的文件部分在fastq文件中的开始位置和结束位置。Step 103, according to the number of nodes and the number of cores of each node, calculate the number of file blocks that each node needs to process, and according to the statistical information obtained in step 102, determine the start position and end position of the file part to be processed by each node in the fastq file.
具体来说,在多节点基因分析流程中,每个计算节点的核数和计算能力不尽相同,因此在进行测序文件划分的时候,要考虑到这些情况,为每个节点确定其处理范围,使得负载更均衡。Specifically, in the multi-node gene analysis process, the number of cores and computing power of each computing node are different. Therefore, when dividing the sequencing files, these situations must be taken into consideration, and the processing range for each node should be determined to make the load more balanced.
根据本发明的一个实施例,每个节点处理的文件块数按照公式1计算。According to an embodiment of the present invention, the number of file blocks processed by each node is calculated according to Formula 1.
其中,Bi为第i个节点处理的块数;Among them, B i is the number of blocks processed by the i-th node;
ci为第i个节点的核数;c i is the number of cores of the i-th node;
Bt为总的文件块数;B t is the total number of file blocks;
n为总的节点数;n is the total number of nodes;
j为整数,范围为1到n;j is an integer ranging from 1 to n;
cj为第j个节点的核数。c j is the number of cores of the jth node.
计算出每个节点需要处理的文件块数量后,即可通过所述的信息文件确定出每个节点要处理的文件部分在原测序文件中的起始位置和结束位置。After the number of file blocks to be processed by each node is calculated, the start position and end position of the file part to be processed by each node in the original sequencing file can be determined through the information file.
步骤104,根据步骤103得出的每个节点要处理的文件部分在原测序文件中的起始位置和结束位置生成读取指令,通过管道的方式提供给后续程序。Step 104, generate a read command according to the starting position and ending position of the file part to be processed by each node obtained in step 103 in the original sequencing file, and provide it to the subsequent program through a pipeline.
管道命令是现有技术,此处以Linux系统为例作简单介绍。Linux管道使用竖线“|”连接多个命令,其被称为管道符。Linux管道的具体语法格式如下:The pipeline command is an existing technology, and a brief introduction is made here by taking the Linux system as an example. Linux pipes use the vertical bar "|" to connect multiple commands, which is called a pipe character. The specific syntax format of the Linux pipeline is as follows:
command1|command2command1|command2
在本发明中,command1是读取fastq文件范围的指令,command2是bwa等按块分析工具的指令。In the present invention, command1 is an instruction for reading the fastq file range, and command2 is an instruction for block analysis tools such as bwa.
根据本发明的另一个方面,发明人还发现,bowtie是按照序列处理fastq文件,因此当后续处理程序是bowtie等按序列分析的工具时,需要将fastq文件按照序列划分。According to another aspect of the present invention, the inventor also found that bowtie processes fastq files according to sequence, so when the subsequent processing program is a tool for sequence analysis such as bowtie, fastq files need to be divided according to sequence.
下面根据本发明的一个实施例,结合图2介绍按序列划分的方法,该方法包括以下步骤。In the following, according to an embodiment of the present invention, a method for dividing by sequence is introduced with reference to FIG. 2 , and the method includes the following steps.
步骤201,按照序列对fastq文件进行分析统计。在分析fastq文件的同时,将分析结果保存到信息文件中。Step 201, analyze and count the fastq files according to the sequence. While analyzing the fastq file, save the analysis results to the info file.
具体来说,在分析过程中,将每条序列的起始和结束位置进行分析记录,并统计序列的数量,保存到信息文件。Specifically, during the analysis process, the start and end positions of each sequence are analyzed and recorded, and the number of sequences is counted and saved to an information file.
对于双端测序文件,如果两个文件大小一致,则按照其中一个文件进行按序列统计分析,如果二者大小不一致,则两个文件分别按照序列进行统计分析。For paired-end sequencing files, if the two files have the same size, perform statistical analysis by sequence according to one of the files, and if the two files are inconsistent in size, perform statistical analysis on the two files according to the sequence respectively.
步骤202,根据节点的数量和每个节点的核数,计算每个节点需要处理的序列数量,根据步骤201得出的统计信息,确定出每个节点要处理的文件部分在fastq文件中的开始位置和结束位置。Step 202, according to the number of nodes and the number of cores of each node, calculate the number of sequences that each node needs to process, and according to the statistical information obtained in step 201, determine the start position and end position of the file part to be processed by each node in the fastq file.
根据本发明的一个实施例,在多节点基因分析流程中,每个计算节点的核数和计算能力不尽相同,因此在进行测序文件划分的时候,要考虑到这些情况,为每个节点确定其处理范围,使得负载更均衡。According to an embodiment of the present invention, in the multi-node gene analysis process, the number of cores and computing power of each computing node are different, so when dividing the sequencing files, these situations should be considered, and the processing range should be determined for each node to make the load more balanced.
每个节点处理的序列数按照公式2计算。The number of sequences processed by each node is calculated according to formula 2.
其中,Si为第i个节点处理的序列数;Among them, S i is the number of sequences processed by the i-th node;
ci为第i个节点的核数;c i is the number of cores of the i-th node;
St为总的序列数;S t is the total sequence number;
n为总的节点数;n is the total number of nodes;
j为整数,范围为1到n;j is an integer ranging from 1 to n;
cj为第j个节点的核数。c j is the number of cores of the jth node.
计算出每个节点需要处理的序列数量后,即可通过所述的信息文件确定出每个节点要处理的文件部分在原测序文件中的起始位置和结束位置。After the number of sequences to be processed by each node is calculated, the start position and end position of the file part to be processed by each node in the original sequencing file can be determined through the information file.
步骤20304,根据步骤202得出的每个节点要处理的文件部分在原测序文件中的起始位置和结束位置生成读取指令,通过管道的方式提供给后续程序。Step 20304, generate a read command according to the starting position and ending position of the file part to be processed by each node in the original sequencing file obtained in step 202, and provide it to the subsequent program through the pipeline.
管道命令格式如下:The pipeline command format is as follows:
command1|command2command1|command2
其中,command1是读取fastq文件范围的指令,command2是bowtie等按序列分析的工具的指令。Among them, command1 is the command to read the range of the fastq file, and command2 is the command to analyze tools such as bowtie by sequence.
本发明提出了一种对大数据基因测序文件的快速划分方法,使得在多节点基因分析过程中,无须对测序文件进行实际的切分,不产生子文件,根据后续分析程序提供灵活的划分方案,从而使得各个节点负载更均衡,减少了硬盘读写,提高了划分效率。The present invention proposes a method for quickly dividing large data gene sequencing files, so that in the multi-node gene analysis process, there is no need to perform actual segmentation on the sequencing files, no sub-files are generated, and a flexible division scheme is provided according to the follow-up analysis program, thereby making the load of each node more balanced, reducing hard disk reading and writing, and improving division efficiency.
需要说明的是,上述实施例中介绍的各个步骤并非都是必须的,本领域技术人员可以根据实际需要进行适当的取舍、替换、修改等。It should be noted that not all the steps described in the foregoing embodiments are necessary, and those skilled in the art may make appropriate trade-offs, replacements, modifications, etc. according to actual needs.
最后所应说明的是,以上实施例仅用以说明本发明的技术方案而非限制。尽管上文参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行修改或者等同替换,都不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Although the present invention has been described in detail above with reference to the embodiments, those skilled in the art should understand that modifications or equivalent replacements to the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and all should be covered by the claims of the present invention.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122470.0A CN111326216B (en) | 2020-02-27 | 2020-02-27 | A fast partitioning method for big data gene sequencing files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010122470.0A CN111326216B (en) | 2020-02-27 | 2020-02-27 | A fast partitioning method for big data gene sequencing files |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111326216A CN111326216A (en) | 2020-06-23 |
CN111326216B true CN111326216B (en) | 2023-07-21 |
Family
ID=71168260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010122470.0A Active CN111326216B (en) | 2020-02-27 | 2020-02-27 | A fast partitioning method for big data gene sequencing files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111326216B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005011430A (en) * | 2003-06-19 | 2005-01-13 | Hitachi Ltd | File management method, recording apparatus, reproduction apparatus, and recording medium |
CN101446976A (en) * | 2008-12-26 | 2009-06-03 | 中兴通讯股份有限公司 | File storage method in distributed file system |
CN102930005A (en) * | 2012-10-29 | 2013-02-13 | 北京奇虎科技有限公司 | Method and device for binding file in host file |
CN103186617A (en) * | 2011-12-30 | 2013-07-03 | 北京新媒传信科技有限公司 | Data storage method and device |
CN103559020A (en) * | 2013-11-07 | 2014-02-05 | 中国科学院软件研究所 | Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data |
EP2759953A1 (en) * | 2013-01-28 | 2014-07-30 | Hasso-Plattner-Institut für Softwaresystemtechnik GmbH | System and method for genomic data processing with an in-memory database system and real-time analysis |
CN105095686A (en) * | 2014-05-15 | 2015-11-25 | 中国科学院青岛生物能源与过程研究所 | High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware |
CN106021538A (en) * | 2016-05-27 | 2016-10-12 | 成都索贝数码科技股份有限公司 | Word segmentation method and system based on storage of FICS objects |
CN106446254A (en) * | 2016-10-14 | 2017-02-22 | 北京百度网讯科技有限公司 | File detection method and device |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240583A1 (en) * | 2004-01-21 | 2005-10-27 | Li Peter W | Literature pipeline |
US7478376B2 (en) * | 2004-12-02 | 2009-01-13 | International Business Machines Corporation | Computer program code size partitioning method for multiple memory multi-processing systems |
US9081501B2 (en) * | 2010-01-08 | 2015-07-14 | International Business Machines Corporation | Multi-petascale highly efficient parallel supercomputer |
US20140067887A1 (en) * | 2012-08-29 | 2014-03-06 | Sas Institute Inc. | Grid Computing System Alongside A Distributed File System Architecture |
CN103049680B (en) * | 2012-12-29 | 2016-09-07 | 深圳先进技术研究院 | gene sequencing data reading method and system |
CN104504257B (en) * | 2014-12-12 | 2017-08-11 | 国家电网公司 | A kind of online Prony analysis methods calculated based on Dual parallel |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
CN110088839B (en) * | 2016-10-11 | 2023-12-15 | 耶诺姆希斯股份公司 | Efficient data structures for bioinformatic information representation |
KR102421458B1 (en) * | 2016-10-11 | 2022-07-14 | 게놈시스 에스에이 | Method and apparatus for accessing structured bioinformatics data with an access unit |
CN107145766A (en) * | 2017-03-27 | 2017-09-08 | 中国科学院深圳先进技术研究院 | Gene sequence reading method and reading system |
CN107169313A (en) * | 2017-03-29 | 2017-09-15 | 中国科学院深圳先进技术研究院 | The read method and computer-readable recording medium of DNA data files |
CN109698010A (en) * | 2017-10-23 | 2019-04-30 | 北京哲源科技有限责任公司 | A kind of processing method for gene data |
CN110120247A (en) * | 2018-01-14 | 2019-08-13 | 广州明领基因科技有限公司 | A kind of distributed genetic big data storage platform |
US12210904B2 (en) * | 2018-06-29 | 2025-01-28 | International Business Machines Corporation | Hybridized storage optimization for genomic workloads |
CN109616156B (en) * | 2018-12-03 | 2021-07-06 | 郑州云海信息技术有限公司 | A kind of gene sequencing data storage method and device |
CN109785905B (en) * | 2018-12-18 | 2021-07-23 | 中国科学院计算技术研究所 | An Accelerator for Gene Alignment Algorithms |
CN110427270B (en) * | 2019-08-09 | 2022-11-01 | 华东师范大学 | Dynamic load balancing method for distributed connection operator in RDMA (remote direct memory Access) network |
-
2020
- 2020-02-27 CN CN202010122470.0A patent/CN111326216B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005011430A (en) * | 2003-06-19 | 2005-01-13 | Hitachi Ltd | File management method, recording apparatus, reproduction apparatus, and recording medium |
CN101446976A (en) * | 2008-12-26 | 2009-06-03 | 中兴通讯股份有限公司 | File storage method in distributed file system |
CN103186617A (en) * | 2011-12-30 | 2013-07-03 | 北京新媒传信科技有限公司 | Data storage method and device |
CN102930005A (en) * | 2012-10-29 | 2013-02-13 | 北京奇虎科技有限公司 | Method and device for binding file in host file |
EP2759953A1 (en) * | 2013-01-28 | 2014-07-30 | Hasso-Plattner-Institut für Softwaresystemtechnik GmbH | System and method for genomic data processing with an in-memory database system and real-time analysis |
CN103559020A (en) * | 2013-11-07 | 2014-02-05 | 中国科学院软件研究所 | Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data |
CN105095686A (en) * | 2014-05-15 | 2015-11-25 | 中国科学院青岛生物能源与过程研究所 | High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware |
CN106021538A (en) * | 2016-05-27 | 2016-10-12 | 成都索贝数码科技股份有限公司 | Word segmentation method and system based on storage of FICS objects |
CN106446254A (en) * | 2016-10-14 | 2017-02-22 | 北京百度网讯科技有限公司 | File detection method and device |
Non-Patent Citations (4)
Title |
---|
Gene Panel流程的并行设计与优化研究;王元戎等;计算机学报;第42卷(第11期);全文 * |
PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead;Lingqi Zhang;Genes;全文 * |
基于Hadoop Streaming的Last比对软件并行化的研究与实现;董本志;李文浩;景维鹏;;计算机工程与应用(第02期);全文 * |
基于高通量转录组测序的序列比对算法研究;张勇等;中国优秀硕士学位论文全文数据库 (信息科技辑)(第3期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111326216A (en) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609350B (en) | Data processing method of second-generation sequencing data analysis platform | |
US11941534B2 (en) | Genome sequence alignment system and method | |
CN110797085B (en) | Method, system, equipment and storage medium for inquiring gene data | |
WO2019161645A1 (en) | Shell-based data table extraction method, terminal, device, and storage medium | |
CN107480466B (en) | Genome data storage method and electronic device | |
WO2024198934A1 (en) | Data processing method, apparatus and system, and electronic device and storage medium | |
CN110021345B (en) | Spark platform-based gene data analysis method | |
CN115469818B (en) | Disk array writing processing method, device, equipment and medium | |
JP6201788B2 (en) | Loop division detection program and loop division detection method | |
CN101770504B (en) | Data storage method, data reading method, and data reading equipment | |
CN110264392B (en) | A multi-GPU-based strongly connected graph detection method | |
CN110262289B (en) | Method, device and storage medium for processing variables in A2L files | |
CN111326216B (en) | A fast partitioning method for big data gene sequencing files | |
CN111370070B (en) | Compression processing method for big data gene sequencing file | |
CN117112004B (en) | Differential data determination method, differential restoration method, device, equipment and medium | |
CN115104092A (en) | Data synchronization method and related device | |
CN104750846B (en) | A kind of substring lookup method and device | |
CN107169313A (en) | The read method and computer-readable recording medium of DNA data files | |
CN114420210B (en) | Rapid trimming method and system for biological sequencing sequence | |
CN113495901B (en) | Quick retrieval method for variable-length data blocks | |
CN102637204A (en) | Method for querying texts based on mutual index structure | |
CN107403076B (en) | DNA sequence processing method and equipment | |
CN114817327A (en) | File version identification method, system, terminal equipment and storage medium | |
WO2019023978A1 (en) | Alignment method, device and system | |
CN108984123A (en) | A kind of data de-duplication method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |