[go: up one dir, main page]

CN111326216B - A fast partitioning method for big data gene sequencing files - Google Patents

A fast partitioning method for big data gene sequencing files Download PDF

Info

Publication number
CN111326216B
CN111326216B CN202010122470.0A CN202010122470A CN111326216B CN 111326216 B CN111326216 B CN 111326216B CN 202010122470 A CN202010122470 A CN 202010122470A CN 111326216 B CN111326216 B CN 111326216B
Authority
CN
China
Prior art keywords
file
node
processed
files
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010122470.0A
Other languages
Chinese (zh)
Other versions
CN111326216A (en
Inventor
张中海
谭光明
张春明
姚二林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202010122470.0A priority Critical patent/CN111326216B/en
Publication of CN111326216A publication Critical patent/CN111326216A/en
Application granted granted Critical
Publication of CN111326216B publication Critical patent/CN111326216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明涉及高性能计算领域,特别涉及一种大数据基因测序文件的快速切分方法,使得在多节点基因分析过程中,无须对测序文件进行实际的切分,不产生子文件,根据后续分析程序提供灵活的划分方案,从而使得各个节点负载更均衡,减少了硬盘读写,提高了划分效率。

The present invention relates to the field of high-performance computing, in particular to a method for quickly splitting large data gene sequencing files, so that in the multi-node gene analysis process, there is no need to actually split the sequencing files, no sub-files are generated, and a flexible partitioning scheme is provided according to subsequent analysis programs, so that the load of each node is more balanced, hard disk reading and writing is reduced, and partitioning efficiency is improved.

Description

一种针对大数据基因测序文件的快速划分方法A fast partitioning method for big data gene sequencing files

技术领域technical field

本发明涉及高性能计算领域,特别涉及一种大数据基因测序文件的快速切分方法。The invention relates to the field of high-performance computing, in particular to a method for quickly splitting large data gene sequencing files.

背景技术Background technique

随着大健康领域的快速发展,基因分析技术扮演了越来越重要的角色。基因测序仪产出了海量的测序文件,最常用的测序文件格式为fastq格式。每个测序文件少则几G,多则几十G到上百G。如何快速的处理这些大数据,越来越成为基因分析的瓶颈。With the rapid development of the general health field, genetic analysis technology has played an increasingly important role. The gene sequencer produces a large number of sequencing files, and the most commonly used sequencing file format is the fastq format. Each sequencing file ranges from a few gigabytes to tens of gigabytes to hundreds of gigabytes. How to quickly process these big data has increasingly become the bottleneck of genetic analysis.

由于测序文件很大,用单节点进行分析处理要花费大量的时间,因此需要多个节点进行并行计算,来减少基因分析的时间。这就需要对测序文件进行划分,每个节点只处理测序文件的一部分,最后将处理结果合并,从而在较短的时间内获得基因分析的完整结果。Because the sequencing file is very large, it takes a lot of time to analyze and process with a single node, so multiple nodes are required to perform parallel computing to reduce the time for gene analysis. This requires dividing the sequencing file, each node only processes a part of the sequencing file, and finally merges the processing results, so as to obtain the complete result of genetic analysis in a short period of time.

当用多个节点去处理测序文件时,普通的切分方法是将测序文件按照节点个数进行均分,然后生成多个子文件,写入硬盘,各个节点分别读取相应的子文件进行处理。这种方法虽然简便方便,但是会增加硬盘读写负担。When multiple nodes are used to process sequencing files, the common segmentation method is to divide the sequencing files equally according to the number of nodes, then generate multiple sub-files, write them to the hard disk, and each node reads the corresponding sub-files for processing. Although this method is simple and convenient, it will increase the read and write burden of the hard disk.

并且,普通的切分方法可能会影响后续程序的结果。在测序分析中,通常利用bwa、bowtie等序列比对软件进行分析比对。例如,bwa程序是对文件进行按块读取的,其运算过程中每次处理fastq文件的一个块。由于普通的切分方法没有考虑到这一点,影响了bwa的结果,容易造成比对结果的不一致。Also, common segmentation methods may affect the results of subsequent procedures. In sequencing analysis, sequence comparison software such as bwa and bowtie are usually used for analysis and comparison. For example, the bwa program reads files by blocks, and processes one block of the fastq file each time during its operation. Since the common segmentation method does not take this into consideration, the result of bwa is affected, and it is easy to cause inconsistency in the comparison results.

发明内容Contents of the invention

本发明提供一种针对大数据基因测序文件的快速划分方法,包括:The present invention provides a method for quickly dividing large data gene sequencing files, including:

步骤101,设置文件块的大小;Step 101, setting the size of the file block;

步骤102,根据步骤101设置的文件块大小对fastq文件进行分析统计,划分成多个文件块;将每个文件块的位置信息和文件块的总数保存到信息文件中;Step 102, the fastq file is analyzed and counted according to the file block size set in step 101, and is divided into a plurality of file blocks; the position information of each file block and the total number of file blocks are saved in the information file;

步骤103,根据节点的数量和每个节点的核数,计算每个节点要处理的文件块数量数量,确定每个节点要处理的文件部分在fastq文件中的开始位置和结束位置;Step 103, calculate the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determine the start position and end position of the file part to be processed by each node in the fastq file;

步骤104,根据步骤103确定的每个节点要处理的文件部分在fastq文件中的起始位置和结束位置,生成读取指令,通过管道的方式提供给后续程序。Step 104, according to the start position and end position of the file part to be processed by each node determined in step 103 in the fastq file, generate a read instruction, and provide it to a subsequent program through a pipeline.

优选地,上述方法中的步骤102还包括:如果文件块的结束位置在某个序列的中间,则将文件块扩展至该序列的结尾。Preferably, step 102 in the above method further includes: if the end position of the file block is in the middle of a certain sequence, extending the file block to the end of the sequence.

优选地,上述方法中的文件块的位置信息包括文件块的起始位置和结束位置。Preferably, the position information of the file block in the above method includes the start position and the end position of the file block.

优选地,上述方法中的文件块的大小的取值范围可在1M至100M之间。Preferably, the value range of the file block size in the above method may be between 1M and 100M.

优选地,上述方法中的步骤103中每个节点需要处理的文件块数量按如下公式计算:Preferably, the number of file blocks to be processed by each node in step 103 of the above method is calculated according to the following formula:

其中,Bi为第i个节点处理的文件块数;Among them, B i is the number of file blocks processed by the i-th node;

ci为第i个节点的核数;c i is the number of cores of the i-th node;

Bt为总的文件块数;B t is the total number of file blocks;

n为总的节点数;n is the total number of nodes;

j为整数,范围为1到n;j is an integer ranging from 1 to n;

cj为第j个节点的核数。c j is the number of cores of the jth node.

根据本发明的另一个方面,提供一种针对大数据基因测序文件的快速划分方法,包括:According to another aspect of the present invention, there is provided a method for rapidly dividing large data gene sequencing files, including:

步骤201,按照序列对fastq文件进行分析统计,获得每条序列的位置信息和序列的总数;Step 201, analyze and count the fastq files according to the sequence, and obtain the position information of each sequence and the total number of sequences;

步骤202,根据节点的数量和每个节点的核数,计算每个节点需要处理的序列数量,确定出每个节点要处理的文件部分在fastq文件中的开始位置和结束位置;Step 202, according to the number of nodes and the number of cores of each node, calculate the number of sequences that each node needs to process, and determine the start position and end position of the file part to be processed by each node in the fastq file;

步骤203,根据步骤202得出的每个节点要处理的文件部分在fastq文件中的起始位置和结束位置,生成读取指令,通过管道的方式提供给后续程序。Step 203, according to the start position and end position of the file part to be processed by each node obtained in step 202 in the fastq file, generate a read instruction, and provide it to a subsequent program through a pipeline.

优选地,上述方法的序列的位置信息包括序列的起始位置和结束位置。Preferably, the position information of the sequence in the above method includes the start position and end position of the sequence.

优选地,上述方法中的步骤202中每个节点需要处理的序列数量按如下公式计算:Preferably, the number of sequences that each node needs to process in step 202 of the above method is calculated according to the following formula:

其中,Si为第i个节点处理的序列数;Among them, S i is the number of sequences processed by the i-th node;

ci为第i个节点的核数;c i is the number of cores of the i-th node;

St为总的序列数;S t is the total sequence number;

n为总的节点数;n is the total number of nodes;

j为整数,范围为1到n;j is an integer ranging from 1 to n;

cj为第j个节点的核数。c j is the number of cores of the jth node.

一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现上述任一方法。A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, any one of the above-mentioned methods is implemented.

一种计算机设备,包括存储器和处理器,在所述存储器上存储有能够在处理器上运行的计算机程序,其特征在于所述处理器执行所述程序时实现上述任一方法。A computer device includes a memory and a processor, and a computer program capable of running on the processor is stored in the memory, and is characterized in that any one of the above methods is implemented when the processor executes the program.

本发明针对现有技术的不足,对fastq文件采用了懒划分的策略,不生成子文件,从而避免了子文件的读写和存储。并且加入了多种划分方式,供后续分析软件使用。本发明的方法减少了硬盘读写次数,提高了文件划分速度,消除了比对错误。Aiming at the deficiencies of the prior art, the present invention adopts a strategy of lazily dividing fastq files and does not generate sub-files, thereby avoiding the reading, writing and storage of sub-files. And added a variety of division methods for subsequent analysis software. The method of the invention reduces the times of reading and writing of the hard disk, improves the speed of file division, and eliminates comparison errors.

附图说明Description of drawings

以下参照附图对本发明实施例作进一步说明,其中:Embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

图1是根据本发明一个实施例的按块划分方法的流程示意图。Fig. 1 is a schematic flowchart of a method for dividing by blocks according to an embodiment of the present invention.

图2是根据本发明一个实施例的按序列划分方法的流程示意图。Fig. 2 is a schematic flowchart of a method for dividing by sequence according to an embodiment of the present invention.

具体实施方式Detailed ways

在详细说明本方法之前,先简单介绍fastq文件的格式。fastq文件是文本文件,每四行为一个序列,第一行是序列的名称信息,第二行是碱基序列,第三行是说明信息,第四行是序列的质量分数信息。每个序列的长度不完全相同。测序文件分为单端测序文件和双端测序文件,单端测序文件只包含一个文件,双端测序文件包含一对文件,这一对文件中的每个序列都是对应的。Before explaining this method in detail, briefly introduce the format of the fastq file. The fastq file is a text file, with four lines per sequence, the first line is the name information of the sequence, the second line is the base sequence, the third line is the description information, and the fourth line is the quality score information of the sequence. Each sequence is not exactly the same length. Sequencing files are divided into single-end sequencing files and paired-end sequencing files. A single-end sequencing file contains only one file, and a paired-end sequencing file contains a pair of files. Each sequence in this pair of files corresponds to each other.

根据本发明的一个实施例,结合图1介绍按块划分的方法,该方法包括以下步骤。According to an embodiment of the present invention, a method for dividing by blocks is introduced with reference to FIG. 1 , and the method includes the following steps.

步骤101,设置文件块的大小,优选的,其取值范围可在1M至100M之间。Step 101, setting the size of the file block, preferably, its value range can be between 1M and 100M.

发明人研究发现,基因测序分析工具bwa在比对过程中,对测序文件按块读取。发明人发现,当使用多节点运行bwa工具并行地对fastq文件比对时,按块划分fastq文件有利于负载均衡。此文件块的大小可根据不同的分析软件以及节点的处理能力采用不同的取值,优选的,其取值范围可在1M至100M之间。在本发明的一个实施例中,此文件块的大小取值为10M时,可以获得较佳的处理速度和负载均衡效果。The inventor found that the gene sequencing analysis tool bwa reads the sequencing files in blocks during the comparison process. The inventors found that when multiple nodes are used to run the bwa tool to compare the fastq files in parallel, dividing the fastq files by blocks is beneficial to load balancing. The size of the file block can take different values according to different analysis software and the processing capability of the nodes. Preferably, the value range can be between 1M and 100M. In an embodiment of the present invention, when the size of the file block is 10M, better processing speed and load balancing effect can be obtained.

步骤102,根据上一步骤设置的块大小对fastq文件进行分析统计,并划分成多个文件块;以及将分析结果保存到信息文件中。Step 102, analyze and count the fastq file according to the block size set in the previous step, and divide it into multiple file blocks; and save the analysis result into the information file.

按块大小分析fastq文件时,以块大小10M为例,从文件起始位置向后偏移10M字节,如果此字节数据恰好在一个序列的结尾,则将第一个文件块的起始位置设为0,结束位置设为10M。如果此字节数据在某个序列的中间,则将第一个文件块的结束位置设置为该序列的结尾位置。找到了fastq文件的第一个文件块位置后,将其起始位置和结束位置存入信息文件。可以看出,此第一个文件块的大小大于或等于10M,并且包含了完整的序列数据。然后,将第一个文件块的结束位置偏移一个字节,作为第二个文件块的起始位置,继续向后偏移10M字节,如果当前字节数据恰好处于一个序列的结尾,则将当前字节的位置设为第二块的结束位置,如果当前字节数据处于一个序列的开始或中间,则将第二块的结束位置设置为该序列的结尾位置。以此类推,一直分析到fastq文件的结尾,找到所有文件块的起始位置和结束位置并存入信息文件。根据本发明的一个实施例,显然,在信息文件中也可以只存入起始位置。可以看出,最后一个文件块的大小可能小于10M,因此,除了最后一个文件块之外,其他每个文件块的大小会略有不同,在10M附近浮动。并且,每个文件块都包含了多个完整的序列,一个序列只存在于一个文件块中。根据本发明的一个实施例,在分析过程中,还累计文件块的个数并将其存入信息文件。When analyzing a fastq file by block size, take a block size of 10M as an example, offset 10M bytes backward from the start position of the file, and if this byte data happens to be at the end of a sequence, set the start position of the first file block to 0 and the end position to 10M. If this byte data is in the middle of a sequence, sets the end of the first file block to the end of the sequence. After finding the first file block position of the fastq file, store its start position and end position in the information file. It can be seen that the size of the first file block is greater than or equal to 10M and contains complete sequence data. Then, offset the end position of the first file block by one byte as the start position of the second file block, continue to offset backward by 10M bytes, if the current byte data is at the end of a sequence, set the current byte position as the end position of the second block, and if the current byte data is at the beginning or middle of a sequence, set the end position of the second block as the end position of the sequence. By analogy, analyze until the end of the fastq file, find the start position and end position of all file blocks and store them in the information file. According to an embodiment of the present invention, obviously, only the starting position can also be stored in the information file. It can be seen that the size of the last file block may be less than 10M, therefore, except the last file block, the size of each other file block will be slightly different, floating around 10M. Moreover, each file block contains multiple complete sequences, and a sequence only exists in one file block. According to an embodiment of the present invention, during the analysis process, the number of file blocks is also accumulated and stored in the information file.

对于双端测序文件,如果两个文件大小一致,则按照其中一个文件进行按块统计分析,如果两个文件大小不一致,则两个文件分别按照本发明中另一个按序列划分的方法进行划分。For the paired-end sequencing files, if the two files have the same size, perform block-by-block statistical analysis according to one of the files, and if the two files are inconsistent in size, then the two files are divided according to another method of dividing by sequence in the present invention.

步骤103,根据节点的数量和每个节点的核数,计算每个节点需要处理的文件块数量,根据步骤102得出的统计信息,确定出每个节点要处理的文件部分在fastq文件中的开始位置和结束位置。Step 103, according to the number of nodes and the number of cores of each node, calculate the number of file blocks that each node needs to process, and according to the statistical information obtained in step 102, determine the start position and end position of the file part to be processed by each node in the fastq file.

具体来说,在多节点基因分析流程中,每个计算节点的核数和计算能力不尽相同,因此在进行测序文件划分的时候,要考虑到这些情况,为每个节点确定其处理范围,使得负载更均衡。Specifically, in the multi-node gene analysis process, the number of cores and computing power of each computing node are different. Therefore, when dividing the sequencing files, these situations must be taken into consideration, and the processing range for each node should be determined to make the load more balanced.

根据本发明的一个实施例,每个节点处理的文件块数按照公式1计算。According to an embodiment of the present invention, the number of file blocks processed by each node is calculated according to Formula 1.

其中,Bi为第i个节点处理的块数;Among them, B i is the number of blocks processed by the i-th node;

ci为第i个节点的核数;c i is the number of cores of the i-th node;

Bt为总的文件块数;B t is the total number of file blocks;

n为总的节点数;n is the total number of nodes;

j为整数,范围为1到n;j is an integer ranging from 1 to n;

cj为第j个节点的核数。c j is the number of cores of the jth node.

计算出每个节点需要处理的文件块数量后,即可通过所述的信息文件确定出每个节点要处理的文件部分在原测序文件中的起始位置和结束位置。After the number of file blocks to be processed by each node is calculated, the start position and end position of the file part to be processed by each node in the original sequencing file can be determined through the information file.

步骤104,根据步骤103得出的每个节点要处理的文件部分在原测序文件中的起始位置和结束位置生成读取指令,通过管道的方式提供给后续程序。Step 104, generate a read command according to the starting position and ending position of the file part to be processed by each node obtained in step 103 in the original sequencing file, and provide it to the subsequent program through a pipeline.

管道命令是现有技术,此处以Linux系统为例作简单介绍。Linux管道使用竖线“|”连接多个命令,其被称为管道符。Linux管道的具体语法格式如下:The pipeline command is an existing technology, and a brief introduction is made here by taking the Linux system as an example. Linux pipes use the vertical bar "|" to connect multiple commands, which is called a pipe character. The specific syntax format of the Linux pipeline is as follows:

command1|command2command1|command2

在本发明中,command1是读取fastq文件范围的指令,command2是bwa等按块分析工具的指令。In the present invention, command1 is an instruction for reading the fastq file range, and command2 is an instruction for block analysis tools such as bwa.

根据本发明的另一个方面,发明人还发现,bowtie是按照序列处理fastq文件,因此当后续处理程序是bowtie等按序列分析的工具时,需要将fastq文件按照序列划分。According to another aspect of the present invention, the inventor also found that bowtie processes fastq files according to sequence, so when the subsequent processing program is a tool for sequence analysis such as bowtie, fastq files need to be divided according to sequence.

下面根据本发明的一个实施例,结合图2介绍按序列划分的方法,该方法包括以下步骤。In the following, according to an embodiment of the present invention, a method for dividing by sequence is introduced with reference to FIG. 2 , and the method includes the following steps.

步骤201,按照序列对fastq文件进行分析统计。在分析fastq文件的同时,将分析结果保存到信息文件中。Step 201, analyze and count the fastq files according to the sequence. While analyzing the fastq file, save the analysis results to the info file.

具体来说,在分析过程中,将每条序列的起始和结束位置进行分析记录,并统计序列的数量,保存到信息文件。Specifically, during the analysis process, the start and end positions of each sequence are analyzed and recorded, and the number of sequences is counted and saved to an information file.

对于双端测序文件,如果两个文件大小一致,则按照其中一个文件进行按序列统计分析,如果二者大小不一致,则两个文件分别按照序列进行统计分析。For paired-end sequencing files, if the two files have the same size, perform statistical analysis by sequence according to one of the files, and if the two files are inconsistent in size, perform statistical analysis on the two files according to the sequence respectively.

步骤202,根据节点的数量和每个节点的核数,计算每个节点需要处理的序列数量,根据步骤201得出的统计信息,确定出每个节点要处理的文件部分在fastq文件中的开始位置和结束位置。Step 202, according to the number of nodes and the number of cores of each node, calculate the number of sequences that each node needs to process, and according to the statistical information obtained in step 201, determine the start position and end position of the file part to be processed by each node in the fastq file.

根据本发明的一个实施例,在多节点基因分析流程中,每个计算节点的核数和计算能力不尽相同,因此在进行测序文件划分的时候,要考虑到这些情况,为每个节点确定其处理范围,使得负载更均衡。According to an embodiment of the present invention, in the multi-node gene analysis process, the number of cores and computing power of each computing node are different, so when dividing the sequencing files, these situations should be considered, and the processing range should be determined for each node to make the load more balanced.

每个节点处理的序列数按照公式2计算。The number of sequences processed by each node is calculated according to formula 2.

其中,Si为第i个节点处理的序列数;Among them, S i is the number of sequences processed by the i-th node;

ci为第i个节点的核数;c i is the number of cores of the i-th node;

St为总的序列数;S t is the total sequence number;

n为总的节点数;n is the total number of nodes;

j为整数,范围为1到n;j is an integer ranging from 1 to n;

cj为第j个节点的核数。c j is the number of cores of the jth node.

计算出每个节点需要处理的序列数量后,即可通过所述的信息文件确定出每个节点要处理的文件部分在原测序文件中的起始位置和结束位置。After the number of sequences to be processed by each node is calculated, the start position and end position of the file part to be processed by each node in the original sequencing file can be determined through the information file.

步骤20304,根据步骤202得出的每个节点要处理的文件部分在原测序文件中的起始位置和结束位置生成读取指令,通过管道的方式提供给后续程序。Step 20304, generate a read command according to the starting position and ending position of the file part to be processed by each node in the original sequencing file obtained in step 202, and provide it to the subsequent program through the pipeline.

管道命令格式如下:The pipeline command format is as follows:

command1|command2command1|command2

其中,command1是读取fastq文件范围的指令,command2是bowtie等按序列分析的工具的指令。Among them, command1 is the command to read the range of the fastq file, and command2 is the command to analyze tools such as bowtie by sequence.

本发明提出了一种对大数据基因测序文件的快速划分方法,使得在多节点基因分析过程中,无须对测序文件进行实际的切分,不产生子文件,根据后续分析程序提供灵活的划分方案,从而使得各个节点负载更均衡,减少了硬盘读写,提高了划分效率。The present invention proposes a method for quickly dividing large data gene sequencing files, so that in the multi-node gene analysis process, there is no need to perform actual segmentation on the sequencing files, no sub-files are generated, and a flexible division scheme is provided according to the follow-up analysis program, thereby making the load of each node more balanced, reducing hard disk reading and writing, and improving division efficiency.

需要说明的是,上述实施例中介绍的各个步骤并非都是必须的,本领域技术人员可以根据实际需要进行适当的取舍、替换、修改等。It should be noted that not all the steps described in the foregoing embodiments are necessary, and those skilled in the art may make appropriate trade-offs, replacements, modifications, etc. according to actual needs.

最后所应说明的是,以上实施例仅用以说明本发明的技术方案而非限制。尽管上文参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行修改或者等同替换,都不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention rather than limit them. Although the present invention has been described in detail above with reference to the embodiments, those skilled in the art should understand that modifications or equivalent replacements to the technical solutions of the present invention do not depart from the spirit and scope of the technical solutions of the present invention, and all should be covered by the claims of the present invention.

Claims (6)

1. A rapid partitioning method for big data gene sequencing files, comprising:
step 101, setting the size of a file block;
102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file, wherein the position information of the file blocks comprises a starting position and an ending position of the file blocks;
step 103, calculating the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed by each node in the fastq file;
step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, providing the reading instruction to a subsequent program in a pipeline mode,
the number of file blocks to be processed by each node in step 103 is calculated according to the following formula:
wherein B is i The number of file blocks processed for the ith node;
c i the number of cores for the ith node;
B t the total file block number;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
2. The rapid partitioning method for big data gene sequencing file of claim 1, said step 102 further comprising: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.
3. The rapid partitioning method for big data gene sequencing file as claimed in claim 1, wherein the file block size has a value ranging from 1M to 100M.
4. A rapid partitioning method for big data gene sequencing files, comprising:
step 201, analyzing and counting fastq files according to sequences, obtaining position information of each sequence and total number of the sequences, and storing the position information of each sequence into an information file, wherein the position information of each sequence comprises a starting position and an ending position of each sequence;
step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of a file part to be processed of each node in a fastq file;
step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file obtained in step 202, providing the read instruction to a subsequent program in a pipeline manner,
the number of sequences that each node needs to process in step 202 is calculated according to the following formula:
wherein S is i The number of sequences processed for the ith node;
c i the number of cores for the ith node;
S t is the total number of sequences;
n is the total node number;
j is an integer ranging from 1 to n;
c j the number of cores for the j-th node.
5. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method according to any of claims 1-4.
6. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the method according to any of claims 1-4 when executing the program.
CN202010122470.0A 2020-02-27 2020-02-27 A fast partitioning method for big data gene sequencing files Active CN111326216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010122470.0A CN111326216B (en) 2020-02-27 2020-02-27 A fast partitioning method for big data gene sequencing files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010122470.0A CN111326216B (en) 2020-02-27 2020-02-27 A fast partitioning method for big data gene sequencing files

Publications (2)

Publication Number Publication Date
CN111326216A CN111326216A (en) 2020-06-23
CN111326216B true CN111326216B (en) 2023-07-21

Family

ID=71168260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010122470.0A Active CN111326216B (en) 2020-02-27 2020-02-27 A fast partitioning method for big data gene sequencing files

Country Status (1)

Country Link
CN (1) CN111326216B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005011430A (en) * 2003-06-19 2005-01-13 Hitachi Ltd File management method, recording apparatus, reproduction apparatus, and recording medium
CN101446976A (en) * 2008-12-26 2009-06-03 中兴通讯股份有限公司 File storage method in distributed file system
CN102930005A (en) * 2012-10-29 2013-02-13 北京奇虎科技有限公司 Method and device for binding file in host file
CN103186617A (en) * 2011-12-30 2013-07-03 北京新媒传信科技有限公司 Data storage method and device
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
EP2759953A1 (en) * 2013-01-28 2014-07-30 Hasso-Plattner-Institut für Softwaresystemtechnik GmbH System and method for genomic data processing with an in-memory database system and real-time analysis
CN105095686A (en) * 2014-05-15 2015-11-25 中国科学院青岛生物能源与过程研究所 High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware
CN106021538A (en) * 2016-05-27 2016-10-12 成都索贝数码科技股份有限公司 Word segmentation method and system based on storage of FICS objects
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240583A1 (en) * 2004-01-21 2005-10-27 Li Peter W Literature pipeline
US7478376B2 (en) * 2004-12-02 2009-01-13 International Business Machines Corporation Computer program code size partitioning method for multiple memory multi-processing systems
US9081501B2 (en) * 2010-01-08 2015-07-14 International Business Machines Corporation Multi-petascale highly efficient parallel supercomputer
US20140067887A1 (en) * 2012-08-29 2014-03-06 Sas Institute Inc. Grid Computing System Alongside A Distributed File System Architecture
CN103049680B (en) * 2012-12-29 2016-09-07 深圳先进技术研究院 gene sequencing data reading method and system
CN104504257B (en) * 2014-12-12 2017-08-11 国家电网公司 A kind of online Prony analysis methods calculated based on Dual parallel
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN110088839B (en) * 2016-10-11 2023-12-15 耶诺姆希斯股份公司 Efficient data structures for bioinformatic information representation
KR102421458B1 (en) * 2016-10-11 2022-07-14 게놈시스 에스에이 Method and apparatus for accessing structured bioinformatics data with an access unit
CN107145766A (en) * 2017-03-27 2017-09-08 中国科学院深圳先进技术研究院 Gene sequence reading method and reading system
CN107169313A (en) * 2017-03-29 2017-09-15 中国科学院深圳先进技术研究院 The read method and computer-readable recording medium of DNA data files
CN109698010A (en) * 2017-10-23 2019-04-30 北京哲源科技有限责任公司 A kind of processing method for gene data
CN110120247A (en) * 2018-01-14 2019-08-13 广州明领基因科技有限公司 A kind of distributed genetic big data storage platform
US12210904B2 (en) * 2018-06-29 2025-01-28 International Business Machines Corporation Hybridized storage optimization for genomic workloads
CN109616156B (en) * 2018-12-03 2021-07-06 郑州云海信息技术有限公司 A kind of gene sequencing data storage method and device
CN109785905B (en) * 2018-12-18 2021-07-23 中国科学院计算技术研究所 An Accelerator for Gene Alignment Algorithms
CN110427270B (en) * 2019-08-09 2022-11-01 华东师范大学 Dynamic load balancing method for distributed connection operator in RDMA (remote direct memory Access) network

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005011430A (en) * 2003-06-19 2005-01-13 Hitachi Ltd File management method, recording apparatus, reproduction apparatus, and recording medium
CN101446976A (en) * 2008-12-26 2009-06-03 中兴通讯股份有限公司 File storage method in distributed file system
CN103186617A (en) * 2011-12-30 2013-07-03 北京新媒传信科技有限公司 Data storage method and device
CN102930005A (en) * 2012-10-29 2013-02-13 北京奇虎科技有限公司 Method and device for binding file in host file
EP2759953A1 (en) * 2013-01-28 2014-07-30 Hasso-Plattner-Institut für Softwaresystemtechnik GmbH System and method for genomic data processing with an in-memory database system and real-time analysis
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN105095686A (en) * 2014-05-15 2015-11-25 中国科学院青岛生物能源与过程研究所 High-flux transcriptome sequencing data quality control method based on multi-core CPU (Central Processing Unit) hardware
CN106021538A (en) * 2016-05-27 2016-10-12 成都索贝数码科技股份有限公司 Word segmentation method and system based on storage of FICS objects
CN106446254A (en) * 2016-10-14 2017-02-22 北京百度网讯科技有限公司 File detection method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Gene Panel流程的并行设计与优化研究;王元戎等;计算机学报;第42卷(第11期);全文 *
PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead;Lingqi Zhang;Genes;全文 *
基于Hadoop Streaming的Last比对软件并行化的研究与实现;董本志;李文浩;景维鹏;;计算机工程与应用(第02期);全文 *
基于高通量转录组测序的序列比对算法研究;张勇等;中国优秀硕士学位论文全文数据库 (信息科技辑)(第3期);全文 *

Also Published As

Publication number Publication date
CN111326216A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN107609350B (en) Data processing method of second-generation sequencing data analysis platform
US11941534B2 (en) Genome sequence alignment system and method
CN110797085B (en) Method, system, equipment and storage medium for inquiring gene data
WO2019161645A1 (en) Shell-based data table extraction method, terminal, device, and storage medium
CN107480466B (en) Genome data storage method and electronic device
WO2024198934A1 (en) Data processing method, apparatus and system, and electronic device and storage medium
CN110021345B (en) Spark platform-based gene data analysis method
CN115469818B (en) Disk array writing processing method, device, equipment and medium
JP6201788B2 (en) Loop division detection program and loop division detection method
CN101770504B (en) Data storage method, data reading method, and data reading equipment
CN110264392B (en) A multi-GPU-based strongly connected graph detection method
CN110262289B (en) Method, device and storage medium for processing variables in A2L files
CN111326216B (en) A fast partitioning method for big data gene sequencing files
CN111370070B (en) Compression processing method for big data gene sequencing file
CN117112004B (en) Differential data determination method, differential restoration method, device, equipment and medium
CN115104092A (en) Data synchronization method and related device
CN104750846B (en) A kind of substring lookup method and device
CN107169313A (en) The read method and computer-readable recording medium of DNA data files
CN114420210B (en) Rapid trimming method and system for biological sequencing sequence
CN113495901B (en) Quick retrieval method for variable-length data blocks
CN102637204A (en) Method for querying texts based on mutual index structure
CN107403076B (en) DNA sequence processing method and equipment
CN114817327A (en) File version identification method, system, terminal equipment and storage medium
WO2019023978A1 (en) Alignment method, device and system
CN108984123A (en) A kind of data de-duplication method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant