CN114420210B - Rapid trimming method and system for biological sequencing sequence - Google Patents
Rapid trimming method and system for biological sequencing sequence Download PDFInfo
- Publication number
- CN114420210B CN114420210B CN202210308606.6A CN202210308606A CN114420210B CN 114420210 B CN114420210 B CN 114420210B CN 202210308606 A CN202210308606 A CN 202210308606A CN 114420210 B CN114420210 B CN 114420210B
- Authority
- CN
- China
- Prior art keywords
- read
- thread
- data
- biological sequencing
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 85
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000009966 trimming Methods 0.000 title claims abstract description 38
- 230000008569 process Effects 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims description 25
- 238000013138 pruning Methods 0.000 claims description 21
- 238000007405 data analysis Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 description 12
- 238000007481 next generation sequencing Methods 0.000 description 10
- 230000006872 improvement Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 101100194362 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res1 gene Proteins 0.000 description 4
- 101100194363 Schizosaccharomyces pombe (strain 972 / ATCC 24843) res2 gene Proteins 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 108091028043 Nucleic acid sequence Proteins 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 150000007523 nucleic acids Chemical group 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5011—Pool
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5018—Thread allocation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Evolutionary Biology (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
技术领域technical field
本发明属于生物信息技术领域,尤其涉及一种生物测序序列快速修剪方法及系统。The invention belongs to the technical field of biological information, and in particular relates to a method and system for rapid trimming of biological sequencing sequences.
背景技术Background technique
本部分的陈述仅仅是提供了与本发明相关的背景技术信息,不必然构成在先技术。The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.
在新一代测序中,要进行测序的核酸序列与接头序列(Adapter)连接,以便被测序仪识别,然而当核酸序列的长度短于测序平台运行的读取长度时,测序得到的基因序列片段(称之为Read)将同时包含需要进行测序的核酸序列以及全部或者部分接头序列。除此之外,在NGS(Next Generation Sequencing:下一代测序)测序中,测序结果的可信度在末尾循环(tail cycle)会变得较低,得到的是一些低质量的测序序列。被测序接头或者低质量的测序过程污染的测序序列经常导致不能满意的下游分析(如基因比对工作等)结果,因此修剪(Trim)测序中的接头以及低质量的数据成为下游分析任务之前不可缺少的环节。In next-generation sequencing, the nucleic acid sequence to be sequenced is connected with the adapter sequence (Adapter) so that it can be recognized by the sequencer. However, when the length of the nucleic acid sequence is shorter than the read length of the sequencing platform, the sequenced gene sequence fragment ( Called Read) will contain both the nucleic acid sequence to be sequenced and all or part of the linker sequence. In addition, in NGS (Next Generation Sequencing: Next Generation Sequencing) sequencing, the confidence of the sequencing results will become lower at the end of the cycle (tail cycle), and some low-quality sequencing sequences are obtained. Sequencing sequences contaminated by sequencing adapters or low-quality sequencing processes often lead to unsatisfactory downstream analysis (such as gene alignment work, etc.) missing link.
随着现代测序仪的进步,不断增长的吞吐量和序列长度,为修剪工作提出了新的挑战,当前的一些处理工具的性能(比如速度和准确率)已经难以满足如此规模的测序数据,使得预处理步骤成为了数据分析的瓶颈,一种用于NGS数据预处理的超快的、准确的测序数据接头质量修剪工具仍是迫切需求。With the advancement of modern sequencers, the ever-increasing throughput and sequence lengths pose new challenges for pruning work. The performance of some current processing tools (such as speed and accuracy) has been difficult to meet the sequencing data of this scale, making The preprocessing step has become the bottleneck of data analysis, and an ultrafast and accurate sequencing data adapter quality trimming tool for NGS data preprocessing is still in urgent need.
目前有许多的工具被用于新一代测序数据中的测序接头和低质量数据的修剪工作,比如:Trimmomatic、Fastp、Ktrim等,这些工具采用了不同的Trim算法,实现了NGS测序数据中的Adapter-Trim和Quality-Trim。At present, many tools are used for the trimming of sequencing adapters and low-quality data in next-generation sequencing data, such as: Trimmomatic, Fastp, Ktrim, etc. These tools use different Trim algorithms to realize Adapter in NGS sequencing data -Trim and Quality-Trim.
发明人发现,Ktrim相比其他几种工具,在NGS短读测序数据上有更好的表现,但是,随着固态硬盘技术的发展和普及,以及磁盘阵列技术的进步,Ktrim当前的处理速率距离硬盘的读写峰值差距较大,无法满足现在对生物数据预处理的性能要求,而且经过测试,Ktrim的线程扩展性并不好,使用四个以上的线程数量程序的处理速度难以继续提高。The inventor found that Ktrim has better performance on NGS short-read sequencing data than several other tools. However, with the development and popularization of solid-state disk technology and the advancement of disk array technology, Ktrim's current processing rate The difference between the read and write peaks of the hard disk is large, which cannot meet the current performance requirements for biological data preprocessing, and after testing, Ktrim's thread scalability is not good, and it is difficult to continue to improve the processing speed of programs using more than four threads.
发明内容SUMMARY OF THE INVENTION
本发明为了解决上述问题,提供了一种生物测序序列快速修剪方法及系统,所述方案通过采用轻量级的I/O框架,大幅度提升了I/O的速率;同时,在此基础上采用了向量化的方式优化了Adapter的寻找过程,有效提高了生物测序序列快速修剪的处理效率。In order to solve the above problems, the present invention provides a method and system for rapid pruning of biological sequencing sequences. The solution greatly improves the rate of I/O by using a lightweight I/O framework; at the same time, on this basis The vectorization method is used to optimize the Adapter search process, which effectively improves the processing efficiency of rapid pruning of biological sequencing sequences.
根据本发明实施例的第一个方面,提供了一种生物测序序列快速修剪方法,包括:According to a first aspect of the embodiments of the present invention, a rapid trimming method for biological sequencing sequences is provided, including:
获取待修剪的生物测序序列;Obtain the biological sequencing sequence to be trimmed;
对所述生物测序序列进行读操作、修剪操作以及写操作;其中,基于生产者—消费者模型对所述读操作、修剪操作以及写操作进行解耦,实现异步执行;且所述生物测序序列的格式化过程从读操作中转移到修剪操作中。A read operation, a trim operation, and a write operation are performed on the biological sequencing sequence; wherein, the read operation, the trimming operation, and the writing operation are decoupled based on a producer-consumer model to achieve asynchronous execution; and the biological sequencing sequence The formatting process is shifted from the read operation to the trim operation.
进一步的,所述读操作、修剪操作以及写操作分别采用独立的线程进行实现,其中,读线程和写线程均设置有一个,所述修剪线程设置有一个或多个。Further, the read operation, the pruning operation and the writing operation are respectively implemented by independent threads, wherein one read thread and one write thread are provided, and one or more pruning threads are provided.
进一步的,所述读操作用于通过读线程对所述生物测序序列按照块方式进行读取,并将读取的块对象存储入第一数据队列中。Further, the read operation is used to read the biological sequencing sequence in a block manner through a read thread, and store the read block objects in the first data queue.
进一步的,所述修剪操作用于通过修剪线程从所述第一数据队列中获取数据,对所述生物测序序列进行格式化,去除生物测序序列中低质量碱基序列和接头序列;同时将处理后的序列存储入第二数据队列中。Further, the trimming operation is used to obtain data from the first data queue through a trimming thread, format the biological sequencing sequence, and remove low-quality base sequences and linker sequences in the biological sequencing sequence; The latter sequence is stored in the second data queue.
进一步的,在所述修剪线程中获取接头序列包括:将所述生物测序序列中的每个碱基作为一个字符;基于向量寄存器,采用若干次位运算获得预设长度序列数据中接头序列的位置。Further, obtaining the linker sequence in the pruning thread includes: using each base in the biological sequencing sequence as a character; and using several bit operations to obtain the position of the linker sequence in the preset length sequence data based on the vector register. .
进一步的,所述写操作用于通过写线程从所述第二数据队列中获取处理后的生物测序序列,并进行存储。Further, the writing operation is used to obtain the processed biological sequencing sequence from the second data queue through a writing thread, and store it.
根据本发明实施例的第二个方面,提供了一种生物测序序列快速修剪系统,包括:According to a second aspect of the embodiments of the present invention, a rapid trimming system for biological sequencing sequences is provided, including:
数据获取单元,其用于获取待修剪的生物测序序列;a data acquisition unit, which is used to acquire the biological sequencing sequence to be trimmed;
数据处理单元,其用于对所述生物测序序列进行读操作、修剪操作以及写操作;其中,基于生产者—消费者模型对所述读操作、修剪操作以及写操作进行解耦,实现异步执行;且所述生物测序序列的格式化过程从读操作中转移到修剪操作中。A data processing unit, which is used to perform read operation, trim operation and write operation on the biological sequencing sequence; wherein, the read operation, trim operation and write operation are decoupled based on the producer-consumer model to realize asynchronous execution ; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trimming operation.
与现有技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:
(1)本发明提供了一种生物测序序列快速修剪方法及系统,所述方案通过采用轻量级的I/O框架,大幅度提升了I/O的速率;同时,在此基础上采用了向量化的方式优化了Adapter的寻找过程,有效提高了生物测序序列快速修剪的处理效率。(1) The present invention provides a method and system for rapid pruning of biological sequencing sequences. The solution greatly improves the rate of I/O by using a lightweight I/O framework; The vectorization method optimizes the Adapter search process and effectively improves the processing efficiency of rapid pruning of biological sequencing sequences.
(2)本发明解决了Ktrim的IO效率低的问题,将文件数据的输入输出速率改进达到或接近磁盘读写的性能峰值。所述方案通过采用生产者-消费者模型实现读线程、工作线程以及写线程之间解耦、异步和速度平衡;同时,将数据格式化的任务从读线程转移到工作线程,将写回数据准备的任务从写线程转移到工作线程来最大可能地保证读线程和写线程的速度。(2) The present invention solves the problem of low IO efficiency of Ktrim, and improves the input and output rate of file data to reach or approach the performance peak value of disk read and write. The scheme realizes decoupling, asynchrony and speed balance among reading threads, working threads and writing threads by adopting the producer-consumer model; at the same time, the task of data formatting is transferred from the reading thread to the working thread, and the data is written back. Prepared tasks are shifted from the writer thread to the worker thread to maximize the speed of the reader and writer threads.
(3)本发明所述方案使用数据池DataPool重复利用chunk对象,减少对象的创建与销毁,减少创建对象的开销;同时,在读线程、工作线程以及写线程之间传递数据时,尽可能使用数据的指针,而不进行数据复制,有效减少了内存拷贝的开销;(3) The scheme of the present invention uses the data pool DataPool to reuse chunk objects, reduces the creation and destruction of objects, and reduces the overhead of creating objects; at the same time, when transferring data between reading threads, working threads and writing threads, use data as much as possible. pointer, without data copying, which effectively reduces the overhead of memory copying;
(4)本发明所述方案在整个工作流程中的读线程、工作线程(即修剪线程)以及写线程均只创建一次,直到完成其所有的任务后才销毁,有效减少了线程创建的开销。(4) In the solution of the present invention, the reading thread, the working thread (ie, the trimming thread) and the writing thread in the whole workflow are created only once, and are not destroyed until all their tasks are completed, which effectively reduces the overhead of thread creation.
本发明附加方面的优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will become apparent from the description which follows, or may be learned by practice of the invention.
附图说明Description of drawings
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。The accompanying drawings forming a part of the present invention are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention.
图1为本发明实施例中所述的读线程和工作线程(即Trim线程)之间chunk对象流转示意图;1 is a schematic diagram of the flow of chunk objects between a read thread and a worker thread (that is, a Trim thread) described in an embodiment of the present invention;
图2为本发明实施例中所述的工作线程(即Trim线程)和写线程之间的关系示意图;FIG. 2 is a schematic diagram of the relationship between a worker thread (that is, a Trim thread) and a writing thread described in an embodiment of the present invention;
图3为本发明实施例中所述的轻量级IO框架结构示意图;3 is a schematic structural diagram of the lightweight IO framework described in the embodiment of the present invention;
图4为本发明实施例中所述的整体框架示意图。FIG. 4 is a schematic diagram of the overall framework described in the embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图与实施例对本发明做进一步说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.
应该指出,以下详细说明都是示例性的,旨在对本发明提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the invention. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.
在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
术语解释:Terminology Explanation:
Adapter:接头序列,是指在生物序列测序过程中因测序技术或者测序平台原因测得的额外序列片段。Adapter: Adapter sequence, refers to the additional sequence fragments measured due to sequencing technology or sequencing platform during the biological sequence sequencing process.
Trim:测序数据质量控制的重要环节,主要是去除低质量碱基序列和Adapter序列。Trim: An important link in the quality control of sequencing data, mainly to remove low-quality base sequences and adapter sequences.
实施例一:Example 1:
本实施例的目的是提供一种生物测序序列快速修剪方法。The purpose of this embodiment is to provide a rapid trimming method for biological sequencing sequences.
首先,需要说明的是,本发明是以Ktrim工具为基础进行的改进,Ktrim相比现有的其他几种工具(如Trimmomatic、Fastp),在NGS短读测序数据上有更好的表现,通过在配备英特尔至强CPU和标准64位CentOS系统的计算服务器设备上进行测试,使用单端测试数据(一千万条FASTQ格式的测序数据,大小约2.1GB),在四个线程上该工具可以达到其最佳性能,处理时间是7.26秒,处理速率大约是300MB/秒;使用双端测试数据(两个单端测序数据文件,共两千万条FASTQ格式的数据,大小约4.2GB),在四个线程上达到最佳性能,处理时间是11.29秒,最佳性能处理速率大约是370MB/秒。Ktrim工具的作者在其论文(Sun K.Ktrim:an extra-fast and accurate adapter-and quality-trimmer for sequencing data)中提到Ktrim在保持较高的准确率的同时,处理速度是其它几种工具(Trim galore 、Trimmomatic、SeqPurgea)的~2-18倍。First of all, it should be noted that the present invention is an improvement based on the Ktrim tool. Compared with several other existing tools (such as Trimmomatic and Fastp), Ktrim has better performance on NGS short-read sequencing data. Tested on a computing server device equipped with an Intel Xeon CPU and a standard 64-bit CentOS system, using single-ended test data (10 million pieces of sequencing data in FASTQ format, about 2.1GB in size), on four threads the tool can To achieve its best performance, the processing time is 7.26 seconds, and the processing rate is about 300MB/sec; using paired-end test data (two single-end sequencing data files, a total of 20 million pieces of data in FASTQ format, about 4.2GB in size), The best performance is achieved on four threads, the processing time is 11.29 seconds, and the best performance processing rate is about 370MB/sec. The author of the Ktrim tool mentions in his paper (Sun K.Ktrim:an extra-fast and accurate adapter-and quality-trimmer for sequencing data) that Ktrim maintains a high accuracy while processing speed is the fastest among several other tools (Trim galore, Trimmomatic, SeqPurgea) ~2-18 times.
尽管,Ktrim相比其它工具有着更好的性能,但是随着固态硬盘技术的发展和普及,以及磁盘阵列技术的进步,Ktrim当前的处理速率距离硬盘的读写峰值差距较大,无法满足当前对生物数据预处理的性能要求,而且经过测试,Ktrim的线程扩展性并不好,使用四个以上的线程数量程序的处理速度难以继续提高,因此,Ktrim在计算性能上仍有较大的进步空间。Although Ktrim has better performance than other tools, with the development and popularization of solid-state disk technology and the advancement of disk array technology, the current processing rate of Ktrim is far from the peak read and write value of hard disks, which cannot meet the current demand for The performance requirements of biological data preprocessing, and after testing, the thread scalability of Ktrim is not good, and it is difficult to continue to improve the processing speed of programs using more than four threads. Therefore, Ktrim still has a lot of room for improvement in computing performance. .
通过分析和测试,我们发现Ktrim的性能瓶颈主要在FASTQ文件的解析上,低效率的IO方式导致程序的处理性能收到了限制,是典型的的IO密集型应用,同时,在Ktrim中寻找Adapter的过程也相对耗时,针对以上情况,我们提出了一种生物测序序列快速修剪方法(以下称之为RabbitTrim工具),该工具在Ktrim的基础上使用了轻量级的FASTQ数据IO框架(如图3所示,展示了该轻量级IO框架结构图),大幅度提升IO的速率,在此基础上又使用向量化的方式优化了Adapter(接头序列)的寻找过程,使得程序的处理效率得到了明显的提升,可接近我们测试平台的硬盘读写峰值(约2GB/秒)。Through analysis and testing, we found that the performance bottleneck of Ktrim is mainly in the parsing of FASTQ files. The inefficient IO method limits the processing performance of the program, which is a typical IO-intensive application. At the same time, look for Adapter in Ktrim The process is also relatively time-consuming. In view of the above situation, we propose a fast trimming method for biological sequencing sequences (hereinafter referred to as the RabbitTrim tool), which uses the lightweight FASTQ data IO framework on the basis of Ktrim (as shown in the figure). 3 shows the lightweight IO framework structure diagram), which greatly improves the IO rate. On this basis, the vectorization method is used to optimize the Adapter (connector sequence) search process, so that the processing efficiency of the program is improved. It is a significant improvement, which can be close to the peak hard disk read and write (about 2GB/sec) of our test platform.
具体的,Ktrim工具的性能瓶颈主要在数据文件的IO上,包括:Specifically, the performance bottleneck of the Ktrim tool is mainly in the IO of data files, including:
(1)读线程压力大。在Ktrim中使用一个读线程读取输入数据文件,读线程不仅需要读取字符串数据,还需要按照FASTQ格式进行数据解析,数据格式化的工作大大加重了读线程的负担,导致生产FASTQ格式的数据效率很低。(1) Read thread pressure is high. A read thread is used to read the input data file in Ktrim. The read thread not only needs to read the string data, but also needs to parse the data according to the FASTQ format. The work of data formatting greatly increases the burden of the read thread, resulting in the production of FASTQ format. Data efficiency is low.
(2)数据预取数量少。Ktrim支持多线程模式,但在多线程模式下也仅是会多预取一块数据(一个BATCH数据),当计算部分执行较快时输入数据供给依然存在问题。(2) The number of data prefetching is small. Ktrim supports multi-threaded mode, but in multi-threaded mode, it only prefetches one more piece of data (one BATCH data), and there is still a problem with the input data supply when the calculation part is executed faster.
(3)工作流程上各部分间存在依赖关系。读操作和Trim操作(即修剪操作)之间实现了部分并行,但当两个操作的执行速率不匹配的时候,会造成一方在等待另一方的问题,而Trim操作和写操作之间是串行的方式,处理完一块数据并输出完成后才会处理下一块,读操作、Trim操作、写操作之间存在较大的依赖。(3) There are dependencies among the various parts of the workflow. Partial parallelism is achieved between the read operation and the Trim operation (that is, the trim operation), but when the execution rates of the two operations do not match, it will cause one side to wait for the other side, and there is a string between the Trim operation and the write operation. In the line method, the next block is processed only after a block of data is processed and output is completed. There is a large dependency between read operations, trim operations, and write operations.
(4)寻找接头的过程较慢,Ktrim工具使用<string.h>库中的char * strstr(const char * haystack , const char * needle ) 函数来寻找接头的“种子”,这种方式相对较慢。(4) The process of finding the connector is slow. The Ktrim tool uses the char * strstr(const char * haystack , const char * needle ) function in the <string.h> library to find the "seed" of the connector, which is relatively slow .
为了解决上述问题,本发明提供了一种生物测序序列快速修剪方法,包括:In order to solve the above problems, the present invention provides a rapid trimming method for biological sequencing sequences, including:
获取待修剪的生物测序序列;Obtain the biological sequencing sequence to be trimmed;
对所述生物测序序列进行读操作、修剪操作以及写操作;其中,基于生产者—消费者模型对所述读操作、修剪操作以及写操作进行解耦,实现异步执行;且所述生物测序序列的格式化过程从读操作中转移到修剪操作中。A read operation, a trim operation, and a write operation are performed on the biological sequencing sequence; wherein, the read operation, the trimming operation, and the writing operation are decoupled based on a producer-consumer model to achieve asynchronous execution; and the biological sequencing sequence The formatting process is shifted from the read operation to the trim operation.
进一步的,所述读操作、修剪操作以及写操作分别采用独立的线程进行实现,其中,读线程和写线程均设置有一个,所述修剪线程设置有一个或多个。Further, the read operation, the pruning operation and the writing operation are respectively implemented by independent threads, wherein one read thread and one write thread are provided, and one or more pruning threads are provided.
进一步的,所述读操作用于通过读线程对所述生物测序序列按照块方式进行读取,并将读取的块对象存储入第一数据队列中。Further, the read operation is used to read the biological sequencing sequence in a block manner through a read thread, and store the read block objects in the first data queue.
进一步的,所述块对象的创建引入数据池思想,仅创建预设数量的块对象进行重复使用。Further, the creation of the block objects introduces the idea of a data pool, and only a preset number of block objects are created for repeated use.
进一步的,所述修剪操作用于通过修剪线程从所述第一数据队列中获取数据,对所述生物测序序列进行格式化,去除生物测序序列中低质量碱基序列和接头序列;同时将处理后的序列存储入第二数据队列中。Further, the trimming operation is used to obtain data from the first data queue through a trimming thread, format the biological sequencing sequence, and remove low-quality base sequences and linker sequences in the biological sequencing sequence; The latter sequence is stored in the second data queue.
进一步的,在所述修剪线程中获取接头序列包括:将所述生物测序序列中的每个碱基作为一个字符;基于向量寄存器,采用若干次位运算获得预设长度序列数据中接头序列的位置。Further, obtaining the linker sequence in the pruning thread includes: using each base in the biological sequencing sequence as a character; and using several bit operations to obtain the position of the linker sequence in the preset length sequence data based on the vector register. .
进一步的,所述写操作用于通过写线程从所述第二数据队列中获取处理后的生物测序序列,并进行存储。Further, the writing operation is used to obtain the processed biological sequencing sequence from the second data queue through a writing thread, and store it.
进一步的,所述读操作、修剪操作以及写操作所对应的线程仅创建一次,直到处理任务完成后进行销毁。Further, the threads corresponding to the read operation, the pruning operation and the write operation are created only once, and are destroyed after the processing task is completed.
进一步的,所述格式化具体包括按照FASTQ格式进行数据解析。Further, the formatting specifically includes performing data parsing according to the FASTQ format.
具体的,为了便于理解,以下结合附图对本发明所述方案进行详细说明:Specifically, in order to facilitate understanding, the scheme of the present invention is described in detail below in conjunction with the accompanying drawings:
本发明所述方案主要针对Ktrim工具中存在的问题进行改进,包括以下两个方面:输入输出过程的改进和Adapter查询过程的改进,具体为:The solution of the present invention mainly improves the problems existing in the Ktrim tool, including the following two aspects: the improvement of the input and output process and the improvement of the Adapter query process, specifically:
(一)输入输出过程的改进(1) Improvement of the input and output process
针对输入输出过程,本发明提出的RabbitTrim工具使用了轻量级的IO框架来完成数据的输入、解析和输出,其中,所述框架是基于生产者-消费者模型,将读操作、Trim操作、写操作三者进行解耦,实现异步执行;具体的,其处理过程如下:For the input and output process, the RabbitTrim tool proposed by the present invention uses a lightweight IO framework to complete data input, parsing and output, wherein the framework is based on the producer-consumer model, which combines read operations, Trim operations, The three write operations are decoupled to achieve asynchronous execution. Specifically, the processing process is as follows:
在读操作和Trim操作之间建立生产者-消费者模型,读线程作为生产者而Trim线程作为消费者,为了保证读取数据的有序性和正确性,生产者设置为一个,而消费者可以设置为一个或多个。在上文提到,Ktrim中读线程不仅负责读取数据也负责格式化数据,这给读线程带去了较大的压力。因为格式化数据时构造数据对象相比从文件中读取指定大小的字符串的速度是比较慢的,所以本发明将格式化的工作转移到了消费者(即Trim线程上),有效提高生产者(即读线程)的效率;同时,本发明可以通过增加Trim线程的数量来平衡消费者和消费者的速度。A producer-consumer model is established between the read operation and the Trim operation. The read thread acts as the producer and the Trim thread acts as the consumer. In order to ensure the order and correctness of the read data, the producer is set to one, and the consumer can Set to one or more. As mentioned above, the read thread in Ktrim is not only responsible for reading data but also for formatting data, which puts a lot of pressure on the read thread. Because the speed of constructing a data object when formatting data is slower than reading a string of a specified size from a file, the present invention transfers the formatting work to the consumer (that is, the Trim thread), effectively improving the productivity of the producer. (i.e. read thread) efficiency; at the same time, the present invention can balance the speed of consumers and consumers by increasing the number of Trim threads.
本发明所述方案按照块(chunk)的方式读取数据,读线程将数据读取到chunk对象中,然后将chunk对象存储进队列供Trim线程使用,为了减少创建反复chunk对象和为其申请内存带来的开销,同时采用了数据池的思想,只创建固定数据的chunk对象反复使用。在读线程、Trim线程之间chunk对象流转过程如图1所示。The solution of the present invention reads data in the form of chunks, the reading thread reads the data into the chunk object, and then stores the chunk object in the queue for use by the Trim thread, in order to reduce the creation of repeated chunk objects and the application of memory for them At the same time, the idea of data pool is adopted, and only chunk objects of fixed data are created for repeated use. Figure 1 shows the flow of chunk objects between the read thread and the Trim thread.
其次,在Trim线程和写线程之间,同样建立了生产者-消费者模型,Trim线程作为生产者,写线程是消费者,为了保证写回数据的正确性,程序中只使用了一个写线程,因为写磁盘的速度较慢,程序中将FASTQ对象数据转化回字符串数据的过程放在了Trim线程中,其关系如图2所示。Secondly, a producer-consumer model is also established between the Trim thread and the writer thread. The Trim thread is the producer and the writer thread is the consumer. In order to ensure the correctness of the data written back, only one writer thread is used in the program. , because the speed of writing to disk is slow, the process of converting FASTQ object data back to string data in the program is placed in the Trim thread, and the relationship is shown in Figure 2.
综上,针对输入输出过程,本发明所述方案相对于现有的Ktrim工具,具有如下改进:To sum up, for the input and output process, the solution of the present invention has the following improvements compared to the existing Ktrim tool:
(1)有效解决了Ktrim的IO效率低的问题,将文件数据的输入输出速率改进达到或接近磁盘读写的性能峰值:(1) Effectively solve the problem of low IO efficiency of Ktrim, and improve the input and output rate of file data to reach or approach the peak performance of disk read and write:
(2)使用生产者-消费者模型实现读线程、工作线程以及写线程之间解耦、异步和速度平衡;(2) Use the producer-consumer model to achieve decoupling, asynchrony, and speed balance among read threads, worker threads, and write threads;
(3)将数据格式化的任务从读线程转移到工作线程,将写回数据准备的任务从写线程转移到工作线程来最大可能地保证读线程和写线程的速度;(3) Transfer the task of data formatting from the reading thread to the working thread, and transfer the task of writing back data from the writing thread to the working thread to ensure the speed of the reading thread and the writing thread as much as possible;
(4)使用数据池DataPool重复利用chunk对象,减少了对象的创建与销毁,减少了创建对象的开销;(4) Use the data pool DataPool to reuse chunk objects, reduce the creation and destruction of objects, and reduce the overhead of creating objects;
(5)在读线程、工作线程(即Trim线程)以及写线程之间传递数据时,尽可能使用数据的指针,而不进行数据复制,有效减少内存拷贝的开销;(5) When transferring data between read threads, worker threads (ie Trim threads) and write threads, use data pointers as much as possible instead of data copying, which effectively reduces the overhead of memory copying;
(6)整个工作流程中读线程、每个工作线程以及写线程只会创建一次,直到完成其所有的任务后才销毁,减少了线程创建的开销。(6) In the entire workflow, the read thread, each worker thread and the write thread are created only once, and are not destroyed until all their tasks are completed, reducing the overhead of thread creation.
(二)Adapter查询过程的改进(2) Improvement of Adapter query process
在Ktrim工具中的核心部分为寻找可能的Adapter的位置,即Adapter的“seed”,它是一个长度为3的核苷酸序列,寻找“seed”的过程可以抽象为在一个长度为几十到几百的字符串中寻找一个长度为3的子串;现有的Ktrim使用了<string.h>库中的char * strstr(const char * haystack , const char * needle ) 函数,在优化了IO模块后,这种简单的方式效率低下,成为了程序运行时的热点。The core part of the Ktrim tool is to find the position of the possible Adapter, that is, the "seed" of the Adapter, which is a nucleotide sequence with a length of 3. The process of finding the "seed" can be abstracted as a sequence of tens to Find a substring of length 3 in hundreds of strings; the existing Ktrim uses the char * strstr(const char * haystack , const char * needle ) function in the <string.h> library to optimize the IO module Later, this simple method is inefficient and becomes a hot spot when the program is running.
本发明所述方案基于现代计算机中的向量化部件,使用向量化的方式来寻找所有“seed”的位置,具体的:将测序序列中的每个碱基可以看成一个字符,我们使用AVX512向量寄存器可以一次性处理64字节的数据,即64个碱基,通过AVX512向量寄存器和指令集,使用多次位运算(三次异或运算、两次左移运算以及两次或运算)我们就可以得到长度为64的序列数据中“seed”所在的位置,我们将长度为L的测序的序列数据分为N组,每组包含64个碱基,进行上述位运算就可以得到整条序列中所有“seed”的位置。其中,符号表示取上整。其具体过程如下述伪代码所示:The solution of the present invention is based on the vectorization components in modern computers, and uses the vectorization method to find the positions of all "seed", specifically: each base in the sequenced sequence can be regarded as a character, we use AVX512 vector The register can process 64 bytes of data at one time, that is, 64 bases. Through the AVX512 vector register and instruction set, using multiple bit operations (three XOR operations, two left shift operations, and two OR operations) we can To get the position of "seed" in the sequence data of length 64, we divide the sequence data of length L into N groups, each of which contains 64 bases. The location of the "seed". in ,symbol Indicates rounding up. The specific process is shown in the following pseudo code:
算法1 vectorized-find-seedAlgorithm 1 vectorized-find-seed
输入: seq测序序列,index1种子1,index2种子2,seed1种子1出现的所有位置,seed2种子2出现的所有位置,seedtable位置查询表Input: seq sequencing sequence, index1 seed 1, index2 seed 2, all positions where seed1 seed 1 occurs, all positions where seed2 seed 2 occurs, seedtable position lookup table
输出: seed1 and seed2Output: seed1 and seed2
1:function VECTORIZED-FIND-SEED(seq, index1, index2, seed1, seed2,seedtable)1: function VECTORIZED-FIND-SEED(seq, index1, index2, seed1, seed2, seedtable)
2: len ← length(seq) ▷序列长度2: len ← length(seq) ▷ sequence length
3: num ← ceil(len − 2/62) ▷需要迭代的次数3: num ← ceil(len − 2/62) ▷The number of iterations needed
4: i ← 04: i ← 0
5: v11 ← init(index1[0]) ▷将avx512寄存器以“种子”的某个碱基初始化5: v11 ← init(index1[0]) ▷Initialize the avx512 register with a certain base of the "seed"
6: v12 ← init(index1[1])6: v12 ← init(index1[1])
7: v13 ← init(index1[2])7: v13 ← init(index1[2])
8: v21 ← init(index2[0])8: v21 ← init(index2[0])
9: v22 ← init(index2[1])9: v22 ← init(index2[1])
10: v23 ← init(index2[2])10: v23 ← init(index2[2])
11: whilei< num do11: whilei< num do
12: v1 ← load(seq + 62 ∗ i)▷加载第i次迭代需要的序列数据到avx512寄存器v1中12: v1 ← load(seq + 62 ∗ i)▷Load the sequence data required for the i-th iteration into the avx512 register v1
13: res11 ← XOR(v1, v11) ▷异或运算13: res11 ← XOR(v1, v11) ▷ XOR operation
14: res21 ← XOR(v1, v21)14: res21 ← XOR(v1, v21)
15: v1 ← leftShif t(v1, 8) ▷v1寄存器数据左移8bit15: v1 ← leftShift(v1, 8) ▷v1 register data is shifted left by 8bit
16: res12 ← XOR(v1, v12)16: res12 ← XOR(v1, v12)
17: res22 ← XOR(v1, v22)17: res22 ← XOR(v1, v22)
18: res12 ← OR(res11, res12) ▷将两次异或的结果进行或运算18: res12 ← OR(res11, res12) ▷ OR the result of two XORs
19: res22 ← OR(res21, res22)19: res22 ← OR(res21, res22)
20: v1 ← leftShift(v1, 8)20: v1 ← leftShift(v1, 8)
21: res13 ← XOR(v1, v13)21: res13 ← XOR(v1, v13)
22: res23 ← XOR(v1, v23)22: res23 ← XOR(v1, v23)
23: res13 ← OR(res12, res13)23: res13 ← OR(res12, res13)
24: res23 ← OR(res22, res23)24: res23 ← OR(res22, res23)
25: res1 ← mask(res13) ▷使用mask将或运算后的结果转化为mask6425: res1 ← mask(res13) ▷Use mask to convert the result of OR operation to mask64
26: res2 ← mask(res23)26: res2 ← mask(res23)
27: j ← 027: j ← 0
28: while j < 8 do28: while j < 8 do
29: t1 ← res1 AND 255 ▷mask64的后8位表示的值29: t1 ← res1 AND 255 ▷The value represented by the last 8 bits of mask64
30: t2 ← res2 AND 25530: t2 ← res2 AND 255
31: res1 ← rightShif t(res1, 8) ▷mask64右移8bit31: res1 ← rightShift(res1, 8) ▷mask64 right shift 8bit
32: res2 ← rightShift(res2, 8)32: res2 ← rightShift(res2, 8)
33: putseedtable[i][j][t1] into seed1▷使用查表法去数组seed-table中查询“种子”位置33: putsseedtable[i][j][t1] into seed1▷Use the table lookup method to query the "seed" position in the array seed-table
34: putseedtable[i][j][t2] into seed234: putsseedtable[i][j][t2] into seed2
35: j + +35: j++
36: end while36: end while
37: i + +37: i++
38: end while38: end while
39: end function39: end function
其中,按照62而不是64划分是因为我们要寻找的是长度为3的碱基序列,当一个64个碱基的序列最后两个(第63、64个)碱基与“seed”的前两个碱基完全匹配,但是由于下一个碱基数据没法读进来,最终得到的结果则认为不是“seed”的开始的位置,但是实际上如果下一个碱基恰好与“seed”的第三个碱基匹配,则该位置就是“seed”的位置。所以我们在进行划分时,需要考虑到上一个64个碱基序列的最后两位,下一个64个碱基序列的前两个应该是它们,那两个碱基进行了冗余读取,相当于第一次真正读取了64个碱基,从第二次开始都是只读取了62个新的碱基,所以是按照62个碱基的大小进行划分,L-2 是第一次读取64个碱基,取上整是因为L-2并非恰好是62的倍数,如果最后剩余的碱基不足62个,也需要新的一轮来完成。Among them, the division according to 62 instead of 64 is because we are looking for a base sequence of length 3, when the last two (63rd, 64th) bases of a 64-base sequence are the same as the first two of "seed". The bases are completely matched, but since the next base data cannot be read in, the final result is considered not to be the starting position of "seed", but in fact, if the next base happens to be the third position of "seed" If the bases match, this position is the position of the "seed". Therefore, when we divide, we need to consider the last two digits of the previous 64-base sequence, and the first two of the next 64-base sequence should be them. Those two bases have been read redundantly, which is quite For the first time, 64 bases were actually read. From the second time, only 62 new bases were read, so it was divided according to the size of 62 bases. L-2 was the first time. Read 64 bases, round up because L-2 is not exactly a multiple of 62. If the last remaining bases are less than 62, a new round is needed to complete.
进一步的,上文中给出的采用AVX512向量寄存器是为了更好的进行解释说明,可以理解的,在具体实施过程中还可以采用其他存储容量的向量寄存器。Further, the use of the AVX512 vector register given above is for better explanation, and it can be understood that vector registers with other storage capacities may also be used in the specific implementation process.
进一步的,为了证明本发明所述方案(如图4所示为其整体框架结构示意图)的有效性,以下进行了实验测试:Further, in order to prove the effectiveness of the solution of the present invention (as shown in Fig. 4 is a schematic diagram of its overall frame structure), the following experimental tests were carried out:
其中,图4中的Low-Quality-Trim表示低质量碱基裁剪工作。根据FASTQ格式数据中的质量分数,使用一个固定大小的滑动窗口,从测序数据的3’端到5’端进行扫描,当滑动窗口内的碱基平均质量分数小于用户指定的阈值,将滑动窗口左移一个碱基并继续判断,直到滑动窗口内的碱基平均质量分数满足阈值,滑动窗口停止滑动,然后将滑动窗口所在位置及其后面(靠近3’端)的碱基序列裁剪掉。Among them, Low-Quality-Trim in Figure 4 represents low-quality base trimming work. According to the quality scores in the FASTQ format data, a fixed-size sliding window is used to scan from the 3' end to the 5' end of the sequencing data. When the average quality score of the bases in the sliding window is less than the user-specified threshold, the sliding window will be Move one base to the left and continue to judge until the average quality score of the bases in the sliding window meets the threshold, the sliding window stops sliding, and then the base sequence at the position of the sliding window and its back (near the 3' end) is clipped.
图4中的Adapter-Trim表示裁剪测序序列中的adapter即测序中使用的接头,adapter本质也是一段核苷酸序列。先根据用户指定的adapter序列的前三个碱基,将其作为“种子”,在测序序列中找到“种子”出现的所有位置,将这些位置作为候选位置。然后从左到右遍历序列中的候选位置,将候选位置处的序列与用户指定的adapter序列进行比较,计算海明编辑距离,距离满足用户指定阈值的第一个候选位置作为裁剪位置,将该候选位置及其后面(靠近3’端)的碱基序列裁剪掉。这个过程使用了向量化的方式,加快了寻找候选位置的速度,提高工作线程的处理效率,向量化过程采用上面的算法1。Adapter-Trim in Figure 4 represents the adapter in the trimming sequence, that is, the adapter used in the sequencing, and the adapter is essentially a nucleotide sequence. First, according to the first three bases of the adapter sequence specified by the user, use it as the "seed", find all the positions where the "seed" appears in the sequencing sequence, and use these positions as candidate positions. Then traverse the candidate positions in the sequence from left to right, compare the sequence at the candidate position with the adapter sequence specified by the user, calculate the Hamming edit distance, and use the first candidate position whose distance meets the user-specified threshold as the clipping position. The base sequence at the candidate position and its back (near the 3' end) is trimmed. This process uses a vectorization method, which speeds up the search for candidate positions and improves the processing efficiency of worker threads. The vectorization process uses the above algorithm 1.
为了评估RabbitTrim程序的正确性和性能,我们在不同的模拟数据和真实数据上进行了一系列的实验和分析,具体实验如下:In order to evaluate the correctness and performance of the RabbitTrim program, we conducted a series of experiments and analyses on different simulated data and real data. The specific experiments are as follows:
(1)正确性测试(1) Correctness test
本实施例中使用模拟数据进行正确性验证。我们使用模拟数据生成软件生成了在不同的测序错误率下的单端数据和双端数据,经过实验测试,我们的RabbitTrim在正确性上和原始的Ktrim保持一致。In this embodiment, simulation data is used for correctness verification. We used simulation data generation software to generate single-end data and paired-end data at different sequencing error rates. After experimental testing, our RabbitTrim is consistent with the original Ktrim in correctness.
(2)性能测试(2) Performance test
本实施例中使用了多个模拟数据和真实数据进行了性能测试的实验,记录了RabbitTrim、Ktirm以及Trimmomatic的运行时间和线程扩展性。下面以某个真实数据为例,数据来自NCBI(National Center for Biotechnology Information)数据库的 GEO(GSE81178),使用 GSM2144218 ~ GSM2144224并且将文件名后缀为 read1.fastq 的合并为 pooled.read1.fq,将文件后缀名为 read2.fastq 合并为 pooled.read2.fq,每个文件的大小为19GB。In this embodiment, a plurality of simulated data and real data are used to conduct performance testing experiments, and the running time and thread scalability of RabbitTrim, Ktirm, and Trimmomatic are recorded. Let's take a real data as an example. The data comes from GEO (GSE81178) in the NCBI (National Center for Biotechnology Information) database, using GSM2144218 ~ GSM2144224 and combining the file name suffix with read1.fastq into pooled.read1.fq, the file The suffix name read2.fastq is merged into pooled.read2.fq, and the size of each file is 19GB.
表1 单端数据运行时间(秒):Table 1 Single-ended data runtime (seconds):
表2 单端数据加速比:Table 2 Single-ended data speedup ratio:
表3 双端数据运行时间(秒):Table 3 Double-ended data runtime (seconds):
表4双端数据加速比:Table 4 Double-ended data speedup ratio:
通过表1至表4的实验结果我们可以看出,RabbitTrim在性能上相比Ktrim有了较大的提升,单线程执行速度提升,线程扩展性更好,最佳处理速度是Ktrim的3倍左右,约为2GB/秒(在相同平台下Ktrim约为730MB/秒),达到了测试平台的磁盘读写性能峰值(IOBound)。From the experimental results in Table 1 to Table 4, we can see that the performance of RabbitTrim has been greatly improved compared with Ktrim, the single-thread execution speed is improved, the thread scalability is better, and the optimal processing speed is about 3 times that of Ktrim. , about 2GB/sec (Ktrim is about 730MB/sec on the same platform), reaching the peak disk read/write performance (IOBound) of the test platform.
本发明所述方案通过对输入输出模块以及向量化模块的优化,解决了IO效率低的问题,提高了寻找Adapter的速率。经过实验测试,RabbitTrim在正确性上和Ktrim保持一致,但是性能得到了较大的提升,数据处理的速度可接近磁盘读写的性能峰值(~2GB/秒)。在相同的实验平台上,RabbitTrim处理数据的最佳性能峰值是原始Ktrim的3倍左右,较大幅度地提升了对新一代测序数据预处理的速度,对于加快下游分析任务有重要意义,为生物信息领域的工作者提供了一种更加快速的quality-adapter-trim工具。The solution of the present invention solves the problem of low IO efficiency by optimizing the input and output modules and the vectorization module, and improves the speed of finding an Adapter. After experimental tests, RabbitTrim is consistent with Ktrim in correctness, but the performance has been greatly improved, and the data processing speed can be close to the peak performance of disk read and write (~2GB/sec). On the same experimental platform, the peak performance of RabbitTrim for processing data is about 3 times that of the original Ktrim, which greatly improves the speed of next-generation sequencing data preprocessing, which is of great significance for speeding up downstream analysis tasks. Information workers provide a faster quality-adapter-trim tool.
实施例二:Embodiment 2:
本实施例的目的是提供一种生物测序序列快速修剪系统。The purpose of this embodiment is to provide a rapid trimming system for biological sequencing sequences.
一种生物测序序列快速修剪系统,包括:A rapid trimming system for biological sequencing sequences, including:
数据获取单元,其用于获取待修剪的生物测序序列;a data acquisition unit, which is used to acquire the biological sequencing sequence to be trimmed;
数据处理单元,其用于对所述生物测序序列进行读操作、修剪操作以及写操作;其中,基于生产者—消费者模型对所述读操作、修剪操作以及写操作进行解耦,实现异步执行;且所述生物测序序列的格式化过程从读操作中转移到修剪操作中。A data processing unit, which is used to perform read operation, trim operation and write operation on the biological sequencing sequence; wherein, the read operation, trim operation and write operation are decoupled based on the producer-consumer model to realize asynchronous execution ; and the formatting process of the biological sequencing sequence is transferred from the read operation to the trimming operation.
在更多实施例中,还提供:In further embodiments, there is also provided:
一种电子设备,包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成实施例一中所述的方法。为了简洁,在此不再赘述。An electronic device includes a memory, a processor, and computer instructions stored on the memory and executed on the processor, and when the computer instructions are executed by the processor, the method described in the first embodiment is completed. For brevity, details are not repeated here.
应理解,本实施例中,处理器可以是中央处理单元CPU,处理器还可以是其他通用处理器、数字信号处理器DSP、专用集成电路ASIC,现成可编程门阵列FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general-purpose processors, digital signal processors DSP, application-specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据、存储器的一部分还可以包括非易失性随机存储器。例如,存储器还可以存储设备类型的信息。The memory may include read-only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成实施例一中所述的方法。A computer-readable storage medium is used to store computer instructions, and when the computer instructions are executed by a processor, the method described in the first embodiment is completed.
实施例一中的方法可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器、可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。为避免重复,这里不再详细描述。The method in the first embodiment can be directly embodied as being executed by a hardware processor, or executed by a combination of hardware and software modules in the processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. To avoid repetition, detailed description is omitted here.
本领域普通技术人员可以意识到,结合本实施例描述的各示例的单元即算法步骤,能够以电子硬件或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art can realize that the unit, that is, the algorithm step of each example described in conjunction with this embodiment, can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.
上述实施例提供的一种生物测序序列快速修剪方法及系统可以实现,具有广阔的应用前景。The method and system for rapid trimming of biological sequencing sequences provided by the above embodiments can be implemented and have broad application prospects.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210308606.6A CN114420210B (en) | 2022-03-28 | 2022-03-28 | Rapid trimming method and system for biological sequencing sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210308606.6A CN114420210B (en) | 2022-03-28 | 2022-03-28 | Rapid trimming method and system for biological sequencing sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114420210A CN114420210A (en) | 2022-04-29 |
CN114420210B true CN114420210B (en) | 2022-09-20 |
Family
ID=81263884
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210308606.6A Active CN114420210B (en) | 2022-03-28 | 2022-03-28 | Rapid trimming method and system for biological sequencing sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114420210B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116092587B (en) * | 2023-04-11 | 2023-08-18 | 山东大学 | A biological sequence analysis system and method based on producer-consumer model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019133928A1 (en) * | 2017-12-30 | 2019-07-04 | Uda, Llc | Hierarchical, parallel models for extracting in real-time high-value information from data streams and system and method for creation of same |
CN111278995A (en) * | 2017-08-28 | 2020-06-12 | 普梭梅根公司 | Method and system for characterizing conditions related to the female reproductive system associated with a microbial organism |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0517960D0 (en) * | 2005-09-03 | 2005-10-12 | Ibm | A pruning method |
WO2015013657A2 (en) * | 2013-07-25 | 2015-01-29 | Kbiobox Inc. | Method and system for rapid searching of genomic data and uses thereof |
US10127260B2 (en) * | 2014-11-25 | 2018-11-13 | Sap Se | In-memory database system providing lockless read and write operations for OLAP and OLTP transactions |
US12300358B2 (en) * | 2018-08-20 | 2025-05-13 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for compressing genetic sequencing data and uses thereof |
CN113192558A (en) * | 2021-05-26 | 2021-07-30 | 北京自由猫科技有限公司 | Reading and writing method for third-generation gene sequencing data and distributed file system |
-
2022
- 2022-03-28 CN CN202210308606.6A patent/CN114420210B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111278995A (en) * | 2017-08-28 | 2020-06-12 | 普梭梅根公司 | Method and system for characterizing conditions related to the female reproductive system associated with a microbial organism |
WO2019133928A1 (en) * | 2017-12-30 | 2019-07-04 | Uda, Llc | Hierarchical, parallel models for extracting in real-time high-value information from data streams and system and method for creation of same |
Non-Patent Citations (3)
Title |
---|
"Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data";Kun Sun等;《Bioinformatics》;20200630;第36卷(第11期);全文 * |
"PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets";Cantu VA等;《PeerJ Preprints》;20190227;全文 * |
"基于哈希的高通量生物基因测序数据处理算法优化";许凯;《中国优秀博硕士学位论文全文数据库(博士)·基础科学辑》;20210415;第2021年卷(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114420210A (en) | 2022-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Huangfu et al. | RADAR: A 3D-ReRAM based DNA alignment accelerator architecture | |
CN112735528A (en) | Gene sequence comparison method and system | |
US11941534B2 (en) | Genome sequence alignment system and method | |
Lavenier et al. | DNA mapping using processor-in-memory architecture | |
TW201602813A (en) | Systems, apparatuses, and methods for feature searching | |
Bingöl et al. | GateKeeper-GPU: Fast and accurate pre-alignment filtering in short read mapping | |
Buhler et al. | Mercury BLASTN: Faster DNA sequence comparison using a streaming hardware architecture | |
CN114420210B (en) | Rapid trimming method and system for biological sequencing sequence | |
CN110111837B (en) | Method and system for searching protein similarity based on two-stage structure comparison | |
CN119007838B (en) | Grouping-based protein sequence clustering method and system | |
Koerkamp | A* PA2: Up to 19× faster exact global alignment | |
US20130041593A1 (en) | Method for fast and accurate alignment of sequences | |
JP2015141543A (en) | Loop division detection program and loop division detection method | |
US20240404642A1 (en) | Genome graph analysis method, device and medium based on in-memory computing | |
CN115484107B (en) | A side channel key analysis method and system based on parallel computing | |
Wirawan et al. | Parallel DNA sequence alignment on the cell broadband engine | |
Zhang et al. | Porting and Optimizing G-BLASTN to the ROCm-based Supercomputer | |
Wang et al. | Accelerating the Smith-Waterman algorithm by GPU for high-throughput sequence alignment | |
WO2024098885A1 (en) | Disk array write processing method and apparatus, device, and medium | |
Tran et al. | Bit-parallel multiple pattern matching | |
CN114420209A (en) | Sequencing data-based pathogenic microorganism detection method and system | |
Liao et al. | A High-performance Hardware Accelerator for Genome Alignment | |
CN109656543A (en) | A kind of real time workshop method based on document keyword replacement | |
WO2017215030A1 (en) | Storage device having search function, and search method thereof | |
CN111326216B (en) | A fast partitioning method for big data gene sequencing files |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |