CN111767255B

CN111767255B - Optimization method for separating sample read data from fastq file

Info

Publication number: CN111767255B
Application number: CN202010442647.5A
Authority: CN
Inventors: 黄俊松; 文晋; 邵艳军
Original assignee: Beijing Herui Exquisite Medical Laboratory Co ltd
Current assignee: Beijing Herui Exquisite Medical Laboratory Co ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-10-13
Anticipated expiration: 2040-05-22
Also published as: CN111767255A

Abstract

The embodiment of the invention provides an optimization method for separating sample read data from fastq files, which comprises the steps of constructing and outputting read data by concurrently loading fastq files containing a plurality of samples; analyzing a barcode pair, identifying a sample to which the read data belongs, and inserting the read data into a read cache of the sample to which the read data belongs; and writing the read data in the read cache into an output fastq file of the corresponding sample. In this way, by using a parallel working mode, a plurality of threads cooperate at the same time, so that the working efficiency is improved; meanwhile, the optimally defined read storage structure is adopted, so that the number of character string copying and re-concatenation times of each read data in the life cycle of the read data is reduced; by applying the read caching technology, the calling times of the system lock and the operation times of writing data interfaces to the line security queue and the file system are greatly reduced, the load of the operating system is reduced, and the aim of rapidly separating sample read data from the fastq file is fulfilled.

Description

Optimization method for separating sample read data from fastq file

Technical Field

Embodiments of the present invention relate generally to the field of gene sequencing and, more particularly, to an optimized method of separating sample read data from fastq files.

Background

In the field of gene sequencing, fastq format is the most commonly used file format for storing the base sequence of a gene and the corresponding mass fraction and related information. The down data of the sequencer can be stored as fastq format files after being processed. To maximize the use of sequencers and on-board kits, it is now essential to mix multiple samples with the sequencer for sequencing and then output a fastq file that contains the data of the multiple sample genes. Such fastq files containing multiple samples are typically very large, as small as a few GB, and as large as tens of hundreds of GB. For further gene sequence analysis, it is necessary to separate the fastq file in samples from such an original fastq file, i.e., to separate the gene data of each sample into a single fastq file (for double-ended sequencing, there are two separate fastq files per sample). The traditional method for separating sample read data is to read the original fastq file line by using the scripting language such as python, analyze and construct the read, identify the sample slave of the read, and additionally write the read into the sample fastq file. This serial mode of operation, and the use of a scripting language with poor performance, makes this process particularly lengthy, resulting in the lengthy time required to separate sample read data from the fastq file; for example, when the next fastq file is only a few GB in size, this approach can take nearly 1 hour to complete the gene data separation. When the next machine data reaches tens or hundreds of GB, more than ten hours are needed to complete the most basic data splitting service.

If the modes of concurrently executing the fastq file reading and the read data analyzing, the concurrent multi-sample fastq file outputting by using the thread pool and the like are adopted, the sample read data separation is much faster than the sample read data separation by adopting the python script and the like, and the speed can be basically improved by more than 80 percent on average. Performance bottlenecks remain, such as extremely high CPU utilization, extremely large memory overhead, and the like. Resulting in the time-consuming occurrence of fluctuations and instability in the final isolated gene data. One reason for the most straightforward simplicity of this scenario is: up to a hundred million levels of read data (e.g., approximately 10 hundred million of read data for a 70GB original fastq file), which means a large number of string indexing, splitting, merging operations, a large number of queue insertion, removal operations, and a large number of data lock and release lock operations, among others.

Disclosure of Invention

According to an embodiment of the present invention, there is provided an optimization method for separating sample read data from fastq file, the method including:

concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data;

analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a read cache of the sample to which the read data belongs;

And writing the read data in the read cache into an output fastq file of a corresponding sample through asynchronous sample threads in an asynchronous sample thread pool.

Further, the concurrent loading of fastq files containing a plurality of samples by two threads, constructing read data and outputting, includes:

distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit of the data block queue and the size of the data block;

in the first thread, sending a memory allocation request to an object memory multiplexing pool, and waiting for allocation of a data block with the data block size;

in the first thread, reading the fastq file according to the size of the data block, putting the read data into the distributed data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is smaller than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue;

continuously judging whether the fastq file is read completely, if so, setting a data block queue end mark, and ending the first thread; if the data block is not read, returning to the data block waiting for allocation, and continuing to load the fastq file;

Judging whether the data block queue is empty and an ending mark is set in the second thread, and ending the second thread if the data block queue is empty and the ending mark is set; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain fastq block data;

sequentially carrying out line feed analysis on the fastq block data to obtain a plurality of read data, and sequentially outputting the read data one by one;

and after the data block is consumed, sending a memory release request to the object memory multiplexing pool.

Further, the read data is stored in a continuous memory and comprises a starting position, an ending position and three index values, wherein the three index values are arranged between the starting position and the ending position and are used for pointing to the positions of three line feeding symbols respectively; the three line changing symbols divide the data between the starting position and the ending position into four lines of data, wherein the first line of data is an information line, the second line of data is a sequence line, the third line of data is an annotation line and the fourth line of data is a quality line; the line feed is used for triggering data line feed operation.

Further, the parsing the barcode pair from the read data includes:

taking the first 8 characters of the second row of the read data as the barcode of the read data;

constructing a barcode pair according to the barcode;

in the single-ended sequencing condition, the number of the barcode is one, and the barcode is copied to obtain two identical barcode pairs serving as the barcode pairs;

in the case of double-ended sequencing, the number of the barcode is two, and two barcode are taken as the pair of the barcode.

Further, the identifying the sample of the read data according to the corresponding relation of the barcode pair and the sample number comprises:

grouping the barcode in the read data to obtain a plurality of barcode groups; each of the barcode groups comprises a plurality of different barcodes, and any of the barcodes and the barcode which are the pair of the barcode are in the same group, so that the unique corresponding relation between the pair of the barcode and the group of the barcode is obtained;

defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;

and identifying the sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.

Further, the inserting the read data into the read cache of the sample to which the read data belongs includes:

allocating a read buffer for each sample; the read cache comprises a cache block and a cache block queue; setting the maximum storage capacity of the cache block; the buffer block queue is used for storing read data of the same sample in sequence;

memory is allocated for a first cache block in the read cache through an object memory multiplexing pool;

the first cache block receives read data and judges whether the read data is received completely, if yes, the read cache is set with an end mark and the cache is ended; otherwise, the read data are put into a first cache block, whether the first cache block reaches the preset maximum storage capacity of the cache block is judged, if the first cache block reaches the preset maximum storage capacity of the cache block, the first cache block is inserted into the tail end of a cache block queue, a memory allocation request is sent to an object memory multiplexing pool, and a second cache block is allocated through the object memory multiplexing pool; if the first cache block does not reach the preset maximum storage capacity of the cache block, continuing to receive read data;

when a second cache block is allocated to the read cache, judging whether the current read data has data which are not put into the cache block, if so, putting the data which are not put into the cache block in the current read data into the second cache block, and returning to the step of receiving the read data by the first cache block; if not, the step of directly returning to the first cache block to receive read data is performed.

Further, the method further comprises the following steps:

in the case of double-ended sequencing, the read data is two, read1 and read2 data, respectively; the two cache blocks are used for respectively storing corresponding read1 and read2 data; associating the two cache blocks;

and when the cache block stored with read1 reaches the preset maximum storage capacity of the cache block, inserting the cache block into the tail of the cache block queue.

Further, the method further comprises the following steps:

setting the maximum buffer block number of the buffer block queue;

if the first cache block or the cache block stored with read1 is inserted into the tail of the cache block queue, judging whether the number of the cache blocks in the cache block queue reaches the preset maximum number of the cache blocks, if so, entering a waiting state until the number of the cache blocks in the cache block queue is smaller than the preset maximum number of the cache blocks; otherwise, the first cache block or the cache block stored with read1 is inserted into the tail of the cache block queue.

Further, the writing, by the asynchronous sample thread in the asynchronous sample thread pool, the read data in the read cache into the output fastq file of the corresponding sample includes:

Judging whether a buffer block queue of the read buffer is empty and an ending mark is set in the asynchronous sample thread, and ending the operation of the asynchronous sample thread if the buffer block queue is empty and the ending mark is set; if the buffer block queue is empty and the ending mark is not set, entering a waiting state until the buffer block queue is not empty or the ending mark is set; if the buffer block queue is not empty, taking out a buffer block from the head of the buffer block queue; the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one read cache and is mutually independent among different asynchronous sample threads;

performing inverse association operation from the obtained cache block, and if an inverse association result is obtained, writing the obtained cache block and the inverse association result into an output fastq file of a corresponding sample respectively; if the anti-association result is not obtained, writing the obtained cache block into an output fastq file of a corresponding sample;

and after the cache block is consumed, sending a memory release request to the object memory multiplexing pool.

Further, the method further comprises the following steps:

Defining an object memory multiplexing pool, configuring an allocation interface and a release interface for the object memory multiplexing pool, and presetting the maximum storage capacity of the object memory multiplexing pool; the object memory multiplexing pool is used for recovering the released object memories and multiplexing the recovered object memories for distribution when the object memories are to be distributed;

when the object memory multiplexing pool receives an allocation memory request, judging whether the object memory multiplexing pool is empty or not through an allocation interface of the object memory multiplexing pool, and if so, allocating the object memory corresponding to the allocation memory request from an operating system memory allocation interface; otherwise, moving out an object memory corresponding to the memory allocation request from the object memory multiplexing pool, and allocating according to the memory allocation request;

when the object memory multiplexing pool receives a memory release request, judging whether the object memory multiplexing pool reaches the maximum storage capacity of the object memory multiplexing pool or not through a release interface of the object memory multiplexing pool, and if so, releasing the object memory corresponding to the memory release request from an operating system memory release interface; otherwise, the object memory corresponding to the memory release request is stored in the object memory multiplexing pool.

According to the invention, fastq files containing a plurality of samples are loaded through a plurality of threads concurrently, read data of the plurality of samples are separated, and the read data are output to the fastq files through asynchronous operation of a plurality of sample threads corresponding to the samples; the parallel working mode is utilized, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, meanwhile, the code running time is greatly reduced, and the calling times of a system lock are greatly reduced by applying a read caching technology, so that a large number of CPUs are released, the load of an operating system is greatly reduced, the performance utilization rate of a computer is improved, the time consumption for separating sample read data from a fastq file is greatly shortened, and the purpose of rapidly separating the sample read data from the fastq file is achieved.

Drawings

The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 is a flow chart of an optimization method for separating sample read data from fastq files according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a concurrent loading of fastq files and outputting read data according to an embodiment of the present invention;

FIG. 3 is a diagram of a read data structure according to an embodiment of the present invention;

FIG. 4 is a flowchart of a process of identifying the sample to which the read data belongs, according to an embodiment of the invention;

FIG. 5 is a schematic diagram of the correspondence between a pair of barcode and a sample number according to an embodiment of the present invention;

FIG. 6 is a diagram showing the difference between inserting read data into a read cache in the case of single-ended sequencing and double-ended sequencing according to an embodiment of the present invention;

FIG. 7 is a flow chart of inserting the read data into a read cache according to an embodiment of the invention;

FIG. 8 is a flow chart of a process for fetching read data from the read cache according to an embodiment of the present invention;

FIG. 9 is a flow chart of writing read data in the read cache to an output fastq file according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of memory multiplexing logic for allocating and releasing object memory based on an object memory multiplexing pool according to an embodiment of the invention;

FIG. 11 is a diagram illustrating an application of the read cache insert/drop data based on the object-based memory multiplex pool according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an application of an object memory multiplexed pool in reading and parsing read data according to an embodiment of the present invention;

Fig. 13 is a block diagram of an exemplary electronic device capable of implementing embodiments of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 illustrates a flow chart of an optimization method for separating sample read data from fastq files, according to an embodiment of the invention.

The method S100 includes:

s110, concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data.

The fastq file containing multiple samples is typically very large, as small as a few GB, and as large as tens of hundreds of GB. For the next gene sequence analysis, it is necessary to separate fastq files in sample units from such original fastq files, i.e., to separate the gene data of each sample into individual fastq files.

As an embodiment of the present invention, the present method contemplates two threads, a first thread and a second thread; the first thread is used for reading fastq file data in blocks and inserting the read data blocks into a data block queue; and the second thread is used for taking out the data block from the data block queue, analyzing the data in the data block, and obtaining read data for output. The process of loading fastq files through the two threads concurrently, constructing read data and outputting the read data, as shown in fig. 2, includes:

s111, distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit of the data block queue and the size of a data block; the data block queue comprises a plurality of data blocks and is arranged in sequence; when it is not empty, there is at least a head data block and a tail data block. The access logic defining the data block queue is fetched by tail store and head store. Setting a fixed size, for example 1MB, for the data block; for equalizing fastq block data size per load as a data access size criterion.

S112, in the first thread, sending a memory allocation request to an object memory multiplexing pool, and waiting for allocation of a data block with the data block size; the object memory multiplexing pool is used for recovering the released object memory, and when the object memory is to be allocated, the object memory multiplexing pool multiplexes the recovered object memory for allocation. When the object memory multiplexing pool receives the memory allocation request, if the current object memory multiplexing pool has the memory which can be allocated, allocating the memory according to the size of the memory to be allocated corresponding to the memory allocation request, and returning the memory allocation request. And if no memory can be allocated in the current object memory multiplexing pool, allocating the object memory corresponding to the memory allocation request from an operating system memory allocation interface.

S113, in the first thread, reading the fastq file according to the size of the data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is smaller than the maximum data item number limit, inserting the data block into the tail of the data block queue, and at the moment, the inserted data block is the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue; continuously judging whether the fastq file is read completely, if so, setting a data block queue end mark, and ending the first thread; if the data block is not read, returning to wait for distributing the data block, and continuing to load the fastq file.

S114, judging whether the data block queue is empty and an ending mark is set in the second thread, and ending the second thread if the data block queue is empty and the ending mark is set; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; and if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain fastq block data.

S115, sequentially carrying out line feed analysis on the fastq block data to obtain a plurality of read data, outputting the read data one by one in sequence, and sending a request for releasing the memory of the data block to an object memory multiplexing pool after the data block is consumed.

The first thread and the second thread are processed simultaneously in parallel, namely the first thread successively loads the data blocks in the fastq file into the data block queue, and simultaneously the second thread sequentially extracts the data blocks from the head of the data block queue one by one and analyzes the data blocks in the memory, and the read data are analyzed from the data blocks through an analysis process. By using a parallel working mode, the multithreading is operated cooperatively at the same time, so that the operating efficiency is improved, and the operating time is greatly reduced.

The read data is stored in a continuous memory and comprises a starting position, an ending position and three index values, wherein the three index values are arranged between the starting position and the ending position and are used for pointing to the positions of three line changing symbols respectively; the three line changing symbols divide the data between the starting position and the ending position into four lines of data, wherein the first line of data is an information line, the second line of data is a sequence line, the third line of data is an annotation line and the fourth line of data is a quality line; the line feed is used for triggering data line feed operation.

In one embodiment of the present invention, as shown in FIG. 3 (a), a read data is stored for using an optimally defined read storage structure. That is, a whole block of continuous memory is used to store four lines of complete data of a READ, and such a block of continuous memory is named as READ, the starting position is 0, and the ending position is end. The start position 0 and the end position ead are introduced for convenience in describing access division of READ, and in actual operation, READ for continuously storing four lines of data of READ is a string type, i.e., a string type, which can obtain the start position and the end position of READ. Three index values are used to point to the positions of the line breaks of the first, second, and third lines of data, such as lf_pos1, lf_pos2, and lf_pos3, respectively, of the read data. Three line-feed symbols divide a piece of data into 4 lines of data, wherein the first line of data is an information line, denoted as READ [0, LF_pos1 ], and represents a left-closed right-open section between 0 and the character LF_pos 1; the second row of data is a sequence row, denoted as READ [ LF_pos1+1, LF_pos2), representing a left-closed right-open section between characters LF_pos1+1 to LF_pos2; the third row of data is an annotation row, denoted as READ [ LF_pos2+1, LF_pos3), representing a left-closed right-open section between characters LF_pos2+1 and LF_pos3; the fourth line of data is a quality line, denoted as READ [ lf_pos3+1, end), representing the left-closed right-open section between the characters lf_pos3+1 to the READ end position end. A whole block of continuous memory is used for completely storing four lines of read data of the read, and each read data is output in sequence.

In one embodiment of the present invention, as shown in FIG. 3 (b), a conventional read storage structure is used to store a read data. Namely, four sections of mutually discontinuous memories are used for respectively storing four lines of data of one read, wherein the first behavior information line data, the second behavior gene sequence data, the third behavior annotation data and the fourth behavior sequence quality data; and outputting the read data one by one in sequence in the read data memory format.

With the conventional read storage structure, if the process of reconnecting four rows of data in the memory to one continuous memory is omitted and direct individual writing of each row of data is adopted during the process of writing the read data into the memory, one write IO for storage will be generated for each row of data of the read, and 4 write IOs will be generated for each row of data of the read, i.e. 4 write IOs will be required for one read data to write data into the memory. And after four rows of data are reconnected to one continuous memory in the memory, only one write IO is needed to complete the same work. The former has 3 more write IO operations than the latter. Each write IO operation goes through the operating system application layer to the kernel layer, to the device drivers, and finally to the storage hardware, which is a long IO stack. It is a lengthy process relative to the operation of reconnecting four rows of data to one continuous memory in the memory. There are 10 hundred million read data that would be 10 hundred million 3=30 hundred million such write IO operations, which would be quite slow. Therefore, when using the conventional read memory structure, 4 lines of data must be reconnected to a continuous memory before writing to the memory.

The following table compares the operations of the conventional read storage structure with the optimized read storage structure of the present invention at various stages of the read lifetime:

as can be seen from the above table, in one read lifecycle, the optimized read storage structure has 7 fewer data copy operations than the conventional storage structure, and one fewer operations of allocating and releasing memory. The read data is amplified to 10 hundred million, 70 hundred million copies and 10 hundred million memory allocations and 10 hundred million memory release operations are reduced. From the actual test results, the simulation test 10 hundred million pieces of read data go through all stages of the lifecycle described in the table one by one, the conventional read storage structure takes 11 minutes, and the optimized read storage structure takes 4 minutes.

In conclusion, through optimizing the read storage structure, the number of basic operations on read data is greatly reduced. For example, assuming that there are 10 hundred million reads in the original fastq file, using the optimized read storage structure, during the whole gene data separation period, 10 hundred million times 4=40 hundred million times four-line data splitting and copying operations are reduced, and 10 hundred million times four-line data reorganizing and splicing operations into a continuous memory are reduced. Thus freeing up a large extra unnecessary CPU and memory consumption.

S120, analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a read cache of the sample to which the read data belongs.

Further, S121, the parsing the barcode pair from the read data includes:

s1211, taking the first 8 characters of the second line of the read data as the barcode of the read data.

As an embodiment of the present invention, as shown in fig. 3:

barcode data: READ [ LF_pos1+1, LF_pos1+8];

note that: is the closed interval between lf_pos1+1 to lf_pos1+8, and is actually a string of the first 8 characters of the second line of data.

S1212, constructing a barcode pair according to the barcode.

Further, S122, the identifying, according to the correspondence between the barcode pair and the sample number, the sample to which the read data belongs, as shown in fig. 4, includes:

S1221, grouping the barcode in the read data, as shown in FIG. 5, to obtain a plurality of barcode groups. Each of the barcode groups contains a plurality of barcode, the barcode of all the groups is not repeated, any of the barcode and the barcode of which are the barcode pairs are in the same group, the barcode pairs and the belonging barcode groups are in a many-to-one relationship, and the unique corresponding relationship between the barcode pairs and the barcode groups is obtained, namely, the unique barcode groups can be positioned by any of the barcode pairs.

S1222, defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;

s1223, identifying the belonging sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.

Further, S123, inserting the read data into a read cache of a sample to which the read data belongs, as shown in fig. 6, includes:

first, whether single-ended sequencing or double-ended sequencing is determined, wherein the single-ended sequencing is as follows:

the first cache block receives read data and judges whether the read data is received completely, if yes, the read cache is set with an end mark and the cache is ended; otherwise, the read data is put into a first cache block, whether the first cache block reaches the preset maximum storage capacity of the cache block is judged, if the first cache block reaches the preset maximum storage capacity of the cache block, the first cache block is inserted into the tail end of a cache block queue, then a memory allocation request is sent to an object memory multiplexing pool, and a second cache block is allocated through the object memory multiplexing pool.

When the second cache block is allocated to the read cache, judging whether the current read data has data which are not put into the cache block, if so, putting the data which are not put into the cache block in the current read data into the second cache block, and then returning to execute the step of receiving the read data; if not, the step of receiving read data is performed directly back.

Further, in the case of double-ended sequencing, the read data is two, read1 and read2 data, respectively; the two cache blocks are used for respectively storing corresponding read1 and read2 data; and associating the two cache blocks.

Further, presetting the maximum buffer block number of the buffer block queue; if the number of the buffer blocks in the buffer block queue reaches the preset maximum buffer block number when the first buffer block is inserted into the queue tail of the buffer block queue, entering a waiting state until the number of the buffer blocks in the buffer block queue is smaller than the preset maximum buffer block number.

As an embodiment of the present invention, as shown in fig. 7, the insertion and extraction of the read data is performed by setting up a read cache, and the process of inserting the read data into the read cache includes the following steps:

s123-1: allocating an independent read buffer for each output sample of the fastq file, wherein the read buffer comprises a buffer block and a buffer block queue; the buffer block is a memory block with a fixed storage capacity, for example, the size of the buffer block is 2MB. The buffer block queue is used for storing buffer blocks, and the maximum buffer block number of the buffer block queue is set, for example, maxcachblocknum is set to 128.

S123-2: and using the object memory multiplexing pool to allocate the buffer block memory for the read buffer.

S123-3: receiving read data of a corresponding sample in a read cache, judging whether the read data is received completely, and if yes, entering S123-8; otherwise, the read data is additionally put into a cache block, and step S123-4 is executed; the judgment on whether the read data is obtained is realized by identifying an end-of-file mark provided by the underlying file system, for example, when the reading of the fastq file is ended, an EOF mark returned by the file system is obtained, and the EOF is English shorthand of file and indicates that the file is already read.

S123-4: judging whether the cache block is filled, if so, executing step S123-5; otherwise, returning to the step S123-3;

s123-5: inserting the buffer blocks into the tail of a buffer block queue, and judging whether the number of the buffer blocks in the buffer block queue reaches the preset maximum buffer block number MaxCacheBlockNum or not when the buffer blocks are inserted, if so, entering a full state until the space in the buffer block queue meets the insertion requirement; if not, S123-6 is entered. The "full waiting" state is a waiting state, and the waiting reason is that the buffer block queue is in a full loading state, so that no space is reserved for inserting the buffer block to be inserted, and the buffer block queue is in the waiting state. The space in the buffer block queue meets the insertion, and the memory space of the queue is released after the existing buffer blocks in the buffer block queue are taken out and consumed, so that the buffer blocks in the full state can be inserted into the released memory space.

S123-6: distributing a new cache block through the object memory multiplexing pool;

s123-7: judging whether the current read data has data which is not put into a cache block, if so, putting the data which is not put into the cache block in the current read data into a newly allocated cache block, and returning to S123-3; if not, the process returns directly to S123-3.

S123-8: and setting an ending mark for the read cache and ending the cache.

As shown in fig. 8, the method for fetching read data from the read cache includes:

judging whether a buffer block queue of the read buffer is empty and an ending mark is set, and ending the process if the buffer block queue is empty and the ending mark is set; if the buffer block queue is empty and the ending mark is not set, entering a waiting state until the buffer block queue is not empty or the ending mark is set; and if the cache block queue is not empty, taking out a cache block from the head of the cache block queue. The "empty etc" state is a waiting state, and the waiting is because the buffer block queue is empty, i.e. no data exists, so that the data cannot be fetched from the buffer block queue, and thus the waiting state is achieved.

In a specific embodiment, assuming 10 hundred million pieces of read data are written to storage, 10 hundred million write IOs are needed to write all of the read data to storage when no read cache is used. After using the read cache, assuming a 2MB size of one cache block, 1.5 ten thousand reads of data can be placed, and 10 hundred million reads would use about 6.6 ten thousand cache blocks in total. When writing and storing, one cache block is one write IO, namely 2MB of data is written at one time. Then, after using the read cache, 10 hundred million pieces of read data need only be written to about 6.6 ten thousand write IOs. The number of the IO is reduced from 10 hundred million to 6.6 ten thousand, and the number of the IO is reduced by 5 orders of magnitude.

The same reason would be that 10 hundred million lock and unlock operations would have to be invoked, but now only 6 ten thousand times. Originally, 10 hundred million enqueue and dequeue operations are needed, and only 6 ten thousand times are needed.

In sum, by designing and applying the read caching technology, the time consumption of code running is greatly reduced, a large amount of CPU resources are released, and the load of an operating system is greatly reduced.

S130, writing read data in the read cache into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool, as shown in FIG. 9, including:

Judging whether a buffer queue of the read buffer is empty and an ending mark is set in the asynchronous sample thread, and ending the operation of the asynchronous sample thread if the buffer block queue is empty and the ending mark is set; if the buffer block queue is empty and the ending mark is not set, entering a waiting state until the buffer block queue is not empty or the ending mark is set; and if the cache block queue is not empty, taking out a cache block from the head of the cache block queue.

The asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one read cache and is mutually independent among different asynchronous sample threads; i.e. different asynchronous sample threads can be processed in parallel.

Further, in the obtained cache block, the inverse association is that the read data associated with the obtained cache block is attempted to be associated with the read data in the obtained cache block, and two results appear in the attempt, namely, one can be inversely associated with the result, and the other can not be inversely associated with the result, wherein the inversely associated result is the cache block storing the read data associated with the obtained read data. If the result can be reversely correlated, the double-ended sequencing is indicated, and the obtained cache block and the reversely correlated result are respectively written into an output fastq file of the corresponding sample; if the anti-correlation result is not obtained, and the single-ended sequencing is indicated, the obtained cache block is written into an output fastq file of the corresponding sample.

After the obtained read data is written into the output fastq file of the corresponding sample, judging whether a buffer block queue of the current read buffer is empty or not, and ending the current asynchronous sample thread operation if the buffer block queue of the current read buffer is empty, wherein the current asynchronous sample thread operation is ended, and the fact that all read data in the fastq file are fetched is indicated; of course, if the current buffer block queue is empty, but there is no end mark, it indicates that there is no data in the current buffer block queue, and the fastq file has unwritten unconsumed read data, or the current buffer block queue is not empty, and the process returns to S130, where the read data in the read buffer is written into the output fastq file of the corresponding sample.

Further, as shown in fig. 10, the present invention optimizes all the locations where the object memory is allocated and released, and specifically defines the object memory multiplexing pool.

Allocating an allocation interface and a release interface for the object memory multiplexing pool, and presetting the maximum storage capacity of the object memory multiplexing pool; the object memory multiplexing pool is used for recovering the released object memory and multiplexing the recovered object memory for distribution when the object memory is to be distributed.

When the frequently used object memories in the process of separating read data are released, the object memory multiplexing pool recovers the memories into the object memory multiplexing pool, and the object memories are not returned to the operating system; when the processing logic needs the memories next time, the object memory multiplexing pool directly multiplexes the previously recovered memories and returns the memories to the processing logic instead of applying for the required memories to the operating system.

The object memory multiplex pool allocation interface and the object memory multiplex pool release interface are call functions, and the object memory is allocated and released by calling the functions.

As shown in fig. 11, memory blocks in the target memory multiplex pool are multiplexed by inserting data into the read cache logic. In the read cache move-out data logic, previously allocated memory blocks are reclaimed. The program will not apply for allocating/releasing the memory of the cache block to the operating system any more soon, and is completely multiplexing the fixed amount of memory blocks already applied before.

When an allocation request is received, judging whether the object memory multiplexing pool is empty or not through the object memory multiplexing pool allocation interface, and if so, allocating the object memory corresponding to the allocation request from the operating system memory allocation interface; otherwise, removing an object memory corresponding to the allocation request from the object memory multiplexing pool, and allocating according to the allocation request.

When a release request is received, judging whether the object memory multiplexing pool reaches the maximum storage capacity of the object memory multiplexing pool or not through the object memory multiplexing pool release interface, and if so, releasing the object memory corresponding to the release request from the operating system memory release interface; otherwise, the object memory corresponding to the release request is placed in the object memory multiplexing pool.

As an embodiment of the present invention, in the allocation interface of the object memory multiplexing pool, whether the object memory multiplexing pool is empty is judged, if yes, the object memory is directly allocated from the allocation interface of the operating system memory; otherwise, an object is moved out of the object memory multiplexing pool and is used as the newly allocated object memory, and the corresponding allocation request is returned.

As an embodiment of the present invention, in the object memory multiplex pool release interface, whether the object memory multiplex pool is filled is judged, if so, the object memory is released directly by using the operating system memory release interface; otherwise, the object memory is directly put into the object memory multiplexing pool, and a corresponding release request is returned.

As shown in fig. 12, in the read fastq and parse read logic, i.e., in step S110, the memory blocks recovered in the target memory multiplex pool are multiplexed in the block read fastq thread. In the parse read thread, previously allocated memory blocks are reclaimed to the target memory reuse pool. Thus, the program can not apply for allocation or release of the memory to the operating system soon after achieving virtuous circle.

The invention greatly reduces the times of the program for distributing and releasing the memory from the operating system by applying the memory multiplexing technology based on the object memory multiplexing pool, thereby reducing a great amount of system overhead, reducing the load of the operating system and enabling the use amount of the program memory and the CPU utilization rate to be in a stable and healthy state.

By applying the optimized read storage structure, the read caching technology and the memory multiplexing technology jointly to the sample read data separation process, the time consumption for separating the sample read data from the fastq file is greatly shortened. By utilizing a parallel working mode, a plurality of threads simultaneously cooperate, so that the working efficiency is improved; meanwhile, the optimally defined read storage structure is adopted, so that the number of character string copying and re-concatenation times of each read data in the life cycle of the read data is reduced; in addition, the read caching technology is applied, so that the calling times of the system lock and the operation times of writing data interfaces to the line security queue and the file system are greatly reduced; the application of the object memory multiplexing pool reduces the request times of applying for and releasing the memory from the operating system in the whole separation process, thereby reducing the load of the operating system, releasing more CPU resources to other operation logics, and achieving the purpose of rapidly separating sample read data from fastq files.

The following are detailed performance comparison data including performance data for tools implemented using python script (python splitting tool), c++ version high concurrency splitting tool (high concurrency splitting tool), and the optimized c++ version splitting tool of the present invention (optimized tool of the present invention):

as shown in the above table, the optimized tools of the present invention average 94% less time consuming than the python isolation tools. An average 60% reduction in time was achieved compared to a high concurrence separation tool. And the larger the original fastq file, the more obvious the performance superiority.

The invention gives different processing modes of single-end sequencing and double-end sequencing in each link, so that the invention is perfectly suitable for single-end sequencing and double-end sequencing, and can perfectly support the function of rapidly and efficiently separating sample read data from fastq files no matter single-end sequencing or double-end sequencing. And ensures the correctness of the output data in two modes.

In some optional implementations of this embodiment, in the case of single-ended sequencing, in step S122, the read data is one, one barcode is obtained through the second data of the read data, and the barcode is copied, so that two identical barcodes are obtained as the pair of the barcodes.

In step S123, a read buffer is allocated to each sample; the buffer block queue in the read buffer is used for storing read data of the same sample in sequence; obtaining a piece of read data of a fastq file, and inserting the piece of read data into a corresponding read cache according to the process of inserting the read data into the read cache.

In step S130, in the asynchronous sample thread, a read data is obtained from the head of the buffer block queue of the corresponding read buffer. The number of the asynchronous sample threads is multiple, each asynchronous sample thread corresponds to one read cache only, and different asynchronous sample threads are mutually independent, namely different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data through the acquired read data, and writing the acquired read data into an output fastq file of a corresponding sample if a result cannot be inversely associated at the moment.

In some alternative implementations of this embodiment, in the case of double-ended sequencing, there are two fastq files, r1 and r2, respectively; two reads are output, read1 and read2, respectively. Each pair of read1 and read2 of two fastq files has the same ID.

In step S122, two read data are obtained, and one barcode is obtained through the second data of the read data, and the two barcodes are used as the pair of the barcodes.

In step S123, a read buffer is allocated to each sample; the buffer block queue in the read buffer is used for storing read data of the same sample in sequence; and acquiring read1 and read2 of the two fastq files of r1 and r2, correlating the read2 with the read1, and inserting the read1 into a corresponding read cache according to the process of inserting the read data into the read cache after correlating.

In step S130, in the asynchronous sample thread, a read data is obtained from the head of the buffer block queue in the corresponding read buffer. The number of the asynchronous sample threads is multiple, each asynchronous sample thread corresponds to one read cache only, and different asynchronous sample threads are mutually independent, namely different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read data, wherein the inverse association is successful, the inverse association result is read2, and the read data and the inversely associated read2 data are written into an output fastq file of a corresponding sample.

An exemplary electronic device capable of implementing embodiments of the invention is shown in fig. 13.

The apparatus 1300 includes a Central Processing Unit (CPU) 1301, which can perform various suitable actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 1302 or computer program instructions loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data required for the operation of the device 1300 can also be stored. The CPU 1301, ROM 1302, and RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

Various components in device 1300 are connected to I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, or the like; an output unit 1307 such as various types of displays, speakers, and the like; storage unit 1308, such as a magnetic disk, optical disk, etc.; and a communication unit 1309 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Claims

1. An optimization method for separating sample read data from fastq files, comprising:

Analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a read cache of the sample to which the read data belongs; wherein,,

inserting the read data into a read cache of a sample to which it belongs, comprising:

When the second cache block is allocated to the read cache, judging whether the current read data has data which are not put into the cache block, if so, putting the data which are not put into the cache block in the current read data into the second cache block, and returning to the step of receiving the read data by the first cache block; if not, directly returning to the step of receiving read data by the first cache block;

2. The method according to claim 1, wherein the concurrently loading fastq files containing a plurality of samples by two threads, constructing read data and outputting, comprises:

3. The method of claim 2, wherein the read data is stored in a continuous memory, comprising a start position, an end position, and three index values disposed between the start position and the end position for pointing to three line-feed positions, respectively; the three line changing symbols divide the data between the starting position and the ending position into four lines of data, wherein the first line of data is an information line, the second line of data is a sequence line, the third line of data is an annotation line and the fourth line of data is a quality line; the line feed is used for triggering data line feed operation.

4. The method of claim 1, wherein parsing the pair of barcode from the read data comprises:

constructing a barcode pair according to the barcode;

5. The method of claim 1, wherein the identifying the sample to which the read data belongs according to the correspondence of the barcode pair to sample numbers comprises:

6. The method as recited in claim 1, further comprising:

in the case of double-ended sequencing, the read data is two, namely read1 data and read2 data; the two cache blocks are used for respectively storing corresponding read1 data and read2 data; associating the two cache blocks;

and when the cache block storing read1 data reaches the preset maximum storage capacity of the cache block, inserting the cache block into the tail of the cache block queue.

7. The method as recited in claim 6, further comprising:

setting the maximum buffer block number of the buffer block queue;

if the first cache block or the cache block storing read1 data is inserted into the tail of the cache block queue, judging whether the number of the cache blocks in the cache block queue reaches the preset maximum number of the cache blocks, if so, entering a waiting state until the number of the cache blocks in the cache block queue is smaller than the preset maximum number of the cache blocks; otherwise, the first cache block or the cache block storing read1 data is inserted into the tail of the cache block queue.

8. The method of claim 1, wherein writing read data in the read cache into the output fastq file of the corresponding sample by an asynchronous sample thread in an asynchronous sample thread pool comprises:

9. The method of any one of claims 1, 2, 8, further comprising: