[go: up one dir, main page]

CN113257356B - Gene sequencing data external ordering method and device based on different storage levels - Google Patents

Gene sequencing data external ordering method and device based on different storage levels Download PDF

Info

Publication number
CN113257356B
CN113257356B CN202110633578.0A CN202110633578A CN113257356B CN 113257356 B CN113257356 B CN 113257356B CN 202110633578 A CN202110633578 A CN 202110633578A CN 113257356 B CN113257356 B CN 113257356B
Authority
CN
China
Prior art keywords
data
sequencing
memory
gene
gene sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110633578.0A
Other languages
Chinese (zh)
Other versions
CN113257356A (en
Inventor
谭光明
刘万奇
李叶文
康宁
孙凝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Western Research Institute Of China Science And Technology Computing Technology
Original Assignee
Western Research Institute Of China Science And Technology Computing Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Western Research Institute Of China Science And Technology Computing Technology filed Critical Western Research Institute Of China Science And Technology Computing Technology
Priority to CN202110633578.0A priority Critical patent/CN113257356B/en
Publication of CN113257356A publication Critical patent/CN113257356A/en
Application granted granted Critical
Publication of CN113257356B publication Critical patent/CN113257356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a device for externally sequencing gene sequencing data based on different storage levels, which belong to the technical field of computer system structural design and data sequencing and are provided with the following scheme that the method for externally sequencing gene sequencing data based on different storage levels comprises the steps of reading the storage capacity required by the data to be sequenced to judge the size of the data to be sequenced; and after the data to be sorted is sorted, writing the sorting result of the data to be sorted back to an external memory for storage. The technical scheme of the invention ensures the sequencing performance of the gene sequencing data and improves the sequencing efficiency of the gene sequencing data.

Description

Gene sequencing data external ordering method and device based on different storage levels
Technical Field
The invention relates to the technical field of computer system structural design and data sequencing, in particular to a method and a device for externally sequencing gene sequencing data based on different storage levels.
Background
Along with the rapid development of bioinformatics, gene analysis has become a widely used technical means for scientific research and industry, has successful application in species identification, disease diagnosis and other aspects, and is based on a gene sequencing technology, and currently, a second generation sequencing technology is commonly adopted.
The cost of current second generation sequencing is continually reduced, resulting in a rapid increase in gene sequencing data, and this effect is becoming more and more pronounced, which will be of a striking magnitude in the future. In order to process massive amounts of gene sequencing data, a human needs to complete a set of gene analysis processes by means of a modern computing system, wherein sequencing is an important step after comparing the gene sequencing data with a reference sequence.
Because the data to be sequenced of the genes may be relatively large, and even difficult to be read into the memory for calculation, an external sequencing mode is needed for the data, but the scheme widely used at present is that software external sequencing, namely a processor is used as a sequencing control and calculation unit, intermediate data is moved between the memory and the hard disk, and the final sequencing result is obtained by combining. However, this exo-ordering scheme uses a processor as a processing unit during the ordering process, which places a burden on the CPU and is inefficient.
Disclosure of Invention
The technical problem solved by the invention is to provide the method and the device for sequencing the gene sequencing data outside based on different storage levels, so that the sequencing performance of the gene sequencing data is ensured, and the sequencing efficiency of the gene sequencing data is improved.
The basic scheme provided by the invention is as follows:
The method for externally sequencing gene sequencing data based on different storage levels comprises the following steps:
the method comprises the steps of reading the storage capacity required by data to be ordered to judge the size of the data to be ordered;
If the size of the data to be ordered exceeds a preset threshold, the data to be ordered is ordered in a grading manner through a first memory, a second memory and a third memory;
After the data to be sorted are sorted, the sorting result of the data to be sorted is written back to an external memory for storage.
The basic scheme is as follows:
In the scheme, the external sequencing method of the gene sequencing data based on different storage levels is based on a sequencing engine of the gene sequencing data calculated in storage, and sequentially sequencing according to names or coordinates in the gene sequencing data, specifically, the storage capacity of data to be sequenced is read to judge the size of the data to be sequenced, when the size of the data to be sequenced exceeds a preset threshold, the data to be sequenced is subjected to multistage hierarchical sequencing through a first memory, a second memory and a third memory, and the sequencing result of the external sequencing is stored in an external memory.
And performing hierarchical sorting through the first memory, the second memory and the third memory, namely when the storage capacity of the data to be sorted is larger and exceeds the capacity of the internal memory, finishing data sorting by using a hardware sorting network for interaction between the internal memory and the external memory.
The basic scheme has the beneficial effects that:
(1) In the scheme, for larger data to be sorted, the occupied storage capacity is larger than that of internal storage, the first storage, the second storage and the third storage are adopted for multi-stage hierarchical sorting, and compared with sorting of the larger data to be sorted directly, the multi-stage sorting can divide the larger data to be sorted into smaller data for sorting respectively, the multi-stage sorting can be operated in parallel and time-division multiplexing is carried out, so that the sorting efficiency of external sorting is improved, and meanwhile, because the storage capacity of external storage is large, larger gene sequencing data can be stored, and meanwhile, the performance of external sorting is improved.
(2) In the scheme, the sequencing of the gene sequencing data is completed in a hardware network mode such as the first memory, the second memory and the third memory, so that the problem of software burden caused by sequencing of software such as a processor is avoided.
(3) In the scheme, the sequencing of the gene sequencing data is completed in a hardware network mode such as the first memory, the second memory and the third memory, so that a large amount of I/O overhead caused by the back and forth transmission of the data to be sequenced between the processor and the memory is avoided, and the sequencing performance of the gene sequencing data is improved.
Further, the storage capacity of the first memory is smaller than the storage capacity of the second memory, and the storage capacity of the second memory is smaller than the storage capacity of the third memory.
In the scheme, the data to be sequenced are sequenced in a grading manner through the multi-level memories with different storage capacities, so that the sequencing performance of the gene sequencing data is improved.
Further, the first memory is a static random access memory, the second memory is a dynamic random access memory, and the third memory is an external memory.
In the scheme, memories with different capacities are further limited to a static random access memory, a dynamic random access memory and an external memory, so that hierarchical sorting of data to be sorted is conveniently and well performed.
Further, if the size of the data to be sorted exceeds a preset threshold, the step of sorting the data to be sorted through the first memory, the second memory and the third memory in a grading manner includes:
equally dividing the data to be sequenced into a plurality of first gene sequencing data through a first memory;
respectively carrying out lossless compression on each first gene sequencing data in the data to be sequenced;
and performing double-tone sequencing on the first gene sequencing data after lossless compression.
In the scheme, when the size of the data to be sequenced exceeds a preset threshold value, the data to be sequenced is equally divided into a plurality of first gene sequencing data through the first memory, namely the data to be sequenced is equally divided into a plurality of first gene sequencing data through static random access storage, each first gene sequencing data is subjected to lossless compression, the first gene sequencing data after lossless compression is sequenced through double-tone sequencing, the double-tone sequencing is suitable for hardware implementation, namely the sequencing of the first gene sequencing data is completed through the first memory directly, the sequencing speed of the gene sequencing data after blocking is improved, the limitation of the memory storage bandwidth when the data to be sequenced is avoided, and the bandwidth utilization rate of an external memory is improved.
Further, the step of performing double-tone sequencing on each first gene sequencing data after lossless compression comprises the following steps:
Merging the sequenced first gene sequencing data into a plurality of second gene sequencing data according to a tree structure through a second memory;
and (5) performing double-tuned sequencing on the combined second gene sequencing data.
The step of double-tone sequencing of the synthesized second gene sequencing data further comprises the following steps:
merging the sequenced second gene sequencing data into a plurality of third gene sequencing data according to a tree structure through a third memory;
and performing double-tuned sequencing on the combined third gene sequencing data.
The step of performing double-tone sequencing on the combined third gene sequencing data comprises the following steps:
and merging the sequenced third gene sequencing data into sequenced final gene sequencing data according to the tree structure and writing the final gene sequencing data back to an external memory.
In the scheme, the first gene sequencing data in good order are combined according to the tree structure through the dynamic random access memory, the second gene sequencing data are subjected to double-tone sequencing based on the hardware network after being combined, the second gene sequencing data in good order are further combined according to the tree structure through the external memory, the third gene sequencing data are subjected to double-tone sequencing based on the hardware network after being combined, and compared with the process of sequencing the segmented gene sequencing data through the external memory directly, the method has less hardware cost, is convenient for parallel work and improves the sequencing efficiency of the gene sequencing data.
Further, the equally dividing the data to be sequenced into a plurality of first gene sequencing data is specifically:
The data to be sequenced is provided with N reading pairs, the N reading pairs are equally divided into T parts, and the number of the reading pairs in each equally divided first gene sequencing data is N/T.
For the data to be sequenced, a plurality of read pairs of the data to be sequenced are provided, the data to be sequenced is equally divided according to the read pairs provided by the read pairs to be split into a plurality of first gene sequencing data, so that the first gene sequencing data can be sequenced in a grading way, the sequencing process of the larger data to be sequenced can be completed at a storage end, and the sequencing performance of the gene sequencing data is improved.
Further, the lossless compression is specifically encoding the repeated information of each first gene sequencing data according to a directed acyclic graph.
The method is used for encoding the repeated information of each first gene sequencing data based on the directed acyclic graph, namely, the repeated information in the first gene sequencing data is compressed, so that the size of the first gene sequencing data is reduced, meanwhile, the method can be used for directly sequencing files in the compressed format, and the bandwidth utilization rate of an external memory is improved.
In addition, in order to achieve the above object, the present invention also provides an external sequencing device for gene sequencing data based on different storage levels, the sequencing device for gene sequencing data comprising:
The method comprises the steps of a memory, a processor and a gene sequencing data external sequencing program which is stored in the memory and can run on the processor and is based on different storage levels, wherein the gene sequencing data external sequencing program based on different storage levels realizes the gene sequencing data external sequencing method based on different storage levels when being executed by the processor.
Drawings
FIG. 1 is a flow chart of an embodiment of a method for out-ordering gene sequencing data based on different storage levels according to the present invention;
FIG. 2 is a schematic diagram of a gene alignment analysis according to an embodiment of the method for external sequencing of gene sequencing data based on different storage levels;
FIG. 3 is a schematic diagram of an embodiment of an out-ordering method for gene sequencing data based on different storage levels according to the present invention;
FIG. 4 is a schematic diagram of an external ordering hierarchical ordering structure according to an embodiment of the method for external ordering of gene sequencing data based on different storage levels according to the present invention;
FIG. 5 is a block compression structure diagram of an embodiment of the method for external ordering of gene sequencing data based on different storage levels according to the present invention;
FIG. 6 is a schematic diagram of an embodiment of a sequencing engine for gene sequencing data calculated in a storage according to the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The following is a further detailed description of the embodiments:
the external sequencing device of the gene sequencing data based on different storage levels in the scheme can be a terminal device and is a structure of a hardware running environment. The external sequencing device of the gene sequencing data based on different storage levels can be terminal equipment such as a PC (personal computer), a portable computer and the like.
The terminal device may include a processor, a communication bus, a user interface, a network interface, and a memory. The communication bus is used for realizing connection communication among the processor, the user interface, the network interface and the memory. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), a tablet, a stylus, etc., and the optional user interface may also include a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface (e.g., RJ45 interface), a wireless interface (e.g., WIFI interface).
In the sequencing device of the present solution, the user interface is mainly used for data communication with each terminal, the network interface is mainly used for connecting the background server and performing data communication with the background server, and the processor may be used for calling the external sequencing program of the gene sequencing data based on different storage levels stored in the memory and performing the following operations as shown in fig. 1:
step S100, obtaining data to be ordered generated after Fastq files are compared with a reference sequence;
That is, referring to fig. 2, before sorting the data to be sorted, it is necessary to align one Fastq file with the reference sequence or align a plurality of Fastq files with the reference sequence to generate SAM intermediate data in which the gene sequencing data is unordered, and it is necessary to sort in order according to the names of the gene fragments or the positions of the gene fragments on the reference sequence in the SAM intermediate data. It should be noted that, many broken gene fragments are stored in Fastq, and the positions of these gene fragments on the existing reference sequence are obtained in the process of comparing Fastq with the reference sequence, so as to obtain the complete sequencing gene sequence.
Further, in the gene analysis flow of double-end sequencing, two Fastq files are compared with a reference sequence, larger data to be sequenced (SAM files) are usually generated, the SAM files at the moment are unordered, sequencing is needed according to the name of the gene sequencing data in the SAM files or the coordinates of the reference sequence so as to analyze the gene sequencing data, the size of the SAM files to be sequenced is positively related to the size of the input gene data files, and the storage capacity of the gene data files is smaller, such as 2GB, 8GB, 16GB, etc., and the storage capacity is larger, such as 128GB, 256GB, etc.
Step S200, reading the storage capacity required by the data to be sorted, so as to judge the size of the data to be sorted;
step S300, if the size of the data to be ordered exceeds a preset threshold, the data to be ordered is ordered in a grading manner through a first memory, a second memory and a third memory;
In the implementation, when the size of the data to be sequenced is not more than a preset threshold, the combination of a traditional processor (CPU) and an internal memory (DRAM) is specifically adopted to finish sequencing the gene sequencing data, and if the size of the data to be sequenced is more than the preset threshold, the data to be sequenced is sequenced in a grading manner through the first memory, the second memory and the third memory.
The storage capacity of the gene sequencing data is small, and the data to be sequenced (SAM file) generated after the comparison of the gene sequencing data and the reference sequence is also small, while the internal memory is characterized by relatively small and high speed, the external memory is characterized by relatively large and low speed, the internal sequencing is used when the storage capacity of the gene sequencing data is small, the external sequencing is used when the storage capacity of the gene sequencing data is large, namely the first memory, the second memory and the third memory are used for hierarchical sequencing when the storage capacity of the gene sequencing data is large, and the self-adaptive mode is beneficial to the benefits of the avoidance of the shortages.
The circuit structure of this scheme is shown in FIG. 3, and first, it is selected whether to use the internal sorting or the external sorting according to the size of the SAM file generated by the comparison and the size of the actual internal memory of the computing system. For example, when the size of the SAM file to be ordered is 2GB and the internal memory size of the computing system is 16GB, the SAM file to be ordered is only needed to be ordered in an internal ordering manner, and then the SAM file to be ordered is directly regarded as a common ordering task, that is, the task is completed by a manner of running a fast ordering algorithm on a conventional processor-internal memory (CPU-DRAM) system. When the size of the SAM files to be sorted is 200GB and the size of the internal memory of the computing system is only 16GB, the SAM files to be sorted are required to be subjected to an external sorting scheme, the external sorting is an I/O intensive task, the external memory is required to participate in the sorting process, the SAM files to be sorted are partitioned by adopting a double-tone sorting algorithm, and the sorting of the SAM files to be sorted is achieved in a layer-by-layer progressive manner from small to large.
Further, the storage capacity of the first memory is smaller than that of the second memory, and the storage capacity of the second memory is smaller than that of the third memory, specifically, the first memory is static random access memory, the second memory is dynamic random access memory, and the third memory is external memory. The memories with different storage capacities adopted by the scheme are subjected to hierarchical sorting, so that the sorting of the hierarchical sorting of the data to be sorted is facilitated, and meanwhile, the sorting performance of the gene sequencing data is improved.
Step S400, after the data to be sorted are sorted, the sorting result of the data to be sorted is written back to an external memory for storage.
Based on the above embodiment, in order to fully exploit the performance of the outer ordering, referring to fig. 4 and fig. 5, the present scheme performs the outer ordering on the larger gene sequencing data, specifically, performs the ordering by adopting three storage levels. The three memory levels are an SRAM level, a DRAM level and a Flash level respectively, wherein the first memory level is an on-chip cache (SRAM), namely static random access memory in the scheme, the second memory level is an internal memory (DRAM), namely dynamic random access memory in the scheme, and the third memory level is a Flash memory (Flash), namely an external memory in the scheme, so that the external ordering performance is obtained as good as possible in a layer-by-layer progressive mode.
Further, the data to be sequenced at the beginning is stored in Flash, after the data to be sequenced in Flash are partitioned and compressed, the sequencer in the SRAM layer can sequence the data blocks generated after the data to be sequenced in Flash are partitioned, the sequence result is sent to the combiner of the DRAM layer, one combiner can combine the data blocks sequenced in a plurality of SRAM layers together, a plurality of layers of combiners are arranged in the DRAM layer to iteratively combine the sequence blocks, and finally the obtained combined blocks are written into Flash, so that the sequence of unordered Flash data blocks is completed. FIG. 5 shows the Flash-level ordering, where multiple data blocks ordered by DRAM level are combined step by step through a tree structure, and finally complete ordered gene sequencing data is obtained on the Flash level. It should be noted that the lossless compression has the effect of improving the bandwidth utilization rate, and simultaneously saves the storage space of the flash memory.
In the implementation, the external sequencing method of the gene sequencing data based on different storage levels is an internal sequencing and external sequencing adaptive method, specifically, whether to adopt internal sequencing or external sequencing is determined according to the size of the data to be sequenced, in the scheme, the internal memory refers to an internal storage DRAM in a traditional architecture, and the external memory refers to an external storage Flash in a hard disk.
Further, the method for externally sequencing the gene sequencing data based on different storage levels comprises the steps of internally/externally sequencing judgment, quick sequencing, externally sequencing data blocking, lossless compression, double-tone sequencing and externally sequencing and combining according to the name or the coordinate in the gene sequencing data by using a sequencing engine of the gene sequencing data calculated in the storage. Because the scheme is second generation sequencing (NGS) gene pretreatment, a plurality of broken gene fragments are stored in Fastq files, one Fastq file is required to be compared with a reference sequence or two Fastq files are required to be compared with the reference sequence before internal/external sequencing judgment so as to generate SAM to-be-sequenced data, and the gene sequencing data in the SAM to-be-sequenced data are unordered and are required to be sequenced in sequence according to the names of the gene fragments of the SAM to-be-sequenced data or the positions of the gene fragments on the reference sequence. The alignment process is to obtain the positions of the broken gene fragments on the existing reference sequence to obtain the complete sequencing gene sequence.
After the compared data to be sorted is obtained, the storage capacity of the data to be sorted is read to judge the size of the data to be sorted, wherein the inner/outer sorting judgment determines to adopt inner sorting or outer sorting according to the size of the data to be sorted, and a preset threshold is dynamically set. When the data to be ordered exceeds the preset threshold, external ordering is adopted, and when the data to be ordered does not exceed the preset threshold, internal ordering is adopted, wherein the quick ordering is used for the internal ordering, and the data ordering is completed in a traditional processor CPU and internal memory DRAM mode. The external sorting data block refers to equally dividing data to be sorted with larger data quantity, sorting each equally divided first gene sequencing data, wherein double-tone sorting is used for the case of external sorting, and the sorting algorithm is suitable for hardware implementation, so that the data sorting is completed by using a hardware sorting network mode such as an internal memory, an external memory and the like. Wherein the outer sequencing combining refers to combining the sequenced first gene sequencing data into sequenced final gene sequencing data under the condition of outer sequencing. The lossless compression refers to lossless compression of first gene sequencing data in an external memory under the condition of external sequencing, so that the bandwidth utilization rate of the external memory is improved, and meanwhile, the lossless compression algorithm can enable sequencing to be directly performed on the compressed data.
In this embodiment, the data to be sequenced with a large data size is equally divided, specifically, because the data to be sequenced (SAM file/Fastq file) is read pairs (reads), and if N read pairs exist in one data to be sequenced, the number of read pairs in each first gene sequencing data is N/T. Each first gene sequencing data is sequenced, specifically, a hardware sequencing tree for internal sequencing of the first gene sequencing data, and a hardware merging tree for merging among the first gene sequencing data.
It should be noted that Fastq is a text format storing biological sequences (typically nucleic acid sequences) and corresponding quality scores, all encoded in ASCII, almost standard format for high throughput gene sequencing. The internal ordering is ordering in the internal memory, the external ordering is combining ordering of the internal memory and the external memory, and interaction exists between the internal memory and the external memory. The dynamically set preset threshold may be adaptively set according to the size of the internal memory to distinguish whether to use the internal ordering or the external ordering according to the size of the data to be ordered.
In one embodiment, referring to FIG. 3, the external sequencing device for gene sequencing data based on different storage levels performs sequencing according to names or coordinates in the gene sequencing data by using a sequencing engine for gene sequencing data calculated in storage, and the sequencing device can be an application specific integrated circuit and comprises an internal/external sequencing judging device, an external memory chip, a double-tone sequencing device, a data merging device and a quick sequencing processor, wherein the internal/external sequencing judging device, the external memory chip, the double-tone sequencing device and the data merging device are sequentially connected, and the quick sequencing processor is connected with the internal/external sequencing judging device;
The input end of the data block device is the input end of the external memory chip, the output end of the data block device is connected with the input end of the lossless compressor, and the output end of the lossless compressor is the output end of the external memory chip;
The fast ordering processor is connected with an internal memory, the double-tone sequencer is connected with an on-chip buffer and the internal memory, and the data combiner is connected with an external memory.
In this embodiment, the external memory chip, that is, the flash memory chip in fig. 3, has a data blocking device and a lossless compressor, so that in the integrated circuit of the present embodiment, the storage computing unit offloads the compression step to the programmable hardware logic unit, so that the gene sequencing data is compressed in the stored process, overlapping of data input/output (I/O) and computation is achieved, and time overhead of switching between the steps of the gene sequencing data is reduced.
The gene sequencing data external sequencing method based on different storage levels can be operated in a gene sequencing data external sequencing device based on different storage levels, and the gene sequencing data external sequencing device based on different storage levels can comprise a memory, a processor, a communication bus and a gene sequencing data external sequencing program based on different storage levels, wherein the gene sequencing data external sequencing program is stored on the memory:
The communication bus is used for realizing connection communication between the processor and the memory;
The processor is used for executing the gene sequencing data external sequencing program based on different storage levels so as to realize the control of the normal operation of the gene sequencing data external sequencing device based on different storage levels.
In this embodiment, the asic is a carrier for implementing the flow of the genetic data sorting algorithm shown in fig. 2. The inner/outer sorting judging device selects the quick sorting processor or the double-tone sorting device according to the preset threshold value and the size of the input SAM files to be sorted, the data blocking device and the lossless compressor in the external memory chip can block and compress the SAM files to be sorted in the external memory, the gene sequencing data after blocking and compressing are sent to the double-tone sorting device for preliminary sorting, then the data merging device merges the preliminary sorting results into larger sorting results, iterates repeatedly, and finally the data merging device sends the results to the external memory.
It should be noted that, referring to fig. 6, based on the in-store calculated sequencing data sequencing engine, the in-store calculated sequencing data sequencing engine is orderly sequenced according to the name or coordinate in the sequencing data, the in-store calculated sequencing data sequencing engine is provided with a Flash controller and a Flash conversion layer, the Flash controller controls the reading and writing of Flash of an external memory, the Flash conversion layer processes the conversion of logical addresses and physical addresses and the scheduling of Flash access, a configurator and a scheduler connected with the Flash conversion layer receives the size of a SAM file and writes the configuration information obtained by analysis into an integrated circuit, the scheduler receives the information of equally dividing the SAM file by a data blocking device and controls the running of the in-store calculated sequencing data sequencing engine in cooperation with the Flash conversion layer, and an integrated circuit for sequencing the sequencing data in the storage is responsible for completing the actual sequencing task.
The steps implemented when the sequencing program of the genetic sequencing data running on the processor is executed may refer to an embodiment of the method for sequencing data external based on different storage levels in the present invention, which is not described herein again.
The foregoing is merely exemplary of the present application, and specific structures and features well known in the art will not be described in detail herein, so that those skilled in the art will be aware of all the prior art to which the present application pertains, and will be able to ascertain the general knowledge of the technical field in the application or prior art, and will not be able to ascertain the general knowledge of the technical field in the prior art, without using the prior art, to practice the present application, with the aid of the present application, to thereby complete the application with its own skills, without any special purpose of the present application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present application, and these should also be considered as the scope of the present application, which does not affect the effect of the implementation of the present application and the utility of the patent. The protection scope of the present application is subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims (10)

1. The method for externally sequencing gene sequencing data based on different storage levels is characterized by comprising the following steps:
the method comprises the steps of reading the storage capacity required by data to be ordered to judge the size of the data to be ordered;
If the size of the data to be ordered exceeds a preset threshold, the data to be ordered is ordered in a grading manner through a first memory, a second memory and a third memory; completing sequencing of the gene sequencing data by adopting a hardware network, wherein the hardware network comprises a first memory, a second memory and a third memory, and specifically, completing sequencing of the gene sequencing data by adopting the first memory, the second memory and the third memory;
After the data to be ordered are ordered, the ordering result of the data to be ordered is written back to an external memory for storage, wherein the storage capacity of the first memory is smaller than that of the second memory, and the storage capacity of the second memory is smaller than that of the third memory;
The first memory is static random access memory, the second memory is dynamic random access memory, and the third memory is external memory;
If the size of the data to be ordered exceeds a preset threshold, the step of sorting the data to be ordered through the first memory, the second memory and the third memory in a grading manner comprises the following steps:
equally dividing the data to be sequenced into a plurality of first gene sequencing data through a first memory;
respectively carrying out lossless compression on each first gene sequencing data in the data to be sequenced;
performing double-tone sequencing on the first gene sequencing data after lossless compression;
The step of performing double-tone sequencing on each first gene sequencing data after lossless compression comprises the following steps:
Merging the sequenced first gene sequencing data into a plurality of second gene sequencing data according to a tree structure through a second memory;
Performing double-tone sequencing on the combined second gene sequencing data, specifically, performing double-tone sequencing on a plurality of second gene sequencing data based on a hardware network after combining;
The step of double-tone sequencing of the synthesized second gene sequencing data further comprises the following steps:
merging the sequenced second gene sequencing data into a plurality of third gene sequencing data according to a tree structure through a third memory;
And (3) performing double-tone sequencing on the combined third gene sequencing data, and specifically, performing double-tone sequencing on the plurality of third gene sequencing data based on a hardware network after combining.
2. The method for out-of-order sequencing of gene sequencing data based on different storage levels according to claim 1, wherein said step of double-tone sequencing each third gene sequencing data combined comprises:
and merging the sequenced third gene sequencing data into sequenced final gene sequencing data according to the tree structure and writing the final gene sequencing data back to an external memory.
3. The method for out-of-order sequencing data based on different storage levels according to claim 1, wherein the equally dividing the data to be sequenced into a plurality of first gene sequencing data is specifically as follows:
The data to be sequenced is provided with N reading pairs, the N reading pairs are equally divided into T parts, and the number of the reading pairs in each equally divided first gene sequencing data is N/T.
4. The method of claim 1, wherein the lossless compression is encoding repeated information of each first gene sequencing data according to a directed acyclic graph.
5. The method for out-ordering genetic sequencing data based on different storage levels according to claim 1, wherein the external memory is a flash memory.
6. The gene sequencing data external sorting device based on different storage levels is characterized by comprising:
Memory, a processor and a different storage hierarchy based out-of-order program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the different storage hierarchy based out-of-order method of gene sequencing data as claimed in any one of claims 1 to 5.
7. The external sequencing data sequencing device based on different storage levels according to claim 6, wherein the sequencing engine for sequencing gene sequencing data according to the names in the sequencing gene data is an application specific integrated circuit, and comprises an internal/external sequencing judging device, an external memory chip, a double-tone sequencing device, a data merging device and a rapid sequencing processor connected with the internal/external sequencing judging device, wherein the internal/external sequencing judging device, the external memory chip, the double-tone sequencing device and the data merging device are sequentially connected.
8. The out-of-order sequencing data by different storage hierarchy device of claim 7, wherein the external memory chip has a data chunker and a lossless compressor.
9. The external sequencing data sequencing device based on different storage levels according to claim 6, wherein the internal-storage-calculation sequencing engine for the gene sequencing data is provided with a Flash controller and a Flash conversion layer, the Flash controller reads and writes to the external memory Flash, and the Flash conversion layer processes the conversion of logical addresses and physical addresses and the scheduling of Flash access.
10. The external sequencing device of gene sequencing data based on different storage levels according to claim 9, wherein the configurator and the scheduler are connected with the flash memory conversion layer, the configurator receives the size of the SAM file and writes the configuration information obtained by analysis into the integrated circuit, the scheduler receives the information of the equal division of the SAM file by the data block divider, the flash memory conversion layer is matched to control the operation of the sequencing engine of the gene sequencing data calculated in the storage, the flash memory chip is provided with a hardware execution unit for block division and compression, and the integrated circuit for sequencing the gene sequencing data is responsible for completing the actual sequencing task.
CN202110633578.0A 2021-06-07 2021-06-07 Gene sequencing data external ordering method and device based on different storage levels Active CN113257356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633578.0A CN113257356B (en) 2021-06-07 2021-06-07 Gene sequencing data external ordering method and device based on different storage levels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633578.0A CN113257356B (en) 2021-06-07 2021-06-07 Gene sequencing data external ordering method and device based on different storage levels

Publications (2)

Publication Number Publication Date
CN113257356A CN113257356A (en) 2021-08-13
CN113257356B true CN113257356B (en) 2024-11-29

Family

ID=77186873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633578.0A Active CN113257356B (en) 2021-06-07 2021-06-07 Gene sequencing data external ordering method and device based on different storage levels

Country Status (1)

Country Link
CN (1) CN113257356B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197433A (en) * 2017-12-29 2018-06-22 厦门极元科技有限公司 Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform
CN111261227A (en) * 2020-01-20 2020-06-09 苏州浪潮智能科技有限公司 Sequencing data storage method, apparatus, device, and computer-readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9274950B2 (en) * 2009-08-26 2016-03-01 Hewlett Packard Enterprise Development Lp Data restructuring in multi-level memory hierarchies
WO2020182175A1 (en) * 2019-03-14 2020-09-17 Huawei Technologies Co., Ltd. Method and system for merging alignment and sorting to optimize

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197433A (en) * 2017-12-29 2018-06-22 厦门极元科技有限公司 Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform
CN111261227A (en) * 2020-01-20 2020-06-09 苏州浪潮智能科技有限公司 Sequencing data storage method, apparatus, device, and computer-readable storage medium

Also Published As

Publication number Publication date
CN113257356A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
US10007605B2 (en) Hardware-based array compression
CN111913955A (en) Data sorting processing device, method and storage medium
CN113705775B (en) Pruning method, device, equipment and storage medium of neural network
CN103999061B (en) Memory device access system
CN112668708B (en) Convolution operation device for improving data utilization rate
US9569381B2 (en) Scheduler for memory
CN113257352B (en) Gene sequencing data sequencing method, integrated circuit and sequencing equipment
CN112070652A (en) Data compression method, data decompression method, readable storage medium and electronic device
CN111949681A (en) Data aggregation processing device and method and storage medium
CN112882663B (en) Random writing method, electronic equipment and storage medium
CN111324303A (en) SSD garbage recycling method and device, computer equipment and storage medium
CN115880132A (en) Graphics processor, matrix multiplication task processing method, device and storage medium
CN105830160B (en) For the device and method of buffer will to be written to through shielding data
CN110767265A (en) A Parallel Acceleration Method for Sorting Big Data Genome Alignment Files
CN113257356B (en) Gene sequencing data external ordering method and device based on different storage levels
CN114118394A (en) Neural network model acceleration method and device
JP2023503034A (en) Pattern-based cache block compression
CN113609310B (en) Single-machine large-scale knowledge map embedding system and method
US7096462B2 (en) System and method for using data address sequences of a program in a software development tool
CN114356512B (en) Data processing method, device and computer readable storage medium
WO2015143708A1 (en) Method and apparatus for constructing suffix array
CN115827221A (en) BAM file parallel reading method, system and medium
CN102117380A (en) System and method for simplification of matrix based Boosting algorithm
CN113268460B (en) Method and device for lossless compression of genetic data based on multi-level parallelism
CN119322912B (en) Matrix operation processing method of parallel computing device and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant