CN107480466A - Genomic data storage method and electronic equipment - Google Patents
Genomic data storage method and electronic equipment Download PDFInfo
- Publication number
- CN107480466A CN107480466A CN201710546293.7A CN201710546293A CN107480466A CN 107480466 A CN107480466 A CN 107480466A CN 201710546293 A CN201710546293 A CN 201710546293A CN 107480466 A CN107480466 A CN 107480466A
- Authority
- CN
- China
- Prior art keywords
- storage unit
- information
- statistical information
- gene sequence
- indicating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/0674—Disk device
- G06F3/0676—Magnetic disk device
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B99/00—Subject matter not provided for in other groups of this subclass
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基因组数据存储方法,包括:在基因组比对过程中,得到基因序列比对信息,并创建基因序列统计信息;将所述基因序列比对信息存储于磁盘,并按基因序列比对信息在基因组的比对位置,在内存中存储相应的索引;所述索引为所述基因序列比对信息在磁盘中的存储位置;对所述基因组统计信息进行分类,得到第一统计信息和第二统计信息;将第一统计信息存储于内存,所述第一统计信息为变异检测过程中访问频率高于预设频率的统计信息;将第二统计信息存储于磁盘,所述第二统计信息为无法存储于内存的统计信息和/或变异检测过程中访问频率低于预设频率的统计信息。本发明还公开了一种采用所述基因组数据存储方法的电子设备。
The invention discloses a genome data storage method, comprising: obtaining gene sequence comparison information during the genome comparison process, and creating gene sequence statistical information; storing the gene sequence comparison information in a disk, and Store the corresponding index in memory at the comparison position of the comparison information in the genome; the index is the storage position of the gene sequence comparison information in the disk; classify the genome statistical information to obtain the first statistical information and the second statistical information; the first statistical information is stored in the memory, and the first statistical information is the statistical information whose access frequency is higher than the preset frequency in the variation detection process; the second statistical information is stored in the disk, and the second statistical information is The statistical information is statistical information that cannot be stored in memory and/or statistical information whose access frequency is lower than a preset frequency during the mutation detection process. The invention also discloses an electronic device adopting the genome data storage method.
Description
技术领域technical field
本发明涉及数据处理技术领域,特别是指一种基因组数据存储方法及电子设备。The invention relates to the technical field of data processing, in particular to a genomic data storage method and electronic equipment.
背景技术Background technique
基因组变异检测计算流程,一般可分为比对、排序、去重复、重比对、变异检测、过滤等步骤。其中,主要的步骤需要使用BAM文件(SAM的全称是sequence alignment map,序列比对图。而BAM文件就是SAM文件的二进制格式的文件(B取自binary))作为输出文件写入硬盘,在下一个步骤再将其从硬盘上读取到内存,接着进行下一步处理。The calculation process of genomic variation detection can generally be divided into steps such as comparison, sorting, deduplication, re-alignment, variation detection, and filtering. Among them, the main steps need to use the BAM file (the full name of SAM is sequence alignment map, sequence alignment map. And the BAM file is the file in the binary format of the SAM file (B is taken from binary)) as the output file to write to the hard disk, in the next The step is to read it from the hard disk to the memory, and then proceed to the next step.
在实现本发明的过程中,发明人发现现有技术存在如下问题:In the process of realizing the present invention, the inventor finds that the prior art has the following problems:
在人类全基因组数据分析中,原始数据一般在100GB左右,中间的主要分析步骤都需要读写上百GB的文件,整个的计算过程耗费大量的I/O资源且程序效率低下。In the analysis of human whole genome data, the original data is generally about 100GB, and the main analysis steps in the middle need to read and write files of hundreds of GB. The entire calculation process consumes a lot of I/O resources and the program efficiency is low.
而发明人发现导致该问题的主要原因有:The inventors found that the main causes of this problem are:
1、中间文件太大,无法直接放入内存。1. The intermediate file is too large to be directly put into memory.
64GB内存是一个典型的常见生物信息分析的机器配置。人类全基因组分析数据,中间结果一般在100GB左右,无法直接存在内存里,而且变异检测过程本身就需要加载参考序列和索引文件到内存里,导致能用来放中间结果的空间进一步缩小。64GB memory is a typical machine configuration for common bioinformatics analysis. For human whole genome analysis data, the intermediate results are generally around 100GB, which cannot be directly stored in memory, and the mutation detection process itself needs to load reference sequences and index files into memory, resulting in further reduction in the space for intermediate results.
2、中间文件的格式,无法直接用于计算。2. The format of the intermediate file cannot be directly used for calculation.
通用的中间文件格式为SAM/BAM格式,这种格式是一种行记录格式,也就是每行存储一条记录,直接放入内存也不能直接用于计算。变异检测所需要的数据,主要是对每个位点的比对情况的统计信息,包括每个位点各类碱基的数目的分布、插入缺失(InDel)序列和频数、比对中的软剪切(soft clipping) 序列等信息。The general intermediate file format is SAM/BAM format, which is a row record format, that is, each row stores a record, and it cannot be directly used for calculation even if it is directly put into memory. The data required for mutation detection is mainly statistical information on the alignment of each site, including the distribution of the number of various bases at each site, the sequence and frequency of insertion-deletion (InDel), and the softness in the alignment. Cut (soft clipping) sequence and other information.
发明内容Contents of the invention
有鉴于此,本发明的目的在于提供一种基因组数据存储方法及电子设备,能够解决基因组变异检测过程中需要频繁地输入输出大量二进制文件而造成的低效率问题。In view of this, the purpose of the present invention is to provide a genome data storage method and electronic equipment, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in the genome variation detection process.
基于上述目的本发明提供的基因组数据存储方法,包括:Based on the above purpose, the genome data storage method provided by the present invention includes:
在比对过程中,得到基因序列比对信息,并创建基因序列统计信息;During the comparison process, obtain gene sequence comparison information and create gene sequence statistical information;
将所述基因序列比对信息存储于磁盘,并按基因序列比对信息在基因组的比对位置,在内存中存储相应的索引;所述索引为所述基因序列比对信息在磁盘中的存储位置;storing the gene sequence comparison information in the disk, and storing the corresponding index in the memory according to the comparison position of the gene sequence comparison information in the genome; the index is the storage of the gene sequence comparison information in the disk Location;
对所述基因组统计信息进行分类,得到第一统计信息和第二统计信息;classifying the genome statistical information to obtain first statistical information and second statistical information;
将第一统计信息存储于内存,所述第一统计信息为变异检测过程中访问频率高于预设频率的统计信息;Storing the first statistical information in the memory, the first statistical information is the statistical information whose access frequency is higher than the preset frequency during the mutation detection process;
将第二统计信息存储于磁盘,所述第二统计信息为无法存储于内存的统计信息和/或变异检测过程中访问频率低于预设频率的统计信息。The second statistical information is stored in the disk, and the second statistical information is statistical information that cannot be stored in memory and/or statistical information whose access frequency is lower than a preset frequency during the mutation detection process.
可选的,所述第一统计信息包括碱基加权质量值统计信息、正负链统计信息、插入缺失统计信息和软剪切统计信息。Optionally, the first statistical information includes base weighted quality value statistical information, positive and negative strand statistical information, indel statistical information and soft clipping statistical information.
可选的,对于没有出现插入缺失和软剪切且碱基类型最多出现过2种的位点,该位点的第一统计信息采用第一数据结构存储;Optionally, for a site where no indel or soft clipping occurs and at most 2 base types appear, the first statistical information of the site is stored in the first data structure;
所述第一数据结构,包括:The first data structure includes:
用于表示碱基类型的第一头部;The first head used to indicate the base type;
用于表示碱基加权质量值的第一质量值存储部;A first quality value storage unit for representing base-weighted quality values;
用于表示正链数量的第一正链数存储部;A first positive chain number storage unit for representing the number of positive chains;
用于表示负链数量的第一负链数存储部。A first negative chain number storage unit for indicating the number of negative chains.
可选的,对于有插入缺失出现且碱基类型出现过3-4种的位点,该位点的第一统计信息采用第一数据结构和第二数据结构存储;Optionally, for a site with indels and 3-4 base types, the first statistical information of the site is stored in the first data structure and the second data structure;
所述第二数据结构,包括:The second data structure includes:
4种碱基类型各自的碱基加权质量值统计信息和正负链统计信息;每种碱基类型的碱基加权质量值统计信息和正负链统计信息的存储结构具体包括:用于表示碱基加权质量值的第二质量值存储部,用于表示正链数量的第二正链数存储部,以及,用于表示负链数量的第二负链数存储部;Base-weighted quality value statistics and positive and negative strand statistics for each of the four base types; the storage structure of base-weighted quality value statistics and positive and negative strand statistics for each base type specifically includes: a second quality value storage unit for base weighted quality values, a second positive chain number storage unit for representing the number of positive chains, and a second negative chain number storage unit for representing the number of negative chains;
第一插入统计信息,具体包括:用于表示插入序列的第一插入序列存储部,用于表示低质量插入数量的第一低质量插入数存储部;The first insertion statistical information specifically includes: a first insertion sequence storage unit used to represent the insertion sequence, and a first low-quality insertion number storage unit used to represent the number of low-quality insertions;
第一缺失统计信息,具体包括:用于表示缺失长度的第一缺失长度存储部,用于表示高质量缺失数量的第一高质量缺失数存储部,用于表示低质量缺失数量的第一低质量缺失数存储部;The first deletion statistical information specifically includes: a first deletion length storage unit used to indicate the deletion length, a first high-quality deletion number storage unit used to indicate the number of high-quality deletions, and a first lowest deletion number storage unit used to indicate the number of low-quality deletions. Mass missing number storage unit;
所述第一数据结构,包括:The first data structure includes:
用11填充的第二头部;the second header filled with 11;
用于表示是否存在插入的第一插入信息存储部,具体包括:用于表示是否存在插入的第一插入信息子存储部,用于表示插入长度的插入长度子存储部,用于表示低质量插入数量的低质量插入数子存储部;The first insertion information storage unit used to indicate whether there is an insertion, specifically includes: a first insertion information sub-storage unit used to indicate whether there is an insertion, an insertion length sub-storage unit used to indicate the insertion length, and a sub-storage unit used to indicate low-quality insertion Quantity of low-quality insertion into digital sub-storage;
用于表示是否存在缺失的第一缺失信息存储部,具体包括:用于表示是否存在缺失的第一缺失信息子存储部;The first missing information storage unit used to indicate whether there is a missing, specifically includes: a first missing information sub-storage unit used to indicate whether there is a missing;
用于指向相应的第二数据结构存储位置的指针。A pointer to the corresponding second data structure storage location.
可选的,对于出现多余1个的插入缺失、插入长度大于12个碱基的位点,该位点的第一统计信息采用第一数据结构和第三数据结构存储,且对于这样的位点的第一统计信息,在内存中创建内存池来进行存储;Optionally, for a site with more than one indel and an insertion length greater than 12 bases, the first statistical information of the site is stored in the first data structure and the third data structure, and for such a site Create a memory pool in the memory for storage of the first statistical information;
所述第三数据结构,包括:The third data structure includes:
4种碱基类型各自的碱基加权质量值统计信息和正负链统计信息;每种碱基类型的碱基加权质量值统计信息和正负链统计信息的存储结构具体包括:用于表示碱基加权质量值的第三质量值存储部,用于表示正链数量的第三正链数存储部,以及,用于表示负链数量的第三负链数存储部;Base-weighted quality value statistics and positive and negative strand statistics for each of the four base types; the storage structure of base-weighted quality value statistics and positive and negative strand statistics for each base type specifically includes: a third quality value storage unit for base weighted quality values, a third positive chain number storage unit for representing the number of positive chains, and a third negative chain number storage unit for representing the number of negative chains;
第二插入统计信息,具体包括:用于表示插入长度的插入长度存储部,用于表示插入序列的第二插入序列存储部,用于表示低质量插入数量的第二低质量插入数存储部,以及,用于表示高质量插入数量的高质量插入数存储部;The second insertion statistical information specifically includes: an insertion length storage unit used to indicate the insertion length, a second insertion sequence storage unit used to indicate the insertion sequence, a second low-quality insertion number storage unit used to indicate the number of low-quality insertions, and, a high-quality insertion number storage unit for representing the number of high-quality insertions;
第二缺失统计信息,具体包括:用于表示缺失长度的第二缺失长度存储部,用于表示高质量缺失数量的第二高质量缺失数存储部,用于表示低质量缺失数量的第二低质量缺失数存储部;The second deletion statistical information specifically includes: a second deletion length storage unit used to indicate the deletion length, a second high-quality deletion number storage unit used to indicate the number of high-quality deletions, and a second-lowest deletion number storage unit used to indicate the number of low-quality deletions Mass missing number storage unit;
所述第一数据结构,包括:The first data structure includes:
用11填充的第三头部;a third header filled with 11;
用于表示是否存在插入的第二插入信息存储部,具体包括:用于表示是否存在插入的第二插入信息子存储部,用于表示是否使用了内存池的第一内存池信息子存储部,用于表示在内存池中的占用长度的第一占用长度子存储部;The second insertion information storage unit for indicating whether there is an insertion, specifically includes: a second insertion information sub-storage unit for indicating whether there is an insertion, a first memory pool information sub-storage unit for indicating whether a memory pool is used, a first occupancy length sub-storage for representing an occupancy length in the memory pool;
用于表示是否存在缺失的第二缺失信息存储部,具体包括:用于表示是否存在缺失的第二缺失信息子存储部,用于表示是否使用了内存池的第二内存池信息子存储部,用于表示在内存池中的占用长度的第二占用长度子存储部。The second missing information storage unit for indicating whether there is a missing, specifically includes: a second missing information sub-storage for indicating whether there is a missing, a second memory pool information sub-storage for indicating whether a memory pool is used, A second occupancy length sub-storage for representing the occupancy length in the memory pool.
可选的,对于所述软剪切统计信息,采用一个动态数组来记录,每条记录包括:Optionally, for the soft clipping statistical information, a dynamic array is used to record, and each record includes:
用于表示软剪切在基因组上所处位置的软剪切位置存储部;a soft-splice location store for representing where the soft-splice is located on the genome;
用于表示软剪切发生在相应位点左边的次数的软剪切左侧数存储部;a soft clipping left number store for representing the number of times soft clipping occurs to the left of the corresponding site;
用于表示软剪切发生在相应位点右边的次数的软剪切右侧数存储部。Soft clipping right number storage for indicating the number of times soft clipping occurs to the right of the corresponding site.
可选的,所述索引包括双端比对信息索引和单端比对信息索引;Optionally, the index includes a paired-end comparison information index and a single-end comparison information index;
对于双端比对信息索引,采用双端比对数组结构进行存储,所述双端比对数组结构包括:For the double-end comparison information index, the double-end comparison array structure is used for storage, and the double-end comparison array structure includes:
用于表示基因序列的ID的第一ID存储部;a first ID storage unit for representing the ID of the gene sequence;
用于表示基因序列比对到基因组上的位置的第一比对位置存储部;A first alignment position storage unit used to indicate the position of the gene sequence aligned to the genome;
用于表示基因序列的插入片段长度的插入片段长度存储部;an insert length storage unit representing an insert length of a gene sequence;
用于表示基因序列的比对质量值的第一比对质量值存储部;A first alignment quality value storage unit for representing the alignment quality value of the gene sequence;
用于表示基因序列的平均质量值的第一平均质量值存储部;A first average quality value storage unit for representing the average quality value of the gene sequence;
对于单端比对信息索引,采用单端比对数组结构进行存储,所述单端比对数组结构包括:For the single-end comparison information index, the single-end comparison array structure is used for storage, and the single-end comparison array structure includes:
用于表示基因序列的ID的第二ID存储部;a second ID storage unit for the ID representing the gene sequence;
用于表示基因序列比对到基因组上的位置的第二比对位置存储部;A second comparison position storage unit used to indicate the position of the gene sequence compared to the genome;
用于表示基因序列的比对质量值的第二比对质量值存储部;A second alignment quality value storage unit for representing the alignment quality value of the gene sequence;
用于表示基因序列的平均质量值的第二平均质量值存储部;A second average quality value storage unit for representing the average quality value of the gene sequence;
其中,对于每条用于比对的基因序列,根据该基因序列在基因组上的比对位置,其相应的索引依次排列。Wherein, for each gene sequence used for comparison, its corresponding index is arranged in order according to the comparison position of the gene sequence on the genome.
可选的,将所述基因序列比对信息存储于磁盘,具体包括:Optionally, the gene sequence comparison information is stored in a disk, specifically including:
所以将基因序列比对信息分成512个文件并存储于磁盘,每个文件存储一定基因组区间的基因序列比对信息,每条基因序列比对信息的存储数据结构包括:Therefore, the gene sequence comparison information is divided into 512 files and stored on the disk. Each file stores the gene sequence comparison information of a certain genome interval. The storage data structure of each gene sequence comparison information includes:
用于表示基因序列的序列长度的序列长度存储部;a sequence length storage unit for representing the sequence length of the gene sequence;
用于表示基因序列本身的序列存储部;A sequence store for representing the gene sequence itself;
用于表示基因序列的质量值的质量值存储部;a quality value storage unit for representing the quality value of the gene sequence;
用于表示基因序列在比对时的比对算法开始位置的开始位置存储部;A start position storage unit used to represent the start position of the alignment algorithm when the gene sequence is compared;
用于表示基因序列在比对时的正负链信息的正负链存储部;The positive and negative strand storage part used to represent the positive and negative strand information of the gene sequence during alignment;
用于表示基因序列在比对时选取的基因组区域长度的区域长度存储部;A region length storage part used to indicate the length of the genome region selected during gene sequence alignment;
用于表示基因序列在比对时左边铆定的位置的左侧位置存储部;The left position storage part used to represent the left riveting position of the gene sequence during alignment;
用于表示基因序列在比对时右边铆定的位置的右侧位置存储部。A right position storage unit used to indicate the position of the right riveted position of the gene sequence during alignment.
可选的,所述的方法还包括:Optionally, the method also includes:
在去重复过程中,减去所述基因组统计信息中重复序列造成的干扰;During the deduplication process, subtracting the interference caused by repetitive sequences in the genome statistical information;
和/或,and / or,
在重比对过程中,提取基因组的重比对区域的基因序列,重新比对重比对区域的基因序列后,调整重比对区域的基因序列的所述基因组统计信息。During the re-alignment process, the gene sequence of the re-alignment region of the genome is extracted, and after the gene sequence of the re-alignment region is re-aligned, the genome statistical information of the gene sequence of the re-alignment region is adjusted.
从上面所述可以看出,本发明提供的基因组数据存储方法及电子设备,针对变异检测全过程中的中间文件的特点,设计了精巧的数据存储结构,将其中一些主要的中间数据保持在内存里,这些数据可从内存中直接调用,使得变异检测全过程的每一个步骤不用大量的进行磁盘的I/O读写,显著提高了整个变异检测分析流程的效率。As can be seen from the above, the genome data storage method and electronic equipment provided by the present invention have designed an exquisite data storage structure for the characteristics of the intermediate files in the whole process of mutation detection, and keep some of the main intermediate data in the memory. Here, these data can be directly called from the memory, so that each step of the whole process of mutation detection does not need a large number of disk I/O reads and writes, which significantly improves the efficiency of the entire mutation detection and analysis process.
附图说明Description of drawings
图1为本发明提供的基因组数据存储方法的一个实施例的流程示意图;Fig. 1 is a schematic flow chart of an embodiment of the genomic data storage method provided by the present invention;
图1a为在没有出现插入缺失和软剪切,且碱基类型最多出现过2种的位点时,所述第一数据结构的示意图;Figure 1a is a schematic diagram of the first data structure when there are no indels and soft cuts, and at most two base types appear;
图1b为在有插入缺失(InDel)出现,且碱基类型出现过3-4种的位点时,所述第二数据结构的示意图;Fig. 1b is a schematic diagram of the second data structure when an insertion-deletion (InDel) occurs and there are 3-4 base types;
图1c为在有插入缺失(InDel)出现,且碱基类型出现过3-4种的位点时,所述第一数据结构的示意图;Fig. 1c is a schematic diagram of the first data structure when an insertion-deletion (InDel) occurs and there are 3-4 base types;
图1d为在出现多余1个的插入缺失、插入长度大于12个碱基的位点时,所述第三数据结构的示意图;Figure 1d is a schematic diagram of the third data structure when there is more than one indel, and the insertion length is greater than 12 bases;
图1e为在出现多余1个的插入缺失、插入长度大于12个碱基的位点时,所述第一数据结构的示意图;Figure 1e is a schematic diagram of the first data structure when there is more than one indel, and the insertion length is greater than 12 bases;
图1f为在对于所述软剪切统计信息,采用一个动态数组来记录时,所述动态数组的示意图;Figure 1f is a schematic diagram of the dynamic array when a dynamic array is used to record the soft clipping statistics;
图1g为所述索引的示意图;Figure 1g is a schematic diagram of the index;
图1h为所述每条基因序列比对信息的存储数据结构的示意图;Figure 1h is a schematic diagram of the storage data structure of the alignment information of each gene sequence;
图2为本发明提供的基因组序列比对方法的一个实施例的流程示意图;Figure 2 is a schematic flow diagram of an embodiment of the genome sequence comparison method provided by the present invention;
图3为本发明提供的基因组数据存储装置的一个实施例的结构示意图;Fig. 3 is a structural schematic diagram of an embodiment of the genome data storage device provided by the present invention;
图4为本发明提供的电子设备的一个实施例的结构示意图。FIG. 4 is a schematic structural diagram of an embodiment of an electronic device provided by the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.
需要说明的是,本发明实施例中所有使用“第一”和“第二”的表述均是为了区分两个相同名称非相同的实体或者非相同的参量,可见“第一”、“第二”仅为了表述的方便,不应理解为对本发明实施例的限定,后续实施例对此不再一一说明。It should be noted that all the expressions using "first" and "second" in the embodiments of the present invention are to distinguish two entities with the same name but different parameters or parameters that are not the same. It can be seen that "first" and "second " is only for the convenience of expression, and should not be understood as a limitation to the embodiments of the present invention, and will not be described one by one in the subsequent embodiments.
基于上述目的,本发明实施例的第一个方面,提出了一种基因组数据存储方法的一个实施例,能够解决基因组变异检测过程中需要频繁地输入输出大量二进制文件而造成的低效率问题。如图1所示,为本发明提供的基因组数据存储方法的一个实施例的流程示意图。Based on the above purpose, the first aspect of the embodiments of the present invention proposes an embodiment of a genomic data storage method, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in the process of genomic variation detection. As shown in FIG. 1 , it is a schematic flowchart of an embodiment of the genome data storage method provided by the present invention.
所述基因组数据存储方法,包括:The genome data storage method includes:
步骤101:在比对过程中,得到基因序列比对信息,并创建基因序列统计信息;所述基因序列比对信息为基因组比对过程中产生的基因序列比对结果信息,根据该基因序列比对结果信息,从中可以提取得到所述基因序列统计信息;Step 101: During the comparison process, obtain gene sequence comparison information, and create gene sequence statistical information; the gene sequence comparison information is the gene sequence comparison result information generated during the genome comparison process, according to the gene sequence comparison For the result information, the statistical information of the gene sequence can be extracted from it;
步骤102:将所述基因序列比对信息存储于磁盘,并按基因序列比对信息在基因组的比对位置,在内存中存储相应的索引;所述索引为所述基因序列比对信息在磁盘中的存储位置;Step 102: storing the gene sequence comparison information in the disk, and storing the corresponding index in the memory according to the comparison position of the gene sequence comparison information in the genome; the index is the gene sequence comparison information in the disk storage location in
步骤103:对所述基因序列统计信息进行分类,得到第一统计信息和第二统计信息;Step 103: classifying the gene sequence statistical information to obtain first statistical information and second statistical information;
步骤104:将所述第一统计信息存储于内存,所述第一统计信息为变异检测过程中访问频率高于预设频率的统计信息;Step 104: Store the first statistical information in the memory, the first statistical information is the statistical information whose access frequency is higher than the preset frequency during the variation detection process;
步骤105:将所述第二统计信息存储于磁盘,所述第二统计信息为无法存储于内存的统计信息和/或变异检测过程中访问频率低于预设频率的统计信息。Step 105: Store the second statistical information in a disk, where the second statistical information is statistical information that cannot be stored in memory and/or statistical information whose access frequency is lower than a preset frequency during mutation detection.
从上述实施例可以看出,本发明实施例提供的基因组数据存储方法,针对变异检测全过程(包括比对、排序、去重复、重比对、变异检测、过滤等步骤) 中的中间文件的特点,设计了精巧的数据存储结构,将其中一些主要的中间数据保持在内存里,这些数据可从内存中直接调用,使得变异检测全过程的每一个步骤不用大量的进行磁盘的I/O读写,显著提高了整个变异检测分析流程的效率。It can be seen from the above embodiments that the genome data storage method provided by the embodiment of the present invention is aimed at storing intermediate files in the whole process of mutation detection (including steps such as comparison, sorting, deduplication, re-comparison, mutation detection, and filtering). Features: Design an exquisite data storage structure, keep some of the main intermediate data in the memory, and these data can be directly called from the memory, so that each step of the whole process of mutation detection does not need a lot of disk I/O reading Write, significantly improving the efficiency of the entire variant detection analysis process.
在一些可选实施方式中,所述第一统计信息包括碱基加权质量值统计信息、正负链统计信息、插入缺失统计信息和软剪切统计信息;具体包括:In some optional implementation manners, the first statistical information includes base-weighted quality value statistical information, positive and negative strand statistical information, indel statistical information, and soft-cut statistical information; specifically includes:
所述碱基加权质量值统计信息(Weighted Count):The base-weighted quality value statistical information (Weighted Count):
由于每个比对到参考基因序列上的碱基都有一个质量值,在0-40之间,赋予的权重如下表所示:Since each base aligned to the reference gene sequence has a quality value between 0-40, the weight assigned is shown in the following table:
将所有比对到同一位置的相同碱基的权重相加,得到这个碱基类型的质量值权重和;Add up the weights of all the same bases aligned to the same position to obtain the weight sum of the quality value of this base type;
所述正负链统计信息(Strand Count):正向和反向比对到同一位置的基因序列数统计;The positive and negative strand statistical information (Strand Count): statistics of the number of gene sequences aligned to the same position in forward and reverse directions;
所述插入缺失统计信息以及插入序列信息(InDel Count):比对基因序列中在基因组某个位置插入缺失序列以及累计出现的次数;The insertion-deletion statistical information and insertion sequence information (InDel Count): the number of insertion-deletion sequences at a certain position in the genome and the cumulative number of occurrences in the comparison gene sequence;
所述软剪切统计信息(Soft Clip Count):比对基因序列中在基因组某个位置出现软剪切(soft clip)的次数。The soft clip statistics (Soft Clip Count): the number of occurrences of soft clips at a certain position in the genome in the aligned gene sequence.
在一些可选实施方式中,考虑最简单的情况,对于没有出现插入缺失和软剪切,且碱基类型最多出现过2种的位点,该位点的第一统计信息采用第一数据结构存储;可选的,所述第一数据结构为8bytes的数据结构Counter(容器),采用一个8bytes的数据结构Counter保存一个位点的信息,整个人类基因组大约包括3G个位点,因此需要内存约24GB;In some optional implementations, considering the simplest case, for a site where there are no indels and soft cuts, and there are at most 2 types of base types, the first statistical information of the site adopts the first data structure Storage; Optionally, the first data structure is a data structure Counter (container) of 8 bytes, and an 8 bytes data structure Counter is used to store the information of a site, and the entire human genome includes approximately 3G sites, so the memory needs about 24GB;
所述第一数据结构,如图1a所示,两个碱基的统计信息(base1information 和base2information)采用同样的4bytes数据结构存储,包括:The first data structure, as shown in Figure 1a, the statistical information (base1information and base2information) of two bases is stored in the same 4bytes data structure, including:
用于表示碱基类型的第一头部;可选的,所述第一头部(base)使用2bits 表示碱基类型,碱基A、C、G、T分别使用00、01、10和11来表示;The first header used to indicate the base type; optionally, the first header (base) uses 2 bits to indicate the base type, and bases A, C, G, and T use 00, 01, 10, and 11 respectively To represent;
用于表示碱基加权质量值的第一质量值存储部;可选的,所述第一质量值存储部(weighted count)使用14bits表示加权质量值和,最大值为16383;A first quality value storage unit used to represent base-weighted quality values; optionally, the first quality value storage unit (weighted count) uses 14 bits to represent the sum of weighted quality values, with a maximum value of 16383;
用于表示正链数量的第一正链数存储部;可选的,所述第一正链数存储部 (+vestrand count)使用1byte(8bits)来表示正链的数量,最大值为255;The first positive chain number storage unit used to represent the number of positive chains; optionally, the first positive chain number storage unit (+vestrand count) uses 1 byte (8 bits) to represent the number of positive chains, and the maximum value is 255;
用于表示负链数量的第一负链数存储部;可选的,所述第一负链数存储部 (-vestrand count)使用1byte(8bits)来表示负链的数量,最大值为255。The first negative chain number storage unit used to represent the number of negative chains; optionally, the first negative chain number storage unit (-vestrand count) uses 1 byte (8 bits) to represent the number of negative chains, and the maximum value is 255.
在一些可选实施方式中,对于有插入缺失(InDel)出现,且碱基类型出现过3-4种的位点,该位点的第一统计信息采用第一数据结构和第二数据结构存储;可选的,采用一个32bytes的数据结构OverflowCounter(溢出容器)来保存一个位点的信息,碱基ACGT的统计信息(base AZinformation、base C information、base G information和base Tinformation)分别各用6bytes表示,插入信息(Insertion Info.)和缺失信息(DeletionInfo.)各用4bytes表示;根据经验30X全基因组数据大约有200M个这样的位点;In some optional embodiments, for sites where indels (InDel) occur and 3-4 base types have occurred, the first statistical information of the site is stored using the first data structure and the second data structure ; Optionally, a 32bytes data structure OverflowCounter (overflow container) is used to store the information of a site, and the statistical information of the base ACGT (base AZinformation, base C information, base G information and base Tinformation) is represented by 6bytes respectively , the insertion information (Insertion Info.) and the deletion information (DeletionInfo.) are represented by 4 bytes each; according to experience, there are about 200M such sites in the 30X whole genome data;
所述第二数据结构,如图1b所示,包括:The second data structure, as shown in Figure 1b, includes:
4种碱基类型各自的碱基加权质量值统计信息和正负链统计信息;每种碱基类型的碱基加权质量值统计信息和正负链统计信息的存储结构具体包括:用于表示碱基加权质量值的第二质量值存储部(weighted count,可选的,使用 2bytes表示加权质量值和,最大值为65535),用于表示正链数量的第二正链数存储部(+ve strand count,可选的,使用2bytes表示正链数量,最大值为 65535),以及,用于表示负链数量的第二负链数存储部(-ve strand count,可选的,使用2bytes表示负链数量,最大值为65535);Base-weighted quality value statistics and positive and negative strand statistics for each of the four base types; the storage structure of base-weighted quality value statistics and positive and negative strand statistics for each base type specifically includes: The second quality value storage part of the base weighted quality value (weighted count, optional, use 2bytes to represent the weighted quality value sum, the maximum value is 65535), used to represent the second positive chain number storage part of the positive chain number (+ve strand count, optional, use 2bytes to indicate the number of positive strands, the maximum value is 65535), and the second negative strand number storage unit used to indicate the number of negative strands (-ve strand count, optional, use 2bytes to indicate negative number of chains, the maximum value is 65535);
第一插入统计信息,具体包括:用于表示插入序列的第一插入序列存储部(Insertion Pattern,可选的,使用3bytes表示,这里最长能表示12个碱基),用于表示低质量插入数量的第一低质量插入数存储部(LQ count,可选的,使用1byte表示,最大值255);The first insertion statistical information specifically includes: the first insertion sequence storage part (Insertion Pattern, optional, expressed in 3 bytes, where the longest can represent 12 bases) used to represent the inserted sequence, used to represent low-quality insertions The number of the first low-quality insertion number storage unit (LQ count, optional, expressed in 1 byte, the maximum value is 255);
第一缺失统计信息,具体包括:用于表示缺失长度的第一缺失长度存储部(Del.Len,可选的,使用1byte表示,最大值255),用于表示高质量缺失数量的第一高质量缺失数存储部(HQ count,可选的,使用1byte表示,最大值255),用于表示低质量缺失数量的第一低质量缺失数存储部(LQ count,可选的,使用1byte表示,最大值255);可选的,还包括1byte未使用空间;The first deletion statistical information specifically includes: the first deletion length storage part (Del.Len, optional, expressed in 1 byte, maximum value 255) used to indicate the length of the deletion, used to indicate the first highest number of high-quality deletions Quality deletion number storage unit (HQ count, optional, expressed in 1 byte, maximum value 255), used to represent the first low-quality deletion number storage unit of the number of low-quality deletions (LQ count, optional, expressed in 1 byte, The maximum value is 255); optional, including 1byte unused space;
当使用OverflowCounter时,相应的所述第一数据结构的存储内容会发生变化,所述第一数据结构,如图1c所示,包括:When using OverflowCounter, the storage content of the corresponding first data structure will change, and the first data structure, as shown in Figure 1c, includes:
用11填充的第二头部;原先用来存储base1information和base2information 两个碱基类型的数据,会都被填充为“11”表示使用了OverflowCounter;The second header filled with 11; originally used to store base1information and base2information base type data, will be filled with "11" to indicate that OverflowCounter is used;
用于表示是否存在插入的第一插入信息存储部(Insertion Information),可选的,使用14bits来保存插入信息,具体包括:用于表示是否存在插入的第一插入信息子存储部(1bit),用于表示插入长度的插入长度子存储部(Ins.Len.使用4bits表示),用于表示低质量插入数量的低质量插入数子存储部(LQ count,使用8bits表示);可选的,1bit设置为0;A first insertion information storage unit (Insertion Information) used to indicate whether there is an insertion. Optionally, 14 bits are used to store the insertion information, specifically including: a first insertion information sub-storage unit (1 bit) used to indicate whether there is an insertion, The insertion length sub-storage section used to indicate the insertion length (Ins.Len. is represented by 4 bits), the low-quality insertion number sub-storage section (LQ count, represented by 8 bits) used to represent the low-quality insertion quantity; optional, 1 bit setting is 0;
用于表示是否存在缺失的第一缺失信息存储部(Deletion Information),可选的,使用14bits来保存缺失信息,使用1bit表示是否存在缺失(第一缺失信息子存储部),1bit设置为0,12bits未使用(Unused);The first deletion information storage unit (Deletion Information) used to indicate whether there is a deletion. Optionally, 14 bits are used to store the deletion information, and 1 bit is used to indicate whether there is a deletion (the first deletion information sub-storage unit), and 1 bit is set to 0, 12bits are not used (Unused);
用于指向相应的第二数据结构存储位置的指针(array index pointing todynamic array of overflow counter),可选的,使用4bytes来保存指向OverflowCounter数据的位置的一个指针。A pointer (array index pointing today dynamic array of overflow counter) used to point to the storage location of the corresponding second data structure. Optionally, 4 bytes are used to store a pointer to the location of the OverflowCounter data.
在一些可选实施方式中,对于出现多余1个的插入缺失、插入长度大于 12个碱基的位点,该位点的第一统计信息采用第一数据结构和第三数据结构存储,且对于这样的位点的第一统计信息,在内存中创建内存池(Memory Pool,专门开辟一块内存)来进行存储,碱基ACGT的统计信息(base A information、 base C information、base G information和base T information)分别各用6bytes 表示,并在OverflowCounter中记录插入缺失信息的指针,如图1d所示;In some optional embodiments, for a site with more than one indel and an insertion length greater than 12 bases, the first statistical information of the site is stored in the first data structure and the third data structure, and for The first statistical information of such a site is to create a memory pool (Memory Pool, which specially opens up a piece of memory) in the memory for storage, and the statistical information of the base ACGT (base A information, base C information, base G information and base T information) information) are represented by 6 bytes respectively, and the pointers for inserting and missing information are recorded in OverflowCounter, as shown in Figure 1d;
所述第三数据结构,如图1d所示,包括:The third data structure, as shown in Figure 1d, includes:
4种碱基类型各自的碱基加权质量值统计信息和正负链统计信息;每种碱基类型的碱基加权质量值统计信息和正负链统计信息的存储结构具体包括:用于表示碱基加权质量值的第三质量值存储部(weighted count,可选的,使用 2bytes表示加权质量值和,最大值为65535),用于表示正链数量的第三正链数存储部(+ve strand count,可选的,使用2bytes表示正链数量,最大值为 65535),以及,用于表示负链数量的第三负链数存储部(-ve strand count,可选的,使用2bytes表示负链数量,最大值为65535);Base-weighted quality value statistics and positive and negative strand statistics for each of the four base types; the storage structure of base-weighted quality value statistics and positive and negative strand statistics for each base type specifically includes: The third quality value storage part of the base weighted quality value (weighted count, optional, use 2bytes to represent the weighted quality value sum, the maximum value is 65535), and the third positive chain number storage part (+ve strand count, optional, use 2bytes to indicate the number of positive strands, the maximum value is 65535), and the third negative strand number storage unit used to indicate the number of negative strands (-ve strand count, optional, use 2bytes to indicate negative number of chains, the maximum value is 65535);
第二插入统计信息(Insertion Ptr),可选的,使用4bytes来表示,具体包括:用于表示插入长度的插入长度存储部(Insertion length,可选的,使用1byte 表示插入长度),用于表示插入序列的第二插入序列存储部(Insertion pattern,长度不定,每2bits表示一个碱基),用于表示低质量插入数量的第二低质量插入数存储部(LQ count,可选的,使用1byte表示低质量插入数量),以及,用于表示高质量插入数量的高质量插入数存储部(HQcount,可选的,使用 1byte表示高质量插入数量);The second insertion statistical information (Insertion Ptr), optional, expressed in 4bytes, specifically includes: an insertion length storage unit used to indicate the insertion length (Insertion length, optional, using 1byte to indicate the insertion length), used to indicate The second insertion sequence storage part of the insertion sequence (Insertion pattern, the length is variable, and each 2bits represents a base), which is used to represent the second low-quality insertion number storage part of the low-quality insertion quantity (LQ count, optional, using 1byte Indicates the number of low-quality inserts), and the high-quality insert number storage unit used to indicate the number of high-quality inserts (HQcount, optional, use 1byte to indicate the number of high-quality inserts);
第二缺失统计信息(Deletion Ptr),可选的,使用4bytes来表示,具体包括:用于表示缺失长度的第二缺失长度存储部(Deletion length,可选的,使用1byte表示缺失长度),用于表示高质量缺失数量的第二高质量缺失数存储部(HQ count,可选的,使用1byte表示高质量缺失数量),用于表示低质量缺失数量的第二低质量缺失数存储部(LQ count,可选的,使用1byte表示低质量缺失数量);The second deletion statistical information (Deletion Ptr), optional, expressed in 4 bytes, specifically includes: a second deletion length storage unit (Deletion length, optional, using 1 byte to indicate the deletion length) for indicating the length of the deletion, expressed in For the second high-quality deletion number storage unit (HQ count, optional, use 1 byte to indicate the number of high-quality deletions) representing the number of high-quality deletions, and for the second low-quality deletion number storage unit (LQ count) for representing the number of low-quality deletions count, optional, use 1byte to represent the number of low quality missing);
与此同时,在Counter中的信息记录变化如图1e所示,所述第一数据结构,包括:At the same time, the information record changes in Counter are shown in Figure 1e, the first data structure includes:
用11填充的第三头部;原先用来存储base1information和base2information 两个碱基类型的数据,会都被填充为“11”表示使用了OverflowCounter;The third header filled with 11; originally used to store base1information and base2information base type data, will be filled with "11" to indicate the use of OverflowCounter;
用于表示是否存在插入的第二插入信息存储部(Insertion Information),可选的,使用14bits表示,具体包括:用于表示是否存在插入的第二插入信息子存储部(1bit),用于表示是否使用了内存池的第一内存池信息子存储部(1bit),用于表示在内存池中的占用长度的第一占用长度子存储部(12bits);The second insertion information storage unit (Insertion Information) used to indicate whether there is an insertion, optionally, expressed in 14 bits, specifically includes: a second insertion information sub-storage unit (1 bit) used to indicate whether there is an insertion, used to indicate Whether the first memory pool information sub-storage part (1bit) of the memory pool is used, and the first occupied length sub-storage part (12bits) for indicating the occupied length in the memory pool;
用于表示是否存在缺失的第二缺失信息存储部(Deletion Information),可选的,使用14bits表示,具体包括:用于表示是否存在缺失的第二缺失信息子存储部(1bit),用于表示是否使用了内存池的第二内存池信息子存储部(1bit),用于表示在内存池中的占用长度的第二占用长度子存储部(12bits)。The second deletion information storage unit (Deletion Information) used to indicate whether there is a deletion, optionally, expressed in 14 bits, specifically includes: a second deletion information sub-storage unit (1 bit) used to indicate whether there is a deletion, used to indicate The second memory pool information sub-storage part (1 bit) of whether the memory pool is used, and the second occupied length sub-storage part (12 bits) used to indicate the occupied length in the memory pool.
根据经验,软剪切(soft clipping)只会发生很少的基因组位置上,因此没有必要为每个位点单独开辟一块存储空间。因此,在一些可选实施方式中,对于所述软剪切统计信息,采用一个动态数组来记录,如图1f所示,每条记录的格式为{position,left counts,right counts},占用12bytes,具体包括:According to experience, soft clipping (soft clipping) will only occur at few genomic positions, so there is no need to allocate a separate storage space for each position. Therefore, in some optional implementation manners, a dynamic array is used to record the soft clipping statistical information, as shown in Figure 1f, the format of each record is {position, left counts, right counts}, occupying 12 bytes , including:
用于表示软剪切在基因组上所处位置的软剪切位置存储部(position),占用4bytes;The soft shear position storage unit (position) used to indicate the position of the soft shear on the genome occupies 4 bytes;
用于表示软剪切发生在相应位点左边的次数的软剪切左侧数存储部(leftcounts),占用4bytes;The soft shearing left count storage unit (leftcounts), which is used to indicate the number of times that soft shearing occurs on the left side of the corresponding site, occupies 4 bytes;
用于表示软剪切发生在相应位点右边的次数的软剪切右侧数存储部(rightcounts),占用4bytes。The soft clipping right count storage unit (rightcounts), which is used to indicate the number of times that the soft clipping occurs on the right side of the corresponding site, occupies 4 bytes.
在一些可选实施方式中,如图1g所示,所述索引包括双端比对(Pair End) 信息索引和单端比对(Single End)信息索引;In some optional implementations, as shown in Figure 1g, the index includes a double-end comparison (Pair End) information index and a single-end comparison (Single End) information index;
对于双端比对信息索引,采用双端比对数组结构进行存储,所述双端比对数组结构(PairEndAlignmentInfo,占用12bytes)包括:For the paired-end alignment information index, a pair-end alignment array structure is used for storage, and the pair-end alignment array structure (PairEndAlignmentInfo, occupying 12 bytes) includes:
用于表示基因序列的ID的第一ID存储部(ReadID),占用4bytes;The first ID storage unit (ReadID) used to represent the ID of the gene sequence occupies 4 bytes;
用于表示基因序列比对到基因组上的位置的第一比对位置存储部 (AlignedPosition),占用4bytes;The first alignment position storage unit (AlignedPosition) used to represent the position of the gene sequence alignment on the genome, occupying 4 bytes;
用于表示基因序列的插入片段长度的插入片段长度存储部(Insert Size),占用2bytes;The insert length storage unit (Insert Size) used to indicate the insert length of the gene sequence occupies 2 bytes;
用于表示基因序列的比对质量值的第一比对质量值存储部(MAPQ),占用1byte;The first alignment quality value storage part (MAPQ) used to represent the alignment quality value of the gene sequence occupies 1 byte;
用于表示基因序列的平均质量值的第一平均质量值存储部(Average BaseQuality),占用1byte;The first average quality value storage unit (Average BaseQuality) used to represent the average quality value of the gene sequence occupies 1 byte;
对于单端比对信息索引,采用单端比对数组结构进行存储,所述单端比对数组结构(SingleEndAlignmentInfo,占用10bytes)包括:For the single-end alignment information index, the single-end alignment array structure is used for storage, and the single-end alignment array structure (SingleEndAlignmentInfo, occupying 10 bytes) includes:
用于表示基因序列的ID的第二ID存储部(ReadID),占用4bytes;The second ID storage unit (ReadID) used to represent the ID of the gene sequence occupies 4 bytes;
用于表示基因序列比对到基因组上的位置的第二比对位置存储部 (AlignedPosition),占用4bytes;The second alignment position storage unit (AlignedPosition) used to represent the position of the gene sequence alignment on the genome, occupying 4 bytes;
用于表示基因序列的比对质量值的第二比对质量值存储部(MAPQ),占用1byte;The second alignment quality value storage part (MAPQ) used to represent the alignment quality value of the gene sequence occupies 1 byte;
用于表示基因序列的平均质量值的第二平均质量值存储部(Average BaseQuality),占用1byte;The second average quality value storage unit (Average BaseQuality) used to represent the average quality value of the gene sequence occupies 1 byte;
其中,对于每条用于比对的基因序列,根据该基因序列在基因组上的比对位置,其相应的索引依次排列。Wherein, for each gene sequence used for comparison, its corresponding index is arranged in order according to the comparison position of the gene sequence on the genome.
因为变异检测过程中读取序列非常随机,在一些可选实施方式中,将所述基因序列比对信息存储于磁盘,具体包括:Because the read sequence in the variation detection process is very random, in some optional implementations, the gene sequence comparison information is stored on disk, specifically including:
所以将基因序列比对信息分成512个文件(buckets)并存储于磁盘,每个文件存储一定基因组区间的基因序列比对信息,如图1h所示,每条基因序列比对信息的存储数据结构包括:Therefore, the gene sequence comparison information is divided into 512 files (buckets) and stored on the disk, and each file stores the gene sequence comparison information of a certain genome interval, as shown in Figure 1h, the storage data structure of each gene sequence comparison information include:
用于表示基因序列的序列长度的序列长度存储部(Read Length),占用 2bytes;The sequence length storage unit (Read Length) used to represent the sequence length of the gene sequence occupies 2 bytes;
用于表示基因序列本身的序列存储部(Packed Read),长度不定,使用2bits 表示一个碱基;The sequence storage part (Packed Read) used to represent the gene sequence itself, the length is variable, using 2bits to represent a base;
用于表示基因序列的质量值的质量值存储部(Base Qualities),长度不定;The quality value storage part (Base Qualities) used to represent the quality value of the gene sequence, the length is variable;
用于表示基因序列在比对时的比对算法开始位置的开始位置存储部(DP StartPos.),占用4bytes;The start position storage unit (DP StartPos.), which is used to indicate the start position of the comparison algorithm when the gene sequence is compared, occupies 4 bytes;
用于表示基因序列在比对时的正负链信息的正负链存储部(Strand),占用1bit;The positive and negative strand storage unit (Strand) used to represent the positive and negative strand information of the gene sequence during comparison, occupying 1 bit;
用于表示基因序列在比对时选取的基因组区域长度的区域长度存储部 (DPref.length),占用15bits;The region length storage part (DPref.length) used to indicate the length of the genomic region selected during the comparison of gene sequences occupies 15 bits;
用于表示基因序列在比对时左边铆定的位置的左侧位置存储部(Left Anchor),占用4bytes;The left position storage part (Left Anchor), which is used to indicate the left fixed position of the gene sequence during comparison, occupies 4 bytes;
用于表示基因序列在比对时右边铆定的位置的右侧位置存储部(Right anchor),占用4bytes。The right anchor storage unit (Right anchor), which is used to indicate the position of the right riveted position of the gene sequence during alignment, occupies 4 bytes.
除了前述的实施例中在对比过程中,创建统计数据和索引的步骤外,在一些可选实施方式中,所述方法还包括:In addition to the steps of creating statistical data and indexes during the comparison process in the foregoing embodiments, in some optional implementation manners, the method further includes:
在去重复(de-duplication)过程中,减去所述基因组统计信息中重复序列造成的干扰;In the de-duplication process, subtract the interference caused by the repetitive sequence in the genome statistical information;
和/或,and / or,
在重比对(realignment)过程中,提取基因组的重比对区域的基因序列,重新比对重比对区域的基因序列后,调整重比对区域的基因序列的所述基因组统计信息;In the realignment process, extracting the gene sequence of the realignment region of the genome, and after re-comparing the gene sequence of the realignment region, adjusting the genome statistical information of the gene sequence of the realignment region;
在变异检测过程中,直接使用这些统计信息,计算各种基因型的概率。These statistics are used directly during variant detection to calculate probabilities for various genotypes.
通过本发明实施例提供的基因组数据存储方法,整个分析过程不用反复输出大量二进制文件,经过整体的算法优化,分析一个全基因组的数据,可以在 4小时内完成,而一般的分析流程需要几十小时才能完成;极大的减少了变异检测分析过程中的I/O过程,极大的提高了程序的分析效率。Through the genome data storage method provided by the embodiment of the present invention, the entire analysis process does not need to repeatedly output a large number of binary files. After the overall algorithm optimization, the analysis of a whole genome data can be completed within 4 hours, while the general analysis process requires dozens of It takes hours to complete; it greatly reduces the I/O process in the process of mutation detection and analysis, and greatly improves the analysis efficiency of the program.
为了便于前述技术方案的理解,在这里简单介绍一种基因组序列比对方法的实施例,用于解释前述实施例中步骤101中的基因组比对过程。如图2所示,为本发明提供的基因组序列比对方法的一个实施例的流程示意图。In order to facilitate the understanding of the aforementioned technical solutions, an example of a genome sequence alignment method is briefly introduced here to explain the genome alignment process in step 101 in the aforementioned examples. As shown in FIG. 2 , it is a schematic flowchart of an embodiment of the genome sequence comparison method provided by the present invention.
所述基因组序列比对方法,包括以下步骤:The genome sequence comparison method comprises the following steps:
步骤201:获取参考基因组序列和待比对的基因组序列文件。这里的文件获取方式采用常规获取方式即可。其中,所述待比对的基因组序列文件的格式可以是FASTQ文件。Step 201: Obtain a reference genome sequence and a genome sequence file to be compared. The file acquisition method here can be the conventional acquisition method. Wherein, the format of the genome sequence file to be compared may be a FASTQ file.
所述基因组序列比对方法,将序列比对分为3个级别来进行;每次从输入的待比对的基因组序列文件中读取一部分序列,然后依次对执行1级、2级、 3级比对算法,上一级没有比对上的序列,进入下一级的比对算法中继续比对;具体包括以下步骤。The genome sequence comparison method divides the sequence comparison into 3 levels; each time a part of the sequence is read from the input genome sequence file to be compared, and then the 1st level, 2nd level, and 3rd level are sequentially performed. Alignment algorithm, the sequences that have not been compared in the previous level enter the comparison algorithm of the next level to continue the comparison; specifically, the following steps are included.
步骤202:从待比对的基因组序列文件中读取部分基因组序列。Step 202: Read the partial genome sequence from the genome sequence file to be compared.
步骤203:按照双向BWT比对算法(第1级:双向BWT比对算法,双向BWT:Bi-directional Burrows–Wheeler Transform,双向巴罗斯-惠勒变换),将所述部分基因组序列与参考基因组序列进行比对。其中,所述双向BWT比对算法处理最多允许4个碱基错误的reads比对。Reads,读长,是高通量测序中获得的测序序列,每一个read是一段碱基序列。在生物信息分析过程中,将每个read比对到参考基因组上,就可以得到测序序列和参考基因组的不同,从而发现变异。Step 203: According to the bidirectional BWT comparison algorithm (level 1: bidirectional BWT comparison algorithm, bidirectional BWT: Bi-directional Burrows–Wheeler Transform, bidirectional Burrows–Wheeler Transform), compare the partial genome sequence with the reference genome sequence Compare. Wherein, the two-way BWT alignment algorithm handles the alignment of reads that allows up to 4 base errors. Reads, read length, is the sequencing sequence obtained in high-throughput sequencing, and each read is a base sequence. In the process of bioinformatics analysis, by comparing each read to the reference genome, the difference between the sequenced sequence and the reference genome can be obtained, so as to discover the variation.
可选的,按照双向BWT比对算法比对基因组序列的方法,具体可包括以下步骤:Optionally, the method of comparing genome sequences according to the two-way BWT comparison algorithm may specifically include the following steps:
使用鸽舍原理对reads进行分段,每段允许0-2个碱基错误;Segment the reads using the dovecote principle, allowing 0-2 base errors for each segment;
接着使用双向BWT比对算法进行搜索比对,包括:Then use the two-way BWT comparison algorithm to search and compare, including:
建立所述参考基因组序列的BWT、后缀数组及参考基因组序列逆序的 BWT;Establishing the BWT of the reference genome sequence, the suffix array and the BWT of the reverse order of the reference genome sequence;
使用后向搜索(backward)和前向搜索(forward)分别对reads或reads 的每个片段从右到左和从左到右两个方向搜索其在参考基因组序列上的位置。Use backward search (backward) and forward search (forward) to search reads or each fragment of reads from right to left and from left to right for its position on the reference genome sequence.
所述双向BWT比对在处理多个碱基错误匹配的时候,效率比较低。在最多允许4个碱基错误匹配的情况下,根据鸽舍原理对reads进行分段,每个段落允许0-2个碱基错误匹配,这样用双向BWT处理最多2个碱基错误的比对,效率大大增加。The efficiency of the two-way BWT alignment is relatively low when dealing with multiple base mismatches. In the case of allowing a maximum of 4 base mismatches, the reads are segmented according to the pigeonhole principle, and each segment allows 0-2 base mismatches, so that bidirectional BWT is used to handle up to 2 base mismatches. , the efficiency is greatly increased.
常见比对软件BWA在建立参考序列的BWT和相应的索引以及SA(suffix array)后,使用backward搜索,即对reads或者reads的每个片段从右到左搜索其在基因组上的位置。本专利使用的双向BWT除了建立传统的BWT索引 (记为B)外,对参考序列的逆序序列也建立一个BWT索引(记为B’)。利用B、B’、SA,通过backward、forward在两个方向上搜索reads或seeds在基因组上的位置,序列比对的效率显著提高。The common comparison software BWA uses backward search after establishing the BWT of the reference sequence and the corresponding index and SA (suffix array), that is, searches the position of the reads or each fragment of the reads from right to left on the genome. In addition to establishing the traditional BWT index (denoted as B), the bidirectional BWT used in this patent also establishes a BWT index (denoted as B') for the reverse sequence of the reference sequence. Using B, B', SA, searching the position of reads or seeds on the genome in two directions backward and forward, the efficiency of sequence alignment is significantly improved.
步骤204:所述部分基因组序列中是否至少存在一对reads中仅有一条read 比对上(即,在所述部分基因组序列中,至少有一对reads是只有一条read被比对上了);若是,进入步骤208;若否,进入步骤205。Step 204: Whether there is at least one pair of reads in the partial genome sequence and only one read is aligned (that is, in the partial genome sequence, at least one pair of reads is only one read is aligned); if , go to step 208; if no, go to step 205.
步骤205:按照单端动态规划比对算法(第2级),将所述部分基因组序列中仅有一条read比对上的每对reads,与参考基因组序列再次进行比对。在经过前述第1级的双向BWT比对算法,在一对reads(A,A’)中,其中一条(A 或A’)比对到参考基因组序列上,另一条(A’或A)却没有比对到参考基因组序列上,将采用第2级比对算法继续进行比对。Step 205: According to the single-end dynamic programming alignment algorithm (level 2), each pair of reads on which there is only one read alignment in the partial genome sequence is compared with the reference genome sequence again. After the above-mentioned first-level two-way BWT alignment algorithm, in a pair of reads (A, A'), one (A or A') is aligned to the reference genome sequence, and the other (A' or A) is not. If there is no comparison to the reference genome sequence, the second-level alignment algorithm will be used to continue the alignment.
可选的,按照单端动态规划比对算法比对基因组序列的方法,具体可包括以下步骤:Optionally, the method of comparing genome sequences according to the single-end dynamic programming comparison algorithm may specifically include the following steps:
确定一对reads(A,A’)中的一条read(A或A’)比对到所述参考基因组序列上的特定位置(pos位置);双端测序得到的数据reads是成对的,假设一对reads(A,A’)的其中一条read(A或A’)比对到参考基因组序列上的pos 位置,则另一条read(A’或A)的理论比对位置在pos位置周围的一定区域即候选区域(candidate region)内;Determine that one read (A or A') in a pair of reads (A, A') is aligned to a specific position (pos position) on the reference genome sequence; the data reads obtained by paired-end sequencing are paired, assuming One read (A or A') of a pair of reads (A, A') is aligned to the pos position on the reference genome sequence, and the theoretical alignment position of the other read (A' or A) is around the pos position A certain area is the candidate area (candidate region);
因此,根据预设位置范围阈值,选取所述特定位置(pos位置)周围的特定范围;所述预设位置范围阈值可以根据实际需要进行选择,例如参考误差容忍范围进行设置;具体地,在双端测序中,一对reads都比对到基因组上,那么两条read之间的距离与两条read长度之和等于测序片段(fragment)的长度,根据这个原理确定候选区域的位置。例如,测序片段是500bp,每个read是 150bp,则比对到基因组上后,两个read之间的理论距离是200bp。因为测序片段长度不等,所以理论距离大约在100bp~200bp;Therefore, according to the preset position range threshold, select a specific range around the specific position (pos position); the preset position range threshold can be selected according to actual needs, for example, set with reference to the error tolerance range; specifically, in double In end-to-end sequencing, a pair of reads are compared to the genome, then the distance between the two reads and the sum of the lengths of the two reads is equal to the length of the sequencing fragment (fragment), and the position of the candidate region is determined according to this principle. For example, if the sequencing fragment is 500bp and each read is 150bp, then after alignment to the genome, the theoretical distance between the two reads is 200bp. Because the sequencing fragments are of different lengths, the theoretical distance is about 100bp-200bp;
在所述特定范围内使用动态规划算法对一对reads中的没有被比对上的另一条(A’或A)进行比对;步骤206:所述部分基因组序列中是否至少存在一对reads中两条read均未比对上(即,在所述部分基因组序列中,至少有一对reads的每条read都没被比对上);若是,进入步骤108;若否,进入步骤207。Within the specified range, use a dynamic programming algorithm to align the other (A' or A) in a pair of reads that is not aligned; step 206: whether there is at least a pair of reads in the partial genome sequence Both reads are not aligned (that is, in the partial genome sequence, each read of at least one pair of reads is not aligned); if so, go to step 108; if not, go to step 207.
步骤207:按照双端动态规划比对算法(第3级),将所述部分基因组序列中两条read均未比对上的每对reads,与参考基因组序列再次进行比对。在一对reads(A,A’)中,经过前述第1级的双向BWT比对算法和第2级的单端动态规划比对算法,某一对reads(A,A’)中的A和A’均没有比对上参考基因组序列,将采用第3级比对算法继续进行比对。Step 207: According to the paired-end dynamic programming alignment algorithm (level 3), each pair of reads in the partial genome sequence that is not aligned with the two reads is re-aligned with the reference genome sequence. In a pair of reads (A, A'), after the first-level two-way BWT comparison algorithm and the second-level single-end dynamic programming comparison algorithm, A and A in a certain pair of reads (A, A') None of A' has been aligned with the reference genome sequence, and the third-level alignment algorithm will be used to continue the alignment.
可选的,按照双端动态规划比对算法比对基因组序列的方法,具体可包括以下步骤:Optionally, the method of comparing genome sequences according to the paired-end dynamic programming comparison algorithm may specifically include the following steps:
对一对reads中的每条(A和A’)分别构建种子(seeds,substrings of a read);Construct seeds (seeds, substrings of a read) for each of a pair of reads (A and A');
具体地,对一对reads(A,A’)的每条read分别分成许多小段,构建种子 (seeds,substrings of a read);一对reads比对到基因组上时,两个read之间的距离在一定的范围内,因此两个read的seed之间的距离也应该在一定的范围内;Specifically, each read of a pair of reads (A, A') is divided into many small segments, and seeds (seeds, substrings of a read) are constructed; when a pair of reads is compared to the genome, the distance between the two reads Within a certain range, so the distance between the two read seeds should also be within a certain range;
将每一个种子比对到参考基因组序列上;Align each seed to the reference genome sequence;
具体地,检索出成对(即两个seeds之间的距离符合要求)的seeds比对的区域,确定这对reads的候选比对区域。然后用动态规划算法将reads比对到候选区域。Specifically, the paired (that is, the distance between the two seeds meets the requirements) regions for the alignment of the seeds are retrieved, and the candidate alignment regions for the pair of reads are determined. Then use the dynamic programming algorithm to compare the reads to the candidate regions.
若在所述参考基因组序列的某一区域,所述reads的两条(A和A’)分别有相应的种子比对上,则该区域为最终比对位置的候选区域;If in a certain region of the reference genome sequence, two (A and A') of the reads have corresponding seed alignments respectively, then this region is a candidate region for the final alignment position;
在所述候选区域内使用动态规划算法分别对所述reads的两条(A和A’) 进行比对;比对完成后进入步骤208;步骤208:是否全部比对完成所述待比对的基因组序列文件;若否,返回步骤102;若是,进入步骤109。Use the dynamic programming algorithm to compare two (A and A') of the reads in the candidate region; after the comparison is completed, enter step 208; Genome sequence file; if no, return to step 102; if yes, enter step 109.
步骤209:输出比对结果。可选的,BAM文件是基因组序列比对的输出文件,BAM为基因组序列比对结果保存格式,记录了基因组序列在参考基因组序列的位置和详细的序列比对情况。Step 209: output the comparison result. Optionally, the BAM file is the output file of the genome sequence alignment. BAM is the format for saving the genome sequence alignment results, and records the position of the genome sequence in the reference genome sequence and the detailed sequence alignment.
从上述实施例可以看出,本发明提供的基因组序列比对方法,通过设置多级比对算法,在前一级算法比对完成后利用下一级比对算法对比对不上的部分进行继续对比,从而让算法的复杂度匹配上数据的复杂度,并且对每一级算法进行优化,进而达到整体算法速度上的优化。采用本发明提供的基因组序列比对方法,在相同资源和保证比对的精确度的前提下,可将一个人全基因组序列的比对时间缩短到4小时左右,较现有技术的比对时间有显著的缩短,提高了数据分析效率。As can be seen from the above examples, the genome sequence comparison method provided by the present invention, by setting a multi-level comparison algorithm, uses the next level of comparison algorithm to continue to compare the parts that cannot be matched after the previous level of algorithm comparison is completed. Comparison, so that the complexity of the algorithm matches the complexity of the data, and optimize each level of the algorithm, and then achieve the optimization of the overall algorithm speed. Using the genome sequence comparison method provided by the present invention, under the premise of the same resources and ensuring the accuracy of the comparison, the comparison time of a person's whole genome sequence can be shortened to about 4 hours, compared with the comparison time of the prior art There is a significant shortening, which improves the efficiency of data analysis.
基于上述目的,本发明实施例的第二个方面,提出了一种基因组数据存储装置的一个实施例,能够解决基因组变异检测过程中需要频繁地输入输出大量二进制文件而造成的低效率问题。如图3所示,为本发明提供的基因组数据存储装置的一个实施例的结构示意图。Based on the above purpose, the second aspect of the embodiment of the present invention proposes an embodiment of a genome data storage device, which can solve the problem of low efficiency caused by frequent input and output of a large number of binary files in the process of genome variation detection. As shown in FIG. 3 , it is a schematic structural diagram of an embodiment of the genome data storage device provided by the present invention.
所述基因组数据存储装置,包括:The genome data storage device includes:
创建模块301,用于在基因组比对过程中,得到基因序列比对信息,并创建基因序列统计信息;Creating module 301, used for obtaining gene sequence comparison information and creating gene sequence statistical information during the genome comparison process;
比对信息存储模块302,用于将所述基因序列比对信息存储于磁盘,并按基因序列比对信息在基因组的比对位置,在内存中存储相应的索引;所述索引为所述基因序列比对信息在磁盘中的存储位置;The comparison information storage module 302 is used to store the gene sequence comparison information in the disk, and store the corresponding index in the memory according to the comparison position of the gene sequence comparison information in the genome; The storage location of the sequence alignment information in the disk;
统计信息分类模块303,用于对所述基因组统计信息进行分类,得到第一统计信息和第二统计信息;Statistical information classification module 303, configured to classify the genome statistical information to obtain first statistical information and second statistical information;
统计信息存储模块304,用于将第一统计信息存储于内存,所述第一统计信息为变异检测过程中访问频率高于预设频率的统计信息;以及,将第二统计信息存储于磁盘,所述第二统计信息为无法存储于内存的统计信息和/或变异检测过程中访问频率低于预设频率的统计信息。The statistical information storage module 304 is configured to store the first statistical information in the memory, the first statistical information is the statistical information whose access frequency is higher than the preset frequency during the mutation detection process; and store the second statistical information in the disk, The second statistical information is statistical information that cannot be stored in memory and/or statistical information whose access frequency is lower than a preset frequency during the mutation detection process.
在一些可选实施方式中,所述第一统计信息包括碱基加权质量值统计信息、正负链统计信息、插入缺失统计信息和软剪切统计信息。In some optional implementation manners, the first statistical information includes base-weighted quality value statistical information, positive and negative strand statistical information, indel statistical information, and soft-cut statistical information.
在一些可选实施方式中,对于没有出现插入缺失和软剪切,且碱基类型最多出现过2种的位点,该位点的第一统计信息采用第一数据结构存储;In some optional embodiments, for a site where there are no indels and soft cuts, and at most two base types appear, the first statistical information of the site is stored in the first data structure;
所述第一数据结构,包括:The first data structure includes:
用于表示碱基类型的第一头部;The first head used to indicate the base type;
用于表示碱基加权质量值的第一质量值存储部;A first quality value storage unit for representing base-weighted quality values;
用于表示正链数量的第一正链数存储部;A first positive chain number storage unit for representing the number of positive chains;
用于表示负链数量的第一负链数存储部。A first negative chain number storage unit for indicating the number of negative chains.
在一些可选实施方式中,对于有插入缺失出现,且碱基类型出现过3-4种的位点,该位点的第一统计信息采用第一数据结构和第二数据结构存储;In some optional embodiments, for a site where an indel occurs and 3-4 base types appear, the first statistical information of the site is stored using the first data structure and the second data structure;
所述第二数据结构,包括:The second data structure includes:
4种碱基类型各自的碱基加权质量值统计信息和正负链统计信息;每种碱基类型的碱基加权质量值统计信息和正负链统计信息的存储结构具体包括:用于表示碱基加权质量值的第二质量值存储部,用于表示正链数量的第二正链数存储部,以及,用于表示负链数量的第二负链数存储部;Base-weighted quality value statistics and positive and negative strand statistics for each of the four base types; the storage structure of base-weighted quality value statistics and positive and negative strand statistics for each base type specifically includes: a second quality value storage unit for base weighted quality values, a second positive chain number storage unit for representing the number of positive chains, and a second negative chain number storage unit for representing the number of negative chains;
第一插入统计信息,具体包括:用于表示插入序列的第一插入序列存储部,用于表示低质量插入数量的第一低质量插入数存储部;The first insertion statistical information specifically includes: a first insertion sequence storage unit used to represent the insertion sequence, and a first low-quality insertion number storage unit used to represent the number of low-quality insertions;
第一缺失统计信息,具体包括:用于表示缺失长度的第一缺失长度存储部,用于表示高质量缺失数量的第一高质量缺失数存储部,用于表示低质量缺失数量的第一低质量缺失数存储部;The first deletion statistical information specifically includes: a first deletion length storage unit used to indicate the deletion length, a first high-quality deletion number storage unit used to indicate the number of high-quality deletions, and a first lowest deletion number storage unit used to indicate the number of low-quality deletions. Mass missing number storage unit;
所述第一数据结构,包括:The first data structure includes:
用11填充的第二头部;the second header filled with 11;
用于表示是否存在插入的第一插入信息存储部,具体包括:用于表示是否存在插入的第一插入信息子存储部,用于表示插入长度的插入长度子存储部,用于表示低质量插入数量的低质量插入数子存储部;The first insertion information storage unit used to indicate whether there is an insertion, specifically includes: a first insertion information sub-storage unit used to indicate whether there is an insertion, an insertion length sub-storage unit used to indicate the insertion length, and a sub-storage unit used to indicate low-quality insertion Quantity of low-quality insertion into digital sub-storage;
用于表示是否存在缺失的第一缺失信息存储部;A first missing information storage unit for indicating whether there is a missing;
用于指向相应的第二数据结构存储位置的指针。A pointer to the corresponding second data structure storage location.
在一些可选实施方式中,对于出现多余1个的插入缺失、插入长度大于 12个碱基的位点,该位点的第一统计信息采用第一数据结构和第三数据结构存储,且对于这样的位点的第一统计信息,在内存中创建内存池来进行存储;In some optional embodiments, for a site with more than one indel and an insertion length greater than 12 bases, the first statistical information of the site is stored in the first data structure and the third data structure, and for For the first statistical information of such a site, a memory pool is created in memory for storage;
所述第三数据结构,包括:The third data structure includes:
4种碱基类型各自的碱基加权质量值统计信息和正负链统计信息;每种碱基类型的碱基加权质量值统计信息和正负链统计信息的存储结构具体包括:用于表示碱基加权质量值的第三质量值存储部,用于表示正链数量的第三正链数存储部,以及,用于表示负链数量的第三负链数存储部;Base-weighted quality value statistics and positive and negative strand statistics for each of the four base types; the storage structure of base-weighted quality value statistics and positive and negative strand statistics for each base type specifically includes: a third quality value storage unit for base weighted quality values, a third positive chain number storage unit for representing the number of positive chains, and a third negative chain number storage unit for representing the number of negative chains;
第二插入统计信息,具体包括:用于表示插入长度的插入长度存储部,用于表示插入序列的第二插入序列存储部,用于表示低质量插入数量的第二低质量插入数存储部,以及,用于表示高质量插入数量的高质量插入数存储部;The second insertion statistical information specifically includes: an insertion length storage unit used to indicate the insertion length, a second insertion sequence storage unit used to indicate the insertion sequence, a second low-quality insertion number storage unit used to indicate the number of low-quality insertions, and, a high-quality insertion number storage unit for representing the number of high-quality insertions;
第二缺失统计信息,具体包括:用于表示缺失长度的第二缺失长度存储部,用于表示高质量缺失数量的第二高质量缺失数存储部,用于表示低质量缺失数量的第二低质量缺失数存储部;The second deletion statistical information specifically includes: a second deletion length storage unit used to indicate the deletion length, a second high-quality deletion number storage unit used to indicate the number of high-quality deletions, and a second-lowest deletion number storage unit used to indicate the number of low-quality deletions Mass missing number storage unit;
所述第一数据结构,包括:The first data structure includes:
用11填充的第三头部;a third header filled with 11;
用于表示是否存在插入的第二插入信息存储部,具体包括:用于表示是否存在插入的第二插入信息子存储部,用于表示是否使用了内存池的第一内存池信息子存储部,用于表示在内存池中的占用长度的第一占用长度子存储部;The second insertion information storage unit for indicating whether there is an insertion, specifically includes: a second insertion information sub-storage unit for indicating whether there is an insertion, a first memory pool information sub-storage unit for indicating whether a memory pool is used, A first occupancy length sub-storage for representing an occupancy length in the memory pool;
用于表示是否存在缺失的第二缺失信息存储部,具体包括:用于表示是否存在缺失的第二缺失信息子存储部,用于表示是否使用了内存池的第二内存池信息子存储部,用于表示在内存池中的占用长度的第二占用长度子存储部。The second missing information storage unit for indicating whether there is a missing, specifically includes: a second missing information sub-storage for indicating whether there is a missing, a second memory pool information sub-storage for indicating whether a memory pool is used, A second occupied length sub-storage for representing the occupied length in the memory pool.
在一些可选实施方式中,对于所述软剪切统计信息,采用一个动态数组来记录,每条记录包括:In some optional implementation manners, for the soft clipping statistical information, a dynamic array is used to record, and each record includes:
用于表示软剪切在基因组上所处位置的软剪切位置存储部;a soft-splice location store for representing where the soft-splice is located on the genome;
用于表示软剪切发生在相应位点左边的次数的软剪切左侧数存储部;a soft clipping left number store for representing the number of times soft clipping occurs to the left of the corresponding site;
用于表示软剪切发生在相应位点右边的次数的软剪切右侧数存储部。Soft clipping right number storage for indicating the number of times soft clipping occurs to the right of the corresponding site.
在一些可选实施方式中,所述索引包括双端比对信息索引和单端比对信息索引;In some optional embodiments, the index includes a paired-end alignment information index and a single-end alignment information index;
对于双端比对信息索引,采用双端比对数组结构进行存储,所述双端比对数组结构包括:For the double-end comparison information index, the double-end comparison array structure is used for storage, and the double-end comparison array structure includes:
用于表示基因序列的ID的第一ID存储部;a first ID storage unit for representing the ID of the gene sequence;
用于表示基因序列比对到基因组上的位置的第一比对位置存储部;A first alignment position storage unit used to indicate the position of the gene sequence aligned to the genome;
用于表示基因序列的插入片段长度的插入片段长度存储部;an insert length storage unit representing an insert length of a gene sequence;
用于表示基因序列的比对质量值的第一比对质量值存储部;A first alignment quality value storage unit for representing the alignment quality value of the gene sequence;
用于表示基因序列的平均质量值的第一平均质量值存储部;A first average quality value storage unit for representing the average quality value of the gene sequence;
对于单端比对信息索引,采用单端比对数组结构进行存储,所述单端比对数组结构包括:For the single-end comparison information index, the single-end comparison array structure is used for storage, and the single-end comparison array structure includes:
用于表示基因序列的ID的第二ID存储部;a second ID storage unit for the ID representing the gene sequence;
用于表示基因序列比对到基因组上的位置的第二比对位置存储部;A second comparison position storage unit used to indicate the position of the gene sequence compared to the genome;
用于表示基因序列的比对质量值的第二比对质量值存储部;A second alignment quality value storage unit for representing the alignment quality value of the gene sequence;
用于表示基因序列的平均质量值的第二平均质量值存储部;A second average quality value storage unit for representing the average quality value of the gene sequence;
其中,对于每条用于比对的基因序列,根据该基因序列在基因组上的比对位置,其相应的索引依次排列。Wherein, for each gene sequence used for comparison, its corresponding index is arranged in order according to the comparison position of the gene sequence on the genome.
在一些可选实施方式中,将所述基因序列比对信息存储于磁盘,具体包括:In some optional embodiments, the gene sequence comparison information is stored in a disk, specifically including:
所以将基因序列比对信息分成512个文件并存储于磁盘,每个文件存储一定基因组区间的基因序列比对信息,每条基因序列比对信息的存储数据结构包括:Therefore, the gene sequence comparison information is divided into 512 files and stored on the disk. Each file stores the gene sequence comparison information of a certain genome interval. The storage data structure of each gene sequence comparison information includes:
用于表示基因序列的序列长度的序列长度存储部;a sequence length storage unit for representing the sequence length of the gene sequence;
用于表示基因序列本身的序列存储部;A sequence store for representing the gene sequence itself;
用于表示基因序列的质量值的质量值存储部;a quality value storage unit for representing the quality value of the gene sequence;
用于表示基因序列在比对时的比对算法开始位置的开始位置存储部;A start position storage unit used to represent the start position of the alignment algorithm when the gene sequence is compared;
用于表示基因序列在比对时的正负链信息的正负链存储部;The positive and negative strand storage part used to represent the positive and negative strand information of the gene sequence during alignment;
用于表示基因序列在比对时选取的基因组区域长度的区域长度存储部;A region length storage part used to indicate the length of the genome region selected during gene sequence alignment;
用于表示基因序列在比对时左边铆定的位置的左侧位置存储部;The left position storage part used to represent the left riveting position of the gene sequence during alignment;
用于表示基因序列在比对时右边铆定的位置的右侧位置存储部。A right position storage unit used to indicate the position of the right riveted position of the gene sequence during alignment.
基于上述目的,本发明实施例的第三个方面,提出了一种执行所述基因组数据存储方法的装置的一个实施例。如图4所示,为本发明提供的执行所述基因组数据存储方法的装置的一个实施例的硬件结构示意图。Based on the above purpose, the third aspect of the embodiments of the present invention proposes an embodiment of an apparatus for implementing the genome data storage method. As shown in FIG. 4 , it is a schematic diagram of the hardware structure of an embodiment of the device for implementing the genome data storage method provided by the present invention.
如图4所示,所述装置包括:As shown in Figure 4, the device includes:
一个或多个处理器401以及存储器402,图4中以一个处理器401为例。One or more processors 401 and memory 402, one processor 401 is taken as an example in FIG. 4 .
执行所述基因组数据存储方法的装置还可以包括:输入装置403和输出装置404。The device for executing the genome data storage method may further include: an input device 403 and an output device 404 .
处理器401、存储器402、输入装置403和输出装置404可以通过总线或者其他方式连接,图4中以通过总线连接为例。The processor 401, the memory 402, the input device 403, and the output device 404 may be connected via a bus or in other ways. In FIG. 4, connection via a bus is taken as an example.
存储器402作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的所述基因组数据存储方法对应的程序指令/模块(例如,附图3所示的创建模块301、比对信息存储模块302、统计信息分类模块303和统计信息存储模块304)。处理器401通过运行存储在存储器402中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例的基因组数据存储方法。The memory 402, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, such as the genome data storage method in the embodiment of the present application corresponds to program instructions/modules (for example, the creation module 301, comparison information storage module 302, statistical information classification module 303, and statistical information storage module 304 shown in FIG. 3). The processor 401 executes various functional applications and data processing of the server by running non-volatile software programs, instructions and modules stored in the memory 402, that is, implements the genomic data storage method of the above method embodiment.
存储器402可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据基因组数据存储装置的使用所创建的数据等。此外,存储器402可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器402可选包括相对于处理器401远程设置的存储器,这些远程存储器可以通过网络连接至会员用户行为监控装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 402 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the genome data storage device, and the like. In addition, the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some embodiments, the storage 402 may optionally include storages that are set remotely relative to the processor 401, and these remote storages may be connected to the member user behavior monitoring device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置403可接收输入的数字或字符信息,以及产生与基因组数据存储装置的用户设置以及功能控制有关的键信号输入。输出装置404可包括显示屏等显示设备。The input device 403 can receive input numbers or character information, and generate key signal input related to user settings and function control of the genome data storage device. The output device 404 may include a display device such as a display screen.
所述一个或者多个模块存储在所述存储器402中,当被所述一个或者多个处理器401执行时,执行上述任意方法实施例中的基因组数据存储方法。所述执行所述基因组数据存储方法的装置的实施例,其技术效果与前述任意方法实施例相同或者类似。The one or more modules are stored in the memory 402, and when executed by the one or more processors 401, perform the genome data storage method in any of the above method embodiments. The technical effect of the embodiment of the device for implementing the genomic data storage method is the same as or similar to that of any method embodiment described above.
本申请实施例还提供了一种非暂态计算机存储介质,所述计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的列表项操作的处理方法。所述非暂态计算机存储介质的实施例,其技术效果与前述任意方法实施例相同或者类似。The embodiment of the present application also provides a non-transitory computer storage medium, the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the processing method of list item operation in any of the above method embodiments. The technical effect of the embodiment of the non-transitory computer storage medium is the same as or similar to that of any method embodiment described above.
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关硬件来完成,所述的程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器 (Read-Only Memory,ROM)或随机存储器(RandomAccess Memory,RAM) 等。所述计算机程序的实施例,其技术效果与前述任意方法实施例相同或者类似。Finally, it should be noted that those skilled in the art can understand that the implementation of all or part of the processes in the methods of the above embodiments can be completed by instructing related hardware through computer programs, and the programs can be stored in computer-readable storage media When the program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM), etc. The technical effect of the computer program embodiment is the same or similar to that of any method embodiment described above.
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本公开的范围(包括权利要求)被限于这些例子;在本发明的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本发明的不同方面的许多其它变化,为了简明它们没有在细节中提供。Those of ordinary skill in the art should understand that: the discussion of any of the above embodiments is exemplary only, and is not intended to imply that the scope of the present disclosure (including claims) is limited to these examples; under the idea of the present invention, the above embodiments or Combinations between technical features in different embodiments are also possible, steps may be carried out in any order, and there are many other variations of the different aspects of the invention as described above, which are not presented in detail for the sake of brevity.
另外,为简化说明和讨论,并且为了不会使本发明难以理解,在所提供的附图中可以示出或可以不示出与集成电路(IC)芯片和其它部件的公知的电源 /接地连接。此外,可以以框图的形式示出装置,以便避免使本发明难以理解,并且这也考虑了以下事实,即关于这些框图装置的实施方式的细节是高度取决于将要实施本发明的平台的(即,这些细节应当完全处于本领域技术人员的理解范围内)。在阐述了具体细节(例如,电路)以描述本发明的示例性实施例的情况下,对本领域技术人员来说显而易见的是,可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本发明。因此,这些描述应被认为是说明性的而不是限制性的。In addition, well-known power/ground connections to integrated circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure the present invention. . Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and this also takes into account the fact that details regarding the implementation of these block diagram devices are highly dependent on the platform on which the invention is to be implemented (i.e. , these details should be well within the understanding of those skilled in the art). Where specific details (eg, circuits) have been set forth to describe example embodiments of the invention, it will be apparent to those skilled in the art that other embodiments may be implemented without or with variations from these specific details. Implement the present invention down. Accordingly, these descriptions should be regarded as illustrative rather than restrictive.
尽管已经结合了本发明的具体实施例对本发明进行了描述,但是根据前面的描述,这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如,其它存储器架构(例如,动态RAM(DRAM))可以使用所讨论的实施例。Although the invention has been described in conjunction with specific embodiments of the invention, many alternatives, modifications and variations of those embodiments will be apparent to those of ordinary skill in the art from the foregoing description. For example, other memory architectures such as dynamic RAM (DRAM) may use the discussed embodiments.
本发明的实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此,凡在本发明的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本发明的保护范围之内。Embodiments of the present invention are intended to embrace all such alterations, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent replacements, improvements, etc. within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710546293.7A CN107480466B (en) | 2017-07-06 | 2017-07-06 | Genome data storage method and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710546293.7A CN107480466B (en) | 2017-07-06 | 2017-07-06 | Genome data storage method and electronic device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107480466A true CN107480466A (en) | 2017-12-15 |
CN107480466B CN107480466B (en) | 2020-08-11 |
Family
ID=60595629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710546293.7A Active CN107480466B (en) | 2017-07-06 | 2017-07-06 | Genome data storage method and electronic device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480466B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197433A (en) * | 2017-12-29 | 2018-06-22 | 厦门极元科技有限公司 | Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform |
CN108920902A (en) * | 2018-06-29 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of gene order processing method and its relevant device |
CN110879782A (en) * | 2019-11-08 | 2020-03-13 | 浪潮电子信息产业股份有限公司 | Test method, device, equipment and medium for gene comparison software |
CN111081314A (en) * | 2019-12-13 | 2020-04-28 | 北京市商汤科技开发有限公司 | Method and apparatus for identifying genetic variation, electronic device, and storage medium |
WO2022082878A1 (en) * | 2020-10-22 | 2022-04-28 | 深圳华大基因股份有限公司 | Shared memory-based gene analysis method and apparatus, and computer device |
CN115602246A (en) * | 2022-10-31 | 2023-01-13 | 哈尔滨工业大学(Cn) | A Sequence Alignment Method Based on Population Genome |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103201744A (en) * | 2010-10-13 | 2013-07-10 | 考利达基因组股份有限公司 | Methods for estimating genome-wide copy number variations |
CN104361264A (en) * | 2014-12-11 | 2015-02-18 | 天津工业大学 | Quick counting method for quantity of nucleic acid fragments of genome |
CN106202991A (en) * | 2016-06-30 | 2016-12-07 | 厦门艾德生物医药科技股份有限公司 | The detection method of abrupt information in a kind of genome multiplex amplification order-checking product |
-
2017
- 2017-07-06 CN CN201710546293.7A patent/CN107480466B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103201744A (en) * | 2010-10-13 | 2013-07-10 | 考利达基因组股份有限公司 | Methods for estimating genome-wide copy number variations |
CN104361264A (en) * | 2014-12-11 | 2015-02-18 | 天津工业大学 | Quick counting method for quantity of nucleic acid fragments of genome |
CN106202991A (en) * | 2016-06-30 | 2016-12-07 | 厦门艾德生物医药科技股份有限公司 | The detection method of abrupt information in a kind of genome multiplex amplification order-checking product |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197433A (en) * | 2017-12-29 | 2018-06-22 | 厦门极元科技有限公司 | Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform |
CN108920902A (en) * | 2018-06-29 | 2018-11-30 | 郑州云海信息技术有限公司 | A kind of gene order processing method and its relevant device |
CN110879782A (en) * | 2019-11-08 | 2020-03-13 | 浪潮电子信息产业股份有限公司 | Test method, device, equipment and medium for gene comparison software |
CN110879782B (en) * | 2019-11-08 | 2022-06-17 | 浪潮电子信息产业股份有限公司 | Test method, device, equipment and medium for gene comparison software |
CN111081314A (en) * | 2019-12-13 | 2020-04-28 | 北京市商汤科技开发有限公司 | Method and apparatus for identifying genetic variation, electronic device, and storage medium |
WO2022082878A1 (en) * | 2020-10-22 | 2022-04-28 | 深圳华大基因股份有限公司 | Shared memory-based gene analysis method and apparatus, and computer device |
CN115602246A (en) * | 2022-10-31 | 2023-01-13 | 哈尔滨工业大学(Cn) | A Sequence Alignment Method Based on Population Genome |
Also Published As
Publication number | Publication date |
---|---|
CN107480466B (en) | 2020-08-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107480466B (en) | Genome data storage method and electronic device | |
US20200334295A1 (en) | Merge tree garbage metrics | |
CN102891852B (en) | Message analysis-based protocol format automatic inferring method | |
TW201837720A (en) | Stream selection for multi-stream storage devices | |
TW201841123A (en) | Merge tree modifications for maintenance operations | |
WO2015184992A1 (en) | Method for recognizing duplicate image, and image search and deduplication method and device thereof | |
CN110457758B (en) | Method, device and system for predicting unstable phase of rock mass and storage medium | |
CN106713273B (en) | A Protocol Keyword Recognition Method Based on Dictionary Tree Pruning Search | |
CN106682393A (en) | Genomic sequence alignment method and genomic sequence alignment device | |
JP2009140161A5 (en) | ||
US10394763B2 (en) | Method and device for generating pileup file from compressed genomic data | |
CN104937599A (en) | Data analysis device and method thereof | |
JP2022533492A (en) | Flexible Seed Extension for Hashtable Genome Mapping | |
CN106021985B (en) | A kind of genomic data compression method | |
CN118782147A (en) | Probe design method, electronic device, and computer-readable storage medium | |
CN103186621B (en) | A kind of catalogue generates method and apparatus | |
CN108846033A (en) | The discovery and classifier training method and apparatus of specific area vocabulary | |
US20140012879A1 (en) | Database management system, apparatus, and method | |
US9715514B2 (en) | K-ary tree to binary tree conversion through complete height balanced technique | |
CN113535962B (en) | Data warehouse-in method, device, electronic device, program product and storage medium | |
CN102968515A (en) | Method and equipment for calculating verification coverage of integrated computer circuit model | |
CN113285720B (en) | Gene data lossless compression method, integrated circuit and lossless compression device | |
CN103294932A (en) | Reference sequence processing system and method for analyzing genome sequence | |
CN104376261B (en) | A kind of method of the automatic detection malicious process under evidence obtaining scene | |
CN109504751B (en) | A method for identification of deletion variants and clone counting of complex clonal structures of tumors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 1002-1, 10th floor, No.56, Beisihuan West Road, Haidian District, Beijing 100080 Patentee after: Ronglian Technology Group Co.,Ltd. Address before: 100080, Beijing, Haidian District, No. 56 West Fourth Ring Road, glorious Times Building, 10, 1002-1 Patentee before: UNITED ELECTRONICS Co.,Ltd. |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Genomic data storage methods and electronic devices Granted publication date: 20200811 Pledgee: Jining High-tech Holding Group Co.,Ltd. Pledgor: Ronglian Technology Group Co.,Ltd. Registration number: Y2025990000041 |