[go: up one dir, main page]

CN108763871B - Hole filling method and device based on third-generation sequencing sequence - Google Patents

Hole filling method and device based on third-generation sequencing sequence Download PDF

Info

Publication number
CN108763871B
CN108763871B CN201810581026.8A CN201810581026A CN108763871B CN 108763871 B CN108763871 B CN 108763871B CN 201810581026 A CN201810581026 A CN 201810581026A CN 108763871 B CN108763871 B CN 108763871B
Authority
CN
China
Prior art keywords
sequence
gap
sub
sequencing
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810581026.8A
Other languages
Chinese (zh)
Other versions
CN108763871A (en
Inventor
周义其
李季
张锦波
蒋智
李瑞强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Novogene Technology Co ltd
Original Assignee
Beijing Novogene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Novogene Technology Co ltd filed Critical Beijing Novogene Technology Co ltd
Priority to CN201810581026.8A priority Critical patent/CN108763871B/en
Publication of CN108763871A publication Critical patent/CN108763871A/en
Application granted granted Critical
Publication of CN108763871B publication Critical patent/CN108763871B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a hole filling method and device based on a third-generation sequencing sequence. Wherein, the method comprises the following steps: comparing at least one sub-sequencing sequence contained in the third-generation sequencing sequence to obtain a first comparison result; extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the first alignment result to obtain at least one first extraction result; carrying out fine comparison on the first extraction result to obtain a second comparison result; extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the second alignment result to obtain at least one second extraction result; assembling at least one second extraction result to obtain a consistency sequence; replacing the original sequence in the genome sketch with the consensus sequence; wherein, the gap sequence is unknown sequence. The invention solves the technical problem of high resource consumption caused by slow speed of the process of comparing sequencing sequences in the process of carrying out genome hole filling on sequencing data.

Description

Hole filling method and device based on third-generation sequencing sequence
Technical Field
The invention relates to the field of biological information, in particular to a hole filling method and device based on a third-generation sequencing sequence.
Background
De novo assembly has become one of the major applications of second generation sequencing (NGS) technology. Currently, there are a number of software available for assembling genomic sketches, such as Velvet, ABySS and SOAPdeno, among others. Nevertheless, the assembled scaffold contains many gap sequences, which are generally represented by "N" filling. In general, since the sequence of the low coverage and repeat region is too complex to allow software to determine the corresponding sequence at that position during assembly, only the sequence at the "N" base representing the position is filled. Although the distance information of pair-end reads can concatenate contigs into longer scaffolds, this does not solve the inherent problems of low coverage and repetition regions: i.e., without adding new sequence information to the genome sketch. The gap sequence in the scaffolds has negative influence on the subsequent genomics analysis result, for example, the gap sequence cannot be predicted in the gene prediction process, or structural variation detection cannot be carried out on the gap sequence.
Since 2011, third generation sequencing Pacbio and Oxford Nanopore single molecule real-time sequencing technologies have gradually entered the market. The third generation sequencer has the characteristic of ultra-long reading length, and the longest length can even reach 1 MB. Based on the characteristic of the third-generation sequencing ultra-long reading length, the third-generation sequencing sequence is used for filling holes in a genome sketch, and the method is a good scheme for improving the genome assembly index and accuracy at present.
At present, PBjelly software is mainly adopted for genome hole filling based on third-generation sequencing data, but the speed of the used blast comparison software is very slow. For example, for human genome, the process of pure alignment requires thousands of cpu hours, and generally can only complete related tasks in high-performance clusters, which is time-consuming and expensive, and is difficult to meet the requirements of practical applications.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a hole filling method and device based on a third-generation sequencing sequence, which at least solve the technical problem of high resource consumption caused by low speed of the process of comparing sequencing sequences in the process of carrying out genome hole filling on sequencing data.
According to an aspect of the embodiments of the present invention, there is provided a third generation sequencing sequence-based hole filling method, including: comparing at least one sub-sequencing sequence contained in the third-generation sequencing sequence to obtain a first comparison result; extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the first alignment result to obtain at least one first extraction result; carrying out fine comparison on the first extraction result to obtain a second comparison result; extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the second alignment result to obtain at least one second extraction result; assembling at least one second extraction result to obtain a consistency sequence; replacing the original sequence in the genome sketch with the consensus sequence; wherein, the gap sequence is unknown sequence.
Optionally, before extracting the sub-sequencing sequences located within a certain range of at least one gap sequence from the first alignment result with a preset number to obtain at least one first extraction result, the method with a preset number further comprises: determining the gap sequence from the third predetermined number of sequencing sequences, wherein the base in the gap sequence is represented by N.
Optionally, the number of bases of the predetermined number of gap sequences is a predetermined number, wherein, in the case that the number of bases of the predetermined number of gap sequences is less than the predetermined number, the sequences at the left and right ends of the predetermined number of gap sequences are changed into gap sequences until the number of bases of the predetermined number of gap sequences reaches the predetermined number.
Optionally, extracting the sub-sequencing sequences located within a certain range of at least one gap sequence from the first alignment result to obtain at least one first extraction result, comprising: aligning at least one sub-sequencing sequence contained in the third-generation sequencing sequence back to the scaffolds sequence of the genome draft by using first alignment software to obtain the alignment position of each sub-sequencing sequence (reads) in the scaffolds; using the second alignment software, the position of the gap sequence of the genome draft is compared with the position of each of the sub-sequencing sequences (reads) included in the third generation sequencing sequences, and sub-sequencing sequences (reads) whose aligned positions are within a certain range of the gap sequence are extracted.
Alternatively, in comparing the position of the gap sequence of the genome sketch with the position of each of the subsequences (reads) contained in the third generation sequencing sequences, the conditions for judgment are: the alignment position of the sub-sequencing sequence reads comprises at least 1bp overlap with 2000bp upstream and downstream of the gap sequence.
Optionally, the fine comparison of the first extraction result to obtain a second comparison result includes: a third alignment software is used to fine align the sub-sequencing sequences with positions within a certain range of the gap sequence.
Optionally, extracting a sub-sequencing sequence located within a certain range of at least one gap sequence from the second alignment result to obtain at least one second extraction result, comprising: according to the comparison result of the third comparison software, extracting the sub sequencing sequence reads meeting the following conditions: comparing until the distance is within the preset number of the starting points of the gap sequences corresponding to the gap sequences; and a second condition: there is at least a first predetermined number of bases that are not aligned into the gap sequence.
Optionally, assembling at least one second extraction result to obtain a consensus sequence, comprising: and performing local assembly based on the extracted sub-sequencing sequences reads and a second predetermined number of genome sketch sequences on both sides of the gap sequence.
Optionally, replacing the native sequence in the genome sketch with a consensus sequence, comprising: the corresponding gap sequence was replaced with the consensus sequence.
According to another aspect of the embodiments of the present invention, there is also provided a hole filling device based on a third generation sequencing sequence, including: the first comparison module is used for comparing at least one sub-sequencing sequence contained in the third-generation sequencing sequence to obtain a first comparison result; the first extraction module is used for extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the first comparison result to obtain at least one first extraction result; the second comparison module is used for carrying out fine comparison on the first extraction result to obtain a second comparison result; the second extraction module is used for extracting the sub sequencing sequences positioned in a certain range of at least one gap sequence from the second comparison result to obtain at least one second extraction result; the assembling module is used for assembling at least one second extraction result to obtain a consistency sequence; a replacing module for replacing the original sequence in the genome sketch with the consensus sequence; wherein, the gap sequence is unknown sequence.
Optionally, the apparatus further comprises: and the determining module is used for determining the gap sequence from the third-generation sequencing sequences with the preset number, and the base in the gap sequence with the preset number is represented by N.
Optionally, the number of bases of the predetermined number of gap sequences is a predetermined number, wherein, in the case that the number of bases of the predetermined number of gap sequences is less than the predetermined number, the sequences at the left and right ends of the predetermined number of gap sequences are changed into gap sequences until the number of bases of the predetermined number of gap sequences at least reaches the predetermined number.
Optionally, the first extraction module includes: the first sub-alignment module is used for aligning at least one sub-sequencing sequence contained in the third-generation sequencing sequence back to the scaffolds sequence of the genome draft by using first alignment software to obtain the alignment position of each sub-sequencing sequence (reads) in the scaffolds; a comparison module for comparing, using second alignment software, a position of a gap sequence of the genome sketch with a position of each of the subsequences (reads) contained in the third generation sequencing sequences; a first sub-extraction module for extracting sub-sequencing sequences (reads) whose alignment positions are within a certain range of the gap sequence.
Alternatively, in comparing the position of the gap sequence of the genome sketch with the position of each of the subsequences (reads) contained in the third generation sequencing sequences, the conditions for judgment are: the alignment position of the sub-sequencing sequence reads comprises at least 1bp overlap with 2000bp upstream and downstream of the gap sequence.
Optionally, the second alignment module includes: and the second sub-alignment module is used for performing fine alignment on the sub-sequencing sequences with the positions within a certain range of the gap sequence by using third alignment software.
Optionally, the second extraction module comprises: and the extraction module is used for extracting the sub sequencing sequence reads meeting the following conditions according to the comparison result of the third comparison software, wherein the conditions are as follows: comparing until the distance is within the preset number of the starting points of the gap sequences corresponding to the gap sequences; and a second condition: there is at least a first predetermined number of bases that are not aligned into the gap sequence.
Optionally, a module is assembled comprising: and the sub-assembly module is used for carrying out local assembly based on the extracted sub-sequencing sequences reads and a second preset number of genome sketch sequences on both sides of the gap sequence.
Optionally, a replacement module comprising: and the sub-replacement module is used for replacing the corresponding gap sequence by using the consistency sequence.
In the embodiment of the invention, the Minimap2 software is adopted to carry out rapid comparison on the third generation sequencing reads, then the reads in a gap certain range are extracted according to the comparison result of Minmap2, and then the extracted third generation sequencing reads are finely compared by blast again. According to the comparison result of the blast, the three generations of sequencing reads within a certain range of the gap are extracted, then the extracted three generations of sequencing reads corresponding to each gap are locally assembled, and then the corresponding gap sequence is replaced by utilizing the assembled consistency sequence, so that the purpose of improving the speed of comparing and sequencing sequences is achieved, the technical effect of saving the consumption of computing resources is realized, and the technical problem that the resource consumption is large due to the fact that the speed of the process of comparing and sequencing sequences is slow in the process of carrying out genome hole filling on sequencing data is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:
FIG. 1 is a schematic diagram of a third generation sequencing sequence based hole filling method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an alternative third generation sequencing sequence based hole filling method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a hole filling apparatus based on third generation sequencing sequence according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Currently, genome sketches assembled using second or third generation sequencing generally consist of thousands to tens of thousands of scfolds sequences. Inside these scaffolds sequences, gap sequences of several bp to several tens of kbp are distributed, that is, the sequences are unknown and are generally represented by "N" filling.
PBjelly software is mainly adopted for genome hole filling based on third-generation sequencing data, but the speed of the blast comparison software is very slow. For example, for human genome, the alignment process requires thousands of cpu hours, and generally can only be completed in high-performance clusters, which is time-consuming and expensive, and is difficult to meet the requirements of practical applications.
Only unique comparison software is adopted in the related art. Currently, alignment software for long sequences is fast and has relatively low precision, such as minimap2, while high precision is often slow, such as blastr.
In order to solve the problems, different types of comparison software minimap2 and blasts are combined together, before comparison is carried out, reads irrelevant to hole filling are filtered out through minimap2, the number of the compared reads is reduced, the purposes of reducing comparison time and reducing consumption of computing resources are achieved, then the remaining reads are compared through blasts, the subsequent analysis is completed through the comparison result of blasts, and therefore the accuracy of the result is guaranteed not to be reduced, and the method is explained in detail below.
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for hole filling based on third generation sequencing sequences, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
FIG. 1 is a schematic diagram of a third generation sequencing sequence-based hole filling method according to an embodiment of the present invention, as shown in FIG. 1, the method includes the following steps:
step S102, comparing at least one sub-sequencing sequence (reads) contained in the third-generation sequencing sequence to obtain a first comparison result;
for example, three generation sequencing reads can be rapidly aligned using Minimap2 software.
Step S104, extracting sub sequencing sequences (reads) positioned in a certain range of at least one gap sequence from the first comparison result to obtain at least one first extraction result, wherein the gap sequence is an unknown sequence.
Optionally, the gap sequence is determined from the third generation of sequencing sequences, prior to extracting sub-sequencing sequences located within a certain range of at least one gap sequence, the bases of which are denoted by N. Wherein, the number of the bases of the gap sequences is more than or equal to the preset number, and under the condition that the number of the bases of the gap sequences with the preset number is less than the preset number, the sequences at the left end and the right end of the gap sequences with the preset number are changed into the gap sequences until the number of the bases of the gap sequences with the preset number at least reaches the preset number. The predetermined number is at least 10bp, i.e., the number of bases of the gap sequence is at least 10 bp.
The embodiment of the application provides an optional application scenario: find the gap sequence in the genome sketch, and define the consecutive 25 or more N as the gap sequence. If the number of N is less than 25, the sequences at the left and right ends of the region are changed into N until the number of N is 25. When the predetermined number is 25, the number of bases in the gap sequence should be 25 or more. When the number of bases of the gap sequence is less than 25, the sequences at the left and right ends of the gap sequence are changed into the gap sequence until the number of bases of a predetermined number of gap sequences reaches the predetermined number.
In an alternative embodiment, at least one of the sub-sequencing sequences included in the three-generation sequencing sequence is aligned back to the scaffold sequence of the genome sketch using a first alignment software to obtain the alignment position of each sub-sequencing sequence (reads) in the scaffold, wherein the first alignment software may be Minimap2, mecat, etc., for example, the three-generation sequencing reads are roughly aligned back to the scaffold sequence of the genome sketch using Minimap2 to obtain all the main alignment positions (primary alignment results) of each reads in the scaffold.
Using the second alignment software, the position of the gap sequence of the genome draft is compared with the position of each of the sub-sequencing sequences (reads) included in the third generation sequencing sequences, and sub-sequencing sequences (reads) whose aligned positions are within a certain range of the gap sequence are extracted. In comparing the position of the gap sequence of the genome draft with the position of each of the subsequences (reads) contained in the third generation sequencing sequences, the conditions for judgment are: the alignment position of the sub-sequencing sequence reads comprises at least 1bp overlap with 2000bp upstream and downstream of the gap sequence.
For example, the second alignment software may be a bendaols tool, and the fourth-generation sequencing reads whose alignment positions are within a certain range of the gap sequence are found by comparing the gap sequence position of the genome sketch with the alignment positions of the fourth-generation sequencing reads using the bendaols tool, where the conditions for the determination are as follows: the alignment position of reads has at least 1bp overlap with 2000bp upstream and downstream of the gap sequence.
The original reads are divided into two categories: the first is third generation sequencing reads aligned to a certain range of gap sequence, and the reads are used for subsequent analysis; the second type is third generation sequencing reads which are compared with non-gap sequences, the ratio of the reads is generally more than 70%, and the reads do not need to be subjected to subsequent analysis, so that the overall software running speed can be greatly improved.
And finely comparing the first extraction result to obtain a second comparison result, wherein the second comparison result comprises the following steps: a third alignment software is used to fine align the sub-sequencing sequences with positions within a certain range of the gap sequence. The third comparison software can be blast software, and the blast software is used for carrying out fine comparison on the first type of reads, namely the reads within a certain range of gap
Step S106, carrying out fine comparison on the first extraction result to obtain a second comparison result;
optionally, step S106 includes: according to the alignment result of the third alignment software, extracting the sub-sequencing sequence reads meeting the following conditions: the first condition is as follows: comparing until the distance is within the preset number of the starting points of the gap sequences corresponding to the gap sequences; and a second condition: there is at least a first predetermined number of bases that are not aligned into the gap sequence.
In an alternative embodiment, if the predetermined number is 25, reads that meet the following condition a) must be aligned to within the predetermined number of the start of the gap sequence are extracted based on the alignment result of the blasts; b) there must be 25 bases that are not aligned into the hole.
Step S108, extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the second comparison result to obtain at least one second extraction result;
step S110, assembling at least one second extraction result to obtain a consistency sequence;
optionally, step S110 includes: and performing local assembly based on the extracted sub-sequencing sequences reads and a second predetermined number of genome sketch sequences which are arranged at two sides of the preset number of gap sequences.
Step S112, replacing the original sequence in the genome sketch by using the consensus sequence;
optionally, step S112 includes: the corresponding gap sequence is replaced with a predetermined number of consensus sequences.
The embodiment of the present application provides an alternative implementation manner, as shown in fig. 2:
step S202: inputting a third generation sequencing sequence and a genome sketch;
step S204: rough comparison;
specifically, a gap sequence was found in the genome draft, and a continuous 25 or more N was defined as a gap sequence. If the number of N is less than 25, the sequences at the left and right ends of the region are changed into N until the number of N is 25. The third generation sequencing reads were roughly aligned back to the scaffold sequence of the genome draft using Minimap2, resulting in all major alignment positions (primary alignment results) for each read in the scaffold.
Step S206: extracting sequences aligned within a certain range of gap sequences;
specifically, by using a bedtools, the gap sequence position of the genome sketch and the alignment position of the third-generation sequencing reads are compared, so as to find out the third-generation sequencing reads with the alignment position within a certain range of the gap sequence, and the judgment condition is as follows: the alignment position of reads has at least 1bp overlap with 2000bp upstream and downstream of the gap sequence. The original reads are divided into two categories: the first is three generations of sequencing reads aligned to a certain range of gap sequence, and the reads are used for subsequent analysis; the second type is third generation sequencing reads which are compared with non-gap sequences, the ratio of the reads is generally more than 70%, and the reads do not need to be subjected to subsequent analysis, so that the overall software running speed can be greatly improved.
Step S208: fine comparison;
specifically, the first kind of reads, i.e., reads aligned to a certain range of gap, are finely aligned by using blast software. Extracting reads meeting the following conditions according to the alignment result of the blast, wherein a) the reads need to be aligned to be within 25bp of the starting point of the gap sequence; b) there must be 25 bases that are not aligned into the hole.
Step S210: assembling the aligned sequences; local assembly was performed using the 1000bp genomic draft sequence flanked by the reads and gap sequences extracted in step S208 (using ALLORA assembly software).
Step S212: replacing the original sequence with the assembled sequence to complete hole filling; and replacing the original sequence in the genome sketch by the assembled sequence.
In the examples of the present application, test data of a part of real tests are provided, as shown in tables 1 and 2:
TABLE 1
Figure BDA0001685902870000081
TABLE 2
Figure BDA0001685902870000091
Through real data testing, the hole filling effect of the method is slightly better than that of PBjelly, and the comparison speed can be improved by 8-10 times.
In the related art, single comparison software is adopted, and if only blast is used for fine comparison, the consumption of computing resources is serious, and expensive computing cost is generated. If minimap2 is used for comparison, although the speed can be improved, the accuracy of the result is reduced. Through the steps, the comparison speed is greatly improved, the consumption of computing resources is saved, and the accuracy is not reduced.
According to an embodiment of the present invention, an embodiment of a device for hole filling based on a third generation sequencing sequence is provided, and fig. 3 is a schematic diagram of a device for hole filling based on a third generation sequencing sequence according to an embodiment of the present invention, as shown in fig. 3, the device includes:
the first comparison module 300 is configured to compare at least one sub-sequencing sequence included in the third-generation sequencing sequence to obtain a first comparison result;
a first extraction module 302, configured to extract a sub-sequencing sequence located within a certain range of at least one gap sequence from the first alignment result, so as to obtain at least one first extraction result; wherein, the gap sequence is unknown sequence.
A second comparison module 304, configured to perform a fine comparison on the first extraction result to obtain a second comparison result;
a second extraction module 306, configured to extract a sub-sequencing sequence located within a certain range of at least one gap sequence from the second alignment result, so as to obtain at least one second extraction result;
an assembling module 308, configured to assemble at least one second extraction result to obtain a consistent sequence;
a replacing module 310 for replacing the original sequence in the genome sketch with the consensus sequence;
optionally, the first extraction module 302 includes: a first sub-alignment unit, configured to use first alignment software to align at least one sub-sequencing sequence included in the third-generation sequencing sequence back to the scaffold sequence of the genome sketch, so as to obtain an alignment position of each sub-sequencing sequence (reads) in the scaffold; a comparison unit for comparing the position of the gap sequence of the genome sketch with the position of each of the subsequences (reads) contained in the third generation sequencing sequences using a second alignment software; a first sub-extraction unit for extracting sub-sequencing sequences (reads) with alignment positions within a certain range of the gap sequence.
In comparing the position of the gap sequence of the genome draft with the position of each of the subsequences (reads) contained in the third generation sequencing sequences, the conditions for judgment are: the alignment position of the sub-sequencing sequence reads comprises at least 1bp overlap with 2000bp upstream and downstream of the gap sequence.
Optionally, the second alignment module 304 includes: and the second sub-alignment module is used for performing fine alignment on the sub-sequencing sequences with the positions within a certain range of the gap sequence by using third alignment software.
Optionally, the second extraction module 306 includes: and the extraction module is used for extracting the sub sequencing sequence reads meeting the following conditions according to the comparison result of the third comparison software, wherein the conditions are as follows: comparing until the distance is within the preset number of the starting points of the gap sequences corresponding to the gap sequences; and a second condition: there is at least a first predetermined number of bases that are not aligned into the gap sequence.
Optionally, an assembly module 308 comprising: and the sub-assembly module is used for carrying out local assembly based on the extracted sub-sequencing sequences reads and a second preset number of genome sketch sequences on both sides of the gap sequence.
Optionally, the replacement module 310 includes: and the sub-replacement module is used for replacing the corresponding gap sequence by using the consistency sequence.
Optionally, the apparatus further comprises: and the determining module is used for determining the gap sequence from the third-generation sequencing sequences with the preset number, and the base in the gap sequence with the preset number is represented by N. The number of bases of the predetermined number of gap sequences is a predetermined number, wherein, when the number of bases of the predetermined number of gap sequences is less than the predetermined number, the sequences at the left and right ends of the predetermined number of gap sequences are changed into gap sequences until the number of bases of the predetermined number of gap sequences at least reaches the predetermined number.
It should be noted that, reference may be made to the description of fig. 1 to fig. 2 for a preferred implementation of the embodiment shown in fig. 3, and details are not repeated here.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described in detail in a certain embodiment.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (18)

1. A hole filling method based on a third generation sequencing sequence is characterized by comprising the following steps:
comparing at least one sub-sequencing sequence contained in the third-generation sequencing sequence by adopting Minimap2 software to obtain a first comparison result;
extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the first alignment result to obtain at least one first extraction result;
performing fine comparison on the first extraction result by adopting blast software to obtain a second comparison result;
extracting a sub-sequencing sequence positioned in a certain range of at least one gap sequence from the second alignment result to obtain at least one second extraction result;
assembling the at least one second extraction result to obtain a consistency sequence;
replacing an original sequence in a genome sketch with the consensus sequence;
wherein the gap sequence is an unknown sequence.
2. The method of claim 1, wherein prior to extracting the sub-sequencing sequences within a range of at least one gap sequence from the first alignment result to obtain at least one first extraction result, the method further comprises:
determining a gap sequence from the third generation sequencing sequence, wherein the base in the gap sequence is represented by N.
3. The method according to claim 2, wherein the number of bases of the gap sequence is a predetermined number, and wherein, in the case where the number of bases of the gap sequence is less than the predetermined number, sequences at the left and right ends of the gap sequence are changed to the gap sequence until the number of bases of the gap sequence reaches the predetermined number.
4. The method of claim 1, wherein extracting the sub-sequencing sequences located within a range of at least one gap sequence from the first alignment result to obtain at least one first extraction result comprises:
aligning at least one sub-sequencing sequence contained in the three generations of sequencing sequences back to the scaffold sequence of the genome sketch by using Minimap2 software to obtain the alignment position of each sub-sequencing sequence reads in the scaffold;
and (3) comparing the position of the gap sequence of the genome sketch with the position of each sub sequencing sequence read contained in the third generation sequencing sequence by using a bedtools, and extracting the sub sequencing sequence reads with the aligned position within a certain range of the gap sequence.
5. The method according to claim 4, wherein in comparing the position of the gap sequence of the genome sketch with the position of each of the reads included in the third-generation sequencing sequences, the judgment is made under the following conditions: the alignment position of the sub-sequencing sequence reads and the upstream and downstream 2000bp of the gap sequence comprise at least 1bp of overlap.
6. The method of claim 4, wherein the fine alignment of the first extracted result to obtain a second aligned result comprises:
performing the fine alignment on the sub-sequencing sequences whose alignment positions are within a certain range of the gap sequence by using blast software.
7. The method of claim 6, wherein extracting the sub-sequencing sequences within a range of at least one gap sequence from the second alignment result to obtain at least one second extraction result comprises:
according to the alignment result of the blast software, extracting the sub sequencing sequence reads meeting the following conditions:
the first condition is as follows: comparing the gap sequences to a preset number of gap sequence starting points corresponding to the gap sequences;
and (2) carrying out a second condition: there is at least a first predetermined number of bases that are not aligned into the gap sequence.
8. The method of claim 6, wherein assembling the at least one second extraction result to obtain a consensus sequence comprises:
performing local assembly based on the extracted sub-sequencing sequences reads and a second predetermined number of genome draft sequences flanking the gap sequence.
9. The method of claim 1, wherein replacing the native sequence in the genome sketch with the consensus sequence comprises: the corresponding gap sequence is replaced with the consensus sequence.
10. A hole filling device based on a third generation sequencing sequence is characterized by comprising:
the first comparison module 300 is configured to compare at least one sub-sequencing sequence included in the third-generation sequencing sequence with Minimap2 software to obtain a first comparison result;
a first extraction module, configured to extract a sub-sequencing sequence located within a certain range of at least one gap sequence from the first alignment result, so as to obtain at least one first extraction result;
the second comparison module is used for carrying out fine comparison on the first extraction result by adopting blast software to obtain a second comparison result;
a second extraction module, configured to extract a sub-sequencing sequence located within a certain range of at least one gap sequence from the second alignment result, so as to obtain at least one second extraction result;
the assembling module is used for assembling the at least one second extraction result to obtain a consistency sequence;
a replacement module for replacing an original sequence in a genome sketch with the consensus sequence;
wherein the gap sequence is an unknown sequence.
11. The apparatus of claim 10, further comprising:
and a determining module for determining a gap sequence from the third generation sequencing sequence, wherein the base in the gap sequence is represented by N.
12. The apparatus according to claim 11, wherein the number of bases of the gap sequence is a predetermined number, and wherein in the case where the number of bases of the gap sequence is less than the predetermined number, sequences at left and right ends of the gap sequence are changed to the gap sequence until the number of bases of the gap sequence reaches at least the predetermined number.
13. The apparatus of claim 10, wherein the first extraction module comprises:
a first sub-alignment module, configured to use Minimap2 software to align at least one sub-sequencing sequence included in the third-generation sequencing sequence back to the scaffold sequence of the genome sketch, so as to obtain an alignment position of each sub-sequencing sequence reads in the scaffold;
a comparison module for comparing the position of the gap sequence of the genome sketch with the position of each of the subsequences contained in the third generation sequencing sequence using a bedtools;
and the first sub-extraction module is used for extracting sub-sequencing sequence reads with the alignment position within a certain range of the gap sequence.
14. The apparatus of claim 13, wherein in comparing the position of the gap sequence of the genome sketch with the position of each of the reads of the sub-sequencing sequences included in the third generation sequencing sequence, the conditions for determining are as follows: the alignment position of the sub-sequencing sequence reads and the upstream and downstream 2000bp of the gap sequence comprise at least 1bp of overlap.
15. The apparatus of claim 13, wherein the second alignment module comprises:
and the second sub-alignment module is used for carrying out fine alignment on the sub-sequencing sequences with the alignment positions within a certain range of the gap sequence by using blast software.
16. The apparatus of claim 15, wherein the second extraction module comprises:
the extraction module is used for extracting the sub sequencing sequences meeting the following conditions according to the comparison result of the blast software:
the first condition is as follows: comparing the gap sequences to a preset number of gap sequence starting points corresponding to the gap sequences;
and a second condition: there is at least a first predetermined number of bases that are not aligned into the gap sequence.
17. The apparatus of claim 15, wherein the assembly module comprises:
and the sub-assembly module is used for carrying out local assembly based on the extracted sub-sequencing sequences reads and a second preset number of genome sketch sequences on both sides of the gap sequence.
18. The apparatus of claim 10, wherein the replacement module comprises: and the sub-replacement module is used for replacing the corresponding gap sequence by using the consistency sequence.
CN201810581026.8A 2018-06-05 2018-06-05 Hole filling method and device based on third-generation sequencing sequence Active CN108763871B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810581026.8A CN108763871B (en) 2018-06-05 2018-06-05 Hole filling method and device based on third-generation sequencing sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810581026.8A CN108763871B (en) 2018-06-05 2018-06-05 Hole filling method and device based on third-generation sequencing sequence

Publications (2)

Publication Number Publication Date
CN108763871A CN108763871A (en) 2018-11-06
CN108763871B true CN108763871B (en) 2022-05-31

Family

ID=64000481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810581026.8A Active CN108763871B (en) 2018-06-05 2018-06-05 Hole filling method and device based on third-generation sequencing sequence

Country Status (1)

Country Link
CN (1) CN108763871B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782099B (en) * 2021-10-27 2022-03-04 北京诺禾致源科技股份有限公司 Method and device for repairing genome sequence assembly gap

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4206227B2 (en) * 2001-06-12 2009-01-07 モビスフィア リミテッド Smart antenna array
US7505867B2 (en) * 2007-05-21 2009-03-17 General Electric Co. System and method for predicting medical condition
CN105989249A (en) * 2014-09-26 2016-10-05 叶承羲 Method, system and device for assembling genomic sequence
CN106022002A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Three-generation PacBio sequencing data-based hole filling method
CN107784201A (en) * 2016-08-26 2018-03-09 深圳华大基因科技服务有限公司 A kind of real-time sequencing sequence joint filling-up hole method and system of two generation sequences and three generations's unimolecule
CN108573127A (en) * 2017-03-14 2018-09-25 深圳华大基因科技服务有限公司 A method for processing raw data of nucleic acid third-generation sequencing and its application

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3025317B1 (en) * 2014-08-26 2022-09-23 Imabiotech METHOD FOR CHARACTERIZING A SAMPLE BY MASS SPECTROMETRY IMAGING
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN107858408A (en) * 2016-09-19 2018-03-30 深圳华大基因科技服务有限公司 A kind of generation sequence assemble method of genome two and system
CN106682393B (en) * 2016-11-29 2019-05-17 北京荣之联科技股份有限公司 Genome sequence comparison method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4206227B2 (en) * 2001-06-12 2009-01-07 モビスフィア リミテッド Smart antenna array
US7505867B2 (en) * 2007-05-21 2009-03-17 General Electric Co. System and method for predicting medical condition
CN105989249A (en) * 2014-09-26 2016-10-05 叶承羲 Method, system and device for assembling genomic sequence
CN106022002A (en) * 2016-05-17 2016-10-12 杭州和壹基因科技有限公司 Three-generation PacBio sequencing data-based hole filling method
CN107784201A (en) * 2016-08-26 2018-03-09 深圳华大基因科技服务有限公司 A kind of real-time sequencing sequence joint filling-up hole method and system of two generation sequences and three generations's unimolecule
CN108573127A (en) * 2017-03-14 2018-09-25 深圳华大基因科技服务有限公司 A method for processing raw data of nucleic acid third-generation sequencing and its application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Third generation sequencing: technology and its potential impact on evolutionary biodiversity research;Christoph Bleidorn et al.;《Systematics and Biodiversity》;20151221;1-8 *
全基因组测序在重要家畜上的研究进展;李晓凯 等,;《生物技术通报》;20171208;第34卷(第6期);11-21 *

Also Published As

Publication number Publication date
CN108763871A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
Deorowicz et al. FAMSA: Fast and accurate multiple sequence alignment of huge protein families
CN110292775B (en) Method and device for acquiring difference data
Li et al. Fast and accurate long-read alignment with Burrows–Wheeler transform
CN108985008B (en) Method and system for rapidly comparing gene data
JP2004164036A (en) Method for evaluating commonality of document
WO2018218788A1 (en) Third-generation sequencing sequence alignment method based on global seed scoring optimization
CN103810229A (en) System, method, and computer program product for performing a string search
CN106021985B (en) A kind of genomic data compression method
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN109901978A (en) A kind of Hadoop log lossless compression method and system
CN103714086A (en) Method and device used for generating non-relational data base module
WO2017092444A1 (en) Log data mining method and system based on hadoop
Zhang et al. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets
CN108763871B (en) Hole filling method and device based on third-generation sequencing sequence
CN114861614B (en) Method and device for filling data, electronic device, and medium
CN106776704B (en) Statistical information collection method and device
Mun et al. Pangenomic genotyping with the marker array
US20230004707A1 (en) Method and device for sorting Chinese characters, searching Chinese characters and constructing dictionary
CN104991920A (en) Label generation method and apparatus
Alipanahi et al. Disentangled long-read de Bruijn graphs via optical maps
EP3663890B1 (en) Alignment method, device and system
CN106682107B (en) Method and device for determining incidence relation of database table
Chen et al. CGAP-align: a high performance DNA short read alignment tool
CN107403076B (en) DNA sequence processing method and equipment
CN108776749B (en) Sequencing data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220509

Address after: 100083 room B258, innovation building, 29 life Garden Road, Hui lung Guan, Changping District, Beijing.

Applicant after: BEIJING NOVOGENE TECHNOLOGY Co.,Ltd.

Address before: 210000 floor 10, building a, phase I, Yangzi science and innovation center, No. 211, pubin Road, industrial technology research and Innovation Park, Jiangbei new area, high tech Development Zone, Nanjing, Jiangsu Province

Applicant before: NANJING NOVOGENE BIOTECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant