CN118335196A

CN118335196A - Micro chromosome assembly identification device, method and application thereof

Info

Publication number: CN118335196A
Application number: CN202410761258.7A
Authority: CN
Inventors: 周勋; 李晓波; 闫琦; 李泽阳; 任雪; 王娟; 牛晓阳; 李志民
Original assignee: Annoroad Gene Technology Beijing Co ltd
Current assignee: Annoroad Gene Technology Beijing Co ltd
Priority date: 2024-06-13
Filing date: 2024-06-13
Publication date: 2024-07-12

Abstract

The invention discloses a micro chromosome assembly identification device, which comprises: a data acquisition module configured to acquire gene sequencing data based on the sample; an assembling module configured to assemble the gene sequencing data acquired by the data acquisition module into a sequence at a chromosome level; the comparison module is used for comparing the sequence of the chromosome level assembled by the assembly module with the data of the sequences stored in the database module, and screening to obtain candidate micro chromosome sequences; a database module storing known minichromosome sequence data; a determination module configured to determine a final minichromosome in the input candidate minichromosome sequence based on the minichromosome feature. The device can assemble the gene sequencing data at the chromosome level, compare the data with a constructed micro chromosome library, and stably identify micro chromosomes in an assembled genome according to the characteristics of the micro chromosomes.

Description

Micro chromosome assembly identification device, method and application thereof

Technical Field

The invention relates to the biotechnology field, in particular to the genome assembly technical field. In particular, the invention relates to a device, a method and application for identifying micro chromosome assembly.

Background

Chromosomes, as the basic unit of genetic material located within the nucleus, undergo extensive changes throughout the eukaryotic evolution. The number and size of chromosomes of existing species vary greatly. Minichromosomes, once considered as non-essential genomic fragments in birds, are gene-rich elements with high GC content and a small number of transposable elements. Their origin has been controversial for decades. In recent years, studies have found that in most birds and non-bird reptiles, chromosomes can be characterized as large-sized large chromosomes (i.e., autosomes, sex chromosomes) and small-sized small chromosomes. Meanwhile, studies have shown that minichromosomes are also found in some other vertebrate species (e.g., fish), but are not typically found in amphibians or mammals. And the karyotype of most birds is very conserved, consisting of 9 to 10 pairs of large chromosomes and about 30 pairs of small chromosomes.

In recent years, it has been found that minichromosomes (MicroChromosome) and spot chromosomes (Dot Chromosome), which are a class of minichromosomes found in birds and reptiles, are generally small in size, about a few Mb, but are relatively stable and conserved in sequence, and remain substantially stable during the evolution of birds. The GC content, repeat content, methylation level, and H3K9me3 histone modification level were higher, the H3K36me3 and H3K27me3 histone modification levels were lower, and there was stronger spatial interaction between the spot chromosomes relative to the corresponding autosomes. Because of the smaller chromosome, the structure is more complex, and the data is generally difficult to assemble. Furthermore, in the prior art, there is no clear method for identification of minichromosomes during genome data assembly for different species.

In view of the above, there is an urgent need to develop a device for identifying the assembly of minichromosomes, which can assemble gene sequencing data at the chromosome level, compare the assembled minichromosome with a constructed minichromosome library, and stably identify minichromosomes in the assembled genome according to the minichromosome characteristics. The method is favorable for further improving the accuracy of the genome assembly result of the biological chromosome and lays a foundation for the origin of evolution of research species.

Disclosure of Invention

The invention aims to provide a device and a method for identifying the assembly of a micro chromosome, which can assemble the chromosome level according to input sequencing data, and realize the purpose of identifying the sequence of the micro chromosome existing in assembled genome data by comparing with a constructed micro chromosome library and combining the characteristics of the micro chromosome.

In order to achieve the above object, according to a first aspect of the present invention, there is provided a minichromosome assembly identification device comprising: a data acquisition module configured to acquire gene sequencing data based on the sample; an assembling module configured to assemble the gene sequencing data acquired by the data acquisition module into a sequence at a chromosome level; the comparison module is used for comparing the sequence of the chromosome level assembled by the assembly module with the data of the sequences stored in the database module, and screening to obtain candidate micro chromosome sequences; a database module storing known minichromosome sequence data; a determination module configured to determine a final minichromosome in the input candidate minichromosome sequence based on the minichromosome feature; wherein the candidate micro chromosome sequence is a genome sequence with the coverage of the sequence data comparison area in the database being more than 30%; the minichromosome features a repeat ratio and a gene density of candidate minichromosome sequences.

Further, the above-mentioned minichromosome features are excluding chromosomes whose complex sequence ratio is more than 70% and whose gene density is less than 10 in the candidate minichromosome sequences.

Further, the above gene sequencing data includes: any one or a combination of a plurality of HiFi sequencing data, ONT ultra-long sequencing data, hi-C sequencing data, ONT duplex sequencing data, port-C sequencing data, strand-seq sequencing data, linked-ready sequencing data, nanopore sequencing data, and CLR sequencing data.

Further, the above gene sequencing data includes: long-read long-sequencing data and long-distance sequencing data; the long-reading long-sequencing data comprises: any one or a combination of a plurality of HiFi sequencing data, ONT ultra-long sequencing data, hi-C sequencing data, ONT duplex sequencing data, linked-ready sequencing data, nanopore sequencing data and CLR sequencing data; the long-distance sequencing data includes: hi-C sequencing data, port-C sequencing data, strand-seq sequencing data.

Further, the data acquisition module further includes: and a data processing unit for cleaning the acquired gene sequencing data before assembly.

Further, the data processing unit includes: a data filtering element and/or a data error correction element; the data filtering element is configured to filter the gene sequencing data acquired by the data acquisition module according to certain characteristics; the data error correction element is configured to correct the input sequencing data and output the corrected data.

Further, the above assembly module includes: a preliminary assembly unit and an auxiliary assembly unit; the preliminary assembly unit is arranged for performing genome preliminary assembly on the sequencing data output by the data acquisition module to obtain a genome sketch; the auxiliary assembly unit is positioned downstream of the preliminary assembly unit and is configured to perform auxiliary assembly on the upstream preliminary assembled genome sketch by using the sequencing data output by the data acquisition module so as to obtain a chromosome-level long scaffolds sequence and orientation thereof.

Further, the above assembly module further includes: a sequence shim unit; the sequence shim unit is located downstream of the auxiliary assembly unit; the sequence shim unit is configured to use the sequencing data output by the data acquisition module to patch holes in the adjusted chromosome level genome obtained upstream to obtain a final chromosome level genome sequence.

Further, the above auxiliary assembly unit further includes: an adjustment element; the above-mentioned regulatory element is configured to correct the error of the mounted chromosome level genome to obtain a regulated chromosome level genome scaffold sequence.

Further, the comparison module further includes: an annotation unit; the annotation unit is configured to annotate the candidate minichromosome sequence.

Further, the device also comprises an evaluation module; the evaluation module is configured to perform a co-linear analysis of the assembled identified minichromosomes with the previously known genomic data.

The invention has the beneficial effects that:

By applying the technical scheme of the invention, gene assembly at chromosome level is carried out by utilizing gene sequencing data such as three-generation sequencing data and Hi-C sequencing data, and the method is compared with a constructed micro chromosome library and combines the characteristics of micro chromosomes, so that identification of the micro chromosomes can be realized. Based on the genomic data it processes and assembles, the identified output minichromosomes can have good collinearity with known reported minichromosome genomes.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present application will be apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings. The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application, without limitation to the application, wherein:

FIG. 1 is a schematic diagram of a minichromosome assembly identification device according to one embodiment of the invention.

FIG. 2 is a schematic diagram showing a minichromosome assembly identification device according to another embodiment of the invention.

FIG. 3 is a schematic diagram showing a minichromosome assembly identification device according to another embodiment of the invention.

FIG. 4 is a schematic diagram showing a minichromosome assembly identification device according to another embodiment of the invention.

FIG. 5 is a schematic diagram showing a minichromosome assembly identification device according to another embodiment of the invention.

FIG. 6 is a schematic diagram showing a minichromosome assembly identification device according to another embodiment of the invention.

FIG. 7 is a flow chart showing a method for identifying minichromosome assembly according to example 1 of the present invention.

FIG. 8 is a block diagram showing the hardware structure of a mini-chromosome assembly identification according to an embodiment of the present invention.

FIG. 9 shows the co-linearity of assembled genomic minichromosomes according to example 2 of the present invention with minichromosomes of the genome of a published chicken (Huxu variety).

FIG. 10 shows the point chromosome collinearity of the assembled genome according to example 2 of the present invention with the point chromosome of the genome of a published chicken (Huxu varieties).

Detailed Description

The scheme of the present invention will be explained below with reference to examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the present invention and should not be construed as limiting the scope of the invention. The examples are not to be construed as a particular technique or condition, as described in the literature in the art (e.g., refer to J. Sambrook et al, huang Peitang et al, molecular cloning Experimental guidelines, third edition, scientific Press) or as per the product specifications. The reagents or apparatus used are not manufacturer specific and are conventional products commercially available, such as those available from Illumina corporation.

It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above-described drawings are used only to distinguish similar objects, and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or as describing a particular sequence or order. Thus, features defining "first", "second", or the like, may explicitly or implicitly include at least one such feature, and it is to be understood that the data so used may be interchanged as appropriate, such that embodiments of the invention described herein may be implemented in sequences other than those illustrated or described herein. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Term interpretation:

Three generation sequencing technologies: refers to a single molecule real-time sequencing technology. Compared with the first two generations of sequencing technology, the method has the biggest characteristics that the sequencing process of single-molecule sequencing does not need PCR amplification, and the single sequencing of each DNA molecule is realized. The current main third generation sequencing technology can be divided into nanopore electrical signal sequencing and single molecule fluorescent signal sequencing according to the difference of sequencing principles. The current major third generation sequencing technologies can be divided into: nanopore Single molecule sequencing (Single-molecule Nanopore DNA Sequencing) technology of oxford nanopore company (Oxford Nanopore Technologies, ONT) and Single molecule real-time sequencing (Single Molecule Realtime, SMRT) technology of pacific corporation of america (Pacific Biosciences, pacBio).

Nanopore Single molecule sequencing (Single-molecule Nanopore DNA Sequencing): the nanopore sequencing core is characterized in that a nanopore is utilized, a molecular joint is covalently bonded in the nanopore, and after the nanopore protein is fixed on a resistor film, the nucleic acid is pulled through the nanopore by using dynamic protein. The charge is changed as the nucleic acid passes through the nanopore, thereby causing a change in current across the resistive membrane. Because the diameter of the nanopore is very small, only a single nucleic acid polymer is allowed to pass through, and the charged property of an ATCG single base is different, so that the interference on current generated when different bases pass through the protein nanopore is different, and the base sequence can be determined by monitoring and decoding the current signals in real time, thereby realizing sequencing.

Single molecule real-time sequencing technology (SMRT): the current three-generation sequencing technology has higher utilization rate, and a representative platform of the sequencing technology is various products developed by PacBIO technology company. The platform sequencing principle and the main sequencing flow comprise: (1) Connecting the DNA template with a hairpin joint sequence to form a closed circular single-stranded DNA template serving as a sequencing unit for establishing an SMRT template library; (2) Assembled sequencing units will be loaded onto SMRT chip units with Zero-mode waveguides (Zero-mode waveguide, ZMW) and sequenced in conjunction with DNA polymerase at the bottom of the ZMW; (3) During the sequencing reaction, the polymerase works around the SMRT template, and the DNA fragment is prolonged by using dNTPs with fluorescence; (4) During extension, dNTPs fluorescent groups after base pairing with the template are captured by laser excitation emitted from the bottom of the ZMW, and the base pairing and free base residence time are different, so that the base information is determined. PacBIO sequencing products mainly comprise RS II b, sequel, sequel II and Revio sequencing platforms, wherein the sequence II platform provides two sequencing modes of high precision (circular consensus sequencing, CCS) and long reading length (continuous long reads, CLR) for selection, and parameters such as reading length, accuracy, data reading quantity and the like of different products and different modes have certain difference, so that the PacBIO sequencing products are applicable to different application scenes. Revio sequencing platform, which generates CCS data, is a novel sequencing platform at present. The sequencing platform of the SMRT can identify and maintain high accuracy rate on a GC repetitive region and a base modification region, can completely detect gene subtypes, new genes and the like, but the accuracy rate is greatly influenced by the effective working time of polymerase, and can effectively reduce random errors in the sequencing process by increasing the sequencing times.

HiFi sequencing data: the high-fidelity long-reading long sequencing data of PacBIO company, and the sequencing sequence which is pushed out by the sequence II third generation sequencing platform and has both long-reading long and high accuracy are generally sequenced by adopting a CCS mode. In this sequencing mode, the enzyme read length is typically greater than the insert length, so that the enzyme will perform rolling circle sequencing around the template and the insert will be sequenced multiple times. Random sequencing errors caused in single sequencing can be corrected by self-correction through an algorithm, and HIFI READS with high accuracy is finally obtained.

ONT ultra-long sequencing data: ultra long (ONT UL) sequencing data of Oxford Nanopore company is Ultra-long sequencing unique to Oxford Nanopore sequencing platform, can generate Ultra-long sequencing fragments, easily spans large fragment repeated areas in genome, can remarkably improve the assembly effect of species genome, and fills the sequencing data of gap in genome.

Hi-C sequencing data: data obtained by High-throughput chromosome conformational capture techniques, i.e. Hi-C techniques (High-throughput/resolution

Chromosome Conformation Capture) is derived from chromosome conformation capture (Chromosome Conformation Capture-3C) technology, and the whole cell nucleus is taken as a research object, and the interaction relation of the whole chromosome DNA in the whole genome range is researched by utilizing a high-throughput sequencing technology and combining a bioinformatics method. Hi-C assisted assembly is to cluster Scaffold or Contig into a group according to the principle that the inter-chromosomal interaction frequency is obviously higher than the inter-chromosomal interaction frequency by capturing the inter-chromosomal interaction relationship of chromosomal DNA and the inter-chromosomal interaction frequency is reduced along with the increase of the inter-chromosomal interaction distance on the same chromosome, and further sequence and orient the Contig/Scaffold in the group to realize genome mounting approaching to the chromosome level.

ONT duplex sequencing data: ONTs developed double-strand sequencing technology that can sequence both strands of DNA fragments. The ONT duplex sequencing data is close to PacBioHiFi in accuracy, and the read length can be longer.

Reads: short sequences generated by high throughput sequencing platforms, or long sequences generated by PacBio single molecule real-time sequencing (single molecule real time, SMRT, including CLR and HIFI data), ONT (Oxford Nanopore Technologies).

Sequence assembly (Sequence Assembly): the sequence of the genome is interrupted (shotgun sequencing), and the whole sequence of the genome is not known to be arranged (formed into a chain and finally formed into a chromosome) and combined (how to distinguish different chromosomes), but the prior sequencing technology cannot realize complete sequencing of the whole sequence at one time, and the short sequences are assembled into a complete ordered sequence by the aid of an algorithm and a computer. Commonly used sequence assembly refers to de novo sequencing of new species (Denovo Sequencing) followed by assembly (i.e., denovo Assembly) using corresponding splicing software.

T2T genome: refers to a horizontal genome obtained by mixing HiFi and second generation data (high throughput sequencing, such as Illumina HiSeqTM/MiseqTM) together through ONTultra-longN & gt 100Kb (sequencing read length N50 is more than 1000000, N50: N50 is an evaluation index after genome splicing, all sequences obtained by splicing are sequenced from large to small according to sequence sizes, then accumulation is gradually started, when the added length exceeds half of the total length, the added sequence length is N50), one or more chromosomes reach telomeres to telomeres (Telomere-to-Telomere, T2T), and a T2T genome completion map is the final target of genome assembly.

Depth (Depth): generally expressed as 1×,2×, 3×, etc. The ratio of the total number of bases sequenced to the size of the genome to be tested, i.e., the average number of times each base in the genome is measured, in short, the amount of data sequenced is compared to the value of the reference genome or transcriptome.

Coverage (Coverage): the sequence obtained by sequencing is proportional to the whole genome, i.e. the region of the genome that is detected at least once is proportional to the whole genome. Typically expressed in percent.

Contig: contigs. It refers to a group of short fragments that are joined to each other by overlapping sequences at the ends to form a continuous long fragment of DNA. These fragments initially result in contigs with overlapping regions during sequencing, which can be spliced together to form contigs. The concept of Contig helps to link many short sequence fragments obtained during genome sequencing into very long contiguous fragments, thereby better understanding and analyzing the genome structure.

Scanfold: the scaffold refers to a part of genome sequence spliced by DNA short fragments generated by genome sequencing. Consists of contigs and deleted sequences. When the two sequences of at least one DNA fragment are located on two different contigs, respectively, the relative positions of the contigs may be determined, and there may be missing sequences between the contigs.

The third generation sequencing technique is mainly HiFi sequencing by PacBio and ultra Long (ONT UL) sequencing by Oxford Nanopore. The HiFi sequencing data has high precision, can be assembled in complex areas, and the ONT UL sequencing data has longer length, so that the genome highly repeated sequence can be assembled. In addition, the high-throughput chromosome conformation capture technology (Hi-0C) is a molecular biology technology for studying three-dimensional structures of chromosomes in a genome, and can study the spatial position relationship of whole chromatin in the whole genome range by combining with a bioinformatic analysis method, thereby assisting the genome in assembling at the chromosome level.

As described in the background art, in the prior art, since the mini-chromosome sequence is relatively conserved, but the size is small, only about a few M; compared with an autosome, the structure is complex, the GC content, the repeated sequence content, the methylation level and the H3K9me3 histone modification level are higher, the H3K36me3 and H3K27me3 histone modification levels are lower, and stronger space interaction exists between spot-dyed bodies, so that the assembly and identification of the spot-dyed bodies are generally difficult. Furthermore, in the prior art, there is no clear method for identification of minichromosomes during genome data assembly for different species.

Thus, the present invention provides a device for identifying the assembly of minichromosomes, which can stably identify and output minichromosomes by comparing the assembled minichromosome database with the constructed minichromosome database and simultaneously combining the characteristics of the minichromosomes, according to the inputted and/or acquired gene sequencing data, particularly the third generation sequencing data and the Hi-C data.

Accordingly, as shown in FIGS. 1-6, in one aspect of the present invention, there is provided a mini-chromosome assembly identification apparatus comprising a data acquisition module 01 configured to acquire gene sequencing data based on a sample; an assembling module 02 configured to assemble the gene sequencing data acquired by the data acquisition module into a sequence at a chromosome level; the comparison module 03 is configured to compare the sequence of the chromosome level assembled by the assembly module with the data of the sequences stored in the database module, and screen to obtain candidate micro chromosome sequences; a database module 04 storing known minute chromosome sequence data; the determining module 05 is configured to determine a final minichromosome in the input candidate minichromosome sequence based on the minichromosome feature.

The minichromosome is characterized by a repeat ratio and a gene density in the candidate minichromosome sequence. More specifically, the above-described minichromosome feature excludes chromosomes having a complex sequence ratio of more than 70% and a gene density of less than 10 in the candidate minichromosome sequences.

The data acquisition module described above acquires gene sequencing data based on a sample, including but not limited to sequencing library construction based on a nucleic acid sample, sequencing to obtain sequencing data for the sample. The method also comprises the steps of obtaining the gene sequencing data based on the sample through computer simulation, network downloading, storage medium transmission, manual introduction and the like.

In one embodiment, the above gene sequencing data comprises: any one or a combination of a plurality of HiFi sequencing data, ONT ultra-long sequencing data, hi-C sequencing data, ONT duplex sequencing data, port-C sequencing data, strand-seq sequencing data, linked-ready sequencing data, nanopore sequencing data, and CLR sequencing data.

In one embodiment, the above gene sequencing data comprises: long read long sequencing data and long sequencing data.

In one embodiment, the long-read long-sequencing data comprises: any one or a combination of a plurality of HiFi sequencing data, ONT ultra-long sequencing data, hi-C sequencing data, ONT duplex sequencing data, linked-ready sequencing data, nanopore sequencing data and CLR sequencing data.

In one embodiment, the long-range sequencing data comprises: hi-C sequencing data, port-C sequencing data, strand-seq sequencing data.

In one embodiment, the above-described genetic sequencing data comprises any one or more of the following combinations: a combination of HiFi sequencing data, ONT ultra-long sequencing data and Hi-C sequencing data; a combination of HiFi sequencing data, ONT ultra-long sequencing data and Pore-C sequencing data; a combination of HiFi sequencing data, ONT ultra-long sequencing data and Strand-seq sequencing data; combinations of ONT duplex sequencing data, ONT ultra-long sequencing data and Hi-C sequencing data; combinations of ONT duplex sequencing data, ONT ultra-long sequencing data and Porte-C sequencing data; combination of ONT duplex sequencing data, ONT ultra-long sequencing data and Strand-seq sequencing data, combination of Nanopore sequencing data, ONT ultra-long sequencing data and Hi-C sequencing data; a combination of Nanopore sequencing data, ONT ultra-long sequencing data and Pore-C sequencing data; combination of Nanopore sequencing data, ONT ultra-long sequencing data and Strand-seq sequencing data.

In a specific embodiment, the above-mentioned gene sequencing data is a combination of HiFi sequencing data, ONT ultra-long sequencing data and Hi-C sequencing data.

The data acquisition module further includes: and a data processing unit 011 that performs data cleansing before assembling the acquired gene sequencing data.

In one embodiment, the data processing unit includes: the data filtering element 0111 is configured to filter the gene sequencing data acquired by the data acquisition module according to certain characteristics. For example, the gene sequencing data obtained by the filter data obtaining module is based on sequence length, sequence redundancy, sequence data quality, mitochondrial sequence, chloroplast sequence data, and the like. Specifically, for example, the filtered ONT ultra-long sequencing data is obtained according to the ONT ultra-long sequencing data obtained by the sequence length filtering data obtaining module.

The data processing unit further includes: the data error correction element 0112 is configured to error correct the input sequencing data and output the error corrected data. Specifically, the sequencing data obtained by the filtering can be subjected to error correction, and the sequencing data after error correction can be output. More specifically, the filtered ONT ultra-long sequencing data may be error corrected and the error corrected ONT ultra-long sequencing data may be output.

In a preferred embodiment of the present invention, the data filter element is filtlong elements.

In one embodiment, the data error correction element includes: necat elements and/or NextDenovo elements.

The above-mentioned assembly module includes: the preliminary assembly unit 021 is configured to perform genome preliminary assembly on the sequencing data output by the data acquisition module, and obtain a genome sketch. Specifically, the above-mentioned sequencing data is long-read long-sequencing data. Still further, the long-read long sequencing data includes: any one or more of HiFi sequencing data, filtered ONT ultra-long sequencing data, and ONT duplex sequencing data. Preferably, the long-read long sequencing data includes: any one or a combination of multiple of HiFi sequencing data or ONT ultra-long sequencing data.

In one embodiment, the preliminary assembly unit is selected from: a combination of any one or more of Hifiasm units, verkko units, canu units, flye units, necat units, and NextDenovo units. Preferably, the above preliminary assembly unit is selected from: either or a combination of Hifiasm units, verkko units. In a preferred embodiment of the present invention, the preliminary assembly unit is Hifiasm units.

The above assembly module further includes: an auxiliary assembly unit 022 located downstream of the preliminary assembly unit; which is configured to assist in assembling the upstream preliminary assembled genomic sketch using the sequencing data output by the data acquisition module to obtain a chromosome-level long scaffolds sequence and its orientation. Specifically, the above-described sequencing data is long-distance sequencing data. Still further, the long-range sequencing data comprises any one or a combination of Hi-C sequencing data, port-C sequencing data, strand-seq sequencing data. In a preferred embodiment of the present invention, the long distance sequencing data is Hi-C sequencing data. More specifically, the above-described helper assembly is to mount a preliminary assembled genome sketch using Hi-C data.

In one embodiment, the auxiliary assembly unit includes: a combination of any one or more of Yahs units, lachesis units, 3D-DNA units Allhic units, SALSA2 units, and Haphic units. Further, the above auxiliary assembly unit further includes: a Chromap unit, a Bwa unit, a Hic-Pro unit, or a combination of any one or more of them.

In a preferred embodiment of the present invention, the auxiliary assembly unit includes: chromap units and Yahs units. Wherein the Chromap unit is arranged to compute the inter-contig interaction signal by comparing the genome sketch with the Hi-C data, cluster-ordering and phasing the contigs. And Yahs, based on the relationship between the interaction intensity and the position of the contig, mounting the genome sketch by using Hi-C sequencing data to obtain a mounted chromosome scaffold horizontal genome sequence.

The above auxiliary assembly unit further includes: an adjustment element 0221 configured to error correction the mounted chromosome level genome to obtain an adjusted chromosome level genome scaffold sequence.

In one embodiment, the adjustment element is a guide-box element. Visual error correction was performed on the mounted genome using the guide-box element.

The above assembly module further includes: a sequence shim 023 downstream of the auxiliary assembly unit; it is set up that it uses the sequencing data outputted by the above-mentioned data acquisition module to make up the hole of the adjusted chromosome level genome obtained upstream, obtain the final chromosome level genome sequence. Specifically, the above-mentioned sequencing data is long-read long-sequencing data. Still further, the long-read long sequencing data includes: any one or more of HiFi sequencing data, filtered ONT ultra-long sequencing data, and ONT duplex sequencing data. Preferably, the long-read long sequencing data includes: any one or a combination of multiple of HiFi sequencing data or ONT ultra-long sequencing data.

In one embodiment, the sequence shim cell includes: a combination of any one or more of lr_ Gapcloser units, tgsgapCloser units, and quarTeT units. In a preferred embodiment, the sequence shim cells are quarTeT cells.

In one embodiment, the comparison module includes: minimap2 modules and/or DotPlotly modules.

In one embodiment, the candidate minichromosome sequence is a genomic sequence having an alignment coverage greater than 30%.

The above alignment module further comprises an annotation unit 031 arranged to annotate the candidate mini-chromosome sequences.

In a specific embodiment, the annotating unit is configured to annotate the repetitive sequence of the candidate micro chromosome sequence, count the duty ratio of the repetitive sequence, and mask the candidate micro chromosome sequence (mask refers to an operation of replacing a position corresponding to the repetitive sequence in the sequence with N, which is beneficial to downstream data analysis and identification).

In a preferred embodiment, the annotation unit comprises: REPEATMASKER units and RepeatModeler units configured to annotate the candidate minichromosome sequences with repeated sequences, counting the repeated sequence duty cycle.

In a preferred embodiment, the annotating unit further comprises Geta units configured to annotate the genome of the candidate minichromosome sequences of the mask, and to count the gene density.

The database module can record the known micro chromosome sequence through computer simulation, network downloading, storage medium transmission, manual introduction and other modes. And simultaneously, the microcosmic chromosome assembly identification device can also identify the obtained microcosmic chromosome data and update the microcosmic chromosome data into a database in an iterative way.

The device of the invention further comprises an evaluation module 06 arranged to perform a co-linear analysis of the assembled identified minichromosomes with existing, always genomic data.

In one aspect of the invention, there is provided a method of minichromosome assembly identification comprising: performing preliminary assembly on the first sequencing data and/or the second sequencing data to obtain a preliminary assembled genome sketch; mounting the genome sketch by using third sequencing data to obtain a mounted genome sequence; performing error correction adjustment on the mounted genome sequence to obtain a genome scaffold sequence with adjusted chromosome level; hole filling is carried out on the genome scaffold sequence with the adjusted chromosome level by using the first sequencing data and/or the second sequencing data, so that a final genome sequence with the chromosome level is obtained; using the final chromosome level genome sequence to align with sequences in the minichromosome database, screening candidate minichromosomes; annotating the candidate minichromosome sequence; and screening the final minichromosomes of the candidate minichromosome seeds according to the characteristics of the minichromosomes.

The first sequencing data and the second sequencing data are long-reading long-sequencing data. The long-reading long sequencing data mentioned above generally refers to sequencing data having a reading length of up to 10kb or more. Data are typically read in the prior art for sequencing technologies from Pacific bioscience (PacBio) and Oxford Nanopore Technology (ONT). Specifically, the long-read long sequencing data described above includes: one or more of Linked-ready sequencing data, nanopore sequencing data, CLR sequencing data, hiFi sequencing data, ONT ultra-long sequencing data, ONT duplex sequencing data. The HiFi sequencing data has obvious advantages compared with other sequencing data because of high base accuracy, is high-fidelity (HiFi) read data with the length of 10-20kb, has the error rate of less than 0.5 percent, and is a core data type of high-quality assembly at present. The length of the ONT ultra-long sequencing data is more than or equal to 100kb, which is helpful for solving the problem that HiFi reading can not assemble the residual repeated sequence.

In a specific embodiment, the first sequencing data comprises any one or more of Linked-ready sequencing data, nanopore sequencing data, CLR sequencing data, hiFi sequencing data, ONT duplex sequencing data. Preferably, the first sequencing data is any one or a combination of a plurality of HiFi sequencing data and ONT duplex sequencing data. More preferably, the first sequencing data is HiFi sequencing data.

In one embodiment, the second sequencing data is ONT ultra-long sequencing data. Thus, in a preferred embodiment of the present invention, using HiFi sequencing data as the first sequencing data and ONT ultra-long sequencing data as the second sequencing data, a preliminary genomic sketch can be assembled accurately and over a wider range.

In addition, there are various kinds of software for obtaining a sketch of a genome of a preliminary assembly by performing the preliminary assembly using the above-described first sequencing data and second sequencing data, including Hifiasm, verkko, canu, flye, necat and NextDenovo, etc. In a preferred embodiment of the invention, hifiasm is used to perform a preliminary assembly of the first sequencing data and the second sequencing data to obtain a sketch of the preliminary assembled genome.

The third sequencing data is long-distance sequencing data. As used herein, long-range sequencing data refers to sequencing data that provides chromosome level long-scaffolds and phase information. In particular, the long-range sequencing data includes any one or a combination of Hi-C sequencing data, port-C sequencing data, strand-seq sequencing data.

In a specific embodiment, the third sequencing data comprises any one or more of Hi-C sequencing data, port-C sequencing data, strand-seq sequencing data. Preferably, the third sequencing data is any one or more of Hi-C sequencing data and Pore-C sequencing data. More preferably, the third sequencing data is Hi-C sequencing data.

In an embodiment, the mounting the genome sketch further includes: comparing the third sequencing data with genome sketch data, calculating interaction signals among contigs, and carrying out clustering sequencing and phasing on the contigs; and mounting the genome sketch based on the relationship between the interaction intensity and the position of the contig, so as to obtain a mounted genome. Among these, there are many kinds of software to be aligned and installed using the third sequencing data and genome sketch data. For example, the above-mentioned comparison software includes: chromap, bwa, hic-Pro. For example, the above-mentioned installed software includes: yahs, lachesis, 3D-DNA, allhic, SALSA2, and Haphic.

The Chromap and yahs kits are employed in the preferred embodiment of the present invention.

The above-described error correction adjustment of the mounted genome is performed, and in a preferred embodiment of the present invention, a guide-box is used to obtain the genome scaffold sequence at the adjusted chromosome level. The guide-box is a piece of Hi-C visual error correction software after being mounted.

The hole filling is the filling of the blank filling sequence. It can be performed using a variety of software, including: any one or a combination of quarTet, lr_ Gapcloser and TgsgapCloser. quarTet is used in the preferred embodiment of the present invention to fill in holes to the adjusted chromosome level genomic scaffold sequence using the first sequencing data and the second sequencing data to obtain the final chromosome level genomic sequence.

The software for aligning genomic sequences using the final chromosome level with sequences in the mini-chromosome database includes: : minimap2 and/or DotPlotly.

The candidate minichromosome refers to a genomic sequence of which the final chromosome level is compared with a minichromosome library, and the genomic sequence of which the coverage of a comparison area is more than 30% is screened.

The annotating candidate minichromosome sequences further comprises: repeating sequence annotation is carried out on the candidate micro chromosome sequences, and the proportion of the repeating sequences is counted; performing mask on the candidate micro chromosome sequences; the candidate chromosomes were sequence annotated and the gene density was counted.

The software for annotating candidate minichromosome sequences includes: REPEATMASKER, REPEATMODELER and geta. In the preferred embodiment of the invention, REPEATMASKER and RepeatModeler are adopted to carry out repeated sequence annotation on candidate micro chromosome sequences and count the duty ratio of the repeated sequences; masking the candidate minichromosome sequence. Genome annotation was performed on the candidate chromosome sequence using geta, and gene density was counted. The features described above based on minichromosomes include repeat ratio and gene density. Further, the above-mentioned chromosomal sequences having a repeat sequence ratio of 70% and a gene density (number of genes per Mb sequence) of less than 10 in the candidate minichromosome sequences are excluded based on the characteristics of minichromosomes.

The method further includes the step of obtaining the first sequencing data, the second sequencing data, and the third sequencing data. The three methods for obtaining sequencing data include, but are not limited to, constructing a sequencing library based on sample nucleic acid and performing upper level sequencing to obtain sequencing data; the gene sequencing data is obtained through computer simulation, network downloading, storage medium transmission, manual introduction and the like.

The method further comprises preprocessing the first sequencing data, the second sequencing data and the third sequencing data. The pretreatment comprises the following steps: length filtering, redundancy removal, low quality data removal, mitochondrial sequence removal, chloroplast sequence removal, sequencing data error correction, data quality control and the like. Specifically, the preprocessing includes length filtering the second sequencing data. More specifically, the preprocessing further includes performing sequencing data error correction on the filtered second sequencing data to obtain error corrected second sequencing data. The length filtering software described above includes filtlong. The above described error correction of sequencing data may employ a variety of software, including: necat and/or NextDenovo.

In a preferred embodiment of the present invention, filtlong is used for the length filtering. Preferably, the length filter parameter is the shortest length of 80kb, and the average Q value is 9.

In a preferred embodiment of the invention, necat is used for error correction of the sequencing data.

The mini chromosome database contains the mini chromosome data reported in the prior art.

The above-mentioned minichromosome database can be used for recording known minichromosome sequences by means of computer simulation, network downloading, storage medium transmission, manual introduction and the like. And simultaneously, the microcosmic chromosome assembly identification device can also identify the obtained microcosmic chromosome data and update the microcosmic chromosome data into a database in an iterative way.

As noted in the background, the high conservation of minichromosomes in birds and like animals makes it possible to construct minichromosome databases for identifying minichromosomes in genomes assembled from different samples and even species. The scheme relies on three generations of sequencing data, adopts a high-level gene assembly strategy, and improves the integrity of chromosome level genome assembly. And comparing with the constructed micro chromosome database, and introducing the repeated sequence ratio and the gene density as judging characteristics, so that the micro chromosome sequence in the sample can be stably assembled and identified.

Accordingly, in one aspect of the present invention, there is provided a computer-readable storage medium including a stored program, wherein the program is executed to control a device in which the storage medium is located to execute the above-described minute chromosome assembly identification method.

Accordingly, in one aspect of the present invention, a processor is provided for running a program, wherein the program, when run, performs the mini-chromosome assembly identification method described above.

Accordingly, in one aspect the present invention provides the use of the method and/or apparatus of the present invention in the field of genetic data analysis.

The method provided by the application can be executed in a terminal, a computer terminal or similar computing device. Taking the operation on a terminal as an example, fig. 8 is a hardware block diagram of a method for identifying a mini-chromosome assembly according to an embodiment of the present application. As shown in fig. 8, the terminal may include one or more (only one is shown in fig. 8) processors A1 (the processor A1 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like) and a memory B1 for storing data, and optionally, a transmission device C1 for a communication function and an input-output device D1. It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 8, or have a different configuration than shown in fig. 8.

The memory B1 may be used to store a computer program, for example, a program and a module of an application software, for example, a computer program corresponding to a method of filtering, correcting errors, preliminary assembling, mounting, hole filling, alignment, annotation, and the like in the embodiment of the present invention, and the processor A1 executes various functional applications and data processing by running the computer program stored in the memory B1, that is, implements the method described above. Memory B1 may include high-speed random access memory, but may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory B1 may further comprise memory located remotely from processor A1, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device C1 is used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the terminal. In one example, the transmission device C1 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station so as to communicate with the internet. In one example, the transmission device C1 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

It will be apparent to those skilled in the art that some of the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, so that they may be stored in a memory device for execution by the computing device, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

Example 1

The present invention will be described more specifically with reference to examples, but the present invention is not limited to these examples.

All chromosome assembly (including minichromosomes) was performed on gene sequencing data of chickens (Huxu breeds) and minichromosomes were identified (see fig. 7).

In this example, the third generation 52×HiFi sequencing data, the 80×ONT ultra-long sequencing data, and Hi-C data (NCBI: PRJNA 693184) downloaded from the National Center for Biotechnology Information (NCBI) were directly used as the original sequencing and downloading data and input to the minimal chromosome assembly identification device.

The embodiment directly adopts published micro chromosome data of chickens, emus, wenchang fish and ducks downloaded by the National Center for Biotechnology Information (NCBI) to be input into the database module 04 for storage and construction of a micro chromosome database.

Specifically:

1. The HiFi sequencing data, the ONT ultra-long sequencing data and the Hi-C data are imported into a data acquisition module 01 of the device.

2. In the data acquisition module 01, a data processing unit 011 is further included, and the upstream input sequencing data is filtered using a data filtering element 0111 therein. Specifically, the upstream input ONT ultra-long sequencing data was length filtered using filtlong software (https:// gitsub. Com/rrwick/Filtlong), with the filter parameters set to (- -min_length 80000- -min_mean_q9), i.e., the shortest length 80kb, and the average Q value (mass) was 9.

3. In the data processing unit 011, the filtered sequenced data is error corrected using a data error correction element 0112 downstream of the sequenced data filter element 0111. Specifically, necat software (https:// github. Com/xiaochuanle/NECAT) is used to correct the filtered ONT ultra-long sequencing data to obtain the corrected ONT ultra-long data.

4. Further included downstream of the data acquisition module 01 is an assembly module 02 in which a preliminary assembly unit 021 performs preliminary assembly of the inputted sequencing data. Specifically, the HiFi sequencing data input upstream and the ONT ultra-long data after error correction are initially assembled by utilizing Hifiasm software (https:// github. Com/chhylp 123/Hifiasm), so as to obtain a genome sketch. (Hifiasm the software is replaced with Verkko software) researchers at the national institutes of health (National Institutes of Health) developed and released an innovative software tool for assembling truly complete (i.e., gapless) genomic sequences from a variety of species.)

5. In assembly module 02, the preliminary assembly unit 021 downstream also includes an auxiliary assembly unit 022 that uses long-distance sequencing data or other sequencing data to assist in assembling the genome that is preliminary assembled upstream to obtain chromosome-level lengths scaffolds and their orientations. Specifically, genome sketches were mounted using Hi-C data. More specifically, chromap software was used to compare the filtered genomic sketch with Hi-C data and calculate the interaction signal between contigs, cluster-ordering and phasing the contigs; based on the relationship between the interaction strength and the position of the contig, yahs software is used for carrying out the mounting, and the mounted genome is obtained. (mounting, i.e., an application that guides two-dimensional genome assembly by Hi-C interaction in three dimensions).

6. The mounted genome is error corrected using the adjusting element 0221 in the auxiliary assembly unit 022, obtaining a genome scaffold sequence at an adjusted chromosome level. Specifically, the mounted genome is subjected to visual error correction by using the guide-box software, and the genome sequence of the adjusted chromosome level is obtained.

7. In the assembly module 02, the auxiliary assembly unit 022 further includes a sequence padding unit 023, which uses the HiFi data output by 01 and the ONT ultra-long data to fill in holes (i.e., blank sequence padding) in the adjusted chromosome level genome obtained upstream, so as to obtain a final chromosome level genome. Specifically, the genomic (scaffold) sequence of the adjusted chromosome level mounted was hole-filled using quarTet software using HiFi and ONT data to obtain the final genomic (including minichromosome) sequence of the chromosome level.

8. The assembly module 02 further comprises a comparison module 03 at the downstream, and the genome (comprising the minichromosome) sequence at the final chromosome level obtained at the upstream is compared with the minichromosome data stored in the database module 04 to screen candidate minichromosomes. Specifically, the genomic sequence (including the minichromosome) from which the final chromosome level was obtained is aligned with a minichromosome pool using Minimap2, and the genomic sequence having an aligned region coverage of more than 30% is selected as a candidate minichromosome sequence.

9. In the alignment module 03, an annotation unit 031 is also included, which annotates candidate minichromosome sequences. Specifically, the candidate minichromosome sequences are annotated with repeated sequences using REPEATMASKER software and RepeatModeler software, the repeated sequence duty ratios are counted, and the candidate minichromosome sequences are masked. Thereafter, candidate minichromosome sequences of mask were genomics annotated using geta and gene densities were counted.

10. Further included downstream of the comparison module 03 is a determination module 05 that determines and outputs the finally identified minichromosome based on the input minichromosome characteristics. Specifically, according to the repeated sequence duty ratio and the gene set density statistics, discarding chromosomes with repeated sequence annotation ratio greater than 70% and gene density (the number of genes per Mb sequence) less than 10 in the candidate micro chromosome sequences, and obtaining the filtered micro chromosomes as the micro chromosomes finally identified.

Example 2

The results of the colinear analysis using the minichromosome identified in example 1 and the published genomic data of chicken (NCBI: PRJNA 693184) as a positive sample are shown in FIGS. 8-9. It can be seen that the minichromosomes obtained by the assembly and identification of the scheme are consistent with the genome collinearity in the article.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A minichromosome assembly identification device, comprising: a data acquisition module configured to acquire gene sequencing data based on the sample; an assembling module configured to assemble the gene sequencing data acquired by the data acquisition module into a sequence at a chromosome level; the comparison module is used for comparing the sequence of the chromosome level assembled by the assembly module with the data of the sequences stored in the database module, and screening to obtain candidate micro chromosome sequences; a database module storing known minichromosome sequence data; a determination module configured to determine a final minichromosome in the input candidate minichromosome sequence based on the minichromosome feature; wherein the candidate micro chromosome sequence is a genome sequence with the coverage of the sequence data comparison area in the database being more than 30%; the minichromosome is characterized by a repeat ratio and a gene density of candidate minichromosome sequences.

2. The device of claim 1, wherein the minichromosome feature excludes chromosomes having a complex sequence ratio of greater than 70% and a gene density of less than 10 in candidate minichromosome sequences.

3. The apparatus of claim 1, wherein the genetic sequencing data comprises: any one or a combination of a plurality of HiFi sequencing data, ONT ultra-long sequencing data, hi-C sequencing data, ONT duplex sequencing data, port-C sequencing data, strand-seq sequencing data, linked-ready sequencing data, nanopore sequencing data, and CLR sequencing data.

4. The apparatus of claim 3, wherein the genetic sequencing data comprises: long-read long-sequencing data and long-distance sequencing data; the long-read long-sequencing data comprises: any one or a combination of a plurality of HiFi sequencing data, ONT ultra-long sequencing data, hi-C sequencing data, ONT duplex sequencing data, linked-ready sequencing data, nanopore sequencing data and CLR sequencing data; the long-range sequencing data comprises: hi-C sequencing data, port-C sequencing data, strand-seq sequencing data.

5. The apparatus of claim 1, wherein the data acquisition module further comprises: and a data processing unit for cleaning the acquired gene sequencing data before assembly.

6. The apparatus of claim 5, wherein the data processing unit comprises: a data filtering element and/or a data error correction element; the data filtering element is configured to filter the gene sequencing data acquired by the data acquisition module according to certain characteristics; the data error correction element is configured to error correct the input sequencing data and output the error corrected data.

7. The apparatus of claim 1, wherein the assembly module comprises: a preliminary assembly unit and an auxiliary assembly unit; the preliminary assembly unit is used for performing genome preliminary assembly on the sequencing data output by the data acquisition module to obtain a genome sketch; the auxiliary assembly unit is positioned downstream of the preliminary assembly unit and is configured to perform auxiliary assembly on the genome sketch which is initially assembled upstream by using the sequencing data output by the data acquisition module so as to obtain a long scaffolds sequence at the chromosome level and orientation thereof.

8. The apparatus of claim 7, wherein the assembly module further comprises: a sequence shim unit; the sequence shim unit is positioned downstream of the auxiliary assembly unit; the sequence filling unit is configured to fill holes in the adjusted chromosome level genome obtained upstream by using the sequencing data output by the data acquisition module, so as to obtain a final chromosome level genome sequence.

9. The device according to claim 7 or 8, wherein the auxiliary assembly unit further comprises: an adjustment element; the tuning element is configured to correct the mounted chromosome level genome to obtain a tuned chromosome level genome scaffold sequence.

10. The apparatus of claim 1, wherein the comparison module further comprises: an annotation unit; the arrangement is such that the candidate minichromosome sequence is annotated.