Brief Communication
Published: 09 December 2019

Fast and accurate long-read assembly with wtdbg2

Nature Methods volume 17, pages 155–158 (2020)Cite this article

17k Accesses
816 Citations
89 Altmetric
Metrics details

Subjects

Abstract

Existing long-read assemblers require thousands of central processing unit hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a long-read assembler wtdbg2 (https://github.com/ruanjue/wtdbg2) that is 2–17 times as fast as published tools while achieving comparable contiguity and accuracy. It paves the way for population-scale long-read assembly in future.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Outline of the wtdbg2 algorithm. Wtdbg2 groups 256 bp into a bin, a small box in the figure.**

Efficient hybrid de novo assembly of human genomes with WENGAN

Article Open access 14 December 2020

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Article Open access 19 March 2024

Beyond assembly: the increasing flexibility of single-molecule sequencing technology

Article 09 May 2023

Data availability

C. elegans and A. thaliana Ler-0 reads are available at the PacBio public datasets portal: http://bit.ly/pbpubdat. We downloaded SRR5439404 for the D. melanogaster A4 strain, SRR6702603 for the D. melanogaster reference ISO1 strain, ERR2571284 through ERR2571302 for M. schizocarpa (banana; MinION reads only), PRJNA378970 for axolotl, SRR7615963 for HG00733, and ERR2631600 and ERR2631601 for NA19240. CHM1 reads were acquired from SRP044331 (http://bit.ly/chm1p6c4 for raw signals), NA12878 reads from http://bit.ly/na12878ont (release 5) and NA24385 from http://bit.ly/NA24385ccs. For the A. thaliana Col-0/Cvi-0 dataset, the FASTQ files at SRA (AC, PRJNA314706) were not processed properly. J. Chin, the first author of the paper¹ describing the dataset, provided us with reprocessed raw reads, which are now hosted at public file transfer protocol (FTP) site ftp://ftp.dfci.harvard.edu/pub/hli/col0-cvi0/. The CHM1 CANU and FALCON assemblies and the axolotl assembly are available at NCBI (GCA_000983455.1, GCA_001297185.1 and GCA_002915635.1, respectively). All the evaluated assemblies generated by us can be obtained at ftp://ftp.dfci.harvard.edu/pub/hli/wtdbg/. The FTP site also provides the detailed command lines and the FALCON configuration files.

Code availability

The wtdbg2 source code is hosted by GitHub at: https://github.com/ruanjue/wtdbg2.

References

Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article CAS Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS Google Scholar
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Article CAS Google Scholar
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).
Article CAS Google Scholar
Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 14, 1072–1074 (2017).
Article CAS Google Scholar
De Coster, W. et al. Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome. Genome Res. 29, 1178–1187 (2019).
Article CAS Google Scholar
Myers, G. Efficient local alignment discovery amongst noisy long reads. in WABI vol. 8701. (eds. D. G. Brown & B. Morgenstern) 52–67, https://doi.org/10.1007/978-3-662-44753-6_5 (Springer, 2014).
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Article CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS Google Scholar
Chaisson, M. J., Wilson, R. K. & Eichler, E. E. Genetic variation and the de novo assembly of human genomes. Nat. Rev. Genet. 16, 627–640 (2015).
Article CAS Google Scholar
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Article CAS Google Scholar
Ye, C., Ma, Z. S., Cannon, C. H., Pop, M. & Yu, D. W. Exploiting sparseness in de novo genome assembly. BMC Bioinforma. 13(Suppl 6), S1 (2012).
Article Google Scholar
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS Google Scholar
Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Article CAS Google Scholar
Belser, C. et al. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants 4, 879–887 (2018).
Article CAS Google Scholar
Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS Google Scholar
Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).
Article CAS Google Scholar
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Article CAS Google Scholar
Nowoshilow, S. et al. The axolotl genome and the evolution of key tissue formation regulators. Nature 554, 50–55 (2018).
Article CAS Google Scholar

Download references

Acknowledgements

We are grateful to J. Chin for providing the properly processed raw reads for the A. thaliana Col-0/Cvi-0 dataset. We thank C. Ye from University of Maryland for frequent and fruitful discussion in the development of wtdbg and thank A. Li and S. Wu from CAAS for the help in polishing assemblies. We also thank the reviewers whose comments have helped us to improve wtdbg2. This study was supported by Natural Science Foundation of China (grant nos. 31571353 and 31822029 to J.R.) and by the US National Institutes for Health (grant no. R01-HG010040 to H.L.).

Author information

Authors and Affiliations

Agricultural Genomics Institute, Chinese Academy of Agriculture Sciences, Shenzhen, China
Jue Ruan
Peng Cheng Laboratory, Shenzhen, China
Jue Ruan
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
Heng Li
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Heng Li
Broad Institute, Cambridge, MA, USA
Heng Li

Authors

Jue Ruan
View author publications
You can also search for this author in PubMed Google Scholar
Heng Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.R. conceived the project, designed the algorithm and implemented wtdbg2. H.L. contributed to the development and drafted the manuscript. Both authors evaluated the results and revised the manuscript.

Corresponding authors

Correspondence to Jue Ruan or Heng Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nicole Rusk and Lin Tang were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Evaluation of long-read assemblies: FALCON requires PacBio-style read names and does not work with ONT data or the A4 strain of D. melanogaster that was downloaded from SRA. The A. thaliana assembly by FALCON is acquired from PacBio website as our assembly is fragmented. MECAT produces fragmented assemblies for the ONT dataset. Human assemblies were performed by the developers of each assembler. Base-level evaluations and NGA50 are only reported when the sequenced strain or individual is close to the reference genome. BUSCO scores are computed for genomes sequenced to 50-fold coverage or higher.

Reporting Summary

Supplementary Data

The FALCON configure file for assembling C. elegans.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ruan, J., Li, H. Fast and accurate long-read assembly with wtdbg2. Nat Methods 17, 155–158 (2020). https://doi.org/10.1038/s41592-019-0669-3

Download citation

Received: 25 January 2019
Accepted: 05 November 2019
Published: 09 December 2019
Issue Date: February 2020
DOI: https://doi.org/10.1038/s41592-019-0669-3

This article is cited by

Unravelling genomic drivers of speciation in Musa through genome assemblies of wild banana ancestors
- Guillaume Martin
- Benjamin Istace
- Angélique D’Hont
Nature Communications (2025)
A chromosome-level genome assembly of the mud carp (Cirrhinus molitorella)
- Guangxian Tu
- Zhuyue Yan
- Muhua Wang
Scientific Data (2025)
A chromosome-level genome assembly of the cabbage aphid Brevicoryne brassicae
- Jun Wu
- Guomeng Li
- Yazhou Chen
Scientific Data (2025)
Draft genome sequence of Kei apple, an underutilized African tree crop
- Robert Kariba
- Bernice Waweru
- Oluwaseyi Shorinola
Scientific Data (2025)
Chromosome level assemblies of Nakaseomyces (Candida) bracarensis uncover two distinct clades and define its adhesin repertoire
- Marina Marcet-Houben
- Ewa Księżopolska
- Toni Gabaldón
BMC Genomics (2024)

Fast and accurate long-read assembly with wtdbg2

Subjects

Abstract

Access options

Similar content being viewed by others

Efficient hybrid de novo assembly of human genomes with WENGAN

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Beyond assembly: the increasing flexibility of single-molecule sequencing technology

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Table 1

Reporting Summary

Supplementary Data

Rights and permissions

About this article

Cite this article

This article is cited by

Unravelling genomic drivers of speciation in Musa through genome assemblies of wild banana ancestors

A chromosome-level genome assembly of the mud carp (Cirrhinus molitorella)

A chromosome-level genome assembly of the cabbage aphid Brevicoryne brassicae

Draft genome sequence of Kei apple, an underutilized African tree crop

Chromosome level assemblies of Nakaseomyces (Candida) bracarensis uncover two distinct clades and define its adhesin repertoire

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links