Canu v1.8
These are release notes for Canu version 1.8, which was released on October 22nd, 2018. Canu is specialized for assembly of single-molecule high-noise sequences. Full documentation can be found at http://canu.readthedocs.org/.
This release provides a stable, tested, and documented version of the software. The binary distributions should work on any relatively recent version of the respective OS and are the recommended way to install Canu. The source code distribution contains everything you need to create a binary distribution for your own specific OS.
Citation
- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. (2017).
- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology. (2018).
Minimum Requirements
- 8GB minimum memory; 16GB strongly suggested
- Perl 5.12.0, or File::Path 2.08
- Java SE 8
- GCC 4.5 (for compilation only); GCC 6 recommended
- macOS 10.10 Yosemite (for macOS/Darwin binaries only)
- gnuplot 5.2 (optional, for generating diagnostic graphs)
Installation
Users can download Canu as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.
To install from source code (the file can be named either canu-v1.8.tar.gz
or just v1.8.tar.gz
, depending on how it is downloaded):
gunzip -dc canu-v1.8.tar.gz | tar -xf -
cd canu-1.8/src
make -j 8
cd ..
To install from a binary distribution (recommended installation method):
tar -xJf canu-1.8.*.tar.xz
In both cases, canu is installed in directory canu-1.8/-, for example, canu-1.8/Linux-amd64. You can run the assembler with:
canu-1.8/*/bin/canu
Changes
This release adds support for trio-binning (Nature Biotechnology), a reimplementation of the meryl kmer counter and processor, and improved support for object storage.
Note, however, that while object storage is supported, there are no methods to run tasks on, e.g., Amazon Web Services or Azure.
Canu v1.8 IS NOT compatible with assemblies started with any previous version.
- The Canu executive now fully supports trio-binning. Specifying parental haplotypes with the -haplotype* options enables trio binning. After the reads are binned into haplotypes, each haplotype assembly is automagically launched.
- The 'meryl' kmer counter was reimplemented for improved performance when counting kmers in reads, and better utilization of grid architectures. The method for deciding which kmers to ignore when computing overlaps changed, resulting, generally, in more kmers being ignored and thus lower run times for computing overlaps.
- The overlap store was largely reimplemented to reduce file counts and sizes during construction, and to allow the data-parallel store construction method to run without a grid. It works with object stores now, too. The sequential construction method runs as its own job, not part of the Canu executive, letting it use more resources than before.
- Decrease the default maximum error rate allowed when finding overlaps in corrected Nanopore reads from 14.4% to 12.0%. With the over-occurring kmer changes mentioned previously, run times for finding overlaps in Nanopore reads should decrease by 5 to 10 fold.
- Options 'executiveMemory' and 'executiveThreads' can be used to increase the size of the executive task. If this job is large enough, tasks that would previously run as individual grid jobs will be run from within the executive task, avoiding a submit/execute/submit cycle on heavily loaded grids.
- Options 'readSamplingCoverage' and 'readSamplingBias' can be used to down sample read coverage before starting correction or assembly.
- Option 'stopOnReadQuality', which seemed to just annoy people, was disabled, but option 'stopOnLowCoverage' was added to stop an assembly if read coverage is too low, 10 by default.
- Option 'gnuplotTested' was removed. Failure to find or run gnuplot is now handled automagically. Issues #1084 and #1129.
- Better file staging in seqStore and ovlStore when object storage is used.
Bug Fixes
- Various tweaks to job sizes. overlapInCore overlap jobs are generally larger now.
- Fix truncation of consensus sequence in large contigs due to mis-aligned reads leaving consensus bases with no read coverage.
- Fix correction failures caused by non-ACGT bases in input reads.
Known Issues
See the issues page for up-to date open issues, or to report a problem.
- Large memory usage and runtime for long reads (e.g., Nanopore) when using the
overlapper=ovl
algorithm, and during Overlap Error Adjustment. The-fast
option enables a significantly faster algorithm, but may produce slightly less contiguous assemblies on genomes larger than 1 Gbp in size. It is recommended for nanopore genomes smaller than 1 Gbp. - Bubbles are not captured in the contig graph, but are included in the unitig graph. No attempt at marking bubbles is made.
See the FAQ for many suggestions, including suggestions for specific data types, e.g., Nanopore r9 reads.
Legal
Canu is derived from Celera Assembler and includes code from many other projects. Most, but not all, of the code is GPL licensed. See the README.licenses file and individual source code files for details.