An accurate and ultra-fast genome assembler
- SYNOPSIS
- Description
- Short-read assembly
- Long-read presets
- Wengan demo
- Wengan benchmark
- Wengan components
- Getting the latest source code
- Limitations
- About the name
- Citation
# Assembling Oxford Nanopore and Illumina reads with WenganM
wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000
# Assembling PacBio reads and Illumina reads with WenganA
wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000
# Assembling ultra-long Nanopore reads and BGI reads with WenganM
wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000
# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000
# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000
# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa
# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa
# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa
Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.
Wengan uses a de Bruijn graph assembler to build the assembly backbone from short-read data. Currently, Wengan can use Minia3, Abyss2 or DiscoVarDenovo. The recommended short-read coverage is 50-60X of 2 x 150bp or 2 x 250bp reads.
This Wengan mode uses the Minia3 short-read assembler. This is the fastest mode of Wengan and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).
This Wengan mode uses the Abyss2 short-read assembler. This is the lowest memory mode of Wengan and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.
This Wengan mode uses the DiscovarDenovo short-read assembler. This is the greedier memory mode of Wengan and for assembling a complete human genome needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.
The presets define several variables of the Wengan pipeline execution and depend on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.
preset for raw ultra-long-reads from Oxford Nanopore, typically with an N50 > 50kb.
preset for raw Nanopore reads typically with an N50 ~[15kb-40kb].
preset for raw long-reads from Pacific Bioscience (PacBio) typically with an N50 ~[8kb-60kb].
preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically with an N50 ~[15kb]. This type of data is not fully supported yet.
The repository wengan_demo contains a small dataset and instructions to test Wengan v0.2.
#fetch the demo dataset
git clone https://github.com/adigenova/wengan_demo.git
Genome | Long reads | Short reads | Wengan Mode | NG50 (Mb) | CPU (h) | RAM (GB) | Fasta file |
---|---|---|---|---|---|---|---|
2x150bp 50X (GIAB:rs1 , rs2) | WenganA | 25.99 | 725 | 45 | asm | ||
NA12878 | ONT 35X (rel5) | 2x150bp 50X (GIAB:rs1 , rs2) | WenganM | 17.23 | 203 | 53 | asm |
2x250bp 60X (ENA:rs1 , rs2) | WenganD | 35.31 | 589 | 622 | asm | ||
HG00073 | PAC 90X (ENA:rl1) | 2x250bp 63X (ENA:rs1 , rs2) | WenganD | 32.35 | 936 | 644 | asm |
NA24385 | ONT 60X (GIAB:rl1) | 2x250bp 70X (GIAB:rs1) | WenganD | 50.59 | 963 | 651 | asm |
CHM13 | ONT 50X (T2T:rel3) | 2x250bp 66X (ENA:rs1 , rs2) | WenganD | 69.72 | 1198 | 646 | asm |
The assemblies generated using Wengan (v0.2) can be downloaded from Zenodo. All the assemblies were ran as described in the Wengan manuscript. NG50 was computed using a genome size of 3.08Gb.
- A de Bruijn graph assembler (Minia, Abyss or DiscovarDenovo)
- FastMIN-SG
- IntervalMiss
- Liger
It is recommended to use/download the latest binary release (Linux) from : https://github.com/adigenova/wengan/releases
To facilitate the execution of Wengan, we provide docker/singularity containers. Wengan images are hosted on Dockerhub and can be downloaded with the command:
docker pull adigenova/wengan:v0.2
Alternatively, using singularity:
export TMPDIR=/tmp
singularity pull docker://adigenova/wengan:v0.2
#using singularity
CONTAINER=/path_to_container/wengan_v0.2.sif
#location of wengan in the container
WENGAN=/wengan/wengan-v0.2-bin-Linux/wengan.pl
#run WenganM with singularity exec
singularity exec $CONTAINER perl ${WENGAN} \
-x pacraw \
-a M \
-s short.R1.fastq.gz,short.R2.fastq.gz \
-l pacbio.clr.fastq.gz \
-p asm_wengan -t 20 -g 3000
To compile Wengan run the following command:
#fetch Wengan and its components
git clone --recursive https://github.com/adigenova/wengan.git wengan
There are specific instructions for each Wengan component. After compilation you have to copy the binaries to wengan-dir/bin.
c++ compiler; compilation was tested with gcc version GCC/7.3.0-2.30 (Linux) and clang-1000.11.45.5 (Mac OSX). cmake 3.2+.
- abyss commit d4b4b5d
- discovarexp-51885 commit f827bab
- minia commit 017d23e
- fastmin-sg commit 861b061
- intervalmiss commit 11be8b42
- liger commit 63a044b0
- seqtk commit 2efd0c8
1.- Genomes larger than 4Gb are not supported yet.
Wengan is a Mapudungun word. Mapudungun is the language of the Mapuche people, the largest indigenous inhabitants of south-central Chile. Wengan means "Making the path".
Di Genova, A., Buena-Atienza, E., Ossowski, S. and Sagot,M-F. Efficient hybrid de novo assembly of human genomes with WENGAN. Nature Biotechnology (2020), link