Article
Published: 09 March 2015

HISAT: a fast spliced aligner with low memory requirements

Nature Methods volume 12, pages 357–360 (2015)Cite this article

62k Accesses
11k Citations
111 Altmetric
Metrics details

Subjects

Abstract

HISAT (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ∼64,000 bp. Tests on real and simulated data sets showed that HISAT is the fastest system currently available, with equal or better accuracy than any other method. Despite its large number of indexes, HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: RNA-seq read types and their relative proportions from 20 million simulated 100-bp reads.**

**Figure 2: Alignment speed of spliced alignment software for 20 million simulated 100-bp reads.**

**Figure 3: Alignment accuracy of spliced alignment software for 20 million simulated 100-bp reads.**

**Figure 4: Alignment accuracy of spliced-alignment software for reads with small anchors from 20 million simulated reads.**

Indexing and searching petabase-scale nucleotide resources

Article 16 May 2024

Large scale sequence alignment via efficient inference in generative models

Article Open access 04 May 2023

Accelerating minimap2 for long-read sequencing applications on modern CPUs

Article 28 February 2022

Accession codes

Accessions

Gene Expression Omnibus

GSM981249

References

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008).
Article CAS Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
Article CAS Google Scholar
Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project. Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs. Nature 457, 1028–1032 (2009).
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Article CAS Google Scholar
Kim, D. & Salzberg, S.L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12, R72 (2011).
Article CAS Google Scholar
Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods 8, 469–477 (2011).
Article CAS Google Scholar
Grant, G.R. et al. Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 27, 2518–2528 (2011).
Article CAS Google Scholar
Engström, P.G. et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods 10, 1185–1191 (2013).
Article Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Article Google Scholar
Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).
Article CAS Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS Google Scholar
Burrows, M. & Wheeler, D.J.A. Block-sorting lossless data compression algorithm (Technical report 124). (Digital Equipment Corp., Palo Alto, 1994).
Ferragina, P. & Manzini, G. in Proc. 41st Annual Symp. Found. Comput. Sci. 390–398 (IEEE, 2000).
Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS Google Scholar
Wu, J., Anczuków, O., Krainer, A.R., Zhang, M.Q. & Zhang, C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 41, 5149–5163 (2013).
Article CAS Google Scholar
Griebel, T. et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40, 10073–10083 (2012).
Article CAS Google Scholar
Chen, R. et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 148, 1293–1307 (2012).
Article CAS Google Scholar

Download references

Acknowledgements

We thank G. Pertea and L. Song for their invaluable contributions to our discussions on HISAT. We also thank C. Trapnell for the use of his TuxSim simulation program. This work was supported in part by the National Human Genome Research Institute (US National Institutes of Health) under grants R01-HG006102 and R01-HG006677 to S.L.S.

Author information

Authors and Affiliations

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
Daehwan Kim, Ben Langmead & Steven L Salzberg
Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, USA
Daehwan Kim, Ben Langmead & Steven L Salzberg
Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
Ben Langmead & Steven L Salzberg

Authors

Daehwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Ben Langmead
View author publications
You can also search for this author in PubMed Google Scholar
Steven L Salzberg
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.K., B.L. and S.L.S. performed the analysis and discussed the results of HISAT. D.K. implemented HISAT. D.K., B.L. and S.L.S. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Daehwan Kim, Ben Langmead or Steven L Salzberg.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Alignment speed and sensitivity of spliced alignment software for 40 million error-free simulated paired-end reads (100 bp long, 20 million pairs).

This figure shows the alignment speed and sensitivity for each type of pair (all, M, 2M_gt_15, 2M_8_15, 2M_1_7, gt_2M, where ‘all’ includes all the pair types). Since a pair consists of left and right reads, the type of a pair is determined by the more difficult read type. The difficulties of read types are given in the following order from easiest to most difficult: M, 2M_gt_15, 2M_8_15, 2M_1_7, and gt_2M. The plot on the left shows the alignment speed of the programs in terms of the number of pairs processed per second. The right plot shows alignment sensitivity. Pairs are categorized as: (1) correctly and uniquely mapped, (2) correctly mapped (multi-mapped), (3) incorrectly mapped, and (4) unmapped. Case (2) covers instances where an aligner mapped a pair to multiple locations and one of the locations was correct. These four categories encompass all of the pairs. The numbers in the right plot represent the percentages of case (1). The numbers inside the parentheses represent the percentages of cases (1) and (2) combined.

Source data

Supplementary Figure 2 Alignment speed and sensitivity of spliced alignment software for 20 million simulated single-end reads with a mismatch rate of 0.5% (100 bp long).

This figure shows the alignment speed and sensitivity for each type of read (all, M, 2M_gt_15, 2M_8_15, 2M_1_7, gt_2M, where ‘all’ includes all the read types). The plot on the left shows the alignment speed of the programs in terms of the number of reads processed per second. The right plot shows alignment sensitivity. Reads are categorized as: (1) correctly and uniquely mapped, (2) correctly mapped (multi-mapped), (3) incorrectly mapped, and (4) unmapped. Case (2) covers instances where an aligner mapped a read to multiple locations and one of the locations was correct. These four categories encompass all of the reads. The numbers in the right plot represent the percentages of case (1). The numbers inside the parentheses represent the percentages of cases (1) and (2) combined. Note that by looking at both plots, it is easy to see tradeoffs between alignment speed and sensitivity.

Source data

Supplementary Figure 3 Alignment results for 109 million reads, each 101 bp long, from a human sample.

Shown are the cumulative numbers of alignments up to a given edit distance. Edit distance is defined here simply as the number of differences (‘edits’) between the read and the reference sequence. The leftmost panel shows reads that matched exactly (with an edit distance of 0). The next panel (labelled “1”) shows the number of reads that aligned with either 0 or 1 mismatches; similarly for the panels labelled 2 and 3. Note that GSNAP and STAR report soft-clipped alignments where bases on the ends of reads are left unaligned. To compute edit distances for these alignments, we re-aligned the soft-clipped bases to their corresponding locations in the reference genome and calculated the number of mismatches.

Source data

Supplementary Figure 4 Alignment results of spliced alignment software for 109 million real reads (101 bp long).

This figure shows the cumulative number of spliced alignments up to a given edit distance (0, 1, 2, and 3) whose splice sites are known in gene annotations.

Source data

Supplementary Figure 5 Alignment speed and sensitivity of spliced-alignment software for 40 million simulated paired-end reads with a mismatch rate of 0.5% (100 bp long, 20 million pairs).

Source data

Supplementary Figure 6 Alignment results of spliced alignment software for ~218 million real paired-end reads (~109 million pairs).

This figure shows two plots: (1) the cumulative number of alignments up to a given edit distance (0, 1, 2, and 3) and (2) the cumulative number of spliced alignments whose splice sites are known in gene annotations. Note these alignments are pair alignments with the combined edit distance from the left and the right alignments. Spliced alignments are those whose read alignment is a spliced alignment.

Source data

Supplementary Figure 7 Alignment results of spliced alignment software for ~126 million real paired-end reads (~63 million pairs).

Source data

Supplementary Figure 8 Three working examples demonstrating how HISAT applies its hierarchical indexing for fast and sensitive alignment.

The examples include alignment of one exonic read and two junction reads (one an intermediate-anchored read and the other a long-anchored read). Reads are error-free and 100-bp long.

Supplementary Figure 9 Two-step approach version of HISAT to allow alignment of junction reads with small anchors.

This figure shows how to align reads with short anchors (1-7 bp) by making use of splice sites found by reads with long anchors.

Supplementary Figure 10 Alignment of junction reads in the presence of processed pseudogenes.

This figure shows how to correctly align reads that would otherwise be mapped incorrectly to processed pseudogenes.

Supplementary Figure 11 Three more examples demonstrating how HISAT applies its hierarchical indexing for reads involving mismatches, indels and three exons.

The examples include alignment of one exonic read with one mismatch, one exonic read with an indel, and three exon spanning reads with two small anchors on both sides. Reads are 100-bp long.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11, Supplementary Tables 1–7 and Supplementary Note (PDF 453 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, D., Langmead, B. & Salzberg, S. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12, 357–360 (2015). https://doi.org/10.1038/nmeth.3317

Download citation

Received: 07 August 2014
Accepted: 16 January 2015
Published: 09 March 2015
Issue Date: April 2015
DOI: https://doi.org/10.1038/nmeth.3317

This article is cited by

The fusion of multi-omics profile and multimodal EEG data contributes to the personalized diagnostic strategy for neurocognitive disorders
- Yan Han
- Xinglin Zeng
- Yonghua Zhao
Microbiome (2024)
Genome-wide investigation of the TGF-β superfamily in scallops
- Qian Zhang
- Jianming Chen
- Jiabao Guo
BMC Genomics (2024)
Establishment of primary prostate epithelial and tumorigenic cell lines using a non-viral immortalization approach
- Simon Lange
- Anna Kuntze
- Christof Bernemann
Biological Research (2024)
Impacts of longitudinal water curtain cooling system on transcriptome-related immunity in ducks
- Qian Hu
- Tao Zhang
- Hehe Liu
BMC Genomics (2024)
Comparative transcriptome analysis reveals major genes, transcription factors and biosynthetic pathways associated with leaf senescence in rice under different nitrogen application
- Yafang Zhang
- Ning Wang
- Guoxiang Chen
BMC Plant Biology (2024)

Subjects

Abstract

Access options

Similar content being viewed by others

Accession codes

Accessions

Gene Expression Omnibus

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links