PERSPECTIVE
Exome sequencing and the genetic basis of complex traits
npg
© 2012 Nature America, Inc. All rights reserved.
Adam Kiezun1,2,16, Kiran Garimella2,16, Ron Do2,3,16, Nathan O Stitziel2,4,16, Benjamin M Neale2,3,5,
Paul J McLaren1,2, Namrata Gupta2, Pamela Sklar6,7, Patrick F Sullivan8, Jennifer L Moran2,
Christina M Hultman9, Paul Lichtenstein9, Patrik Magnusson9, Thomas Lehner10, Yin Yao Shugart11,
Alkes L Price2,12,13,17, Paul I W de Bakker1,2,14,15,17, Shaun M Purcell5,17 & Shamil R Sunyaev1,2,17
Exome sequencing–based studies are emerging as a popular approach
to test for association of rare coding variants with complex phenotypes.
The promise of exome sequencing is grounded in theoretical population
genetics and in the empirical successes of candidate gene sequencing studies. We discuss here several aspects of exome sequencing studies that we
view as particularly important. We analyze exome sequencing data from
438 individuals and use this as a basis to review processing and quality
control of raw sequence data, evaluate the statistical properties of exome
sequencing studies, discuss rare variant burden tests to detect association
to phenotypes and show the importance of accounting for population
stratification in the analysis of rare variants. We conclude that enthusiasm
for exome sequencing studies to identify the genetic basis of complex traits
should be combined with caution stemming from the observation that
on the order of over 10,000 samples may be required to reach sufficient
statistical power.
Next-generation sequencing1–5 coupled with efficient DNA capture6–8 enables the use of exome sequencing as a new approach to study
the genetic basis of human phenotypes. A number of genes underlying
Mendelian diseases have been mapped using this approach6,9–15. Exome
sequencing has also been applied to tumors16–20, where sample purity,
read mapping and chromosomal rearrangements are critical and form a
very distinctive set of issues. Here, we restrict our focus to complex traits.
1Division of Genetics, Department of Medicine, Brigham and Women’s Hospital,
Harvard Medical School, Boston, Massachusetts, USA. 2The Broad Institute of MIT
and Harvard, Cambridge, Massachusetts, USA. 3The Center for Human Genetic
Research, Massachusetts General Hospital, Boston, Massachusetts, USA. 4Division
of Cardiovascular Medicine, Brigham and Women’s Hospital, Harvard Medical
School, Boston, Massachusetts, USA. 5Analytic and Translational Genetics Unit,
Massachusetts General Hospital, Boston, Massachusetts, USA. 6Department
of Psychiatry, Friedman Brain Institute, Mount Sinai School of Medicine, New
York, New York, USA. 7Institute for Genomics and Multi-scale Biology, Mount
Sinai School of Medicine, New York, New York, USA. 8Department of Genetics,
University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.
9Department of Medical Epidemiology and Biostatistics, Karolinska Institutet,
Stockholm, Sweden. 10Division of Neuroscience and Basic Behavioral Science,
National Institute of Mental Health, Bethesda, Maryland, USA. 11Division of
Intramural Research Program, National Institute of Mental Health, Bethesda,
Maryland, USA. 12Department of Epidemiology, Harvard School of Public Health,
Boston, Massachusetts, USA. 13Department of Biostatistics, Harvard School of
Public Health, Boston, Massachusetts, USA. 14Department of Medical Genetics,
University Medical Center Utrecht, Utrecht, The Netherlands. 15Julius Center for
Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The
Netherlands. 16These authors contributed equally to this work. 17These authors
jointly directed this work. Correspondence should be addressed to S.R.S.
(ssunyaev@rics.bwh.harvard.edu).
Published online 29 May 2012; doi:10.1038/ng.2303
NATURE GENETICS | VOLUME 44 | NUMBER 6 | JUNE 2012
In complex trait genetics, exome sequencing studies can be used to identify
rare coding variants that are not detected by microarray-based genomewide association studies (GWAS). The promise of exome sequencing
studies of complex traits has its basis in the success of candidate gene
studies21–26 and has firm roots in population genetic theory27–35.
Large-scale GWAS of complex traits have consistently shown, with few
exceptions, that common variants have modest effects, often requiring
over 10,000 samples for their detection. Exome sequencing provides a
complementary approach by comprehensively assessing the role of all
coding variation, both common and rare. With mutations continually
occurring in each protein-coding gene (at a rate of ~1 × 10-5 per gene per
generation for nonsynonymous variants)36–39 and fitness losses of less
than 1% for most novel nonsynonymous mutations29–31,34, almost every
gene is expected to harbor functionally important variants that can be
tested through sequencing, even if these variants are rare. Therefore, the
strong interest in exome sequencing stems from three factors: the potential to identify many genes underlying complex traits, straightforward
functional annotation of coding variation and a substantially lower cost
(approximately five times lower) than that of whole-genome sequencing.
We evaluate the extent of rare coding variation in empirical data, discuss data processing and quality control of raw sequence data and review
analytical methods for detecting genotype-phenotype associations,
their expected statistical power and the potential for confounding due
to population stratification. To illustrate our analyses, we used empirical
whole-exome sequence data from 184 individuals from the International
HIV Controllers Study (HIV)40 and 254 control individuals from the
Schizophrenia (SCZ) exome sequencing study (Supplementary Note).
Assessment of rare coding variation in empirical data
Exome sequencing data contain an abundance of rare coding variation
and indicate that a large fraction of this variation is functional. Not only
are there many more rare variants than common ones, but sequencing
additional samples continues to uncover additional rare variants. In fact, as
sample size increases, the number of observed variants grows much faster
than is predicted by the neutral model with constant population size41,42
(Fig. 1). This relative excess of rare variants can, in part, be attributed to
recent population expansion43–45 but is also likely to be due to purifying
selection. As a consequence, rare variation is enriched for evolutionarily
deleterious, and thus functional, variants. Additionally, the proportion
of nonsynonymous variants is higher among rare than among common
variants45. Finally, among rare variants, missense variants predicted46 to
be damaging are more prevalent than variants predicted to be benign
(Fig. 1). These findings are consistent with studies that showed that rare
623
PERSPECTIVE
Novel variants discovered by class
(Fold increase over within-class baseline)
50
40
Functional class
Nonsense
Probably damaging
Possibly damaging
Missense
Benign
Synonymous
Theoretical
30
20
Effect of
population
growth
10
© 2012 Nature America, Inc. All rights reserved.
0
npg
Effect of
purifying
selection
100
200
300
400
Cumulative number of samples
Figure 1 Discovery of novel variants for increasing numbers of samples. For
each functional class, the fold increase over the number of variants in one
sample for that class is plotted as a function of the number of samples in
a sequencing experiment. For example, the number of nonsense variants
discovered in 300 samples is 40 times greater than the average number
discovered in a single sample, whereas the number of synonymous variants is
only 10 times greater (although the absolute number of nonsense variants is a
relatively minor proportion of the total variation discovered); this effect is due
to purifying selection. All classes of variants are discovered at rates exceeding
what would be predicted under a neutral model of evolution in a population of
constant size, an effect of population growth. The crossing between curves for
synonymous variants and the theoretical prediction most likely is a signature
of the out-of-Africa bottleneck. Additional details are provided in Methods.
variants in protein-coding regions are under purifying selection35,47–51.
Because sequencing larger samples continuously uncovers functionally
relevant variants, exome sequencing studies enable the direct identification of causal variants (in contrast to GWAS that use linkage disequilibrium patterns between common markers).
Variant calling and quality control filtering
An exome sequencing study starts with exome capture and sequencing of
DNA samples followed by the identification of sequence variants. Exome
capture may be realized on many platforms (for example, Illumina HiSeq,
Roche 454 and ABI SOLiD) and through a variety of probe definitions (for
example, Agilent SureSelect and Nimblegen SeqCap EZ). Recent advances
have made it possible to sequence an entire exome or even several exomes
at deep coverage in a single run of the sequencing instrument. However,
exome capture technologies differ in what they target, how much they
capture and how consistently they do so8. Moreover, only 80–90% of the
targeted regions are covered by greater than 10×, which may leave 4–8 Mb
(or 1,000–2,000 genes) without sufficient coverage for variant detection.
Exome sequencing coverage shows tremendous regional variation8.
Some regions may be over-covered, representing true structural variation
(for example, segmental duplications for which only one copy of the region
exists in the reference genome) or technical artifacts (for example, greater
abundance of capture probes or overlapping probe definitions resulting
in ‘double capturing’). Similarly, some areas may be under-covered for
biological reasons (for example, segmental duplications where more
than one copy exists in the reference sequence, preventing the aligner
from placing the read uniquely) or for technical reasons (for example,
high GC content or density of variation, which impair hybridization of
probes). Furthermore, some near-target regions within 50 bp of the target
boundary can have sufficient coverage to warrant inclusion in variant
calling. Critically, whichever capture technology is used, either all samples
should be processed using the same technology or the variability should be
624
accounted for, for example, by stratifying the study by technology.
We generated whole-exome data (targeting 28 Mb) from 438 samples
(Methods) using a two-stage approach52,53. First, we applied a data processing and variant calling protocol described previously54. Second, we
applied post–SNP calling quality control filters.
For quality control of the resulting SNPs, we used population genetic
statistics and properties of human genetic variation. Using those statistics
helps to identify true variants, because the properties of the mutational
process37,55 are different from errors of the sequencing technology.
We compared statistics computed in the 438-sample data set and in 37
whole-genome data sets released by Complete Genomics, Inc. (CGI; see
URLs), focusing only on the same genomic regions as in the exome data.
The CGI whole-genome data set is good for comparison, because wholegenome sequencing is not dependent on exome capture technology. We
further stratified these per-sample statistics into classes that are biologically interesting (by functional class and CpG status) but may also show
different rates of technical artifacts. We show that filtering is critical for
achieving high-quality calls (Table 1). Before filtering, the metrics showed
significant deviation from their expected values, which may indicate a
high false positive rate. After filtering, the statistics converged to those
in the CGI data set. The effectiveness of the filters is evident also in comparison of the mutation pattern with human-chimpanzee divergence55.
The number of novel variant sites (defined here as those not present
in dbSNP 129) is another metric of SNP call quality (Supplementary
Table 1). Most novel variants have low frequency and are especially
enriched in the singletons and doubletons. Singletons and doubletons
are particularly important to distinguish from false positives, because
technical artifacts or errors in data processing can easily manifest as
novel variation.
Statistics such as the transition/transversion ratio (Ti/Tv) and the number of novel variants are useful as broad indicators of the quality of the
data set and enable comparison of two sets of calls from the same data.
However, precise expectations of these statistics are unknown, because
they depend on many factors, including uneven coverage, variability in
DNA quality and other sources of technical bias, such as machine error.
Therefore, interpreting small differences from expectation in these statistics is nontrivial. Genotyping validation provides an additional measure of call set quality, independent of the population genetics statistics.
Comparing genotyping data to sequencing data makes it possible to
directly measure call set quality by calculating non-reference sensitivity
(NRS—the rate at which non-reference sites in the genotyping data are
recovered in the sequencing data) and non-reference discrepancy rate
(NRD—the rate at which genotypes from sequencing and genotyping
data differ). A genotyping assay should include sites at various allele frequencies, especially at low frequencies (~1%). When available, data from
families, particularly from parent-child trios, can also be useful in assessing call set quality.
Comparing our call set with GWAS data in the same samples at overlapping sites suggested high sensitivity for common variants (98.6% NRS).
To assess the quality of low-frequency variant calls in a comparable
sequencing data set, we compared CGI data to that generated with the
HumanOmni2.5-8 (Omni2.5) BeadChip for the 1000 Genomes Project
(Supplementary Table 2). This comparison resulted in 95.65% NRS,
1.79% NRD and 1.12% NRD for novel variants.
Despite stringent quality control, genotyping and sequencing errors
are still present. Unfortunately, when stratifying variants on the basis of
putative functional consequences, the class of variation that is annotated
to be most deleterious is also more heavily enriched for errors56. This
underscores the importance of rigorous quality control.
It is critical that great care is taken to prevent technical biases and
confounding in sequencing to avoid distorting association results. For
VOLUME 44 | NUMBER 6 | JUNE 2012 | NATURE GENETICS
PERSPECTIVE
npg
© 2012 Nature America, Inc. All rights reserved.
instance, differences in how (rare and precious) case samples are handled
compared to control samples may lead to systematic false positives that
masquerade as interesting associations. Likewise, simultaneous multisample variant calling on only cases or only controls may lead to differential detection of variants across batches, negatively impacting the accuracy
of allele frequency estimates and association analyses. Many other, often
poorly understood or hidden, technical confounders (for example, in
DNA preparation, exome capture technology, machine type, read length,
depth of coverage, SNP calling algorithm or quality control filters) may
influence the properties of exome sequencing data. Therefore, although
the use of shared controls (for example, from the 1000 Genomes Project)
has been helpful in filtering approaches applied to Mendelian disorders6,9,
it is not likely to be applicable to association analysis of complex diseases.
There are numerous other statistical tests for rare variants in complex traits
(reviewed in refs. 62–65).
In simulation studies64, most tests behave similarly in many situations. However, the results may depend on assumptions used in simulated data. The relative power to detect association depends on factors
such as the number and proportion of causal variants, their population
frequencies and their effect sizes, as well as the directionality of effects,
the number of genes contributing to the trait and the fraction of causal
genetic variation located in the exome. Statistical tests were developed
with various combinations of these factors in mind and therefore are
likely to be sensitive to different disease architectures. For example,
the simulation framework used in the development of the WSS test
assumes effect size proportional to 1/x(1 - x) (where x is the population frequency of the causal allele), the sequence kernel association
test (SKAT)66 simulation framework uses effect size proportional to
-log(x), and the VT test simulations use a demographic history model
with a range of possible values of strength of selection leading to different relationships between effect size and x. These simulations were
designed to show the strengths of each method under different effect
size distributions: the WSS test is designed for effect sizes proportional
to 1/x(1 - x), SKAT is designed for effect sizes proportional to b density
b(x; a1, a2) for prespecified a1 and a2, the C-alpha test67 is designed for
Statistical methods for the analysis of rare variants
Analysis of rare variants requires statistical methods that are fundamentally different from association statistics used for testing common
variants. There are two reasons for this. First, rare variants have to be
combined in a gene (or pathway) for an association test to reach sufficient
power57. For example, a causal SNP at a frequency of 1 in 500 and genotype relative risk of 10 in a sample of 200 cases and 200 controls has 0.2%
power to be detected at a conventional significance threshold for GWAS
(P < 5 × 10-8). Second, functional and population genetics information can be added to the Table 1 SNP counts per Illumina-sequenced sample
testing approach, because exome sequencing
Counts
Number of heterozygotes Number of homozygotes
comprehensively captures variation that can be
Filter
(% filtered)
(% filtered)
(% filtered)
Ti/Tv
annotated with such information.
Total
Early candidate gene sequencing studies for
Unfiltered
18,626
11,761
3,007
2.92
complex traits were based on comparison of the
Filtered
16,776
10,242
2,785
3.21
numbers of nonsynonymous alleles exclusive to
(10%)
(13%)
(7%)
cases or controls (or samples at the extremes of
Comparison
16,914
10,464
2,492
3.31
the trait distribution)21,26. This approach has
By
functional
class
limited power, because it ignores common and
Unfiltered
9,536
5,933
1,601
4.80
low-frequency polymorphisms, as most such Silent
Filtered
8,845
5,372
1,514
5.10
variants would be present both in cases and
(7%)
(9%)
(5%)
controls. Recently, a number of statistical tests
Comparison
8,987
5,514
1,352
5.22
have been designed for rare variant analysis.
Missense Unfiltered
8,698
5,557
1,350
1.92
The Combined Multivariate and Collapsing
Filtered
7,644
4,685
1,220
2.11
(CMC) test58 jointly assesses the role of rare
(12%)
(16%)
(10%)
and common variation. For common variComparison
7,723
4,772
1,095
2.17
ants, traditional regression-based association
Nonsense Unfiltered
70
60
9
1.31
is applied. For rare variation, an individual’s
Filtered
48
39
8
1.65
predictor in a regression model is defined as 1
(31%)
(35%)
(11%)
if the individual possesses at least one rare variComparison
46
38
6
2.00
ant in the region (for example, a gene) and 0
By CpG status
otherwise. The weighted-sum statistic (WSS)
CpG
Unfiltered
2,213
1,539
422
4.82
test59 creates a composite genotype score for all
Filtered
2,030
1,390
397
5.12
individuals. This score is the sum of alternate
(8%)
(10%)
(6%)
alleles weighted by the inverse of the binomial
Comparison
2,098
1,448
350
5.44
variance. A rank-sum test is then performed on
16,415
10,218
2,585
2.75
the genotype scores between phenotypic groups. Non-CpG Unfiltered
Filtered
14,752
8,852
2,338
3.03
The kernel-based adaptive cluster (KBAC) test60
(10%)
(13%)
(10%)
also uses a weighting scheme that reflects apparComparison
14,822
9,901
2,145
3.11
ent effect sizes of individual variants. An alternative approach to combine rare variants into a SNP counts (computed as the median of the metrics value in each sample over 438 samples) were localized to the
single test selects an allele frequency threshold exome, stratified by functional and biological criteria and compared to SNP counts per the Complete Genomics–
sequenced sample (computed as the median of the metrics value in each sample over 37 samples). Before the
on the basis of the observed data. The develop- application of filters, counts significantly differed from the expectations derived from the independently obtained
61
ment of this variable threshold (VT) approach
comparison set, indicating the presence of many false positives. For example, the SNPs identified as nonsense
was motivated by population genetics simula- mutations initially appeared to be enriched by 1.5-fold for false positives. As nonsense variants are rare, an unfiltered
call set may contain many artifactual events that masquerade as nonsense variants. Quality control filters helped to
tions that showed that there is no single optimal align the metrics with the comparison set and the data from human-chimpanzee divergence. Number of homozygotes
weighting scheme or allele frequency threshold. includes only homozygotes of the derived allele.
NATURE GENETICS | VOLUME 44 | NUMBER 6 | JUNE 2012
625
PERSPECTIVE
for all possible missense changes in humans,
making these scores readily applicable.
An important consideration for exome
Triglycerides
ANGPTL4
Fisher’s exact
13
2
1,775
0.016b
26
b
sequencing
studies is selecting the significance
Triglycerides
ANGPTL5
Fisher’s exact
9
1
1,775
0.022
threshold that accounts for multiple testing. A
HDL
ABCA1
RVE
28
4
519
<0.0001b
21
simple way of doing this is to adopt a Bonferroni
APOA1
1
0
519
correction for 20,000 independent tests (1 test
LCAT
6
1
519
for each gene), which, for an experiment-wide
Blood pressure
SLC12A1,
Fisher’s exact
9
1
626
0.02
22
significance of 0.05 gives a P-value threshold of
SLC12A3, KCNJ1
2.5 × 10-6 per gene. However, such a threshold
Obesity
Obesityc
Fisher’s exact
73
97
757
0.061
25
may
be overly conservative, because it assumes
Type 1 diabetes
IFIH1
Fisher’s exact
21
39
960
0.025
24
that
each
tested gene has sufficient variation to
Triglycerides
APOA5
Fisher’s exact
1
5
765
0.25
23
achieve
the
asymptotic properties for the test
GCKR
Fisher’s exact
5
20
765
0.024
statistic. For example, if only two individu-5
LPL
Fisher’s exact
8
44
765
2.47 × 10
als carry nonsynonymous variants in a given
APOB
Fisher’s exact
39
85
765
0.008
gene, the difference between cases and controls
Gene burden test results are summarized from published candidate gene resequencing studies. Only one signal for the
never exceeds two total observations—thus, the
LPL gene is strongly associated (P = 2.47 × 10-5) but does not attain a genome-wide significance of P < 2.5 × 10-6
(P < 0.05 after applying a Bonferroni correction for 20,000 genes tested). This highlights the importance of
most significant P value that can be achieved is
sequencing large numbers of samples. Counts for refs. 21 and 26 are as reported in the published studies. Counts for
approximately 0.25, assuming that these two
ref. 22 are based on functional mutation carriers, as described in the published study. Counts for refs. 24 and 25 are
variants are independent. Therefore, unless the
based on SNPs with minor allele frequency (MAF) < 0.01. Counts for ref. 23 are based on SNPs with MAF < 0.01 in
controls only. All P values are from two-sided tests unless reported otherwise in the published study. RVE, rare variant
study is large, association P values will generally
exclusive test; HDL, high-density lipoprotein.
be
less significant than expected under the null
aAllele count for non-reference allele. bAs reported in the published study. cObesity refers to a combination of 21 candidate genes.
hypothesis of no association. This deficiency of
significant P values is apparent in the 438 whole
effects in opposite directions in the same region, and the VT test makes exomes (Fig. 2a). The PLINK/SEQ suite computes from data the i-stat,
which is an estimate of the minimal achievable P value for a gene. The
no assumptions about effect size distributions.
When combining rare variants, all functional variants may either be i-stat can be used by setting a threshold (for example, 1 × 10-3) and only
assumed to influence the trait in the same direction, or some may be correcting for the number of genes that have an i-stat value below the
allowed to have opposite directions of effect. A biochemical argument can threshold, in agreement with the idea that, for the genes with an i-stat value
be made that most nonsynonymous variants are loss-of-function hypo- above the threshold, there is no power to detect an association. Another
morphs, whereas gain-of-function variants are infrequent. However, some way to correct for multiple testing is to compute an experiment-wide siggenes (for example, PCSK9)68 have variants of both kinds. Several tests nificance threshold by permutations of phenotype labels, create the empirallow for rare variants to have opposite effects on the trait (for example, ical distribution of minimal P values for all genes across permutations
Step-up69, C-alpha, the replication-based test70 and SKAT). These tests are and compare the minimal P value from the real data to that distribution
based either on the analysis of over-dispersion or on explicit linear models (Fig. 2b). This approach efficiently controls type 1 error (the probability
that determine the contribution of a variant to a score on the basis of the of rejecting the null hypothesis when it is true) and is less conservative
than the Bonferroni correction. Notably, the P-value threshold computed
direction of effect observed in the data.
Rare variant tests can benefit from stratifying or weighting rare alleles by permutations is dependent on both the study and the statistical test.
by functional significance, as evidenced by simulations and sequencing However, the experiment-wide correction via permutation is not robust to
studies of candidate genes61,64,71–73. The power of rare variant tests is confounding, and it is essential to assess the quality of the distribution of
strongly influenced by the fraction of causal variants among all variants test statistics for those genes that have i-stat values less than the threshold
analyzed, and using functional information is an effective way to give to ensure appropriate calibration of the distribution. Nevertheless, with
greater weight to likely causal variants. For example, nonsense variants increasing sample sizes, the dimensionality of the tests will also increase,
should be prioritized above non-conserved missense variants. Similarly, and studies will be assessing close to 20,000 tests. Therefore, for large studmissense variants should be prioritized above synonymous variants. ies, we consider the Bonferroni threshold to be preferable.
Functional consequences of variants can be predicted by examining the
effects of amino-acid changes using comparative sequence and protein Statistical power of exome sequencing studies
structure analyses. Many computational prediction and conservation The power of an exome sequencing study is limited by the amount of
methods74,75 are available (reviewed in refs. 76–79). The accuracy of variation in a gene. Therefore, power is higher for genes with more varithose methods is approximately 80% (ref. 80), and it is likely highest ants, for example, larger genes or genes in regions with elevated mutation
for rare variants. (Truly functional variants are most likely deleterious rate. Additionally, genes in which most variants are causal are easier to
and are kept at low frequencies by purifying selection, and, so, com- identify than those in which few variants are causal. In individual candimon variants are most likely neutral and nonfunctional.) Therefore, date gene sequencing studies, estimates of this proportion ranged from
using prediction methods enriches for functional variants and thereby 30–70% (refs. 21,22,26). Consequently, the effect size is not only a property
boosts the power of association tests. Because such predictions are not of an individual variant but rather a reflection of the distribution of effects
perfect, however, they should be used quantitatively by weighing vari- coupled with how those effects are interpreted via the test. Some statistical
ants rather than qualitatively by filtering out variants. A number of tests explicitly account for differences in power when evaluating evidence
tests allow the inclusion of prediction scores in test statistics, including of association81.
the VT test, KBAC, SKAT, the rare variant weighted aggregate statistic
Given the sample sizes, the likely effect sizes and frequencies of causal
(RWAS)72 and the likelihood ratio test (LRT)73. The PLINK/SEQ suite variants and the proportion of causal variants in a gene, do current exome
includes previously computed PolyPhen-2 (ref. 46) prediction scores sequencing studies have sufficient power to detect genes underlying
Table 2 Summary of gene burden test results for rare variant studies
npg
© 2012 Nature America, Inc. All rights reserved.
Trait
626
Gene
Test
ACa low
ACa high
n
P
Ref.
VOLUME 44 | NUMBER 6 | JUNE 2012 | NATURE GENETICS
complex phenotypes? Enthusiasm for exome sequencing studies stems,
in part, from successful candidate gene sequencing studies, and, so, we
sought to test whether exome sequencing would be expected to have sufficient power to detect genes discovered by the candidate gene approach.
To date, no published candidate gene study reported P values that would
be significant in the context of the complete exome (Table 2). This is particularly notable, because some candidate gene studies used much larger
sample sizes (thousands of individuals) than ongoing exome sequencing
studies (hundreds of individuals). This shows that current exome sequencing studies are underpowered to detect genes with the allelic distribution and effect sizes similar to those in the published examples. Indeed,
extrapolation of effect sizes and frequencies from published studies shows
that thousands of individuals are required to reach acceptable statistical
power (Fig. 3). This analysis is consistent with an earlier study based on
population genetics simulations that concluded that as many as 10,000
individuals at phenotypic extremes would be needed to achieve satisfactory power30. The very first GWAS82–84 were also highly underpowered,
but falling costs and the ability to combine studies in meta-analyses have
enabled the rapid creation of well-powered studies and have resulted in
many discoveries. Similarly, with the falling cost of sequencing and targeted enrichment85, exome sequencing will soon be affordable to many
research groups, and we expect that consortia will form to facilitate the
pooling of exome sequencing data, thus enabling better-powered studies
and a new wave of discoveries.
Replication to confirm association
To discover robust associations, replication in exome sequencing studies
will be critical. Because small early studies will inevitably be underpowered, no gene may achieve exome-wide statistical significance. In such
cases, unless strict correction for multiple tests is performed, researchers
should resist the temptation to apply a battery of statistical tests, each
with various weighing schemes and variant selection. We strongly argue
that an association can only be considered real if it has been replicated. A
reasonable replication strategy is to select a few genes (for example, ten)
from the discovery stage on the basis of the strength of association86 and
biological plausibility. Sequencing and rare variant associations must then
be performed on new samples using a multiple test correction threshold
applied only to the (smaller) set of candidate genes.
southern European ancestry in European-Americans) may have different allele frequency spectra due to their different demographic histories. For example, in an exome sequencing study in African-Americans
in which disease cases have more African ancestry than controls, one
expects to see an excess of rare variants in cases, because African chromosomes carry more rare variants92.
We created a hypothetical case-control exome sequencing study
involving real sequencing data and simulated phenotype data using
438 individuals divided into two populations (Methods). To induce
population stratification, we assigned case or control status to each
sample randomly, with a bias whereby more cases were taken from one
population and more controls were taken from the other population.
Association tests indicated inflated rates of spurious statistically significant P values. We corrected for population stratification by modifying
the permutation scheme to account for subpopulations. This correction
was effective at controlling type 1 errors in all association tests.
Our simulations show that exome sequencing studies for complex
traits can be affected by population stratification, which may produce
spurious associations. We have shown that a simple permutation
scheme is sufficient to correct for population stratification when discrete clusters corresponding to genome-wide ancestry are known or can
be inferred by applying standard methods to GWAS chip data88,89,93.
The permutation scheme is appealing in that it generalizes most burden
of multiple rare variants tests; however, some tests may also be amenable
to the use of PCA covariates in instances in which population structure
is best described by continuous clines rather than discrete clusters89.
Discussion
Exome sequencing studies enable the unbiased discovery of coding variations for subsequent association testing for complex traits.
However, we expect that initial studies will be underpowered, and we
have highlighted a number of technical issues that could bias the interpretation and analysis of rare variant data, especially for novel variants.
We expect that over 10,000 exomes will be required to achieve sufficient statistical power to robustly detect associations of rare variation
with complex traits. Issues affecting the design of exome sequencing
studies that we have discussed here are also relevant to whole-genome
sequencing studies, in which the analysis of protein-coding variation
will remain the same as in exome sequencing studies.
Focusing exclusively on the exome sequence is a limitation in complex trait genetics, where noncoding genetic variation is believed to
Population stratification
Population stratification—systematic ancestry differences between cases
and controls—is a well-studied confounder in
genetic association studies87. In GWAS, comb 1.0
a
monly used approaches to correct for stratifica4
tion include stratifying by population cluster
0.8
(structured association), principal-components analysis (PCA) and mixed models87–90.
3
0.6
Genomic control may also be applied, but it is
generally more useful for assessing stratifica2
tion than correcting for it87,91.
0.4
An important question is whether population stratification can confound exome
0.2
sequencing studies, and, if so, how should
stratification be corrected for in this con0.0
text? Although excess of rare variant tests
3.0
3.5
4.0
4.5
5.0
5.5
6.0
0
1
2
3
4
are fundamentally different from single−log (best P−value)
Expected –log (P)
variant tests, the possibility of stratificaFigure 2 Association analysis. (a) Quantile-quantile plot of association P values under the null
tion still exists, because different ancestries
hypothesis. (b) Distributions of lowest P values under whole-exome permutations. Histograms show the
within a structured population sample (for distributions of the lowest P values across permutations for the T5 test. The red vertical line indicates
example, African and European ances- the exome-wide significance level of 0.05 for the most significant gene (the most significant gene
try in African-Americans or northern and reaches exome-wide significance if its P value is lower that the level indicated by the red line).
Density
Observed –log10(P)
npg
© 2012 Nature America, Inc. All rights reserved.
PERSPECTIVE
10
NATURE GENETICS | VOLUME 44 | NUMBER 6 | JUNE 2012
10
627
PERSPECTIVE
Data generation
Reads were aligned to the reference genome using the Burrows-Wheeler
Aligner (BWA)94, duplicate PCR reads were removed using Picard (see
URLs), base quality scores were recalibrated using the Genome Analysis
Toolkit (GATK), and alignments near putative indels were refined using
GATK. The resulting data were run through GATK to discover and genotype SNP candidates.
6
5
−log10 (P)
4
3
Gene
LPL
APOB
GCKR
IFIH1
Obesity
APOA5
2
1
0
0
2,000
4,000
6,000
8,000
10,000
npg
© 2012 Nature America, Inc. All rights reserved.
Sample size
Figure 3 Extrapolation of gene burden results. Horizontal solid red line
shows Bonferroni genome-wide significance threshold of P = 2.5 × 10-6.
Horizontal dashed line shows the threshold derived from whole-exome
permutations (Fig. 2b). For larger sample sizes, the permutation threshold
would be closer to the Bonferroni threshold, asymptotically approaching
it as the sample sizes increase. Obesity refers to a combination of 21
candidate genes.
have a larger role than in Mendelian genetics or in somatic cancer
genetics. However, there are several reasons to perform exome sequencing. First, statistical approaches combining multiple rare variants are
problematic in noncoding regions, because there is no easily identifiable
set of sites harboring variants with unidirectional phenotypic effects.
Second, variants in regulatory regions are likely to have smaller effect
sizes. In contrast, protein-coding genes provide a well-defined and
interpretable target for mutations in the locus. These mutations create
variants at this locus that, in a well-powered study, can be identified
as being associated with the trait. Thus, although the associated variants identified by focusing on the exome sequence alone are unlikely
to explain all of the heritability for a complex trait, this approach does
have the potential to highlight the genes involved.
Despite the challenges in the design of exome sequencing studies
of complex traits discussed here, the observation that a large number
of functionally significant coding variants exist in the human population brings hope that the exome sequencing approach will be useful in
identifying loci important for complex traits and diseases.
Methods
Simulating discovery of novel variants
To calculate the discovery rate of novel variants for increasing numbers of samples, first, all exome samples were arranged in a random
order. Then, samples were analyzed sequentially, starting with the first
sample, and the cumulative set of identified variants was computed.
For every subsequent sample, a variant site was considered novel if that
site had not been identified as a variant in the cumulative set of the preceding samples. The fold increase over baseline (where the baseline for
each class was the number of variants discovered in the first sample)
was plotted (Fig. 1). To avoid sampling bias, random resampling was
performed, and the overall mean was calculated. Nonsense, missense
and synonymous classes were defined on the basis of RefSeq annotations. The missense class was further divided into probably damaging,
possibly damaging and benign subclasses according to PolyPhen-2
predictions46. The theoretical line plotted the expected number of
segregating sites under a neutral model of evolution in a population
of constant size41.
628
Quality control filters
We used the following quality control filters: (i) a quality score versus
depth filter that excluded variants whose depth-normalized discovery confidence did not exceed 2.0; (ii) a homopolymer run filter that
excluded variants that had an alternate allele that matched the allele in an
immediately adjacent homopolymer run of greater than five nucleotides;
(iii) a strand bias filter that excluded variants whose alternate allele was
preferentially found on one of the two available read orientations at the
site, and (iv) an indel mask filter that excluded variants discovered at sites
that overlap with indels.
Association analysis
Case-control status was assigned randomly, and a T5 test for burden of rare
variants was executed on all genes (T5 is a variant of the CMC test58 that
considers only nonsynonymous variants with MAFs less than 5%, uses
the total count of alternative minor alleles in cases as the test statistic and
assigns significance by permuting phenotype labels). The overall deflation
in significant P values (fewer genes associated at any significance level than
expected by chance) was due to low counts of variants in genes. Results
were similar for the T1 version of CMC, as well as for WSS59 and the VT
test61. This pattern is expected in studies with small sample sizes (with
fewer than approximately 1,000 individuals). Whole-exome permutations
can be used to establish exome-wide significance in such cases.
Whole-exome permutations
Phenotype labels of full exomes were permuted 1 million times, that is,
permuted phenotype affected all genes in an individual. In each permutation, the lowest exome-wide P value was computed. It took fewer than
1,000 computing hours to run 8 statistical tests on the 1 million wholeexome permutations of 15,122 genes in 438 individuals. The computation
was very easy to parallelize and was thus quite affordable using cluster or
cloud computing.
Power calculations
Data were extrapolated from results from five candidate genes and one
obesity gene set from published studies (Table 2). Fisher ’s exact test was
used to calculate P values after sample size extrapolations.
Population stratification
We induced population stratification in a hypothetical exome sequencing
study involving real sequencing data and simulated phenotype data using
184 individuals from the HIV and 254 individuals from the SCZ exome
sequencing studies. We observed that there were exome-wide differences
in allele frequencies between the populations, which we quantified by estimating the FST between HIV and SCZ samples using exome sequencing
data95. FST was estimated using EIGENSOFT software. Using variants with
MAFs of at least 5%, we observed an FST value of 0.003, which is consistent
with the different European ancestries of the HIV (European-American)
and SCZ (Swedish) samples and with previous estimates of genetic distances between European populations96. We considered the possibility
that the observed differences between HIV and SCZ samples could be
due to differential bias resulting from differences in sample collection,
sequencing or data processing97, but we view this as unlikely because we
VOLUME 44 | NUMBER 6 | JUNE 2012 | NATURE GENETICS
npg
© 2012 Nature America, Inc. All rights reserved.
PERSPECTIVE
applied identical data processing and quality control procedures to both
sample sets and because quality control metrics revealed no systematic
differences between the sample sets.
To induce population stratification, we randomly assigned 80% of
the HIV samples and 20% of the SCZ samples as cases and assigned the
remaining samples as controls. We then used case-control labels to run
four association tests: two fixed-threshold approaches (T1 and T5 versions
of the CMC test58), WSS59 and the VT test61. We quantified the evidence
of population stratification by considering the most significant P value (of
15,122 genes) and the proportion of P values < 0.05 and < 0.01. As seen in
the null distribution (Fig. 2), it is expected that, due to low counts, P values
will have a deficiency of statistically significant signals. Before correction
for population stratification, however, our metrics indicated an excess of
statistically significant signals. For example, for T5, the most significant
P value was < 0.000001, and the proportions of P values were 0.0595 at
the 0.05 level and 0.0136 at the 0.01 level. Results were similar for the
other statistical tests and for other proportions of HIV samples assigned
as cases (we experimented with 90%, 80% and 70%, as well as 30%, 20%
and 10%). We note that, when the proportion of HIV individuals assigned
as cases was above 50%, the induced inflation was higher than when the
proportion was below 50%, which could be due to a population genetic
excess of rare variants in Swiss and European-American samples relative
to Swedish samples.
To correct for stratification, we modified the script that implements
association tests (see URLs) to employ a permutation scheme in which
case-control status was permuted within each population (HIV and SCZ),
assuming known population labels. This permutation scheme does not
change the computational cost of the study. The results show that the
permutation procedure adequately controlled for population stratification, removing the excess of significant signals. For example, for T5, the
most significant P value after correction was 0.0001, and the proportions
of P values were 0.0340 at the 0.05 level and 0.0060 at the 0.01 level. The
deficiency of statistically significant signals is due to low counts and is
consistent with the null distribution (Fig. 2). Results were similar for the
other statistical tests and for other proportions of HIV samples assigned
as cases. These results show that the permutation-based correction was
effective at controlling type 1 errors.
Data access
SCZ control data can be accessed via the database of Genotypes and
Phenotypes (dbGAP; phs000473.v1.p1). To access the HIV data, investigators can submit a brief concept sheet detailing their study design,
research questions and other needs. The concept sheet with detailed
instructions can be downloaded (see URLs). Please send completed forms
to P. Richtmyer (prichtmyer@partners.org). Requests will be reviewed on
the basis of scientific merit, feasibility and overlap with ongoing concept
sheets and investigations.
URLs. Complete Genomics data set, http://www.completegenomics.com/sequencedata/download-data/; R script used for all association analyses (contains T1, T5,
WSS and VT tests and optionally weighted PolyPhen-2 predictions), http://genetics.
bwh.harvard.edu/rare_variants/; EIGENSOFT, http://www.hsph.harvard.edu/faculty/alkes-price/software/; Picard utilities for manipulation of Sequence Alignment/
Map (SAM) files, http://picard.sourceforge.net/index.shtml; Burrows-Wheeler
Aligner (BWA), http://bio-bwa.sourceforge.net/; Genome Analysis Toolkit (GATK)
suite, http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page; PLINK/SEQ
library for management, quality control and analysis of exome sequencing data,
including several statistical tests, http://atgu.mgh.harvard.edu/plinkseq/; sample
repository research concept sheet, http://cfar.globalhealth.harvard.edu/fs/docs/
icb.topic938249.files/Harvard%20CFAR%20Concept%20Sheet%20Template%20.
docx.
Note: Supplementary information is available in the online version of the paper.
NATURE GENETICS | VOLUME 44 | NUMBER 6 | JUNE 2012
ACKNOWLEDGMENTS
The authors are grateful to S. Pollack for assistance with EIGENSOFT. This work
was made possible, in part, by the US National Institutes of Health (NIH; grant 5R01
MH084676) and, in part, by the International HIV Controllers Study, supported by the
Collaboration for AIDS Vaccine Discovery of the Bill and Melinda Gates Foundation
(to P.I.W.d.B.), and the AIDS Clinical Trials Group, supported by the NIH (grants
AI069513, AI34835, AI069432, AI069423, AI069477, AI069501, AI069474, AI069428,
AI69467, AI069415, Al32782, AI27661, AI25859, AI28568, AI30914, AI069495,
AI069471, AI069532, AI069452, AI069450, AI069556, AI069484, AI069472, AI34853,
AI069465, AI069511, AI38844, AI069424, AI069434, AI46370, AI68634, AI069502,
AI069419, AI068636, RR024975 and AI077505). Sequencing of the SCZ control
individuals was funded by the NIH (grant RC2MH089905), the Herman Foundation
and the Stanley Medical Research Institute. N.O.S. was supported, in part, by an NIH
Training Grant (T32-HL07604-25; Division of Cardiovascular Medicine, Brigham and
Women’s Hospital). B.M.N. was supported by a National Institute of Mental Health
(NIMH) grant (1R01MH089208-01). R.D. is supported by a Canadian Institutes of
Health Research Banting Postdoctoral Fellowship. The views expressed in this paper
do not necessarily represent the views of the NIMH, NIH, Department of Health and
Human Services (HHS) or the US government.
AUTHOR CONTRIBUTIONS
A.K. developed the computational analysis pipeline and analyzed data, K.G.
performed upstream quality control and analysis of sequencing data, R.D. performed
the power analysis, N.O.S. performed the assessment of rare variants in empirical data,
B.M.N. contributed to statistical analyses, and P.J.M. assisted with data analysis. T.L.
participated in designing the study. P.S., P.F.S., J.L.M., C.M.H., P.L., P.M., P.I.W.d.B.,
N.G. and S.M.P. contributed data. A.L.P., P.I.W.d.B. and S.R.S. conceived and designed
the study. A.L.P., P.I.W.d.B., S.M.P. and S.R.S. supervised the work. A.K., K.G., R.D.,
N.O.S., B.M.N., Y.Y.S., A.L.P., P.I.W.d.B. and S.R.S. wrote the manuscript. All authors
approved the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Published online at http://www.nature.com/doifinder/10.1038/ng.2303.
Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
1. Fuller, C.W. et al. The challenges of sequencing by synthesis. Nat. Biotechnol. 27,
1013–1023 (2009).
2. Rusk, N. & Kiermer, V. Primer: Sequencing—the next generation. Nat. Methods 5, 15
(2008).
3. Metzker, M.L. Sequencing technologies the next generation. Nat. Rev. Genet. 11,
31–46 (2010).
4. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–
1145 (2008).
5. Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA
sequencing. Nat. Nanotechnol. 4, 265–270 (2009).
6. Ng, S.B. et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki
syndrome. Nat. Genet. 42, 790–793 (2010).
7. Teer, J.K. & Mullikin, J.C. Exome sequencing: the sweet spot before whole genomes.
Hum. Mol. Genet. 19, R145–R151 (2010).
8. Hedges, D.J. et al. Comparison of three targeted enrichment strategies on the SOLiD
sequencing platform. PLoS ONE 6, e18595 (2011).
9. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human
exomes. Nature 461, 272–276 (2009).
10. Pierce, S.B. et al. Am. Mutations in the DBP-deficiency protein HSD17B4 cause
ovarian dysgenesis, hearing loss, and ataxia of Perrault Syndrome. J. Hum. Genet.
87, 282–288 (2010).
11. Krawitz, P.M. et al. Identity-by-descent filtering of exome sequence data identifies
PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat. Genet.
42, 827–829 (2010).
12. Wang, J.L. et al. TGM6 identified as a novel causative gene of spinocerebellar ataxias
using exome sequencing. Brain. 133, 3510–3518 (2010).
13. Ng, S.B., Nickerson, D.A., Bamshad, M.J. & Shendure, J. Massively parallel sequencing and rare disease. Hum. Mol. Genet. 19, R119–R124 (2010).
14. Musunuru, K. et al. Exome sequencing, ANGPTL3 mutations, and familial combined
hypolipidemia. N. Engl. J. Med. 363, 2220–2227 (2010).
15. Hoischen, A. et al. De novo mutations of SETBP1 cause Schinzel-Giedion syndrome.
Nat. Genet. 42, 483–485 (2010).
16.Zhao, Q. et al. Systematic detection of putative tumor suppressor genes through the
combined use of exome and transcriptome sequencing. Genome Biol. 11, R114
(2010).
17. Wei, X. et al. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat. Genet. 43, 442–446 (2011).
18. Varela, I. et al. Exome sequencing identifies frequent mutation of the SWI/SNF
complex gene PBRM1 in renal carcinoma. Nature 469, 539–542 (2011).
19. Agrawal, N. et al. Exome sequencing of head and neck squamous cell carcinoma
reveals inactivating mutations in NOTCH1. Science 333, 1154–1157 (2011).
20. Chang, H. et al. Exome sequencing reveals comprehensive genomic alterations across
eight cancer cell lines. PLoS ONE 6, e21097 (2011).
21. Cohen, J.C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–872 (2004).
629
npg
© 2012 Nature America, Inc. All rights reserved.
PERSPECTIVE
22. Ji, W. et al. Rare independent mutations in renal salt handling genes contribute to
blood pressure variation. Nat. Genet. 40, 592–599 (2008).
23. Johansen, C.T. et al. Excess of rare variants in genes identified by genome-wide
association study of hypertriglyceridemia. Nat. Genet. 42, 684–687 (2010).
24. Nejentsev, S., Walker, N., Riches, D., Egholm, M. & Todd, J.A. Rare variants of IFIH1,
a gene implicated in antiviral responses, protect against type 1 diabetes. Science
324, 387–389 (2009).
25. Ahituv, N. et al. Medical sequencing at the extremes of human body mass. Am. J.
Hum. Genet. 80, 779–791 (2007).
26. Romeo, S. et al. Rare loss-of-function mutations in ANGPTL family members
contribute to plasma triglyceride levels in humans. J. Clin. Invest. 119, 70–79
(2009).
27. Pritchard, J.K. Are rare variants responsible for susceptibility to complex diseases?
Am. J. Hum. Genet. 69, 124–137 (2001).
28. Pritchard, J.K. & Cox, N. J. The allelic architecture of human disease genes: common
disease–common variant...or not? Hum. Mol. Genet. 11, 2417–2423 (2002).
29. Kryukov, G.V., Pennacchio, L.A. & Sunyaev, S.R. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J.
Hum. Genet. 80, 727–739 (2007).
30. Kryukov, G.V., Shpunt, A., Stamatoyannopoulos, J.A. & Sunyaev, S.R. Power of deep,
all-exon resequencing for discovery of human trait genes. Proc. Natl. Acad. Sci. USA
106, 3871–3876 (2009).
31. Boyko, A.R. et al. Assessing the evolutionary impact of amino acid mutations in the
human genome. PLoS Genet. 4, e1000083 (2008).
32. Williamson, S.H. et al. Simultaneous inference of selection and population growth
from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA 102,
7882–7887 (2005).
33. Eyre-Walker, A., Woolfit, M. & Phelps, T. The distribution of fitness effects of new
deleterious amino acid mutations in humans. Genetics 173, 891–900 (2006).
34. Yampolsky, L.Y., Kondrashov, F.A. & Kondrashov, A.S. Distribution of the strength of
selection against amino acid replacements in human proteins. Hum. Mol. Genet.
14, 3191–3201 (2005).
35. Fay, J.C., Wyckoff, G.J. & Wu, C.-I. Positive and negative selection on the human
genome. Genetics 158, 1227–1234 (2001).
36. Nachman, M.W. & Crowell, S.L. Estimate of the mutation rate per nucleotide in
humans. Genetics 156, 297–304 (2000).
37. Kondrashov, A.S. Direct estimates of human per nucleotide mutation rates at 20
loci causing Mendelian diseases. Hum. Mutat. 21, 12–27 (2003).
38. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-genome
sequencing. Science 328, 636–639 (2010).
39. Xue, Y. et al. Human Y chromosome base-substitution mutation rate measured by
direct sequencing in a deep-rooting pedigree. Curr. Biol. 19, 1453–1457 (2009).
40. The HIV Controllers Study. The major genetic determinants of HIV-1 control affect HLA
class I peptide presentation. Science 330, 1551–1557 (2010).
41. Ewens, W.J. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3,
87–112 (1972).
42. Kimura, M. Molecular evolutionary clock and the neutral theory. J. Mol. Evol. 26,
24–33 (1987).
43. Marth, G.T., Czabarka, E., Murvai, J. & Sherry, S.T. The allele frequency spectrum in
genome-wide human variation data reveals signals of differential demographic
history in three large world populations. Genetics 166, 351–372 (2004).
44. Coventry, A. et al. Deep resequencing reveals excess rare recent variants consistent
with explosive population growth. Nat. Commun. 1, 131 (2010).
45. Li, Y. et al. Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42, 969–972 (2010).
46. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
47. Halushka, M.K. et al. Patterns of single-nucleotide polymorphisms in candidate
genes for blood-pressure homeostasis. Nat. Genet. 22, 239–247 (1999).
48. Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding
regions of human genes. Nat. Genet. 22, 231–238 (1999).
49. Bustamante, C.D. et al. Natural selection on protein-coding genes in the human
genome. Nature 437, 1153–1157 (2005).
50. Sunyaev, S., Ramensky, V. & Bork, P. Towards a structural basis of human nonsynonymous single nucleotide polymorphisms. Trends Genet. 16, 198–200 (2000).
51. Sunyaev, S. et al. Prediction of deleterious human alleles. Hum. Mol. Genet. 10,
591–597 (2001).
52. McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
53. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25,
2078–2079 (2009).
54. DePristo, M.A. et al. A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
55. Hellmann, I. et al. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13, 831–837 (2003).
56. MacArthur, D.G. & Tyler-Smith, C. Loss-of-function variants in the genomes of healthy
humans. Hum. Mol. Genet. 19, R125–R130 (2010).
57. Purcell, S., Cherny, S.S. & Sham, P.C. Genetic Power Calculator: design of linkage and
association genetic mapping studies of complex traits. Bioinformatics 19, 149–150
(2003).
58. Li, B. & Leal, S.M. Methods for detecting associations with rare variants for common
diseases: application to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321
(2008).
59. Madsen, B.E. & Browning, S.R. A groupwise association test for rare mutations using
a weighted sum statistic. PLoS Genet. 5, e1000384 (2009).
630
60. Liu, D.J. & Leal, S.M. A novel adaptive method for the analysis of next-generation
sequencing data to detect complex trait associations with rare variants due to gene
main effects and interactions. PLoS Genet. 6, e1001156 (2010).
61. Price, A.L. et al. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010).
62. Bansal, V., Libiger, O., Torkamani, A. & Schork, N.J. Statistical analysis strategies for
association studies involving rare variants. Nat. Rev. Genet. 11, 773–785 (2010).
63. Asimit, J. & Zeggini, E. Rare variant association analysis methods for complex traits.
Annu. Rev. Genet. 44, 293–308 (2010).
64. Basu, S. & Pan, W. Comparison of statistical tests for disease association with rare
variants. Genet. Epidemiol. 35, 606–619 (2011).
65. Stitziel, N.O., Kiezun, A. & Sunyaev, S.R. Computational and statistical approaches
to analyzing variants identified by exome sequencing. Genome Biol. 12, 227 (2011).
66. Wu, M.C. et al. Rare variant association testing for sequencing data using the sequence
kernel association test (SKAT). Am. J. Hum. Genet. 89, 82–93 (2011).
67. Neale, B.M. et al. Testing for an unusual distribution of rare variants. PLoS Genet. 7,
e1001322 (2011).
68. Kotowski, I.K. et al. A spectrum of PCSK9 alleles contributes to plasma levels of lowdensity lipoprotein cholesterol. Am. J. Hum. Genet. 78, 410–422 (2006).
69. Hoffmann, T.J., Marini, N.J. & Witte, J.S. Comprehensive approach to analyzing rare
genetic variants. PLoS ONE 5, e13584 (2010).
70. Ionita-Laza, I., Buxbaum, J.D., Laird, N.M. & Lange, C. A new testing strategy to identify
rare variants with either risk or protective effect on disease. PLoS Genet. 7, e1001289
(2011).
71. Tavtigian, S.V. et al. Rare, evolutionarily unlikely missense substitutions in ATM
confer increased risk of breast cancer. Am. J. Hum. Genet. 85, 427–446 (2009).
72. Sul, J.H., Han, B., He, D. & Eskin, E. An optimal weighted aggregated association
test for identification of rare variants involved in common diseases. Genetics 188,
181–188 (2011).
73. Sul, J.H., Han, B. & Eskin, E. Increasing power of groupwise association test with
likelihood ratio test. in Research in Computational Molecular Biology, Lecture Notes
in Computer Science Vol. 6577/2011 452–467 (Springer, Berlin/Heidelberg, 2011).
74. Cooper, G.M. et al. Distribution and intensity of constraint in mammalian genomic
sequence. Genome Res. 15, 901–913 (2005).
75. Cooper, G.M. et al. Single-nucleotide evolutionary constraint scores highlight
disease-causing mutations. Nat. Methods 7, 250–251 (2010).
76. Ng, P.C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein
function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).
77. Jordan, D.M., Ramensky, V.E. & Sunyaev, S.R. Human allelic variation: perspective from protein function, structure, and evolution. Curr. Opin. Struct. Biol. 20,
342–350 (2010).
78. Thusberg, J., Olatubosun, A. & Vihinen, M. Performance of mutation pathogenicity
prediction methods on missense variants. Hum. Mutat. 32, 358–368 (2011).
79. Cooper, G.M. & Shendure, J. Needles in stacks of needles: finding disease-causing
variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
80. Hicks, S., Wheeler, D.A., Plon, S.E. & Kimmel, M. Prediction of missense mutation
functionality depends on both the algorithm and sequence alignment employed.
Hum. Mutat. 32, 661–668 (2011).
81. Stephens, M. & Balding, D.J. Bayesian statistical methods for genetic association
studies. Nat. Rev. Genet. 10, 681–690 (2009).
82. Sladek, R. et al. A genome-wide association study identifies novel risk loci for type
2 diabetes. Nature 445, 881–885 (2007).
83. Saxena, R. et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331–1336 (2007).
84. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000
cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678
(2007).
85. Drmanac, R. et al. Human genome sequencing using unchained base reads on
self-assembling DNA nanoarrays. Science 327, 78–81 (2010).
86. Lipman, P.J. et al. On the follow-up of genome-wide association studies: an overall
test for the most promising SNPs. Genet. Epidemiol. 35, 303–309 (2011).
87. Price, A.L., Zaitlen, N.A., Reich, D. & Patterson, N. New approaches to population
stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463
(2010).
88. Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using
multilocus genotype data. Genetics 155, 945–959 (2000).
89. Price, A.L. et al. Principal components analysis corrects for stratification in
genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
90. Kang, H.M. et al. Variance component model to account for sample structure in
genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
91. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55,
997–1004 (1999).
92. Keinan, A., Mullikin, J.C., Patterson, N. & Reich, D. Measurement of the human
allele frequency spectrum demonstrates greater genetic drift in East Asians than
in Europeans. Nat. Genet. 39, 1251–1255 (2007).
93. Alexander, D.H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry
in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
94. Li, H. & Durbin, R. ast and accurate short read alignment with Burrows Wheeler
transform. Bioinformatics 25, 1754–1760 (2009).
95. Holsinger, K.E. & Weir, B.S. Genetics in geographically structured populations:
defining, estimating and interpreting F . Nat. Rev. Genet. 10, 639–650 (2009).
ST
96. Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101
(2008).
97. Clayton, D.G. et al. Population structure, differential bias and genomic control in
a large-scale, case-control association study. Nat. Genet. 37, 1243–1246 (2005).
VOLUME 44 | NUMBER 6 | JUNE 2012 | NATURE GENETICS