-
Notifications
You must be signed in to change notification settings - Fork 60
/
Copy pathREADME
251 lines (201 loc) · 13.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
See ../README for high-level documentation of the entire EIGENSOFT package.
This file contains documentation of EIGENSTRAT programs:
smartpca.perl: run PCA on input genotype data (calls smartpca)
smarteigenstrat.perl: run EIGENSTRAT stratification correction. This program
supports all 5 file formats, and supports quantitative phenotypes.
gc.perl: apply Genomic Control (Devlin and Roeder, 1999) to the
association statistics computed by EIGENSTRAT.
We note that the programs eigenstrat and eigenstratQTL of EIGENSOFT version 2.0
have been replaced by smarteigenstrat.perl. However, we have retained the old
programs for backwards compatibility (see below).
See ./example.perl and ./exampleQTL.perl for toy examples using our programs.
Please note that answers to the following questions can be found at
http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm
-- How do I compute principal components using only a subset of populations
and project other populations onto those principal components?
-- Should regions of long-range LD in the genome be removed prior to PCA?
-- How do I compute principal components using only a subset of SNPs but
then run EIGENSTRAT using the full set of SNPs?
------------------------------------------------------------------------
DOCUMENTATION of smartpca.perl program:
This program calls the smartpca program (see ../POPGEN/README).
For this to work, the bin directory containing smartpca MUST be in your path.
See ./example.perl for a toy example.
../bin/smartpca.perl
-i example.geno : genotype file in any format (see ../CONVERTF/README)
-a example.snp : snp file in any format (see ../CONVERTF/README)
-b example.ind : indiv file in any format (see ../CONVERTF/README)
-k k : (Default is 10) number of principal components to output
-o example.pca : output file of principal components. Individuals removed
as outliers will have all values set to 0.0 in this file.
-p example.plot : prefix of output plot files of top 2 principal components.
(labeling individuals according to labels in indiv file)
-e example.eval : output file of all eigenvalues
-l example.log : output logfile
-m maxiter : (Default is 5) maximum number of outlier removal iterations.
To turn off outlier removal, set -m 0.
-t topk : (Default is 10) number of principal components along which
to remove outliers during each outlier removal iteration.
-s sigma : (Default is 6.0) number of standard deviations which an
individual must exceed, along one of topk top principal
components, in order to be removed as an outlier.
OPTIONAL FLAGS:
-w poplist : compute eigenvectors using populations in poplist only,
where poplist is an ASCII file with one population per line
-y plotlist : output plot will include populations in plotlist only,
where plotlist is an ASCII file with one population per line
-z badsnpname : list of SNPs which should be excluded from the analysis
-q YES/NO : If set to YES, assume that there is a single population and
the population field contains real-valued phenotypes.
(Corresponds to qtmode parameter in smartpca program.)
The default value for this parameter is NO.
Estimated running time of the smartpca program is
2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
Thus, under the default of up to 5 outlier removal iterations, running time is
up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
Recommendation: we advise after running pca to check for large correlations
between each principal component and all variables of interest. For example,
large correlations with % missing data (per sample) could imply assay issues
large correlations with plate membership could imply assay issues
large correlations with phenotype indicate highly mismatched cases vs. controls
which will lead to a loss of power upon applying eigenstrat correction.
If input indiv file contains Case and Control labels only, then
correlations between each principal component and Case/Control status will be
listed at end of output logfile (-l flag).
------------------------------------------------------------------------
DOCUMENTATION of smarteigenstrat.perl program: [run smartpca.perl program first]
This program is a PERL wrapper which calls the C program smarteigenstrat.
Note: the bin directory containing smarteigenstrat MUST be in your path.
See ./example.perl for a toy example.
We recommend smarteigenstrat.perl for users who prefer command-line flags.
However, users who prefer parameter files can run smarteigenstrat instead.
../bin/smarteigenstrat.perl
-i example.geno : genotype file in any format (see ../CONVERTF/README)
-a example.snp : snp file in any format (see ../CONVERTF/README)
-b example.ind : individual file in any format (see ../CONVERTF/README).
We note that phenotype information will be contained in example.ind,
either as Case/Control labels or quantitative phenotypes if -q set to YES.
-q YES/NO : If set to YES, use quantitative phenotypes in example.ind.
If -q is set to YES, the third column of the input individual file
in EIGENSTRAT format (or sixth column of input individual file in PED format)
should be real numbers. The value -100.0 signifies "missing data".
If -q is set to NO, these values should be "Case" or "Control".
The default value for the -q parameter is NO.
-p example.pca : input file of principal components (output of smartpca.perl)
-k 1 : (Default is 10) number of principal components along which to
correct for stratification. Note that l must be less than or equal to
the number of principal components reported in the file example.pca.
-o example.chisq : chisq association statistics. File contains log of
flags to eigenstrat program, followed by one line per SNP:
The first entry of each line is Armitage chisq statistic (Armitage, 1955)
defined as NSAMPLES x (correlation between genotype and phenotype)^2.
If the set of individuals with genotype and phenotype both valid
is monomorphic for either genotype or phenotype, t
C306
hen NA is reported.
The second entry of each line is the EIGENSTRAT chisq statistic, defined as
(NSAMPLES-l-1) x (corr between adjusted_genotype and adjusted_phenotype)^2.
If the set of individuals with genotype and phenotype both valid
is monomorphic for either genotype or phenotype, then NA is reported.
Note: even if l=0, there is a tiny difference between the two statistics
because Armitage uses NSAMPLES while we use NSAMPLES-1, which we
consider to be appropriate.
-l example.log : standard output file
The running time of smarteigenstrat.perl is very fast compared to
the running time of smartpca.perl.
------------------------------------------------------------------------
DOCUMENTATION of smarteigenstrat program:
Users who prefer parameter files to command-line flags can run the
C program smarteigenstrat instead of the PERL wrapper smarteigenstrat.perl.
The syntax of smarteigenstrat is "../bin/smarteigenstrat -p parfile"
DESCRIPTION OF EACH PARAMETER in parameter file for smarteigenstrat:
genotypename: input genotype file
snpname: input snp file
indivname: input indiv file
pcaname: input pca file
outputname: name of output file of chisq association statistics
numeigs: number of principal components to correct for
qtmode: YES for quantitative phenotype, NO (default) otherwise
For details, see documentation of smarteigenstrat.perl above.
OPTIONAL PARAMETERS:
hashcheck: If set to YES and the input genotype file is in PACKEDANCESTRYMAP
format, check the hash stored inside the file to make sure that individual
and SNP files have not changed since the file was made. If they have, then
exit in error. The default value for this parameter is YES.
------------------------------------------------------------------------
DOCUMENTATION of gc.perl: [run smartpca.perl & smarteigenstrat.perl first]
../bin/gc.perl infile outfile
infile is input file of chisq statistics produced by eigenstrat program.
It contains both uncorrected and EIGENSTRAT statistics for each SNP.
outfile is output file. It lists
lambda inflation values (for both uncorrected and EIGENSTRAT)
chisq statistics after scaling by lambda (uncorrected and EIGENSTRAT)
Computation of lambda is as described in Devlin and Roeder 1999.
A lambda above 1 indicates inflation in chisq statistics.
By definition, lambda is not allowed to be less than 1.
Running time of the gc.perl program is very fast.
------------------------------------------------------------------------
STATISTICAL SIGNFICANCE of each principal component: twstats program
See ../POPGEN/README for documentation of twstats program to compute
Tracy-Widom statistics (statistical significance of each principal component).
------------------------------------------------------------------------
BACKWARDS COMPATIBILITY with EIGENSTRAT version 2.0: eigenstrat program
For backwards compatibility with EIGENSTRAT version 2.0, we have also included
our old program eigenstrat (for Case/Control phenotypes), as well as
our old program eigenstratQTL (for quantitative phenotypes) from that release,
which have now been replaced by our new program smarteigenstratQTL.perl.
See ./example.oldstyle.perl for an example involving the eigenstrat program.
Most users will want to use our new program smarteigenstrat.perl, which has
added functionality. However, users wishing to understand or modify our
source code may find it advantageous to instead work with the simpler
eigenstrat programs.
------------------------------------------------------------------------
BACKWARDS COMPATIBILITY with 07/23/06 EIGENSTRAT release: pca program
For backwards compatibility with the 07/23/06 EIGENSTRAT release, we have also
included our old program pca used in that release, which has now been replaced
by our new program smartpca.perl. See ./example.oldstyle.perl for an example
involving the pca program.
Most users will want to use our new program smartpca.perl, which calls
the smartpca program and has added functionality. However, users wishing
to understand or modify our source code may find it advantageous to instead
work with the simpler pca program.
------------------------------------------------------------------------
TREATMENT OF X CHROMOSOME
We recommend excluding X chr genotypes when computing principal components,
as the different variance for males complicates proper theoretical treatment.
Excluding X chr genotypes is the smartpca default (see ../POPGEN/README).
We comment that including or excluding X chromosome genotypes has little effect
on top principal components in dense genome-scan data sets we have analyzed.
When computing association statistics using the eigenstrat program, the
default is to include X chr genotypes. Assuming the usual data convention of
reporting male X chr genotypes as homozygotes (instead of hemizygotes),
this corresponds to a model in which male hemizygotes have the same disease
risk as female homozygotes. Investigators should consider carefully whether
this is their desired X chr risk model. (When correcting for stratification,
we believe it is reasonable to use principal components inferred using
autosomal data, as autosomal ancestry and X chr ancestry are likely to be
very strongly correlated, even if percentages are different in absolute terms).
------------------------------------------------------------------------
RUNNING EIGENSTRAT WITH LESS THAN 100,000 MARKERS
The EIGENSTRAT method is primarily aimed at genome-scan data sets with
at least 100,000 markers. If running EIGENSTRAT on much smaller data sets,
the inclusion of a candidate marker in the set of markers used to infer
principal components may lead to a loss in power (Price et al. 2006 Supp Note).
Thus, if running on much smaller data sets, it is necessary to exclude a
candidate marker from the set of markers used to infer principal components
used to correct for stratification at the candidate marker. In the case of
a data set which uses ancestry-informative markers to infer ancestry, a
good way to do this is to run smartpca.perl to infer principal components
*only* using ancestry-informative markers, excluding the candidate markers
(and excluding any ancestry-informative marker in LD with a candidate marker),
and then run eigenstrat on candidate markers.
------------------------------------------------------------------------
Questions?
See http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm
or email Samuela Pollack, spollack@hsph.harvard.edu
SOFTWARE COPYRIGHT NOTICE AGREEMENT
This software and its documentation are copyright (2010) by Harvard University
and The Broad Institute. All rights are reserved. This software is supplied
without any warranty or guaranteed support whatsoever. Neither Harvard
University nor The Broad Institute can be responsible for its use, misuse, or
functionality. The software may be freely copied for non-commercial purposes,
provided this copyright notice is retained.