EIG/EIGENSTRAT/README at master · DReichLab/EIG

History

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

See ../README for high-level documentation of the entire EIGENSOFT package.

This file contains documentation of EIGENSTRAT programs:

smartpca.perl: run PCA on input genotype data (calls smartpca)

smarteigenstrat.perl: run EIGENSTRAT stratification correction. This program

supports all 5 file formats, and supports quantitative phenotypes.

gc.perl: apply Genomic Control (Devlin and Roeder, 1999) to the

association statistics computed by EIGENSTRAT.

We note that the programs eigenstrat and eigenstratQTL of EIGENSOFT version 2.0

have been replaced by smarteigenstrat.perl. However, we have retained the old

programs for backwards compatibility (see below).

See ./example.perl and ./exampleQTL.perl for toy examples using our programs.

Please note that answers to the following questions can be found at

http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm

-- How do I compute principal components using only a subset of populations

and project other populations onto those principal components?

-- Should regions of long-range LD in the genome be removed prior to PCA?

-- How do I compute principal components using only a subset of SNPs but

then run EIGENSTRAT using the full set of SNPs?

------------------------------------------------------------------------

DOCUMENTATION of smartpca.perl program:

This program calls the smartpca program (see ../POPGEN/README).

For this to work, the bin directory containing smartpca MUST be in your path.

See ./example.perl for a toy example.

../bin/smartpca.perl

-i example.geno : genotype file in any format (see ../CONVERTF/README)

-a example.snp : snp file in any format (see ../CONVERTF/README)

-b example.ind : indiv file in any format (see ../CONVERTF/README)

-k k : (Default is 10) number of principal components to output

-o example.pca : output file of principal components. Individuals removed

as outliers will have all values set to 0.0 in this file.

-p example.plot : prefix of output plot files of top 2 principal components.

(labeling individuals according to labels in indiv file)

-e example.eval : output file of all eigenvalues

-l example.log : output logfile

-m maxiter : (Default is 5) maximum number of outlier removal iterations.

To turn off outlier removal, set -m 0.

-t topk : (Default is 10) number of principal components along which

to remove outliers during each outlier removal iteration.

-s sigma : (Default is 6.0) number of standard deviations which an

individual must exceed, along one of topk top principal

components, in order to be removed as an outlier.

OPTIONAL FLAGS:

-w poplist : compute eigenvectors using populations in poplist only,

where poplist is an ASCII file with one population per line

-y plotlist : output plot will include populations in plotlist only,

where plotlist is an ASCII file with one population per line

-z badsnpname : list of SNPs which should be excluded from the analysis

-q YES/NO : If set to YES, assume that there is a single population and

the population field contains real-valued phenotypes.

(Corresponds to qtmode parameter in smartpca program.)

The default value for this parameter is NO.

Estimated running time of the smartpca program is

2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.

2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.

Thus, under the default of up to 5 outlier removal iterations, running time is

up to 1.5e-11 * nSNP * NSAMPLES^2 hours.

Recommendation: we advise after running pca to check for large correlations

between each principal component and all variables of interest. For example,

large correlations with % missing data (per sample) could imply assay issues

large correlations with plate membership could imply assay issues

large correlations with phenotype indicate highly mismatched cases vs. controls

which will lead to a loss of power upon applying eigenstrat correction.

If input indiv file contains Case and Control labels only, then

correlations between each principal component and Case/Control status will be

listed at end of output logfile (-l flag).

------------------------------------------------------------------------

DOCUMENTATION of smarteigenstrat.perl program: [run smartpca.perl program first]

This program is a PERL wrapper which calls the C program smarteigenstrat.

Note: the bin directory containing smarteigenstrat MUST be in your path.

See ./example.perl for a toy example.

We recommend smarteigenstrat.perl for users who prefer command-line flags.

However, users who prefer parameter files can run smarteigenstrat instead.

../bin/smarteigenstrat.perl

-i example.geno : genotype file in any format (see ../CONVERTF/README)

-a example.snp : snp file in any format (see ../CONVERTF/README)

-b example.ind : individual file in any format (see ../CONVERTF/README).

We note that phenotype information will be contained in example.ind,

either as Case/Control labels or quantitative phenotypes if -q set to YES.

-q YES/NO : If set to YES, use quantitative phenotypes in example.ind.

If -q is set to YES, the third column of the input individual file

in EIGENSTRAT format (or sixth column of input individual file in PED format)

should be real numbers. The value -100.0 signifies "missing data".

If -q is set to NO, these values should be "Case" or "Control".

The default value for the -q parameter is NO.

-p example.pca : input file of principal components (output of smartpca.perl)

-k 1 : (Default is 10) number of principal components along which to

correct for stratification. Note that l must be less than or equal to

the number of principal components reported in the file example.pca.

-o example.chisq : chisq association statistics. File contains log of

flags to eigenstrat program, followed by one line per SNP:

The first entry of each line is Armitage chisq statistic (Armitage, 1955)

defined as NSAMPLES x (correlation between genotype and phenotype)^2.

If the set of individuals with genotype and phenotype both valid

is monomorphic for either genotype or phenotype, t C306 hen NA is reported.

The second entry of each line is the EIGENSTRAT chisq statistic, defined as

(NSAMPLES-l-1) x (corr between adjusted_genotype and adjusted_phenotype)^2.

If the set of individuals with genotype and phenotype both valid

is monomorphic for either genotype or phenotype, then NA is reported.

Note: even if l=0, there is a tiny difference between the two statistics

because Armitage uses NSAMPLES while we use NSAMPLES-1, which we

consider to be appropriate.

-l example.log : standard output file

The running time of smarteigenstrat.perl is very fast compared to

the running time of smartpca.perl.

------------------------------------------------------------------------

DOCUMENTATION of smarteigenstrat program:

Users who prefer parameter files to command-line flags can run the

C program smarteigenstrat instead of the PERL wrapper smarteigenstrat.perl.

The syntax of smarteigenstrat is "../bin/smarteigenstrat -p parfile"

DESCRIPTION OF EACH PARAMETER in parameter file for smarteigenstrat:

genotypename: input genotype file

snpname: input snp file

indivname: input indiv file

pcaname: input pca file

outputname: name of output file of chisq association statistics

numeigs: number of principal components to correct for

qtmode: YES for quantitative phenotype, NO (default) otherwise

For details, see documentation of smarteigenstrat.perl above.

OPTIONAL PARAMETERS:

hashcheck: If set to YES and the input genotype file is in PACKEDANCESTRYMAP

format, check the hash stored inside the file to make sure that individual

and SNP files have not changed since the file was made. If they have, then

exit in error. The default value for this parameter is YES.

------------------------------------------------------------------------

DOCUMENTATION of gc.perl: [run smartpca.perl & smarteigenstrat.perl first]

../bin/gc.perl infile outfile

infile is input file of chisq statistics produced by eigenstrat program.

It contains both uncorrected and EIGENSTRAT statistics for each SNP.

outfile is output file. It lists

lambda inflation values (for both uncorrected and EIGENSTRAT)

chisq statistics after scaling by lambda (uncorrected and EIGENSTRAT)

Computation of lambda is as described in Devlin and Roeder 1999.

A lambda above 1 indicates inflation in chisq statistics.

By definition, lambda is not allowed to be less than 1.

Running time of the gc.perl program is very fast.

------------------------------------------------------------------------

STATISTICAL SIGNFICANCE of each principal component: twstats program

See ../POPGEN/README for documentation of twstats program to compute

Tracy-Widom statistics (statistical significance of each principal component).

------------------------------------------------------------------------

BACKWARDS COMPATIBILITY with EIGENSTRAT version 2.0: eigenstrat program

For backwards compatibility with EIGENSTRAT version 2.0, we have also included

our old program eigenstrat (for Case/Control phenotypes), as well as

our old program eigenstratQTL (for quantitative phenotypes) from that release,

which have now been replaced by our new program smarteigenstratQTL.perl.

See ./example.oldstyle.perl for an example involving the eigenstrat program.

Most users will want to use our new program smarteigenstrat.perl, which has

added functionality. However, users wishing to understand or modify our

source code may find it advantageous to instead work with the simpler

eigenstrat programs.

------------------------------------------------------------------------

BACKWARDS COMPATIBILITY with 07/23/06 EIGENSTRAT release: pca program

For backwards compatibility with the 07/23/06 EIGENSTRAT release, we have also

included our old program pca used in that release, which has now been replaced

by our new program smartpca.perl. See ./example.oldstyle.perl for an example

involving the pca program.

Most users will want to use our new program smartpca.perl, which calls

the smartpca program and has added functionality. However, users wishing

to understand or modify our source code may find it advantageous to instead

work with the simpler pca program.

------------------------------------------------------------------------

TREATMENT OF X CHROMOSOME

We recommend excluding X chr genotypes when computing principal components,

as the different variance for males complicates proper theoretical treatment.

Excluding X chr genotypes is the smartpca default (see ../POPGEN/README).

We comment that including or excluding X chromosome genotypes has little effect

on top principal components in dense genome-scan data sets we have analyzed.

When computing association statistics using the eigenstrat program, the

default is to include X chr genotypes. Assuming the usual data convention of

reporting male X chr genotypes as homozygotes (instead of hemizygotes),

this corresponds to a model in which male hemizygotes have the same disease

risk as female homozygotes. Investigators should consider carefully whether

this is their desired X chr risk model. (When correcting for stratification,

we believe it is reasonable to use principal components inferred using

autosomal data, as autosomal ancestry and X chr ancestry are likely to be

very strongly correlated, even if percentages are different in absolute terms).

------------------------------------------------------------------------

RUNNING EIGENSTRAT WITH LESS THAN 100,000 MARKERS

The EIGENSTRAT method is primarily aimed at genome-scan data sets with

at least 100,000 markers. If running EIGENSTRAT on much smaller data sets,

the inclusion of a candidate marker in the set of markers used to infer

principal components may lead to a loss in power (Price et al. 2006 Supp Note).

Thus, if running on much smaller data sets, it is necessary to exclude a

candidate marker from the set of markers used to infer principal components

used to correct for stratification at the candidate marker. In the case of

a data set which uses ancestry-informative markers to infer ancestry, a

good way to do this is to run smartpca.perl to infer principal components

*only* using ancestry-informative markers, excluding the candidate markers

(and excluding any ancestry-informative marker in LD with a candidate marker),

and then run eigenstrat on candidate markers.

------------------------------------------------------------------------

Questions?

See http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm

or email Samuela Pollack, spollack@hsph.harvard.edu

SOFTWARE COPYRIGHT NOTICE AGREEMENT

and The Broad Institute. All rights are reserved. This software is supplied

without any warranty or guaranteed support whatsoever. Neither Harvard

University nor The Broad Institute can be responsible for its use, misuse, or

functionality. The software may be freely copied for non-commercial purposes,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README

README

Files

README

Latest commit

History

README

File metadata and controls