Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
Resource
novoSNP, a novel computational tool for sequence
variation discovery
Stefan Weckx, Jurgen Del-Favero,1 Rosa Rademakers, Lieve Claes, Marc Cruts,
Peter De Jonghe, Christine Van Broeckhoven, and Peter De Rijk
Department of Molecular Genetics, Flanders Interuniversity Institute for Biotechnology, University of Antwerp, Antwerpen, Belgium
Technological improvements shifted sequencing from low-throughput, work-intensive, gel-based systems to
high-throughput capillary systems. This resulted in a broad use of genomic resequencing to identify sequence
variations in genes and regulatory, as well as extended genomic regions. We describe a software package, novoSNP,
that conscientiously discovers single nucleotide polymorphisms (SNPs) and insertion–deletion polymorphisms
(INDELs) in sequence trace files in a fast, reliable, and user-friendly way. We compared the performance of novoSNP
with that of PolyPhred and PolyBayes on two data sets. The first data set comprised 1028 sequence trace files
obtained from diagnostic mutation analyses of SCN1A (neuronal voltage-gated sodium channel ␣-subunit type I gene).
The second data set comprised 9062 sequence trace files from a genomic resequencing project aiming at the
construction of a high-density SNP map of MAPT (microtubule-associated protein tau gene). Visual inspection of
these data sets had identified 38 sequence variations for SCN1A and 488 for MAPT. novoSNP automatically identified
all 38 SCN1A variations including five INDELs, while for MAPT only 15 of the 488 variations were not correctly
marked. PolyPhred detected far fewer SNPs as compared to novoSNP and missed nearly all INDELs. PolyBayes,
designed for the sequence analysis of cloned templates, detected only a limited number of the variations present in
the data set. Besides the significant improvement in the automated detection of sequence variations both in
diagnostic mutation analyses and in SNP discovery projects, novoSNP also offers a user-friendly interface for
inspecting possible genetic variations.
[novoSNP is freely available online at http://www.molgen.ua.ac.be/bioinfo/novosnp.]
With the human and numerous other eukaryotic genome sequences finished (The C. elegans Sequencing Consortium 1998;
Adams et al. 2000; Lander et al. 2001; Venter et al. 2001; Waterston et al. 2002), and other genome-wide sequencing efforts ongoing, researchers can now fully explore these genomes. Mapping genetic differences between individuals is one of the major
challenges in the post-genome era, which will provide valuable
information about the quality of life of human beings. Discovery
of these sequence variations by resequencing of a genomic region
in a set of individuals is considered the golden standard (Kwok et
al. 1994).
Those sequence variations are mostly mutations in coding
or regulatory regions of transcription units, potentially related to
a phenotypic trait or disease, or single nucleotide polymorphisms
(SNPs) that can be used as markers for genetic associations studies, for fine mapping candidate regions based on linkage disequilibrium, or in pharmacogenetics aiming at genetic profiling patients for drug response and/or side effects. Since SNPs occur on
average once every 1000 bp (Sachidanandam et al. 2001), they
offer a higher marker density compared to short tandem repeat
(STR) markers and allow high-throughput automated analysis.
Therefore, SNPs are rapidly becoming the genetic markers of
choice, especially in the search for genetic factors involved in
complex diseases or traits, and in pharmacogenetics. A highdensity SNP map of a gene or a chromosomal candidate region
can be constructed using data from public SNP databases (Sherry
1
Corresponding author.
E-mail jurgen.delfavero@ua.ac.be; fax 32 3 820 2541.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/
gr.2754005.
436
Genome Research
www.genome.org
et al. 2001; Fredman et al. 2004). However, since the number of
validated SNPs is often still limited and marker density is not
always sufficiently high, additional sequencing efforts are often
needed to saturate a gene or candidate chromosomal region with
SNPs.
As sequencing has evolved from low-throughput, workintensive, gel-based systems to high-throughput capillary systems, data analysis is becoming a major bottleneck in a resequencing approach. Several sequence-variation-finding programs like PolyPhred (Nickerson et al. 1997) and PolyBayes
(Marth et al. 1999) are available. However, these programs have
shown limitations regarding correct SNP and/or INDEL discovery. Therefore, we developed novoSNP, providing a fast, reliable,
and accurate strategy for the discovery of SNPs and INDELs from
sequence trace files obtained from large-scale genomic resequencing projects of genes or candidate chromosomal regions.
Results
novoSNP-based automated sequence variation discovery is a
straightforward process that can be divided into two major steps:
detection and validation. Both steps are supported from within
an intuitive graphical user interface.
Initiating a project
The first step in initiating a new novoSNP project is the creation
of a single file SQLite database and the addition of a reference
sequence in FASTA format. Next, one or more sets of sequence
trace files can be added. Each set consists of forward and/or reverse sequence trace files from a region enclosed in the reference
15:436–442 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05; www.genome.org
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
novoSNP, a tool for sequence variation discovery
sequence. Adding sequence trace files to the project automatically initiates SNP and INDEL detection according to the strategy
described in the Methods section. All resulting scores are stored
in the SQLite database.
The graphical user interface at a glance
The graphical user interface supports the validation step and consists of three frames and a toolbar with buttons for the most
frequently used functions (Fig. 1). The sequence traces are visualized in the main frame (Fig. 1A). The reference sequence and
the list of all sequence traces in the project are shown in a second
frame (Fig. 1B). All identified sequence variations can be retrieved
from the database and are presented in the variation display list
(Fig. 1C). By clicking on a sequence variation from this list, the
sequence trace views are shown in the main window with the
variation highlighted in the center of the frame (Fig. 1A). To
allow efficient validation of sequencing data, the software clusters similar sequence traces using a greedy algorithm, displaying
one sequence trace for each group (Fig. 1D). Other sequence
traces from within a group can be displayed by selecting the trace
name in the list next to the displayed read (Fig. 1D). Parameter
setting for sequence trace grouping can be adapted, and setting it
to zero will display all traces separately.
The variation display list can be filtered and sorted in several
ways: by variation type, by minimal and maximal overall score,
by status of validation, and by start and end base position (Fig.
1E). During visual inspection, sequence variations can be annotated as approved, rejected, or uncertain with instant updating of
the database (Fig. 1C).
To view the alignment of all sequences, an alignment window can be opened. The columns are color-coded according to
the variation score, ranging from a white background representing a low score to red for a high overall score. At any time during
the validation process, it is possible to exclude or include one or
more sequence trace files and to reanalyze the remaining traces.
Also, the structure of the file name can be adapted, allowing
novoSNP to determine which forward and reverse reads originated from the same DNA sample. The variation data can be
exported in text format, ordered by position or by sequence trace
filename.
Figure 1. The graphical user interface. (A) The main frame, displaying trace files centered on a T/C SNP. (B) Window displaying an overview of the
files in this project. (C) Window displaying the potential variations. The three checkboxes indicate whether a variation is approved (ok, +), rejected (not
ok, ⳮ), or uncertain (?). (D) Sequence file names of the clustered sequence trace files. (E) The multifunctional toolbar.
Genome Research
www.genome.org
437
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
Weckx et al.
Study of the performance of novoSNP, PolyPhred,
and PolyBayes
The performance of novoSNP, PolyPhred, and PolyBayes was
compared on two data sets, representing the two main resequencing approaches. A large data set, containing 9062 sequence
trace files covering a 140-kb genomic region—the MAPT data
set—illustrates a large-scale SNP discovery project with the typical tradeoff between throughput and the number of identified
variations. A smaller data set, containing 1028 sequence trace
files—the SCN1A data set—represents a typical gene mutation
analysis project, requiring an extensive optimization of the resequencing process in order to ensure detection of all variations
(Claes et al. 2003; Rademakers et al. 2004).
All three programs provide a quality score for each detected
variation. Depending on the quality score cutoff used, several
SNPs are detected for each program (Table 1). At the lowestquality cutoff score, novoSNP detected all 38 variations in the
SCN1A data set that were previously observed by visual inspection, including five INDELs, and missed only 10 out of 452
known SNPs (2.2%) and five out of 36 INDELs in the MAPT data
set (Table 1). PolyPhred found all but three of the SNPs in the
SCN1A data set at the lowest cutoff, but missed all five INDELs
(Table 1A) while listing more false-positive INDELs (23) than
novoSNP (nine). PolyPhred analysis of the MAPT data set showed
that a large number of SNPs (172, or 38.1%) were not detected
(Table 1B) and also that only two of 36 INDELs were correctly
identified, while the number of false-positive INDELs (101) was
again higher compared to novoSNP (63). PolyBayes was included
in this comparative analysis as it is often used for SNP discovery.
However, it was designed to handle sequencing data generated
from cloned DNA templates (Marth et al. 1999) and is therefore
unable to detect heterozygous bases and/or INDELs. Because of
these limitations, PolyBayes identified only a small percentage of
the SNPs in the SCN1A data set (54.5%) and the MAPT data set
(31%) (Table 1). An overall comparison of the true SNPs and false
positives (FP) detected by the three programs is represented as a
Venn diagram in Figure 2. Clearly, novoSNP detected most SNPs
for both data sets. However, PolyPhred detected three of the 10
SNPs missed by novoSNP. Somewhat surprisingly, most of the
false positives were not shared between the different programs
but were program-specific.
The use of low-quality cutoff values resulted in a large number of false positives for all three programs (Table 1). Using
higher-quality cutoffs, at the expense of detecting less true variations, diminished the number of false positives. Only a small
number of false positives remained when novoSNP was used with
a high-quality cutoff of 20, while PolyPhred returned a substantially larger number of false positives (ranging from a factor 10 to
100 compared to novoSNP) with the highest-quality cutoff of 99
Table 1. Output summary of the novoSNP, PolyPhred, and PolyBayes SNP analysis on the SCN1A
mutation and MAPT SNP data sets analyzed under different quality cutoff values
Quality
cutoff
A. SCN1A
novoSNP
PolyPhred
PolyBayes
B. MAPT
novoSNP
PolyPhred
PolyBayes
Total number
of SNPs
Correctly
identified
10
15
20
25
447
122
36
26
33
32
26
22
414
90
10
4
92.6%
73.8%
27.8%
15.4%
0
1
7
11
0.0%
3.0%
21.2%
33.3%
20
25
50
75
95
99
586
510
347
254
208
189
30
30
30
30
30
26
556
480
317
224
178
163
94.9%
94.1%
91.4%
88.2%
85.6%
86.2%
3
3
3
3
3
7
9.1%
9.1%
9.1%
9.1%
9.1%
21.2%
54
46
37
33
18
17
16
16
36
29
21
17
66.7%
63.0%
56.8%
51.5%
15
16
17
17
45.5%
48.5%
51.5%
51.5%
5
10
15
20
25
3424
1146
484
251
206
442
421
377
244
203
2982
725
107
7
3
87.1%
63.3%
22.1%
2.8%
1.5%
10
31
75
208
249
2.2%
6.9%
16.6%
46.0%
55.1%
20
25
50
75
95
99
2637
2510
2243
1892
1677
1572
280
280
271
252
207
175
2357
2230
1972
1640
1470
1397
89.4%
88.8%
87.9%
86.7%
87.7%
88.9%
172
172
181
200
245
277
38.1%
38.1%
40.0%
44.2%
54.2%
61.3%
991
830
672
567
140
136
126
115
851
694
546
452
85.9%
83.6%
81.2%
79.7%
312
316
326
337
69.0%
69.9%
72.1%
74.6%
0.1
0.25
0.5
0.75
0.1
0.25
0.5
0.75
False positives
False negatives
For the SCN1A data set, the lowest novoSNP shown cutoff is 10 since all SNPs were found at this cutoff value.
438
Genome Research
www.genome.org
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
novoSNP, a tool for sequence variation discovery
Figure 2. Venn diagrams representing the results of novoSNP, PolyPhred, and PolyBayes and their respective overlaps. The numbers represent the amount of true (TP) and false-positive (FP) SNPs detected by
each program, regardless of their quality scores. (A) The SCN1A data. (B)
The MAPT data.
(Table 1). Even at a quality cutoff of 15, novoSNP detected considerably more SNPs compared to the lowest-quality cutoff for
PolyPhred, with a lower false-positive rate than PolyPhred at the
highest possible quality (Table 1).
To reject the possibility that the high-quality false positives
were, in fact, false negatives from the visual inspection, all highscoring false positives of novoSNP (quality > 20) and a random
sample of 110 highest-scoring PolyPhred false positives (quality = 99) were manually checked, confirming all as true false positives.
Discussion
We developed a software package novoSNP that allows automated detection of sequence variations from sequence trace files
in a fast and reliable manner. novoSNP runs on computers with
Linux or Windows as operating system. In contrast to assemblyoriented display programs, novoSNP offers a variation-oriented
visualization. Important assets of novoSNP over existing variation detection software are: its high rate of correctly identified
INDEL polymorphisms, low number of false negatives, availability of an intuitive graphical user interface, and its flexibility in
use resulting from the backend database.
Applying novoSNP on a total of 10,090 sequence trace files
showed that 511 of the 526 variations identified by visual inspection were detected. This in-house-generated large data set was
composed of two independent data sets exemplifying two differ-
ent approaches: a mutation analysis and a SNP discovery data set.
The mutation data set was derived from the exon-based mutation
analysis of SCN1A in 15 patients within a DNA diagnostic context (Claes et al. 2003). The SNP discovery data set was obtained
from the genomic resequencing of 140 kb containing MAPT in 23
individuals aiming at generating a high-density SNP map for genetic association studies (Rademakers et al. 2004). Applying
novoSNP on the SCN1A mutation data set, all 38 variations observed by visual inspection were detected including five INDELs,
using a quality score cutoff of 10. Applying novoSNP, with a
quality cutoff of 5, on the MAPT data set showed that only 15
variations out of 488 (3.1%) were missed, including five INDELs.
Using the same cutoff value for both data sets, the percentage of
false negatives was always significantly lower for the mutation
analysis data set. The other two programs used in this study
showed a similar difference in false-negative rate between the
data sets (Table 1).
This difference in variation discovery success rate can be
explained by the initial scope of the two data sets. Since in DNA
diagnosis pathogenic mutations cannot be missed, a rigorous
screening is required using more extensive sequence coverage
and validated/optimized primer sets, resulting in high-quality
sequence trace files. Genomic resequencing, on the other hand,
does not necessarily require detection of all SNPs, and such
project does not always allow time for optimizing all primer sets
for high-quality sequencing and sequence coverage. Indeed, inspection of the unidentified novoSNP variations showed that
these were typically positioned near the ends of sequence traces
and/or in regions that were quality trimmed by novoSNP.
Performance of PolyPhred and PolyBayes on the same data
set showed that even at the lowest-quality cutoffs, both programs
missed a large number of SNPs: 175 and 327, respectively (Table 1).
The significantly better performance of novoSNP can be explained by the use of a cumulative scoring scheme that independently examined different variation characteristics. With this approach, variations that have a low score for one characteristic
could still be scored if they had a high score for the other characteristics. novoSNP also excelled in the detection of INDELs
missing only five out of 41 INDELs in the complete data set. Since
PolyBayes does not support INDEL detection, it obviously could
not find these. The INDEL detection feature in PolyPhred missed
all but two INDELs. Furthermore, novoSNP is not only able to
efficiently detect INDELs but also provides the user with the correct sequence of the INDEL.
A high false-positive rate was observed for all three programs
used in this study (Table 1; Fig. 2). This is not surprising because
the false-positive rate is directly correlated with the overall quality of the sequence traces and especially background noise, and
thus is inherent to the discovery methods underlying these software programs. One way to reduce the false-positive rate could be
the application of a more consistent selection of PCR and sequencing primers as we did in this study by using the highthroughput primer design program SNPbox (Weckx et al. 2004,
2005). Another way is by relying on the quality scores assigned
to the SNP. Indeed, the results presented here showed that the
quality score given by novoSNP is a reliable measure of the correctness of the SNP (Table 1). Using a relatively low cutoff score
of 10, 97.9% of the true SNPs were found in the combined data
sets, but 87.7% of the listed SNPs were false positives. Using
higher cutoff values, the number of true variations decreased
to 84.3% for a cutoff score of 15, and to 55.7% for a score of 20.
Genome Research
www.genome.org
439
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
Weckx et al.
However, the number of false positives decreased accordingly to
32.5% at a cutoff of 15, and only 5.9% for a cutoff of 20. This was
not the case for the other two programs, where lower-quality
cutoffs detected more true SNPs but the percentage of false positives remained relatively similar with 87.4% to 90.4% for PolyPhred and 78.2% to 84.9% for PolyBayes. This particular feature
of novoSNP’s quality scores makes it very useful in different scenarios. A high cutoff value can be used when building a SNP map
for genetic association studies, where it is important to get reliable SNPs in a prompt manner. Lower cutoff values are to be used
in DNA diagnostic mutation analyses resulting in a larger number of marked variations that need followup but at the same time
eliminating the risk of false negatives, increasing the probability
that pathogenic mutations would not be missed.
To conclude, we showed that novoSNP is an efficient and
reliable software package for sequence variation discovery with a
high discovery rate of both SNPs and INDELs. Also, the quality
score assigned by novoSNP to a marked sequence variant can be
used as a reliable criterion for selecting sequence variations for
either pathogenic mutation or SNP detection.
Methods
primer set, containing reads from forward and reverse primer
sequencing reactions. Once the reference sequence and trace files
are added to the single file SQLite database, the program handles
all following steps automatically. The trace files are base-called
and clipped, and the resulting sequences are aligned to the reference sequence using the BLAST algorithm.
Next, each position in the alignment is scored for the presence of a SNP using a cumulative scoring scheme. The final score
for each position is the sum of three subscores, independently
determined for forward and reverse reads, and an extra score
reflecting how well forward and reverse reads match. The peak
size of a color used for the calculation of these subscores is calculated by normalizing the area under this color at the given
trace position to the average size of the 16 neighboring trace
peaks in the same read (horizontal normalization). The metrics
used in calculating the three subscores are illustrated in Figure 3.
The first subscore represents the evidence for a variant in
one of the aligned traces (“feature” score). This feature score is
calculated by comparing the two largest peaks at the given position for each trace. When only one peak is present (no background), the base represented by this peak will be awarded the
maximum score. If two peaks of equal height are detected, both
bases will receive a maximum score. Based on a number of cutoffs, lower scores for both bases will be produced when one base
Language and data storage
novoSNP is written in the scripting language Tcl version 8.4, has
a graphical user interface written in Tk version 8.4 (http://
www.tcl.tk), and can be used on Linux as well as on Windows
systems. novoSNP stores all information about reads, alignment,
and variations in a single file relational database (http://
www.sqlite.org).
Tcl Libraries
novoSNP relies on the additional Tcl packages ClassyTcl,
ClassyTk, Extral, and dbi. ClassyTcl is an object system for Tcl
and ClassyTk is an extention for Tk, based on the ClassyTcl object system. Additional information can be found at http://
www.sourceforge.net/projects/classytcl/. Extral is a library providing additional commands to Tcl (http://www.sourceforge.net/
projects/extral/). Dbi is an interface providing a unified way to
access different SQL database management systems (http://
www.sourceforge.net/projects/tcl-dbi/).
External programs
The BLAST algorithm (Altschul et al. 1990) is used to align sequence reads to a reference sequence. Sequence trace files in ABI
format generated on capillary systems like the ABI PRISM 3700 or
3100 DNA Analyzer (Applied Biosystems) or in SCF format generated on other sequencing systems like the CEQ Genetic Analysis System (Beckman Coulter) or the MegaBACE DNA Analysis
systems (Amersham) are auto-detected, base-called, and clipped
using Phred (Ewing and Green 1998; Ewing et al. 1998). Phred is
not included in the distribution of novoSNP but should be obtained according to the information at http://www.phrap.org.
Trace files generated on an ABI PRISM 3730 DNA Analyzer require base-calling with the ABI KB BaseCaller. These will be quality-clipped by novoSNP using the same algorithm as used by
Phred.
novoSNP strategy
The novoSNP input data consist of a reference sequence in FASTA
format, covering the sequenced region(s) as well as the generated
sequence trace files. The trace files are preferably arranged per
440
Genome Research
www.genome.org
Figure 3. Illustration of the metrics used to calculate the novoSNP
quality scores. (A) An easy to detect, homozygous SNP that scores high
for all metrics. The features seen in the traces and their scores are listed on
the right; the second best score (circled) represents the feature score for
this position. (B) An example showing a relative small difference in peak
area resulting in a low distance metric. The feature metric assigns a low
score to the small secondary peak in the heterozygous sample. However,
the compensation in the differences in peak area usually seen in true SNPs
and measured by the peak shift metric is very clear: The drop in size in
one color is almost just as large as the rise of the other color. (C) The peak
shift is not always clearly present. In this example, the differences in peak
size are large, but their absolute values not similar. The SNP will still be
picked up by the other metrics.
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
novoSNP, a tool for sequence variation discovery
peak is at a fraction of the other. When scanning the position in
all reads, only the best score for each base is kept, finally resulting
in one score for each base at that position. If the position harbors
a SNP, two bases instead of one will have high scores. Considering the best scoring base as the wild type, the second best score
is the one representing the variant and will be added to the final
score as the feature score. Differences with the reference sequence
are captured in this score by assigning the reference base a maximum score.
The second parameter is the “difference” score, which provides the largest observed dissimilarity from all combinations of
two traces at the given position. The dissimilarity is calculated by
summing the peak size differences for each base color. If the
highest dissimilarity found exceeds a defined cutoff value, the
difference score is added to the final score.
The third parameter or “peak shift” score explicitly targets a
typical behavior observed when comparing sequence trace files
containing a different allele: A drop in size of a peak in one color
is compensated by an equal rise of a peak of another color. The
peak shift is calculated by multiplying the ratio between drop
and rise of the peaks at the given position with the value of the
smallest change at that position. The result is normalized for the
highest base peak observed at the aligned position (vertical normalization). If the peak shift between any two reads exceeds a
defined cutoff value, a peak shift score will be added to the final
score.
In case forward and reverse reads are available, an extra
“type” score is added. This extra score reflects how well predicted
variants between the forward and reverse reads match. Conflicting reads result in a small penalty. Maximal scores are assigned if
the different variants are present in both matching forward and
reverse reads.
Heterozygous INDELs show a typical pattern of frameshiftinduced stretches containing many double sequence peaks starting at the INDEL position. These regions are clipped before the
trace files are aligned to the reference sequence because the basecaller considers these stretches as low-quality bases. Therefore,
after SNP scoring of the trace files, each read is tested separately
for the presence of heterozygous INDELs by searching for the
presence of a double sequence pattern after the clipping point. In
case a pattern consistent with the presence of an INDEL is found,
the bases corresponding to the reference sequence are removed
from the double sequence pattern resulting in a sequence containing the potential INDEL. BLAST alignment of this sequence
to the reference sequence determines the presence of an actual
INDEL. Heterozygous INDELs are scored according to the length
and quality of the aligned part after the INDEL, and the consistency with the double sequence pattern. Finally, homozygous
INDELs gathered during the alignment process are evaluated.
Since such gaps are often produced by the base-caller missing a
call, the quality of homozygous INDELs is determined based on
the quality scores of the base-caller, and the uniformity of spacing between the bases in a 6-base region around the INDEL. The
uniformity of spacing is scored by comparing the distance between the highest and lowest distance between actual peak positions to the average distance. As with the scoring of SNPs,
matching results in forward and reverse reads will add to the final
score if available.
The parameters and cutoff values used in the parameter
scoring were optimized using large sequencing data sets from
resequencing projects in our department.
Sequence trace data sets
We compared the performance of novoSNP with that of PolyPhred (version 4.20) and PolyBayes (version 3.0) for automated
sequence variation detection on two data sets, with a total of
10,090 sequence trace files generated on an ABI Prism 3700 and
3730 DNA Analyzer (Applied Biosystems).
The first data set comprises 1028 sequence trace files generated in-house from a diagnostic mutation analysis of the neuronal voltage-gated sodium channel ␣-subunit type I gene (SCN1A),
located on Chromosome 2, in 15 patients with severe myoclonic
epilepsy of infancy (SMEI). Visual inspection of these sequence
trace files identified 38 sequence variations including five
INDELs (Claes et al. 2001, 2003; data not shown). Based on the
Chromosome 2 genomic sequence with GenBank accession
number NT_005403.14 from position 17,050,000 to 17,150,000,
we designed primers for this mutation analysis with SNPbox using the “exon” module (Weckx et al. 2005).
The second data set comprises 9062 sequence trace files obtained in-house from genomic resequencing of 140 kb on Chromosome 17q21 spanning the gene coding for the microtubuleassociated protein tau (MAPT) in 23 individuals to construct a
high-density SNP map for genetic association studies (Rademakers et al. 2004). Based on the genomic sequence with GenBank
accession number NC_000017.9 from position 41,323,600 to
41,462,800, 198 primer sets were designed with SNPbox using
the “saturation” module (Weckx et al. 2005). These primer sets
were amplified and sequenced in both directions in 23 individuals. Visual inspection of this data set yielded 488 variations including 36 INDELs. Of 70 variations tested for validation using
other technologies, 69 were successfully confirmed.
For novoSNP, a FASTA file of the genomic sequence was
used as a reference sequence. For PolyPhred and PolyBayes the
reference sequences were translated into a phd file using
the fasta2Phd Perl script, and the resulting files were added to the
appropriate data sets. PolyBayes was run under standard conditions. PolyPhred was used with default settings except that the
option to detect INDELs was enabled. We also ran PolyPhred
with a lower-quality clipping score of 15 and with the source
option enabled. These settings resulted in a lower number of true
variations with a slightly lower false-positives rate. Phrap was
used with the force level input variable set to 10 to allow the
lowest possible stringency.
Acknowledgments
We acknowledge the contribution of the VIB Genetic Service
Facility (http://www.vibgeneticservicefacility.be/) to the resequencing and SNP detection. This work was in part funded by the
Special Research Fund of the University of Antwerp, the Fund for
Scientific Research Flanders (FWO-V), the Interuniversity Attraction Poles program P5/19 of the Federal Science Policy Office, the
Medical Foundation Queen Elisabeth, and the International Alzheimer Research Foundation, Belgium; and an Integrated Project
APOPIS within the 6th framework program of the European
Commission. R.R. and M.C. are postdoctoral fellows of the
FWO-V.
References
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D.,
Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et
al. 2000. The genome sequence of Drosophila melanogaster. Science
287: 2185–2195.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990.
Basic local alignment search tool. J. Mol. Biol. 215: 403–410.
The C. elegans Sequencing Consortium. 1998. Genome sequence of the
nematode C. elegans: A platform for investigating biology. Science
282: 2012–2018.
Genome Research
www.genome.org
441
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
Weckx et al.
Claes, L., Del Favero, J., Ceulemans, B., Lagae, L., Van Broeckhoven, C.,
and De Jonghe, P. 2001. De novo mutations in the sodium-channel
gene SCN1A cause severe myoclonic epilepsy of infancy. Am. J. Hum.
Genet. 68: 1327–1332.
Claes, L., Ceulemans, B., Audenaert, D., Smets, K., Lofgren, A., Del
Favero, J., Ala-Mello, S., Basel-Vanagaite, L., Plecko, B., Raskin, S., et
al. 2003. De novo SCN1A mutations are a major cause of severe
myoclonic epilepsy of infancy. Hum. Mutat. 21: 615–621.
Ewing, B. and Green, P. 1998. Base-calling of automated sequencer
traces using phred. II. Error probabilities. Genome Res. 8: 186–194.
Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of
automated sequencer traces using phred. I. Accuracy assessment.
Genome Res. 8: 175–185.
Fredman, D., Munns, G., Rios, D., Sjoholm, F., Siegfried, M., Lenhard,
B., Lehvaslaiho, H., and Brookes, A.J. 2004. HGVbase: A curated
resource describing human DNA variation and phenotype
relationships. Nucleic Acids Res. 32: D516–D519.
Kwok, P.Y., Carlson, C., Yager, T.D., Ankener, W., and Nickerson, D.A.
1994. Comparative analysis of human DNA variations by
fluorescence-based sequencing of PCR products. Genomics 23:
138–144.
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C.,
Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al.
2001. Initial sequencing and analysis of the human genome. Nature
409: 860–921.
Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel,
N.O., Hillier, L., Kwok, P.Y., and Gish, W.R. 1999. A general
approach to single-nucleotide polymorphism discovery. Nat. Genet.
23: 452–456.
Nickerson, D.A., Tobe, V.O., and Taylor, S.L. 1997. PolyPhred:
Automating the detection and genotyping of single nucleotide
substitutions using fluorescence-based resequencing. Nucleic Acids
Res. 25: 2745–2751.
Rademakers, R., van der Zee, J., Bogaerts, V., Van den Bossche, D.,
Backhovens, H., De Pooter, T., Bel Kacem, S., van Duijn, C.,
Del-Favero, J., Van Broeckhoven, C., et al. 2004. P4-154 genomic
sequencing of MAPT provides an extended SNP map and identifies
>30 H1 subhaplotypes. Neurobiol. Aging 25 (Suppl 2): S519.
442
Genome Research
www.genome.org
Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein,
L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey,
D.L., et al. 2001. A map of human genome sequence variation
containing 1.42 million single nucleotide polymorphisms. Nature
409: 928–933.
Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski,
E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic
variation. Nucleic Acids Res. 29: 308–311.
Venter, J., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G.,
Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The
sequence of the human genome. Science 291: 1304–1351.
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F.,
Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P.,
et al. 2002. Initial sequencing and comparative analysis of the
mouse genome. Nature 420: 520–562.
Weckx, S., De Rijk, P., Van Broeckhoven, C., and Del Favero, J. 2004.
SNPbox: Web-based high-throughput primer design from gene to
genome. Nucleic Acids Res. 32: W170–W172.
———. 2005. SNPbox, a modular software package for large scale primer
design. Bioinformatics 21: 385–387.
Web site references
http://www.phrap.org; Phred/Phrap.
http://www.sourceforge.net/projects/classytcl/; ClassyTcl.
http://www.sourceforge.net/projects/extral; Extral.
http://www.sourceforge.net/projects/tcl-dbi/; Dbi.
http://www.sqlite.org; SQLite.
http://www.tcl.tk; Tcl/Tk.
http://www.vibgeneticservicefacility.be/; VIB Genetic Service Facility.
http://www.molgen.ua.ac.be/bioinfo/novosnp/; novoSNP: a program to
find SNPs and small indels in resequencing projects.
Received May 5, 2004; accepted in revised form January 4, 2005.
Downloaded from genome.cshlp.org on June 6, 2020 - Published by Cold Spring Harbor Laboratory Press
novoSNP, a novel computational tool for sequence variation
discovery
Stefan Weckx, Jurgen Del-Favero, Rosa Rademakers, et al.
Genome Res. 2005 15: 436-442
Access the most recent version at doi:10.1101/gr.2754005
References
This article cites 19 articles, 4 of which can be accessed free at:
http://genome.cshlp.org/content/15/3/436.full.html#ref-list-1
License
Email Alerting
Service
Receive free email alerts when new articles cite this article - sign up in the box at the
top right corner of the article or click here.
To subscribe to Genome Research go to:
http://genome.cshlp.org/subscriptions
Cold Spring Harbor Laboratory Press