[go: up one dir, main page]

Academia.eduAcademia.edu
Syst. Biol. 52(3):283–295, 2003 c Society of Systematic Biologists Copyright ° ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150390196948 18S Ribosomal RNA and Tetrapod Phylogeny XUHUA XIA,1 ZHENG XIE,1,2 AND K ARL M. K JER 3 1 Department of Biology, University of Ottawa, 150 Louis, P.O. Box 450, Station A, Ottawa, Ontario K1N 6N5, Canada; E-mail: xxia@uottawa.ca (X.X.) 2 Institute of Environmental Protection, Hunan University, Changsha, China 3 Department of Entomology, Cook College, Rutgers University, Highland Park, New Jersey 08904, USA; E-mail: kjer@aesop.rutgers.edu Abstract.— Previous phylogenetic analyses of tetrapod 18S ribosomal RNA (rRNA) sequences support the grouping of birds with mammals, whereas other molecular data, and morphological and paleontological data favor the grouping of birds with crocodiles. The 18S rRNA gene has consequently been considered odd, serving as “definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms” (p. 156; Huelsenbeck et al., 1996, Trends Ecol. Evol. 11:152–158). Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by (1) the misalignment of the sequences, (2) the inappropriate use of the frequency parameters, and (3) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species. [alignment; 18S rRNA; RNA secondary structure; Indel; molecular phylogenetics; tetrapod phylogeny.] One of the early controversies in the phylogenetic relationship among tetrapods is whether birds are more closely related to crocodilians (Romer, 1966; Carroll, 1988; Gauthier et al., 1988) or to mammals (Gardiner, 1982; Løvtrup, 1985). Hedges et al. (1990) collected a set of 18S ribosomal RNA (rRNA) sequences to evaluate the relationships among tetrapods and found that the bird–mammal grouping was much more strongly supported, with a bootstrap value of 88%, than the bird– crocodilian grouping. A subset of these 18S rRNA sequences was used subsequently in a statistical test, based on the minimum-evolution criterion, to evaluate relative support of these alternative phylogenetic hypotheses (Rzhetsky and Nei, 1992). The nine shortest trees, including the neighbor-joining tree, all grouped the avian and mammalian species together as a monophyletic taxon. The bird–mammal grouping contradicts both the traditional classification and the results derived from a large amount of other molecular data (Hedges, 1994; Seutin et al., 1994; Caspers et al., 1996; Janke and Arnason, 1997; Zardoya and Meyer, 1998; Ausio et al., 1999) and morphological and paleontological data (Eernisse and Kluge, 1993). For this reason, the proposal of the bird– mammal grouping based on the 18S rRNA gene has received critical examination from many different perspectives to determine what bias could have been introduced in analyzing the 18S rRNA sequences among tetrapod species. Four kinds of potential bias involving the 18S rRNA gene have been proposed. First, the genomes of homeotherms such as birds and mammals tend to be more GC rich than those of poikilotherms (Bernardi, 1993). This shift in nucleotide frequencies, i.e., the problem of nonstationarity in the substitution process, is generally not accommodated in either parsimony or maximum-likelihood phylogenetic methods. For this reason, the LogDet (Lockhart et al., 1994) distance, which is based on a substitution model that presumably should correct for the nucleotide frequency shift, was used to see whether the annoying bird–mammal grouping would disappear (Huelsenbeck and Bull, 1996; Huelsenbeck et al., 1996). It did not. Second, the substitution pattern is biased favoring U ↔ C transitions in rRNA genes (Marshall, 1992). This bias is expected and is mainly caused by the fact that the nucleotide G can pair with either U or C in maintaining the secondary structure of the rRNA molecule. For example, in pairing with a G, a C can be replaced by a U with little effect on the secondary structure. This ease of substitution is partly the reason for the model (Tamura and Nei, 1993) that uses one parameter for the T ↔ C transition and another for the A ↔ G transition. However, the substitution pattern is not unique in the 18S rRNA gene and therefore cannot explain the difference in phylogenetic outcome between the 18S rRNA and the rest of rRNA molecules. The use of the most general distance such as the LogDet still does not break up the bird–mammal grouping when applied to the 18S rRNA sequences (Huelsenbeck and Bull, 1996; Huelsenbeck et al., 1996). The combination of a higher GC content in homeotherms and the biased substitution pattern favoring U ↔ C transitions may jointly increase the problems associated with long-branch attraction (Marshall, 1992; Huelsenbeck et al., 1996). If U ↔ C transition is the predominant substitution type and if birds and mammals have experienced an increase in C, then convergent U → C transitions may occur independently in the lineages leading to birds and to mammals. However, Hedges and Maxon (1992) dismissed long-branch attraction because the 18S rRNA sequences do not seem to have experienced substitutional saturation. A weighted parsimony method (Williams and Fitch, 1990; Fitch et al., 1995) produced equivocal results (Marshall, 1992). When the paleontological tree grouping birds with crocodilians was used as the starting tree, the method ended up supporting this starting tree. However, when the tree grouping birds with mammals was used as the starting tree, the method ended up supporting this new starting tree. The weighted parsimony method is 283 284 SYSTEMATIC BIOLOGY known to depend on the starting tree, and its relevance to the 18S rRNA sequences is not obvious (Hedges, 1992). Third, sequence misalignment was suspected to have resulted in a biased phylogenetic estimate from the 18S rRNA sequences (Eernisse and Kluge, 1993). However, a realignment of the sequences that was not based on secondary structure (Eernisse and Kluge, 1993) generated results that are exactly the same as those of earlier studies (Hedges et al., 1990; Rzhetsky and Nei, 1992). As previously argued (Hedges et al., 1990), the sequence divergence between the amphibians and the amniotes is only 4.4%; thus, sequence alignment should not be a problem. Fourth, negligence in accommodating variable substitution rates between the conserved and the variable domains in the 18S rRNA sequences might cause a problem (Van de Peer et al., 1993). Both the indel and nucleotide substitution events have occurred predominantly in the eight variable domains (Van de Peer et al., 1993). If some sequences have experienced a number of deletions at their variable domains and if the genetic distance between the two sequences is calculated by using all homologous sites between the two sequences, then the genetic distance involving the shortest sequence, i.e., the one with the shortest variable region, will be relatively underestimated (Van de Peer et al., 1993). Although this observation is insightful, it cannot explain the bird– mammal grouping in previous studies (Hedges et al., 1990; Rzhetsky and Nei, 1992) because in these studies all indel-containing sites were deleted before the phylogenetic analysis was performed so that the number of the homologous sites between any pair of sequences would be constant, i.e., all sequences have the same number of conserved and variable sites in these studies. Huelsenbeck et al. (1996) made perhaps the most extensive and critical examination of the 18S rRNA sequences in relation to the tetrapod phylogeny. They tried almost all existing phylogenetic methods, such as distance methods with the LogDet distance and the maximum-likelihood method with several substitution models. However, the 18S rRNA sequences consistently produced the bird–mammal grouping, whereas other rRNA genes supported the bird–“reptile” grouping. This result led Huelsenbeck et al. (1996:156) to conclude that their analysis “offers definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms.” Is the 18S rRNA gene really so unique? This question is serious because the rRNA genes have been heralded as the universal yardstick in molecular phylogenetics VOL. 52 (Olsen and Woese, 1993), and it would be truly frustrating if we often had “different genes providing significantly different estimates of phylogeny in higher organisms” (Huelsenbeck et al., 1996:156), especially when the universal yardstick is at fault. In previous studies, including that of Eernisse and Kluge (1993), little attention was paid to two problems. The first problem is that of sequence alignment and the associated definition and treatment of alignment in ambiguous regions. The realignment of Eernisse and Kluge (1993) has not been published but appears to have been generated using a gap penalty slightly higher than that used by Hedges et al. (1990). No reference was made to the secondary structure of the rRNA molecule, which has been considered by some as essential for aligning rRNA sequences (Kjer, 1995; Notredame et al., 1997; Hickson et al., 2000). The reported sequence divergence of 4.4% between amphibians and amniotes (Hedges and Maxson, 1992) might have misled researchers into thinking that few changes have occurred during the evolution of 18S rRNA and that there is consequently little ambiguity in sequence alignment. In fact, many indel events have occurred, and it is extremely difficult to arrive at a definite alignment, even with the information provided by secondary structure. Hedges et al. (1990) excluded a segment of sequences because of the difficulty in aligning them. The reported 4.4% sequence divergence was obtained after deleting all indel-containing sites and is therefore not a reflection of sequence differences. With sequences of such low divergence, these hypervariable regions are important because the majority of characters may come from regions of ambiguous homology. In addition, even if 4.4% were an accurate estimate of pairwise divergence, Kjer (1995) cautioned against making general statements about divergence levels below which structural alignments would be unimportant because of the high variability of conservation among stems. The 18S rRNA sequences in mammalian species are much longer than those in avian and “reptilian” species. The alignment program therefore has more room to slide the “reptilian” bases to match the mammalian sequences during sequence alignment. The similarity in nucleotide frequencies between birds and mammals (Bernardi, 1993) would increase the chance of spurious matching between the avian and mammalian sequences. For illustration, imagine four orthologous sequences, two having experienced no indel events and two having experienced many indel events (Fig. 1). The alignment of many sites, especially at sites 3 and 4 and 10 and 11, FIGURE 1. The problem of aligning short and long sequences. 2003 XIA ET AL.—18S R RNA AND TETRAPOD PHYLOGENY is uncertain. Assuming that indel events are generally rare, we conclude that Seq1 and Seq2 are more similar to each other, as are Seq3 and Seq4. However, if we delete all indels, then the uncertainty in sequence alignment is forgotten, and all existing phylogenetic methods will generate the “best” tree, with Seq2 and Seq3 forming a monophyletic taxon. Previous studies that support the grouping of birds with mammals (Hedges et al., 1990; Rzhetsky and Nei, 1992; Huelsenbeck et al., 1996) happen to have aligned long mammalian sequences with short avian and ”reptilian” sequences and then deleted all indels before phylogenetic analysis. The information in the secondary structure of rRNA sequences has been recognized as helpful in guiding sequence alignment (Kjer, 1995; Hickson et al., 1996; Notredame et al., 1997; Buckley et al., 2000), and structure-based alignment has improved phylogenetic resolution in many studies (Dixon and Hillis, 1993; Kjer, 1995; Titus and Frost, 1996; Morrison and Ellis, 1997; Uchida et al., 1998; Mugridge et al., 1999; Cunningham et al., 2000; Gonzalez and Labarere, 2000; Hwang and Kim, 2000; Lydeard et al., 2000; Morin, 2000; Xia, 2000b). Structural alignments are performed manually and thus require the investigator to look at the data and make decisions, thus preventing some of the arbitrary statements about homology that are illustrated in Figure 1. For the 18S rRNA sequences from the tetrapod species, even the inclusion of secondary structure information cannot guarantee unequivocal alignment of all homologous sites. Although many authors have proposed methods for handling indels (Swofford, 1993; Baldwin et al., 1995; Hibbett et al., 1995; Kjer, 1995; Crandall and Fitzpatrick, 1996; Kretzer et al., 1996; Manos, 1997; FloresVillela et al., 2000; Kjer et al., 2001), ambiguously aligned sites have sometimes been handled in phylogenetics in two extreme and inappropriate ways, i.e., they are either totally discarded or totally included with the hope that the majority of sites have been aligned properly. Lutzoni et al. (2000) described a method for coding these ambiguously aligned regions, but their method has not yet been used with the 18S rRNA sequences for tetrapods. The second problem shared by previous studies is the misuse of frequency parameters in the distance and maximum-likelihood methods. The vast majority of both substitution and indel events have occurred in just a few variable domains of the 18S rRNA sequences (Van de Peer et al., 1993). The variable domains have nucleotide frequencies different from those of the conserved domains in the 18S rRNA gene and the 28S rRNA gene (Zardoya and Meyer, 1996). In phylogenetic analyses involving the distance and maximum-likelihood methods, the frequency parameters most appropriate for the underlying substitution model must be used. The most appropriate estimate of the frequency parameters should be derived from the sites where substitution occurs, i.e., from the variable domains. However, in previous studies, many sites in the variable domains have been deleted after alignment because they contain indels. Consequently, the frequency parameters in those studies have been estimated mainly from the conserved domains 285 where nucleotide substitutions are rare. Such frequency parameters could be irrelevant to the underlying substitution models used. In this study, we reexamined phylogenetic relationship among tetrapods by (1) using the 18S rRNA sequences aligned against the secondary structure and (2) estimating the frequency parameters from all sites (i.e., including indel-containing sites) or from variable sites only. In contrast to previous studies based on the 18S rRNA sequences, our reanalysis does not support the hypothesis that birds group with mammals. M ATERIALS AND M ETHODS We used three sets of 18S rRNA sequences in this study. The first set of sequences was retrieved from the rRNA WWW server (Van de Peer et al., 2000; http: //rrna.uia.ac.be/ssu/) and consists of 48 sequences, after excluding redundant sequences. The sequence files from the rRNA WWW server are plain-text files with a special distribution format to specify secondary structure information. The format uses square brackets to enclose a helix, parentheses to enclose a nonstandard base pair, and braces to enclose an internal loop. The computer software DAMBE (Xia, 2000a; Xia and Xie, 2001) can read the files and interpret the symbols properly. The alignment was refined by visual inspection against the secondary structure, and the final aligned sequences and the second and third sets of data are available at http://aix1.uottawa.ca/ ∼xxia/research/data/XiaXieKjer.htm. The refinement of the alignment was required because, as with any huge electronic database of rRNA expanding at a rapid rate, some of the sequences we downloaded had been taxonomically aligned. The second set of aligned 18S rRNA sequences was retrieved from the Ribosomal Database Project II (Maidak et al., 2000; ftp://ftp.cme.msu.edu/pub/RDP/ SSU rRNA/alignments/). This FTP site contains two relevant files (SSU Euk.gb and SSU Euk rep.gb). All tetrapod sequences with >1200 resolved bases were included in this study, and the set consists of 15 aligned sequences. Most of these sequences have been used in previous studies to generate the best tree supporting the bird–mammal grouping (Hedges et al., 1990; Rzhetsky and Nei, 1992; Eernisse and Kluge, 1993; Huelsenbeck et al., 1996). We refined the alignment by visual inspection, and the final alignment is also available in the URL above. The third sets of sequences were retrieved from GenBank, with four sequences not contained in the previous two sets: Crocodylus niloticus (crocodile), Ornithorhynchus anatinus (platypus), Vombatus ursinus (wombat), and Didelphis virginiana (opossum). The platypus and the two marsupial species help subdivide the branch leading to the placental mammal clade. The sequences are also aligned against the secondary structure, and the final alignment is also available at the URL above. Alignment-ambiguous regions were defined with reference to secondary structure (Kjer, 1997). We examined whether there could be something peculiar about the 286 VOL. 52 SYSTEMATIC BIOLOGY sequences collected by Hedges et al. (1990) that could cause the avian and mammalian sequences to group together. We sequenced most of the 18S rRNA gene from the turtle Trachemys scripta and combined these new sequences with those of Pseudemys scripta from Hedges et al. (1990), forming a chimeric sequence. Other sequences from Hedges et al. (1990) were replaced with taxa collected by others, except for the lizard and snake sequences, which are the only representatives of their taxa available. Of the sequences that were retained from Hedges et al. (1990), we recorded as missing any specific site at which one nucleotide was recorded from all of the amniotes by Hedges et al. (1990) and another nucleotide was recorded from all other taxa by other researchers. These sequences were analyzed using parsimony methods, with a new method for coding alignment-ambiguous regions (Lutzoni et al., 2000), and analyzed using likelihood methods, with a general timereversible model with a gamma correction for among-site rate variation and an estimate of invariant sites. Parameters were estimated from the parsimony tree with PAUP 4 (Swofford, 2000). The three sets of sequences are not mutually exclusive. Strong heterogeneity in substitution rate is expected to exist in the 18S rRNA sequences, and we used the method of Gu and Zhang (1997) implemented in DAMBE (Xia, 2000a; Xia and Xie, 2001) to estimate the alpha parameter of the gamma distribution. We also estimated the proportion of invariant sites using a modification of Gu and Zhang’s method as follows. The estimated alpha value is used with the new version of DNAML (Felsenstein, 1993) and for correcting distance estimation based on the TN93 model (Tamura and Nei, 1993). The genomes of homeotherms such as birds and mammals tend to be more GC rich than those of poikilotherms (Bernardi, 1993), and the 18S rRNA sequences from the avian and mammalian species are also more GC rich than those of other species. If the high GC content in avian and mammalian sequences has been gained independently in the two lineages, then we should use a substitution model that would accommodate nonstationarity in the substitution process. At present, only the underlying model for the LogDet (Lockhart et al., 1994) and the paralinear (Lake, 1994) distances accommodate nonstationarity, and thus much of our phylogenetic analysis was limited to distance-based methods. The presence of a large proportion of invariant sites may bias phylogenetic estimation (Lockhart et al., 1996). It is important to distinguish between the invariant sites and those sites with no observed substitution, the former being a subset of the latter. The proportion of sites with no observed substitution (designated p) is made of two components: the proportion of sites expected to have experienced no substitution (p1 ) under certain substitution model and the truly invariant sites, i.e., those where a change will have a very deleterious effect and will be strongly selected against ( p2 ). To estimate p2 , we allowed the p2 value to fluctuate between 0 and p (and the p1 value consequently fluctuated from p to 0) and fit the observed substitution data to a negative binomial dis- tribution. The resulting p2 value that produced the best fit to the substitution data was used as the proportion of invariant sites. This method has been implemented in DAMBE (Xia, 2000a; Xia and Xie, 2001). A similar approach was used by Lockhart et al. (1996). Unless specified otherwise, the nucleotide frequencies were estimated using all sites, including the indelcontaining sites. This approach differs from that used in previous studies (Hedges et al., 1990; Rzhetsky and Nei, 1992; Eernisse and Kluge, 1993; Huelsenbeck et al., 1996), in which all indel-containing sites were deleted before the phylogenetic analysis was performed. R ESULTS AND D ISCUSSION The mammalian and avian sequences are consistently more GC rich than the sequences from poikilotherms (Table 1), a finding with two implications. First, avian 18S rRNA sequences are much shorter than mammalian sequences, and alignment of the short avian sequences against the long mammalian sequences allows a great TABLE 1. Frequencies of A+T and G+C for the fish, amphibian, “reptilian,” mammalian, and avian species. Taxon Fish Amphibian Crocodilian Tuatara Mammal Bird Frequencies Scientific name Accesion no. A+T C+G Latimeria chalumnae Xenopus laevis Xenopus laevis Ranodon sibiricus Alligator mississippiensis Sphenodon punctatus Mus musculus Mus musculus Rattus norvegicus Rattus norvegicus Rattus norvegicus Oryctolagus cuniculus Homo sapiens Homo sapiens Homo sapiens Homo sapiens Anas platyrhynchos Dromaius novaehollandiae Tockus nasutus Chordeiles acutipennis Charadrius semipalmatus Larus glaucoides Urocolius macrourus Columba livia Coracias caudata Cuculus pallidus Galbula pastazae Ortalis guttata Coturnix pectoralis Gallus gallus Grus canadensis Gallirex porphyreolophus Picoides pubescens Tyrannus tyrannus Ciconia nigra Apus affinus Trogon collaris Turnix sylvatica Upupa epops Apteryx australis L11288 X02995 X04025 AJ279506 AF173605 AF115860 X00686 X82564 K01593 M11188 V01270 X06778 K03432 M10098 U13369 X03205 AF173614 AF173610 AF173626 AF173622 AF173638 AF173637 AF173617 AF173630 AF173625 AF173628 AF173624 AF173613 AF173611 AF173612 AF173632 AF173618 AF173615 AF173616 AF173636 AF173619 AF173623 AF173631 AF173627 AF173609 0.4766 0.4619 0.4622 0.4788 0.4634 0.4602 0.4398 0.4396 0.4424 0.4422 0.4429 0.4465 0.4393 0.4395 0.4388 0.4388 0.4467 0.4537 0.4410 0.4398 0.4410 0.4381 0.4386 0.4404 0.4404 0.4410 0.4410 0.4399 0.4502 0.4525 0.4410 0.4393 0.4410 0.4398 0.4398 0.4392 0.4404 0.4404 0.4404 0.4537 0.5234 0.5380 0.5378 0.5212 0.5367 0.5398 0.5602 0.5605 0.5576 0.5577 0.5571 0.5534 0.5607 0.5605 0.5612 0.5612 0.5533 0.5463 0.5590 0.5602 0.5591 0.5619 0.5613 0.5596 0.5596 0.5590 0.5590 0.5602 0.5498 0.5475 0.5590 0.5608 0.5590 0.5602 0.5602 0.5607 0.5596 0.5596 0.5596 0.5463 2003 XIA ET AL.—18S R RNA AND TETRAPOD PHYLOGENY deal of freedom in sliding the avian bases to match the mammalian bases. This length mismatch and the similarity in nucleotide frequencies between avian and mammalian sequences lead to “optimal” alignment (i.e., the best alignment score); therefore, it is necessary to use a nucleotide substitution model that accommodates the inherent nonstationary substitution process. At present, only the paralinear distance (Lake, 1994) and the LogDet distance (Lockhart et al., 1994) methods are appropriate for the phylogenetic analysis of these sequences. There is a subtle difference between the paralinear and the LogDet distances. To highlight the difference, we reproduced the distance between two nucleotide sequences (1 and 2):   d12    det J 12 1   , = − ln  s  4  Q 4 4  Q p1i p2i i=1 (1) i=1 where J12 is the observed substitution matrix, p1 and p2 are nucleotide frequencies for sequences 1 and 2, respectively, and det J12 means the determinant of J12 . In the formulation of the paralinear distance, J12 are numbers and p1 and p2 are reconstituted from J12 (Lake, 1994). Consequently, p1 and p2 are based on aligned sites only, i.e., sites with no indels. However, this approach causes a new problem for analyzing the 18S rRNA sequences. Both substitution and indel events have occurred almost exclusively in just a few variable domains of the 18S rRNA sequences (Van de Peer et al., 1993). The variable domains have nucleotide frequencies different from those of the conserved domains in the 18S rRNA gene and in the 28S rRNA gene (Zardoya and Meyer, 1996). In phylogenetic analyses involving distance and maximum-likelihood methods, frequency parameters most appropriate for the underlying substitution model must be used. The most appropriate estimate of the frequency parameters should be from the sites where substitution occurs, i.e., from the variable domains. However, variable domains in the 18S rRNA sequences are poorly represented in the aligned sites because of the presence of many indels in these domains. Thus, p1 and p2 in Equation 1 are mainly based on invariable domains and consequently are not appropriate for phylogenetic reconstruction. PAUP 4 (Swofford, 2000) uses this original formulation for calculating the pairwise Lake/LogDet distances. Two modifications can be made to alleviate the problem of using inappropriately estimated frequency parameters. The first is to use polymorphic sites only in phylogenetic reconstruction. This would produce proper estimates of p1 and p2 but has the disadvantage of generating extraordinarily large distances. An alternative is to use to use the LogDet distance (Lockhart et al., 1994), which defines J12 as a substitution matrix in proportions summing up to 1 and p1 and p2 as vectors of proportions summing up to 1. This permits the computation of empirical frequencies from all sites, including sites con- 287 taining indels. Both DAMBE and the DNADIST program in PHYLIP (Felsenstein, 1993) use all sites in computing p1 and p2 . This approach allows sites in the variable domains of the 18S rRNA sequences to be better represented in computing nucleotide frequencies and is the approach that we have taken in analyzing the 18S rRNA sequences. Distance-based phylogenetic reconstruction demands both the unbiased estimation of the distance matrix and an efficient and accurate method that uses the input distance matrix to search for the best tree based on a biologically meaningful optimization criterion. The latter component has been much advanced in recent years, with the development of new methods implemented in Weighbor (Bruno et al., 2000), BIONJ (Gascuel, 1997), and FastME (Desper and Gascuel, 2002). In particular, FastME represents one of the first successful implementations of the global minimum evolution (ME) criterion in phylogenetic analysis. Previous implementations of the ME criterion, such as METREE (Rzhetsky and Nei, 1994) and FITCH in the PHYLIP package (Felsenstein, 1993), use the ordinary least-square method for evaluating branch lengths and do not handle the resulting negative branch lengths in a meaningful way. FastME is fast and achieves high topological accuracy by the combination of a very efficient branch-swapping algorithm and a fast tree-evaluating method equivalent to the weighted least-square method. The tree produced by FastME with default options and with the LogDet distance for the 48 sequences in the first set (Fig. 2) revealed a group of odd-looking sequences: Ambystoma mexicanum (salamander; GenBank M59384), Nesomantis thomasseti (salamander; M59396), Bufo valliceps (frog; M59386), Turdus migratorius (bird; M59402), Pseudemys scripta (turtle; M59398), Heterodon platyrhinos (snake; M59392), and Alligator mississippiensis (M59383). These amphibian, “reptilian,” and avian species have relatively long branches, do not cluster with their taxonomic sister taxa, and form a cluster among themselves. These sequences are all from the first study (Hedges et al., 1990) in which the avian and mammalian species formed a monophyletic group. A close examination of these sequences shows that all have many unresolved sites, which suggests that the neighboring resolved sites in the sequences might also be unreliable. The long branches associated with these sequences may not mean that they all have extraordinarily rapid evolutionary rates but rather are more likely to be the result of sequencing errors. The grouping of these heterogeneous sequences together to the exclusion of their respective sister taxa cannot be satisfactorily explained without invoking sequencing errors. A site-bysite examination of the data confirms this explanation. Examination of this odd group of sequences suggests that the grouping of the avian and mammalian species by previous studies based on this group of sequences (Hedges et al., 1990; Rzhetsky and Nei, 1992; Eernisse and Kluge, 1993; Huelsenbeck and Bull, 1996) is at least partially attributable to sequencing error. In subsequent analyses of the first set of sequences, we excluded these seven sequences and one of the two 288 SYSTEMATIC BIOLOGY VOL. 52 FIGURE 2. Phylogenetic tree obtained from the FastME method with LogDet distances. All sites were included in counting nucleotide frequencies for computing the LogDet distance. 2003 XIA ET AL.—18S R RNA AND TETRAPOD PHYLOGENY 289 and “reptilian” species to the exclusion of mammalian species (Fig. 3a). The bootstrap values from 500 resamples leave little ambiguity in such a grouping (Fig. 3a). The combination of the DNADIST (producing a matrix of LogDet distance matrix) and NEIGHBOR programs in PHYLIP (Felsenstein, 1993) also groups the avian and the Oryctolagus cuniculus sequences (rabbit; X00640). This sequence was obtained many years ago (Connaughton et al., 1984), and its suspiciously long branch (Fig. 2) suggests that it is unreliable. The phylogenetic tree for the remaining 40 sequences, based on the FastME method with the LogDet distances, clustered the avian (a) FIGURE 3. Phylogenetic tree obtained from the FastME method (a) and the Fitch–Margoliash method (b) with the LogDet distances. Sequences of poor quality have been removed. The numbers are bootstrap values. (Continued on next page) 290 SYSTEMATIC BIOLOGY VOL. 52 (b) FIGURE 3. Continued. “reptilian” species together to the exclusion of mammalian species. The reconstructed tree from Weighbor (Bruno et al., 2000) is similar to that of the neighborjoining (NJ) method, and both trees share the annoying outcome of grouping one of the three rat sequences with the mouse sequences. The phylogenetic tree based on the Fitch-Margoliash (FM) method (Fitch and Margoliash, 1967), implemented in the FITCH program of the PHYLIP package (Felsenstein, 1993) and in DAMBE (Xia, 2000a; Xia and Xie, 2001), is similar to the FastME tree in that avian and “reptilian” species are clustered together with high bootstrapping values (Fig. 3b). Although the FM method has a global optimization criterion whereas the NJ method achieves only local optimization, this advantage of the FM method over the NJ method is typically lost in practical computation. The FM method is slow, and current implementations of the method, such as those in PHYLIP and DAMBE, adopt a greedy algorithm by starting the tree reconstruction with three operational taxonomic 2003 XIA ET AL.—18S R RNA AND TETRAPOD PHYLOGENY units (OTUs) and then add new OTUs to the growing tree sequentially. Thus, the so-called global optimization is only applied to successive local trees, and it is misleading to call this approach global optimization. In contrast, FastME explores the tree space much more thoroughly than does the FM method implemented in FITCH and DAMBE. There is a small difference in the implementation of the FM method between DAMBE and PHYLIP. DAMBE starts with three OTUs that have the greatest average distance from the other OTUs and then adds other OTUs sequentially to the tree. Once all the trees have been added, the first three OTUs are then taken off and replanted. This process should produce a tree that is better than that produced by the FITCH program using its default mode but may not be as good as that produced by the FITCH program when all optimization switches are turned on. The second set of 18S rRNA sequences is mostly composed of sequences used originally to produce the bird– mammal grouping with high bootstrap values (Hedges et al., 1990; Rzhetsky and Nei, 1992; Eernisse and Kluge, 1993; Huelsenbeck et al., 1996). However, when the sequences are aligned according to secondary structure and all sites are used for counting nucleotide frequencies in computing LogDet distances, the avian and “reptilian” sequences form a monophyletic group with unambiguous bootstrapping support (Fig. 4). Phylogenetic reconstruction with the FM method produced identical topology and almost identical bootstrapping values. Thus, the bird–mammal grouping is still not recovered with proper 291 phylogenetic methods even when the sequence quality is low. Previous studies with roughly the same set of sequences grouped avian and mammalian species together with the LogDet distances (Huelsenbeck and Bull, 1996; Huelsenbeck et al., 1996). There are several possibilities for the discrepancy. First, the sequences used in previous studies may have been aligned differently. Second, the previous studies may have included the LogDet distances specified in equation 1 of Lockhart et al. (1994) instead of those of equation 3 of Lockhart et al. (1994). The latter is identical to our Equation 1 and the paralinear distance (Lake, 1994) in form, but that defined in equation 1 of Lockhart et al. (1994) is different and equals −ln(detJ 12 ). Third, the previous studies may have also included the LogDet distances as defined in our Equation 1 but may have done the calculations based on sites containing no indels. This would imply that p1 and p2 in our Equation 1 were dominated by sites in the conserved domains of the 18S rRNA sequences and consequently may not be appropriate in characterizing a substitution pattern involving substitutions mostly in sites of the variable domains. The inclusion of the indel-containing sites in our calculation of the LogDet distances also suffers from a possible bias. Both the indel events and the nucleotide substitution events occurred mostly in the variable domains of rRNA sequences (Van de Peer et al., 1993). If some sequences have experienced a number of deletions at their variable domains and if the genetic distance between the two sequences is calculated using all sites FIGURE 4. Phylogenetic tree obtained from the FastME method with LogDet distances and the second set of sequences used previously to support the bird–mammal grouping. The numbers are bootstrap values. The Fitch–Margoliash method produces the same topology and almost identical bootstrap values. 292 SYSTEMATIC BIOLOGY VOL. 52 FIGURE 5. Maximum parsimony (µP) (a) and maximum likelihood (b) trees based on the third set of sequences, including the turtle and the more primitive mammalian species. The branch lengths of the MP tree are not estimated and are set to the same length for display. 2003 XIA ET AL.—18S R RNA AND TETRAPOD PHYLOGENY between the two sequences (as we have done), then the genetic distance involving the shortest sequence, i.e., the one with the shortest variable region, will be relatively underestimated (Van de Peer et al., 1993). The avian and “reptilian” 18S rRNA sequences are shorter than those of mammalian species. If avian and “reptilian” sequences share a number of independent deletions of homologous variable domains, then our calculation of the LogDet distances would tend to underestimate the genetic distances involving the avian and the “reptilian” sequences. This problem is shared by results from both the first and the second data sets. For the third set of data with more primitive mammalian lineages, the phylogenetic result supports the avian–crocodilian grouping (Fig. 5) in both parsimony and likelihood analyses. This set of sequences was aligned independently from the other two sets of sequences, and none of these three independently aligned sequences support the bird–mammal grouping. This leaves little doubt that the 18S rRNA gene is not as odd as previous studies have suggested (Hedges et al., 1990; Rzhetsky and Nei, 1992; Eernisse and Kluge, 1993; Huelsenbeck et al., 1996). In particular, this last set of data does not have the potential bias outlined in the previous paragraph involving our calculation of the LogDet distances. Although it appears premature to conclude that the 18S rRNA sequences supply “definitive evidence of different genes providing significantly different estimates of phylogeny in higher organisms” (Huelsenbeck et al., 1996:156), it is important to properly choose the substitution model and phylogenetic methods. When we delete all indel-containing sites in the first and the second data sets so that the nucleotide frequencies are dominated by the invariable sites, then all major distance methods (NJ, FM, Weighbor, FastME) with any of the genetic distances (including LogDet and paralinear distances) group the avian and mammalian sequences as a monophyletic group, just as shown in previous studies. PAUP 4 (Swofford, 2000) ignores all indelcontaining sites in calculating the pairwise Lake/LogDet distances, and the distance-based tree-making methods implemented in PAUP will always group the mammalian and avian species together to the exclusion of the “reptilian” species. Similarly, when we apply to the first and the second data sets any existing maximum likelihood, maximum parsimony, or any distance-based method that does not accommodate the nonstationary nature of the substitution process involved, we again have avian sequences strongly grouped with the mammalian sequences to the exclusion of “reptilian” species (data not shown). The sequences exhibit strong heterogeneity in substitution rate over sites, with estimated alpha values of 0.1643 and 0.1432 for the first and the second data sets, respectively. We have used the maximum-likelihood method with gamma-distributed rates by using the new version of DNAML and BASEML, and the resulting trees always grouped the birds and mammals together. When we do not use the LogDet distances but instead use 293 the genetic distance based on the three-parameter TN93 model (Tamura and Nei, 1993) or any other substitution model, the resulting trees also group the avian and mammalian species together, regardless of whether the distance is corrected with the estimated alpha value or not. This result highlights the importance of accommodating nonstationarity in the substitution process. Structurally aligned 18S rRNA sequences from major tetrapod taxa produce topologies similar to those based on other genes, morphological characters, and paleontological evidence. The rRNA sequences must be aligned using the secondary structure as a template, and frequency parameters appropriate for the underlying substitution model must be used. Secondary structure information should also be used to determine the boundaries between aligned and alignment-ambiguous regions (Kjer, 1997) so that these regions can be objectively examined according to the coding method (Lutzoni et al., 2000). This study highlights the problem of applying a battery of computer programs to the data without first checking the quality of the data and emphasizes the importance of becoming intimately familiar with the data. Many of these conclusions could not have been made without looking at the data. ACKNOWLEDGMENTS This study was supported by research grants from NSERC and from the University of Ottawa to X.X. and by a Chinese Ministry of Education grant to Z.X. We thank Axel Meyer for references and anonymous referees for helpful comments and suggestions. K.M.K. acknowledges support from the New Jersey Agricultural Experiment Station. We thank Chris Simon for her comments, suggestions, and references that helped clarify a number of points. John LaPolla contributed fragments of the turtle sequence. R EFERENCES AUSIO , J., J. T. S OLEY, W. B URGER, J. D. LEWIS , D. B ARREDA, AND K. M. CHENG . 1999. The histidine-rich protamine from ostrich and tinamou sperm. A link between reptile and bird protamines. Biochemistry (Moscow) 38:180–184. B ALDWIN, B. G., M. J. S ANDERSON, J. M. PORTER, M. F. WOJCIECHOWSKI , C. C. CAMPBELL, AND M. J. D ONOGHUE. 1995. The ITS region of nuclear ribosomal DNA: A valuable source of evidence on angiosperm phylogeny. Ann. Mo. Bot. Gard. 82:257–277. B ERNARDI , G. 1993. The vertebrate genome: Isochores and evolution. Mol. Biol. Evol. 10:186–204. B RUNO , W. J., N. D. S OCCI , AND A. L. HALPERN. 2000. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:189–197. B UCKLEY, T. R., C. S IMON, P. K. FLOOK , AND B. M ISOF. 2000. Secondary structure and conserved motifs of the frequently sequenced domains IV and V of the insect mitochondrial large subunit rRNA gene. Insect Mol. Biol. 9:565–580. CARROLL, R. L. 1988. Vertebrate paleontology and evolution. W. H. Freeman, New York. CASPERS , G. J., G. J. R EINDERS , J. A. M. LEUNISSEN, J. WATTEL, AND W. W. DEJ ONG . 1996. Protein sequences indicate that turtles branched off from the amniote tree after mammals. J. Mol. Evol. 42:580–586. CONNAUGHTON, J. F., A. R AIRKAR, R. E. LOCKARD , AND A. K UMAR . 1984. Primary structure of rabbit 18S ribosomal RNA determined by direct RNA sequence analysis. Nucleic Acids Res. 12:4731–4745. CRANDALL, K. A., AND J. J. F. FITZPATRICK . 1996. Crayfish molecular systematics: Using a combination of procedures to estimate phylogeny. Syst. Biol. 45:1–26. 294 SYSTEMATIC BIOLOGY CUNNINGHAM , C. O., H. ALIESKY, AND C. M. COLLINS . 2000. Sequence and secondary structure variation in the Gyrodactylus (Platyhelminthes: Monogenea) ribosomal RNA gene array. J. Parasitol. 86:567–576. D ESPER, R., AND O. G ASCUEL. 2002. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol. 9:687–705. D IXON, M. T., AND D. M. HILLIS . 1993. Ribosomal RNA secondary structure: Compensatory mutations and implications for phylogenetic analysis. Mol. Biol. Evol. 10:256–267. EERNISSE, D. J., AND A. G. K LUGE. 1993. Taxonomic congruence versus total evidence, and amniote phylogeny inferred from fossils, molecules, and morphology. Mol. Biol. Evol. 10:1170–1195. FELSENSTEIN, J. 1993. PHYLIP 3.5 (phylogeny inference package), version 3.5. Department of Genetics, Univ. Washington, Seattle. FITCH, D. H. A., B. B UGAJGAWEDA, AND S. W. EMMONS . 1995. 18S ribosomal-RNA gene phylogeny for some Rhabditidae related to Caenorhabditis. Mol. Biol. Evol. 12:346–358. FITCH, W. M., AND E. M ARGOLIASH. 1967. Construction of phylogenetic trees. Science 155:279–284. FLORES -VILLELA, O., K. M. K JER, M. B ENABIB , AND J. W. S ITES . 2000. Multiple data sets, congruence and hypothesis testing for the phylogeny of basal groups of the lizard genus Sceloporus (Squamata, Phrynosomatidae). Syst. Biol. 49:713–739. G ARDINER, B. G. 1982. Tetrapod classification. Zool. J. Linn. Soc. 74:207– 232. G ASCUEL, O. 1997. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685– 695. G AUTHIER, J., A. G. K LUGE, AND T. R OWE. 1988. Amniote phylogeny and the importance of fossils. Cladistics 4:105–209. G ONZALEZ, P., AND J. LABARERE. 2000. Phylogenetic relationships of Pleurotus species according to the sequence and secondary structure of the mitochondrial small-subunit rRNA V4, V6 and V9 domains. Microbiology 146:209–221. G U, X., AND J. ZHANG . 1997. A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106–1113. HEDGES , S. B. 1992. The number of replications needed for accurate estimation of bootstrap P-value in phylogenetic studies. Mol. Biol. Evol. 9:366–369. HEDGES , S. B. 1994. Molecular evidence for the origin of birds. Proc. Natl. Acad. Sci. USA 91:2621–2624. HEDGES , S. B., AND L. R. M AXSON. 1992. 18S-ribosomal-RNA sequences and amniote phylogeny—Reply to Marshall. Mol. Biol. Evol. 9:374–377. HEDGES , S. B., K. D. M OBERG , AND L. R. M AXSON. 1990. Tetrapod phylogeny inferred from 18S and 28S ribosomal RNA sequences and a review of the evidence for amniote relationships. Mol. Biol. Evol. 7:607–633. HIBBETT , D. S., Y. FUKUMASA-NAKAI , A. TSUNEDA, AND M. J. D ONOGHUE. 1995. Phyogenetic diversity in shiitake inferred from nuclear ribosomal DNA. Mycologia 87:618–638. HICKSON, R. E., C. S IMON, A. COOPER, G. S. S PICER, J. S ULLIVAN, AND D. PENNY. 1996. Conserved sequence motifs, alignment, and secondary structure for the third domain of animal 12S rRNA. Mol. Biol. Evol. 13:150–169. HICKSON, R. E., C. S IMON, AND S. W. PERREY. 2000. The performance of several multiple-sequence alignment programs in relation to secondary-structure features for an rRNA sequence. Mol. Biol. Evol. 17:530–539. HUELSENBECK , J. P., AND J. J. B ULL. 1996. A likelihood ratio test to detect conflicting phylogenetic signal. Syst. Biol. 45:92–98. HUELSENBECK , J. P., J. J. B ULL, AND C. W. CUNNINGHAM . 1996. Combining data in phylogenetic analysis. Trends Ecol. Evol. 11:152–158. HWANG , S. K., AND J. G. K IM . 2000. Secondary structural and phylogenetic implications of nuclear large subunit ribosomal RNA in the ectomycorrhizal fungus Tricholoma matsutake. Curr. Microbiol. 40:250–256. J ANKE, A., AND U. ARNASON. 1997. The complete mitochondrial genome of Alligator mississippiensis and the separation between recent archosauria (birds and crocodiles). Mol. Biol. Evol. 14:1266– 1272. VOL. 52 K JER, K. M. 1995. Use of ribosomal-RNA secondary structure in phylogenetic studies to identify homologous positions—an example of alignment and data presentation from the frogs. Mol. Phylogenet. Evol. 4:314–330. K JER, K. M. 1997. Conserved primary and secondary structural motifs of amphibian 12S rRNA, domain III. J. Herpetol. 31:599–604. K JER, K. M., R. J. B LAHNIK , AND R. HOLZENTHAL. 2001. Phylogeny of Trichoptera (Caddisflies): Characterization of signal and noise within multiple datasets. Syst. Biol. 50:781–816. K RETZER, A., Y. LI , T. S ZARO , AND T. D. B RUNS . 1996. Internal transcribed spacer sequences from 38 recognized species of Suillus sensu lato: Phylogenetic and taxonomic implications. Mycologia 88:776– 785. LAKE, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455–1459. LOCKHART , P. J., A. W. LARKUM , M. S TEEL, P. J. WADDELL, AND D. PENNY. 1996. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA 93:1930–1934. LOCKHART , P. J., M. A. S TEEL, M. D. HENDY, AND D. PENNY. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605–612. LØ VTRUP, S. 1985. On the classification of the taxon Tetrapoda. Syst. Zool. 34:463–470. LUTZONI , F., P. WAGENER, V. R EEV , AND S. ZOLLER. 2000. Integrating ambiguously aligned regions of DNA sequence in phylogenetic analyses without violating positional homology. Syst. Biol. 49:628– 651. LYDEARD , C., W. E. HOLZNAGEL, M. N. S CHNARE, AND R. R. G UTELL. 2000. Phylogenetic analysis of molluscan mitochondrial LSU rDNA sequences and secondary structures. Mol. Phylogenet. Evol. 15:83– 102. M AIDAK , B. L., J. R. COLE, T. G. LILBURN, C. T. PARKER, J R., P. R. S AXMAN, J. M. S TREDWICK , G. M. G ARRITY, B. LI , G. J. O LSEN, S. PRAMANIK , T. M. S CHMIDT , AND J. M. TIEDJE. 2000. The RDP (Ribosomal Database Project) continues. Nucleic Acids Res. 28:173– 174. M ANOS , P. S. 1997. Systematics of Nothofagus (Nothofagaceae) based on rDNA spacer sequences (ITS): Taxonomic congruence with morphology and plastid sequences. Am. J. Bot. 84:1137–1155. M ARSHALL, C. R. 1992. Substitution bias, weighted parsimony, and amniote phylogeny as inferred from 18S-ribosomal-RNA sequences. Mol. Biol. Evol. 9:370–373. M ORIN, L. 2000. Long branch attraction effects and the status of ”basal eukaryotes”: Phylogeny and structural analysis of the ribosomal RNA gene cluster of the free-living diplomonad Trepomonas agilis. J. Eukaryot. Microbiol. 47:167–177. M ORRISON, D. A., AND J. T. ELLIS . 1997. Effects of nucleotide sequence alignment on phylogeny estimation: A case study of 18S rDNAs of Apicomplexa. Mol. Biol. Evol. 14:428–441. M UGRIDGE, N. B., D. A. M ORRISON, A. M. J OHNSON, K. LUTON, J. P. D UBEY, J. VOTYPKA, AND A. M. TENTER. 1999. Phylogenetic relationships of the genus Frenkelia: A review of its history and new knowledge gained from comparison of large subunit ribosomal ribonucleic acid gene sequences. Int. J. Parasitol. 29:957–972. NOTREDAME, C., E. A. O’B RIEN, AND D. G. HIGGINS . 1997. RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res. 25:4570–4580. O LSEN, G. J., AND C. R. WOESE. 1993. Ribosomal RNA: A key to phylogeny. Fed. Am. Soc. Exp. Biol. J. 7:113–123. R OMER, A. S. 1966. Vertebrate paleontology. Univ. Chicago Press, Chicago. R ZHETSKY, A., AND M. NEI . 1992. A simple method for estimating and testing minimum-evolution trees. Mol. Biol. Evol. 9:945–967. R ZHETSKY, A., AND M. NEI . 1994. METREE: A program package for inferring and testing minimum-evolution trees. Comput. Appl. Biosci. 10:409–412. S EUTIN, G., B. F. LANG , D. P. M INDELL, AND R. M ORAIS . 1994. Evolution of the WANCY region in amniote mitochondrial-DNA. Mol. Biol. Evol. 11:329–340. S WOFFORD , D. L. 1993. PAUP: Phylogenetic analysis using parsimony. Illinois Natural History Survey, Champaign. 2003 XIA ET AL.—18S R RNA AND TETRAPOD PHYLOGENY S WOFFORD , D. L. 2000. PAUP: Phylogenetic analysis using parsimony* (*and other methods), version 4. Sinauer, Sunderland, Massachusetts. TAMURA, K., AND M. NEI . 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512–526. TITUS , T. A., AND D. R. FROST . 1996. Molecular homology assessment and phylogeny in the lizard family Opluridae (Squamata: Iguania). Mol. Phylogenet. Evol. 6:49–62. UCHIDA, H., K. K ITAE, K. I. TOMIZAWA, AND A. YOKOTA. 1998. Comparison of the nucleotide sequence and secondary structure of the 5.8S ribosomal RNA gene of Chlamydomonas tetragama with those of green algae. DNA Seq. 8:403–408. VAN DE PEER, Y., P. D E R IJK , J. WUYTS , T. WINKELMANS , AND R. D E WACHTER . 2000. The European small subunit ribosomal RNA database. Nucleic Acids Res. 28:175–176. VAN DE PEER, Y., J. M. NEEFS , P. D E R IJK , AND R. D E WACHTER . 1993. Reconstructing evolution from eukaryotic small-ribosomal-subunit RNA sequences: Calibration of the molecular clock. J. Mol. Evol. 37:221–232. 295 WILLIAMS , P. L., AND W. M. FITCH. 1990. Phylogeny determination using dynamically weighted parsimony method. Methods Enzymol. 183:615–626. XIA, X. 2000a. Data analysis in molecular biology and evolution. Kluwer, Boston. XIA, X. 2000b. Phylogenetic relationship among horseshoe crab species: The effect of substitution models on phylogenetic analyses. Syst. Biol. 49:87–100. XIA, X., AND Z. XIE. 2001. DAMBE: Software package for data analysis in molecular biology and evolution. J. Hered. 92:371–373. ZARDOYA, R., AND A. M EYER . 1996. Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. Mol. Biol. Evol. 13:933– 942. ZARDOYA, R., AND A. M EYER . 1998. Complete mitochondrial genome suggests diapsid affinities of turtles. Proc. Natl. Acad. Sci. USA 95:14226–14231. First submitted 13 March 2001; reviews returned 17 June 2001; final acceptance 8 February 2003 Associate Editor: Chris Simon