Abstract
Arabidopsis is currently the reference genome for higher plants. A new, more detailed statistical analysis of Arabidopsis gene structure is presented including intron and exon lengths, intergenic distances, features of promoters, and variant 5′-ends of mRNAs transcribed from the same transcription unit. We also provide a statistical characterization of Arabidopsis transcripts in terms of their size, UTR lengths, 3′-end cleavage sites, splicing variants, and coding potential. These analyses were facilitated by scrutiny of our collection of sequenced full-length cDNAs and much larger collection of 5′-ESTs, together with another set of full-length cDNAs from Salk/Stanford/Plant Gene Expression Center/RIKEN. Examples of alternative splicing are observed for transcripts from 7% of the genes and many of these genes display multiple spliced isoforms. Most splicing variants lie in non-coding regions of the transcripts. Non-canonical splice sites constitute less than 1% of all splice sites. Genes with fewer than four introns display reduced average mRNA levels. Putative alternative transcription start sites were observed in 30% of highly expressed genes and in more than 50% of the genes with low expression. Transcription start sites correlate remarkably well with a CG skew peak in the DNA sequences. The intergenic distances vary considerably, those where genes are transcribed towards one another being significantly shorter. New transcripts, missing in the current TIGR genome annotation and ESTs that are non-coding, including those antisense to known genes, are derived and cataloged in the Supplementary Material. They identify 148 new loci in the Arabidopsis genome. The conclusions drawn provide a better understanding of the Arabidopsis genome and how the gene transcripts are processed. The results also allow better predictions to be made for, as yet, poorly defined genes and provide a reference for comparisons with other plant genomes whose complete sequences are currently being determined. Some comparisons with rice are included in this paper.
Similar content being viewed by others
References
A. Beletskii A.S. Bhagwat (1996) ArticleTitleTranscription-induced mutations: increase in C to T mutations in the nontranscribed strand during transcription in Escherichia coli Proc. Natl. Acad. Sci. USA 93 IssueID24 13919–13924 Occurrence Handle8943036 Occurrence Handle1:CAS:528:DyaK28Xnt1Glt70%3D Occurrence Handle10.1073/pnas.93.24.13919
A. Beletskii A. Grigoriev et al. (2000) ArticleTitleMutations induced by bacteriophage T7 RNA polymerase and their effects on the composition of the T7 genome J. Mol. Biol. 300 IssueID5 1057–1065 Occurrence Handle10903854 Occurrence Handle1:CAS:528:DC%2BD3cXkvFars7k%3D Occurrence Handle10.1006/jmbi.2000.3944
E. Birney J.D. Thompson et al. (1996) ArticleTitlePairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames Nucleic Acids Res. 24 IssueID14 2730–2739 Occurrence Handle8759004 Occurrence Handle1:CAS:528:DyaK28XkslWls7g%3D Occurrence Handle10.1093/nar/24.14.2730
V. Castelli J.M. Aury et al. (2004) ArticleTitleWhole genome sequence comparisons and full-length cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation Genome Res. 14 IssueID3 406–413 Occurrence Handle14993207 Occurrence Handle10.1101/gr.1515604
M.J. Daly (2002) ArticleTitleEstimating the human gene count Cell 109 IssueID3 283–284 Occurrence Handle12015978 Occurrence Handle1:CAS:528:DC%2BD38XjvV2lsL0%3D Occurrence Handle10.1016/S0092-8674(02)00742-0
M. Danin-Kreiselman C.Y. Lee et al. (2003) ArticleTitleRNAse III-mediated degradation of unspliced pre-mRNAs and lariat introns Mol. Cell 11 IssueID5 1279–1289 Occurrence Handle12769851 Occurrence Handle1:CAS:528:DC%2BD3sXksVOjt7c%3D Occurrence Handle10.1016/S1097-2765(03)00137-0
S.R. Eddy (2001) ArticleTitleNon-coding RNA genes and the modern RNA world Nat. Rev. Genet. 2 IssueID12 919–929 Occurrence Handle11733745 Occurrence Handle1:CAS:528:DC%2BD38Xmt1Whu7g%3D Occurrence Handle10.1038/35103511
L. Florea G. Hartzell et al. (1998) ArticleTitleA computer program for aligning a cDNA sequence with a genomic DNA sequence Genome Res. 8 IssueID9 967–974 Occurrence Handle9750195 Occurrence Handle1:CAS:528:DyaK1cXmsVWnt74%3D
J.M. Freeman T.N. Plasterer et al. (1998) ArticleTitlePatterns of Genome Organization in Bacteria Science 279 1827 Occurrence Handle10.1126/science.279.5358.1827a
Gish, W. (1996–2001). BLASTN 2.0MP-WashU. http://blast.wustl.edu.
S.A. Goff D. Ricke et al. (2002) ArticleTitleA draft sequence of the rice genome (Oryza sativa L. ssp. japonica) Science 296 IssueID5565 92–100 Occurrence Handle11935018 Occurrence Handle1:CAS:528:DC%2BD38XivVSqtrw%3D Occurrence Handle10.1126/science.1068275
A. Grigoriev (1998a) ArticleTitleAnalyzing genomes with cumulative skew diagrams Nucleic Acids Res. 26 IssueID10 2286–2290 Occurrence Handle1:CAS:528:DyaK1cXjvFWmtr0%3D Occurrence Handle10.1093/nar/26.10.2286
A. Grigoriev (1998b) ArticleTitleGenome arithmetic Science 281 1923a Occurrence Handle10.1126/science.281.5385.1923a
A. Grigoriev (1999) ArticleTitleStrand-specific compositional asymmetries in double-stranded DNA viruses Virus Res. 60 IssueID1 1–19 Occurrence Handle10225270 Occurrence Handle1:CAS:528:DyaK1MXhs1OmsL0%3D Occurrence Handle10.1016/S0168-1702(98)00139-7
B.J. Haas A.L. Delcher et al. (2003) ArticleTitleImproving the Arabidopsis genome annotation using maximal transcript alignment assemblies Nucleic Acids Res. 31 IssueID19 5654–5666 Occurrence Handle14500829 Occurrence Handle1:CAS:528:DC%2BD3sXns1Cntbs%3D Occurrence Handle10.1093/nar/gkg770
Haas, B.J., Volfovsky, N. et al. 2002. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3(6).
R.T. Hillman R.E. Green et al. (2004) ArticleTitleAn unappreciated role for RNA surveillance Genome Biol. 5 IssueID2 R8 Occurrence Handle14759258 Occurrence Handle10.1186/gb-2004-5-2-r8
X. Huang M.D. Adams et al. (1997) ArticleTitleA tool for analyzing and annotating genomic sequences Genomics 46 IssueID1 37–45 Occurrence Handle9403056 Occurrence Handle1:CAS:528:DyaK2sXnvVCkurw%3D Occurrence Handle10.1006/geno.1997.4984
K. Iida M. Seki et al. (2004) ArticleTitleGenome-wide analysis of alternative pre-mRNA splicing in Arabidopsis thaliana based on full-length cDNA sequences Nucleic Acids Res. 32 IssueID17 5096–5103 Occurrence Handle15452276 Occurrence Handle1:CAS:528:DC%2BD2cXotlWmur0%3D Occurrence Handle10.1093/nar/gkh845
S. Kikuchi K. Satoh et al. (2003) ArticleTitleCollection, mapping, and annotation of over 28,000 cDNA clones from japonica rice Science 301 IssueID5631 376–379 Occurrence Handle12869764 Occurrence Handle10.1126/science.1081288
C.H. Ko V. Brendel et al. (1998) ArticleTitleU-richness is a defining feature of plant introns and may function as an intron recognition signal in maize Plant Mol. Biol. 36 IssueID4 573–583 Occurrence Handle9484452 Occurrence Handle1:CAS:528:DyaK1cXht1eltLY%3D Occurrence Handle10.1023/A:1005932620374
A.V. Kochetov M.P. Ponomarenko et al. (1999) ArticleTitlePrediction of eukaryotic mRNA translational properties Bioinformatics 15 IssueID7–8 704–712 Occurrence Handle10487876 Occurrence Handle1:CAS:528:DyaK1MXntVWiuro%3D Occurrence Handle10.1093/bioinformatics/15.7.704
E.S. Lander L.M. Linton et al. (2001) ArticleTitleInitial sequencing and analysis of the human genome Nature 409 IssueID6822 860–921 Occurrence Handle11237011 Occurrence Handle1:CAS:528:DC%2BD3MXhsFCjtLc%3D Occurrence Handle10.1038/35057062
K. Mayer C. Schuller et al. (1999) ArticleTitleSequence and analysis of chromosome 4 of the plant Arabidopsis thaliana Nature 402 IssueID6763 769–777 Occurrence Handle10617198 Occurrence Handle1:CAS:528:DC%2BD3cXptF2j Occurrence Handle10.1038/47134
Mignone, F., Gissi, C. et al. 2002. Untranslated regions of mRNAs. Genome Biol. 3(3).
Mirkin, B. 1996. Mathematical Classification and Clustering, Kluwer Academic Publishers.
R. Mott (1997) ArticleTitleEST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA Comput. Appl. Biosci. 13 IssueID4 477–478 Occurrence Handle9283765 Occurrence Handle1:CAS:528:DyaK2sXmtVOhsr0%3D
J. Mrazek S. Karlin (1998) ArticleTitleStrand compositional asymmetry in bacterial and large viral genomes Proc. Natl. Acad. Sci. USA 95 IssueID7 3720–3725 Occurrence Handle9520433 Occurrence Handle1:CAS:528:DyaK1cXitlKjt78%3D Occurrence Handle10.1073/pnas.95.7.3720
H. Myllykallio P. Lopez et al. (2000) ArticleTitleBacterial mode of replication with eukaryotic-like machinery in a hyperthermophilic archaeon Science 288 IssueID5474 2212–2215 Occurrence Handle10864870 Occurrence Handle1:CAS:528:DC%2BD3cXksFeis70%3D Occurrence Handle10.1126/science.288.5474.2212
H. Ner-Gaon R. Halachmi et al. (2004) ArticleTitleIntron retention is a major phenomenon in alternative splicing in Arabidopsis Plant J. 39 IssueID6 877–885 Occurrence Handle15341630 Occurrence Handle1:CAS:528:DC%2BD2cXpt1GntLg%3D Occurrence Handle10.1111/j.1365-313X.2004.02172.x
M.E. Petracek T. Nuygen et al. (2000) ArticleTitlePremature termination codons destabilize ferredoxin-1 mRNA when ferredoxin-1 is translated Plant J. 21 IssueID6 563–569 Occurrence Handle10758507 Occurrence Handle1:CAS:528:DC%2BD3cXjsVeru7Y%3D Occurrence Handle10.1046/j.1365-313x.2000.00705.x
M. Picardeau J.R. Lobry et al. (2000) ArticleTitleAnalyzing DNA strand compositional asymmetry to identify candidate replication origins of Borrelia burgdorferi linear and circular plasmids Genome Res. 10 IssueID10 1594–1604 Occurrence Handle11042157 Occurrence Handle1:CAS:528:DC%2BD3cXns1Shsr8%3D Occurrence Handle10.1101/gr.124000
S.Y. Rhee W. Beavis et al. (2003) ArticleTitleThe Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community Nucleic Acids Res. 31 IssueID1 224–228 Occurrence Handle12519987 Occurrence Handle1:CAS:528:DC%2BD3sXhvFSnurk%3D Occurrence Handle10.1093/nar/gkg076
I.B. Rogozin A.V. Kochetov et al. (2001) ArticleTitlePresence of ATG triplets in 5′ untranslated regions of eukaryotic cDNAs correlates with a ‘weak’ context of the start codon Bioinformatics 17 IssueID10 890–900 Occurrence Handle11673233 Occurrence Handle1:CAS:528:DC%2BD3MXot1Ggtb0%3D Occurrence Handle10.1093/bioinformatics/17.10.890
A.B. Rose J.A. Beliakoff (2000) ArticleTitleIntron-mediated enhancement of gene expression independent of unique intron sequences and splicing Plant Physiol. 122 IssueID2 535–542 Occurrence Handle10677446 Occurrence Handle1:CAS:528:DC%2BD3cXktFCjtbg%3D Occurrence Handle10.1104/pp.122.2.535
A. Schmitz D.J. Galas (1979) ArticleTitleThe interaction of RNA polymerase and lac repressor with the lac control region Nucleic Acids Res. 6 IssueID1 111–137 Occurrence Handle370784 Occurrence Handle1:CAS:528:DyaE1MXhvVWltbY%3D
H. Schoof R. Ernst et al. (2004) ArticleTitleMIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource for plant genomics Nucleic Acids Res 32 Database issue D373–D376 Occurrence Handle10.1093/nar/gkh068
M. Seki M. Narusaka et al. (2002) ArticleTitleFunctional annotation of a full-length Arabidopsis cDNA collection Science 296 IssueID5565 141–145 Occurrence Handle11910074 Occurrence Handle10.1126/science.1071006
I.A. Shahmuradov A.J. Gammerman et al. (2003) ArticleTitlePlantProm: a database of plant promoter sequences Nucleic Acids Res. 31 IssueID1 114–117 Occurrence Handle12519961 Occurrence Handle1:CAS:528:DC%2BD3sXhvFSns7g%3D Occurrence Handle10.1093/nar/gkg041
M. Schmid T.S. Davison S.R. Henz U.J. Pape M. Demar M. Vingron B. Scholkopf D. Weigel J.U. Lohmann (2005) ArticleTitleA gene expression map of Arabidopsis thaliana development Nat. Genet. 37 IssueID5 501–506 Occurrence Handle15806101 Occurrence Handle1:CAS:528:DC%2BD2MXjsF2ksrg%3D Occurrence Handle10.1038/ng1543
G. Storz (2002) ArticleTitleAn expanding universe of noncoding RNAs Science 296 IssueID5571 1260–1263 Occurrence Handle12016301 Occurrence Handle1:CAS:528:DC%2BD38Xjsl2qtrY%3D Occurrence Handle10.1126/science.1072249
InstitutionalAuthorNameThe Arabidopsis Genome Initiative (2000) ArticleTitleAnalysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408 796–815 Occurrence Handle10.1038/35048692
J. Usuka W. Zhu et al. (2000) ArticleTitleOptimal spliced alignment of homologous cDNA to a genomic DNA template Bioinformatics 16 IssueID3 203–211 Occurrence Handle10869013 Occurrence Handle1:CAS:528:DC%2BD3cXksFajurk%3D Occurrence Handle10.1093/bioinformatics/16.3.203
J.C. Venter M.D. Adams et al. (2001) ArticleTitleThe sequence of the human genome Science 291 IssueID5507 1304–1351 Occurrence Handle11181995 Occurrence Handle1:CAS:528:DC%2BD3MXhtlSgsbo%3D Occurrence Handle10.1126/science.1058040
K. Yamada J. Lim et al. (2003) ArticleTitleEmpirical analysis of transcriptional activity in the Arabidopsis genome Science 302 IssueID5646 842–846 Occurrence Handle14593172 Occurrence Handle1:CAS:528:DC%2BD3sXos1Cmsbg%3D Occurrence Handle10.1126/science.1088305
J. Yu S. Hu et al. (2002) ArticleTitleA draft sequence of the rice genome (Oryza sativa L. ssp. indica) Science 296 IssueID5565 79–92 Occurrence Handle11935017 Occurrence Handle1:CAS:528:DC%2BD38XivVSqtr8%3D Occurrence Handle10.1126/science.1068037
M. Zavolan E.V. Nimwegen et al. (2002) ArticleTitleSplice variation in mouse full-length cDNAs identified by mapping to the mouse genome Genome Res. 12 IssueID9 1377–1385 Occurrence Handle12213775 Occurrence Handle1:CAS:528:DC%2BD38Xnt1elsbk%3D Occurrence Handle10.1101/gr.191702
J. Zhao L. Hyman et al. (1999) ArticleTitleFormation of mRNA 3′ ends in eukaryotes: mechanism, regulation, and interrelationships with other steps in mRNA synthesis Microbiol. Mol. Biol. Rev. 63 IssueID2 405–445 Occurrence Handle10357856 Occurrence Handle1:STN:280:DyaK1M3os1eksQ%3D%3D
W. Zhu S.D. Schlueter et al. (2003) ArticleTitleRefined annotation of the Arabidopsis genome by complete expressed sequence tag mapping Plant Physiol. 132 IssueID2 469–484 Occurrence Handle12805580 Occurrence Handle1:CAS:528:DC%2BD3sXkslersLs%3D Occurrence Handle10.1104/pp.102.018101
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Alexandrov, N.N., Troukhan, M.E., Brover, V.V. et al. Features of Arabidopsis Genes and Genome Discovered using Full-length cDNAs. Plant Mol Biol 60, 69–85 (2006). https://doi.org/10.1007/s11103-005-2564-9
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s11103-005-2564-9