The premise of this project is that the scale of sequence and other data accumulation in plant genomics necessitates the development of novel, highly automated, scalable, comprehensive, and accurate approaches to genome annotation. The depth of transcript data accumulating for many plant species under numerous experimental conditions provide unprecedented evidence for the evaluation of all aspects of transcription, including precise mapping of transcription start sites as well as dominant and alternative splice sites. This project engages a team of experts in a wide range of fields, including genomics, molecular biology, bioinformatics, statistics, machine learning, high performance computing, and software engineering to jointly work toward a solution for accurately predicting the expressed protein-coding gene transcriptome from plant genome sequences. Successful completion of the project will result in the deployment of (1) software that implements the novel prediction algorithms, (2) visualization and data access portals, and (3) a cyberinfrastructure environment implementation of the developed tools for distributed computing, sharing of protocols, and analysis provenance recording. In the long run, the project seeks to explore the extent to which genomic biology can transition from a largely descriptive to a highly predictive science driven by quantitative measurements, with algorithms and computation as the domain-adapted language.