Dr. Taylor Helen Ferebee

Sr. Data Scientist I | Applied Biological AI & Strategy | Gene Editing

Fishing for a reelGene: evaluating gene models with evolution and machine learning


Journal article


Aimee J. Schulz, Jingjing Zhai, Taylor M. Aubuchon-Elder, Mohamed El-Walid, Taylor H. Ferebee, Elizabeth H Gilmore, M. Hufford, Lynn Johnson, Elizabeth A. Kellogg, T. La, Evan M Long, Zachary R. Miller, M. Romay, Arun S. Seetharam, Michelle C. Stitzer, Travis Wrightsman, E. Buckler, B. Monier, Sheng‐Kai Hsu
bioRxiv, 2023

Semantic Scholar DOI
Cite

Cite

APA   Click to copy
Schulz, A. J., Zhai, J., Aubuchon-Elder, T. M., El-Walid, M., Ferebee, T. H., Gilmore, E. H., … Hsu, S. K. (2023). Fishing for a reelGene: evaluating gene models with evolution and machine learning. BioRxiv.


Chicago/Turabian   Click to copy
Schulz, Aimee J., Jingjing Zhai, Taylor M. Aubuchon-Elder, Mohamed El-Walid, Taylor H. Ferebee, Elizabeth H Gilmore, M. Hufford, et al. “Fishing for a ReelGene: Evaluating Gene Models with Evolution and Machine Learning.” bioRxiv (2023).


MLA   Click to copy
Schulz, Aimee J., et al. “Fishing for a ReelGene: Evaluating Gene Models with Evolution and Machine Learning.” BioRxiv, 2023.


BibTeX   Click to copy

@article{aimee2023a,
  title = {Fishing for a reelGene: evaluating gene models with evolution and machine learning},
  year = {2023},
  journal = {bioRxiv},
  author = {Schulz, Aimee J. and Zhai, Jingjing and Aubuchon-Elder, Taylor M. and El-Walid, Mohamed and Ferebee, Taylor H. and Gilmore, Elizabeth H and Hufford, M. and Johnson, Lynn and Kellogg, Elizabeth A. and La, T. and Long, Evan M and Miller, Zachary R. and Romay, M. and Seetharam, Arun S. and Stitzer, Michelle C. and Wrightsman, Travis and Buckler, E. and Monier, B. and Hsu, Sheng‐Kai}
}

Abstract

Assembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene provides a tool for both evaluating gene model accuracy and genome biology.