Graduate Studies

 

First Advisor

Keenan Amundsen

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Complex Biosystems

Date of this Version

12-5-2023

Document Type

Dissertation

Comments

Copyright 2023, Michael Morikone

Abstract

The task of gene prediction has been largely stagnant in algorithmic improvements compared to when algorithms were first developed for predicting genes thirty years ago. Rather than iteratively improving the underlying algorithms in gene prediction tools by utilizing better performing models, most current approaches update existing tools through incorporating increasing amounts of extrinsic data to improve gene prediction performance. The traditional method of predicting genes is done using Hidden Markov Models (HMMs). These HMMs are constrained by having strict assumptions made about the independence of genes that do not always hold true. To address this, a Convolutional Neural Network (CNN) based gene prediction tool was developed and named GeneCNN. Due to their nonlinearity, neural networks are adept at capturing complex relationships between data points when applied to sufficiently large datasets such as whole genomes. Convolutional neural networks further improve upon neural networks through the incorporation of spatial dependence into individual datapoints. GeneCNN was trained using a sequenced buffalograss (Bouteloua dactyloides) genome. Training performance of GeneCNN resulted in a 97% accuracy in correctly identifying genic sequences in test data. GeneCNN uniquely identified a greater number of genes than currently existing gene prediction tools BRAKER3, AUGUSTUS, and Fgenesh at 1,089, 1,535, and 478 respectively, when using a 10 million nucleotide length genome sequence of buffalograss as input. Gene predictions made by combinations of the tools BRAKER3, AUGUSTUS, and Fgenesh, were compared to GeneCNN to assess the percentage of gene predictions made by GeneCNN that are supported by at least one other tool, where support ranged from 40.5% to 84.1% of all GeneCNN gene predictions for every combination of BRAKER3, AUGUSTUS, and Fgenesh. The findings in this study support the use of CNNs for gene prediction and serve as a valuable resource for the improvement of gene prediction algorithms in future research.

Share

COinS