Complex Biosystems PhD Program

 

Complex Biosystems Program: Dissertations and Student Research

First Advisor

Keenan Amundsen

Date of this Version

11-2023

Document Type

Dissertation

Citation

A dissertation presented to the faculty of the Graduate College at the University of Nebraska in partial fulfillment of requirements for the degree of Doctor of Philosophy

Major: Complex Biosystems (Systems Analysis)

Under the supervision of Professor Keenan Amundsen

Lincoln, Nebraska, November 2023

Comments

Copyright 2023, Michael Morikone

Abstract

The task of gene prediction has been largely stagnant in algorithmic improvements compared to when algorithms were first developed for predicting genes thirty years ago. Rather than iteratively improving the underlying algorithms in gene prediction tools by utilizing better performing models, most current approaches update existing tools through incorporating increasing amounts of extrinsic data to improve gene prediction performance. The traditional method of predicting genes is done using Hidden Markov Models (HMMs). These HMMs are constrained by having strict assumptions made about the independence of genes that do not always hold true. To address this, a Convolutional Neural Network (CNN) based gene prediction tool was developed and named GeneCNN. Due to their nonlinearity, neural networks are adept at capturing complex relationships between data points when applied to sufficiently large datasets such as whole genomes. Convolutional neural networks further improve upon neural networks through the incorporation of spatial dependence into individual datapoints. GeneCNN was trained using a sequenced buffalograss (Bouteloua dactyloides) genome. Training performance of GeneCNN resulted in a 97% accuracy in correctly identifying genic sequences in test data. GeneCNN uniquely identified a greater number of genes than currently existing gene prediction tools BRAKER3, AUGUSTUS, and Fgenesh at 1,089, 1,535, and 478 respectively, when using a 10 million nucleotide length genome sequence of buffalograss as input. Gene predictions made by combinations of the tools BRAKER3, AUGUSTUS, and Fgenesh, were compared to GeneCNN to assess the percentage of gene predictions made by GeneCNN that are supported by at least one other tool, where support ranged from 40.5% to 84.1% of all GeneCNN gene predictions for every combination of BRAKER3, AUGUSTUS, and Fgenesh. The findings in this study support the use of CNNs for gene prediction and serve as a valuable resource for the improvement of gene prediction algorithms in future research.

Advisor: Keenan Amundsen

Share

COinS