Electrical & Computer Engineering, Department of


First Advisor

Dr. Hasan H. Otu

Date of this Version



A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Electrical Engineering, Under the Supervision of Professor Hasan H. Otu. Lincoln, Nebraska: May, 2022

Copyright © 2022 Dillon Burgess


Identification of genes that show similarity between different organisms, a.k.a orthologous genes, is an open problem in computational biology. The purpose of this thesis is to create an algorithm to group orthologous genes using machine learning. Following an optimization step to find the best characterization based on training data, we represented sequences of genes or proteins with kmer vectors. These kmer vectors were then clustered into orthologous groups using hierarchical clustering. We optimized the clustering phase with the same training data for the method and parameter selection. Our results indicated that use of protein sequences with k=2 and scaling the data for each kmer provided the best results. We employed Pearson’s correlation as the distance metric and used complete linkage in the agglomeration step. The number of clusters are calculated based on four different approaches that evaluates optimum number of clusters. This algorithm was pitted against OrthoDB which is an orthologous gene grouping algorithm that has been proven to work well. The results show that when small datasets were used, our algorithm performed better than OrthoDB. When larger genome-level datasets were used, OrthoDB outperformed our algorithm as long as the input data to OrthoDB was divided based on organism count. Our algorithm has an advantage over OrthoDB in that the data doesn’t have to be divided by organism; it can be given as one file. The proposed algorithm runs much faster than OrthoDB and is the first approach, to the best of our knowledge, that uses unsupervised machine learning techniques that does not rely on sequence alignment or phylogeny to identify orthologues genes. Overall, our algorithm provides a novel solution that is fast, practical, and unlike existing approaches can be applied to data sets such as metagenomics where the underlying number of organisms is unknown.

Adviser: Hasan H. Otu