Computer Science and Engineering, Department of


First Advisor

Juan Cui

Second Advisor

Etsuko Moriyama

Third Advisor

Heriberto Cerutti

Date of this Version

Fall 12-1-2022


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the supervision of Professor Juan Cui. Lincoln, NE: December, 2022

Copyright © 2022 Yingshan Li


Viral metagenomics is independent of lab culturing and capable of investigating viromes of virtually any given environmental niches. While numerous sequences of viral genomes have been assembled from metagenomic studies over the past years, the natural hosts for the majority of these viral contigs have not been determined. Different computational approaches have been developed to predict hosts of bacteria phages. Nevertheless, little progress has been made in the virus-host prediction, especially for viruses that infect eukaryotes and archaea. In this study, by analyzing all documented viruses with known eukaryotic and archaeal hosts, we assessed the predictive power of four computational approaches in viral host prediction. The use of the following biological relationships among viruses and hosts were explored: 1. Sequence similarity between virus and host genome, where direct genetic interactions between viruses and hosts are assumed to leave traces of historical infections. 2. Co-evolution between viruses and hosts, where the viral dependency on their hosts for replication is assumed to result in similar genomic features including nucleotide composition and codon usage. 3. Sequence similarity between viruses, where closely related viruses are assumed to infect the same hosts. And 4. genomic feature similarities between viruses based on nucleotide compositions and dinucleotide/codon/bi-codon usage biases. We assume that viruses with similar genomic features tend to share the same hosts. We showed that using any of the four approaches produced better predictions than uninformed guesses, indicating that our current knowledge of virus-host interaction and

co-evolution can be exploited to help predict natural hosts among eukaryotes and archaea for viral contigs. Overall, the third and fourth approaches (prediction based on virus-virus genomic sequence similarity and genomic feature similarity) had the highest prediction accuracy. The second approach (prediction based on virus-host co-evolution) has the least predictive power. We also discuss the biological underpinnings of different predictive power shown in each of these approaches. We anticipate a significant increase in predictive capacity as more training data and knowledge of virus-host relationships are accumulated in the future.

Advisor: Juan Cui