Food Science and Technology Department


First Advisor

Dr. Yanbin Yin

Date of this Version

Summer 7-2021


Li, T., 2021. The Differences of Prokaryotic Pan-genome Analysis on Complete Genomes and Simulated Metagenome-Assembled Genomes. University of Nebraska-Lincoln. M.S. Thesis.


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Food Science & Technology, Under the Supervision of Professor Yanbin Yin. Lincoln, Nebraska: July 2021.

Copyright © 2021 Tang Li


Metagenomic assembly is often used in microbiome research. In metagenomic assembly, contigs are binned based on the shared nucleotide composition. These contig bins are called metagenome-assembled genomes (MAGs), each representing a unique bacterial genome recovered from metagenome sequencing. Hundreds of thousands of high-quality MAGs of various ecological environments have been published since 2017, and increasingly more MAGs are being used in pan-genome analyses where unculturable species or species without reference genomes are studied in microbiome research. However, compared to the traditional pan-genome analysis that uses isolate genomes (from a pure strain isolated from a mixed bacterial population), it is not known how the quality of pan-genome analyses is affected by the problems often associated with MAGs, such as fragmentation, incompleteness, and contamination. The purpose of this study is to investigate differences in pan-genome analysis on complete isolate genomes and MAGs. The specific aims are: (1) to evaluate the changes in the core genome of MAGs, and (2) to test the influence of MAGs on downstream functional analysis. MAGs with expected quality were simulated from complete genomes of 17 prokaryotic species, and pan-genome analysis was performed for simulated MAGs to generate core genomes. Functional and phylogenetic analyses were performed using the results of simulated MAGs and benchmarked against those using the original complete genomes. Compared to the analyses using the complete genomes, fragmentation and incompleteness in MAGs led to reduced core genomes, while contamination in MAGs resulted in large numbers of unique genes. The potential underestimation in functional prediction and incorrect phylogenetic reconstruction was associated with the loss of core genomes. We suggest that more relaxed parameters should be used in pan-genome analysis to improve the accuracy of MAGs. Better quality control of MAGs and the development of new pan-genome analysis tools (e.g., with improved gene annotation and clustering algorithms) are needed in future studies.

Advisor: Yanbin Yin