Computer Science and Engineering, Department of


First Advisor

Dr.Mohammad Rashedul Hasan

Date of this Version

Spring 5-26-2021



author = {Yuanzhi Chen},

title = {Towards A Machine Learning Based Generalizable Framework For Detecting COVID-19 Misinformation On Social Media},

school = {University of Nebraska-Lincoln},

address = {Lincoln,Nebraska},

year = {2021}



A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professor Mohammad Rashedul Hasan. Lincoln, Nebraska: May, 2021

Copyright © Yuanzhi Chen


Since the beginning of the COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, online social media has become a conduit for the rapid propagation of misinformation. The misinformation is a type of fake news that is created inadvertently without the intention of causing harm. Yet COVID-19 misinformation has caused serious social disruptions including accidental death and destruction of public property. Timely prevention of the propagation of online misinformation requires the development of automated detection tools. Machine learning (ML) based models have been used to automate techniques for identifying fake news. These techniques involve converting text data into numeric features (or text embeddings) and supervised classification. An effective classifier requires expressive embeddings that capture the semantic, syntactic, and contextual (both local and global) relationship. There has been a significant advancement in Natural Language Processing (NLP) methods to create text embeddings. Using the state-of-the-art (SOTA) NLP techniques it is possible to create expressive language models from a large and general-purpose corpus (source dataset). Then, the language model can be used on varied target datasets for effective classification. This transfer learning approach based on the SOTA NLP methods made it possible to build a generalizable solution for NLP tasks including text classification. However, its efficacy on the COVID-19 social media data has not been thoroughly investigated. The COVID-19 dataset is significantly challenging due to its dynamic (context evolves rapidly), nuanced (ambiguities in the content of the text), and diverse (varied and overlapping categories) nature. This thesis hypothesizes that none of the SOTA NLP methods provide a generalizable solution for the detection of misinformation. We conduct a multi-dimensional study to understand the scope and limitations of the NLP SOTA approaches. We propose an ML-based framework as the first step towards designing a generalizable solution for detecting misinformation from social media data that are similar to COVID-19 data.

Adviser: Mohammad Hasan