Statistics, Department of


First Advisor

Kent Eskridge

Date of this Version

Spring 5-2023


Hauschild, J.L. (2023). Examining the Effect of Word Embeddings and Preprocessing Methods on Fake News Detection (Doctoral dissertation, University of Nebraska-Lincoln).


A DISSERTATION Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Doctor of Philosophy, Major: Statistics, Under the Supervision of Professor Kent Eskridge. Lincoln, Nebraska: May, 2023

Copyright © 2023 Jessica L. Hauschild


The words people choose to use hold a lot of power, whether that be in spreading truth or deception. As listeners and readers, we do our best to understand how words are being used. There are many current methods in computer science literature attempting to embed words into numerical information for statistical analyses. Some of these embedding methods, such as Bag of Words, treat words as independent, while others, such as Word2Vec, attempt to gain information about the context of words. It is of interest to compare how well these various methods of translating text into numerical data work specifically with detecting fake news. The term “fake news” can be quite divisive, but we define it as news that is hyper-partisan, filled with untruths, and written to cause anger and outrage, as defined in Potthast & Kiesel (2018). We hypothesize a person’s word choice relates to the factualness of an article. In Chapter 5, we utilize this embedded information in several binary classification methods. We find that words are only marginally valuable in detecting fake news regardless of the embedding or classification method used. However, within natural language processing tasks, there are many preprocessing steps taken to get the text ready for analysis, which is explored in Chapter 6. The embedding methods are confounded with the preprocessing methods used. Preprocessing of text includes, but is not limited to, filtering out words that do not appear a minimum number of times, filtering out stop words, removing numbers, and translating all letters to lower case. We find filtering out stop words and removing words not appearing a minimum number of times have the most significant effect in combination with embedding and classification methods. Finally, in Chapter 7, we extend the classification to six categories ranging from true to pants-on-fire false and found these preprocessing methods are not as influential as they were with the binary outcome. Other predictors outside of the words and word embeddings themselves are necessary for improvement in the detection of fake news.

Advisor: Kent Eskridge