Computer Science and Engineering, Department of


First Advisor

Ashok Samal

Second Advisor

Leen-Kiat Soh

Date of this Version

Summer 6-2019

Document Type



Sunkara, V.K.M (2019). A Data Driven Approach to Identify Journalistic 5Ws From Text Documents (Master's Thesis). University of Nebraska - Lincoln.


A THESIS presented to the faculty of the Graduate College at the University of Nebraska In partial fulfillment of requirements for the degree of Master of Science (Computer Science) under the supervision of Professors Ashok Samal and Leen-Kiat Soh, Lincoln, Nebraska: June, 2019

Copyright(c) 2019 - Venkata Krishna Mohan Sunkara


Textual understanding is the process of automatically extracting accurate high-quality information from text. The amount of textual data available from different sources such as news, blogs and social media is growing exponentially. These data encode significant latent information which if extracted accurately can be valuable in a variety of applications such as medical report analyses, news understanding and societal studies. Natural language processing techniques are often employed to develop customized algorithms to extract such latent information from text.

Journalistic 5Ws refer to the basic information in news articles that describes an event and include where, when, who, what and why. Extracting them accurately may facilitate better understanding of many social processes including social unrest, human rights violations, propaganda spread, and population migration. Furthermore, the 5Ws information can be combined with socio-economic and demographic data to analyze state and trajectory of these processes.

In this thesis, a data driven pipeline has been developed to extract the 5Ws from text using syntactic and semantic cues in the text. First, a classifier is developed to identify articles specifically related to social unrest. The classifier has been trained with a dataset of over 80K news articles. We then use NLP algorithms to generate a set of candidates for the 5Ws. Then, a series of algorithms to extract the 5Ws are developed. These algorithms based on heuristics leverage specific words and parts-of-speech customized for individual Ws to compute their scores. The heuristics are based on the syntactic structure of the document as well as syntactic and semantic representations of individual words and sentences. These scores are then combined and ranked to obtain the best answers to Journalistic 5Ws. The classification accuracy of the algorithms is validated using a manually annotated dataset of news articles.