Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Accessibility Remediation

If you are unable to use this item in its current form due to accessibility barriers, you may request remediation through our remediation request form.

Information Extraction and Classification on Journal Papers

Lei Yu, University of Nebraska-LincolnFollow

First Advisor

Stephen D. Scott

Second Advisor

Prof. Ashok Samal

Third Advisor

Prof. Vinodchandran Variyam

Committee Members

Ashok Samal, Vinodchandran N. Variyam

Date of this Version

11-2021

Document Type

Thesis

Citation

Yu, Lei, "Information Extraction and Classification on Journal Papers" (2021).

Comments

A thesis presented to the faculty of the Graduate College at the University of Nebraska In Partial Fulfilment of Requirements for the degree of Master of Science

Major: Computer Science

Under the supervision of Professor Stephen D. Scott. Lincoln, Nebraska, November 2021

Abstract

The importance of journals for diffusing the results of scientific research has increased considerably. In the digital era, Portable Document Format (PDF) became the established format of electronic journal articles. This structured form, combined with a regular and wide dissemination, spread scientific advancements easily and quickly. However, the rapidly increasing numbers of published scientific articles requires more time and effort on systematic literature reviews, searches and screens. The comprehension and extraction of useful information from the digital documents is also a challenging task, due to the complex structure of PDF.

To help a soil science team from the United States Department of Agriculture (USDA) build a queryable journal paper system, we used web crawler to download articles on soil science from the digital library. We applied named entity recognition and table analysis to extract useful information including authors, journal name and type, publish date, abstract, DOI, experiment location in papers and highlight the paper characteristics in a computer queryable format in the system. Text classification is applied on to identify the parts of interest to the users and save their search time. We used traditional machine learning techniques including logistic regression, support vector machine, decision tree, naive bayes, k-nearest neighbors, random forest, ensemble modeling, and neural networks in text classification and compare the advantages of these approaches in the end.

Advisor: Stephen D. Scott

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons, Theory and Algorithms Commons

COinS

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Accessibility Remediation

Information Extraction and Classification on Journal Papers

First Advisor

Second Advisor

Third Advisor

Committee Members

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

Computing, School of

School of Computing: Dissertations, Theses, and Student Research

Accessibility Remediation

Information Extraction and Classification on Journal Papers

Authors

First Advisor

Second Advisor

Third Advisor

Committee Members

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links