Libraries, University of Nebraska-Lincoln

Library Philosophy and Practice (e-journal)

Exploring OCR Errors in Full-Text Large Documents: A Study of LIS Theses and Dissertations

Manika Lamba, Department of Library and Information Science, University of DelhiFollow
Margam Madhusudhan, Department of Library and Information Science, University of DelhiFollow

Document Type

Article

Abstract

The accuracy of OCR output for text mining and NLP analyses of large text documents can be impacted by errors that occur during the OCR process. The methodology involves retrieving electronic theses and dissertations (ETDs) for LIS discipline from the ProQuest Dissertations and Theses Global database and manually reviewing the full-text ETDs for OCR problems associated with the conversion of PDF files into plain text format. The study examines the factors that impact the quality of OCR output, including the quality of the original document. The findings show that five major types of scanning problems in PDFs were identified that caused OCR errors like joining of words, misspellings, space between words, insertion of random characters, hyphenation, and formatting. To avoid these errors, it is important to use high-quality scanned documents published from the 1920s to the 1970s. Further research could focus on improving the accuracy of OCR technology for large-text documents published before the 1980s.

Download

COinS

Libraries, University of Nebraska-Lincoln

Library Philosophy and Practice (e-journal)

Exploring OCR Errors in Full-Text Large Documents: A Study of LIS Theses and Dissertations

Document Type

Abstract

Search

Links

Browse

Author Corner

Links

Libraries, University of Nebraska-Lincoln

Library Philosophy and Practice (e-journal)

Exploring OCR Errors in Full-Text Large Documents: A Study of LIS Theses and Dissertations

Authors

Document Type

Abstract

Share

Search

Links

Browse

Author Corner

Links