Libraries at University of Nebraska-Lincoln
Document Type
Article
Abstract
The accuracy of OCR output for text mining and NLP analyses of large text documents can be impacted by errors that occur during the OCR process. The methodology involves retrieving electronic theses and dissertations (ETDs) for LIS discipline from the ProQuest Dissertations and Theses Global database and manually reviewing the full-text ETDs for OCR problems associated with the conversion of PDF files into plain text format. The study examines the factors that impact the quality of OCR output, including the quality of the original document. The findings show that five major types of scanning problems in PDFs were identified that caused OCR errors like joining of words, misspellings, space between words, insertion of random characters, hyphenation, and formatting. To avoid these errors, it is important to use high-quality scanned documents published from the 1920s to the 1970s. Further research could focus on improving the accuracy of OCR technology for large-text documents published before the 1980s.