Date of this Version
code4lib, issue 47, article 15016, February 27, 20202.
Also available at: https://journal.code4lib.org/articles/15016.
This article will describe our process developing a script to automate downloading of documents and secondary materials from our library’s BePress repository. Our objective was to collect the full archive of dissertations and associated files from our repository into a local disk for potential future applications and to build out a preservation system.
Unlike at some institutions, our students submit directly into BePress, so we did not have a separate repository of the files; and the backup of BePress content that we had access to was not in an ideal format (for example, it included “withdrawn” items and did not effectively isolate electronic theses and dissertations). Perhaps more importantly, the fact that BePress was not SWORD-enabled and lacked a robust API or batch export option meant that we needed to develop a data-scraping approach that would allow us to both extract files and have metadata fields populated. Using a CSV of all of our records provided by BePress, we wrote a script to loop through those records and download their documents, placing them in directories according to a local schema. We dealt with over 3,000 records and about three times that many items, and now have an established process for retrieving our files from BePress. Details of our experience and code are included.
Archival Science Commons, Collection Development and Management Commons, Databases and Information Systems Commons, Intellectual Property Law Commons, Scholarly Communication Commons, Scholarly Publishing Commons