Increasingly Automated Ingestion of Material
Leader: Johanna Lilja, University of Helsinki/National Library of Finland; Partners: CSC, UHEL/ARTS, UTU; Collaborators: NARC, JYU
Development of a data ingestion pipeline where data providers such as cultural heritage institutions and companies can easily deposit their data for research use will be piloted. The foreseen impact is to make large materials digitized by cultural heritage institutions or donated by companies more rapidly available for research. Initially, data will be accepted as data dumps, but work will be done to increase automation in the transfer, versioning and incremental updates of the data. We currently have the digitized collections of the National Library of Finland containing 20 million pages of newspapers, journals and books available as pictures and data. The volume of data delivered in the piloting phase is approximately 50 terabytes consisting of out-of-copyright digitized publications as data packages. The solution will be scaled for other data providers after the two-year piloting phase.