FIN-CLARIAH day - Digital cultural heritage... Data or Metadata?

The Fall plenary brought the whole infrastructure to reflect about metadata, what is it in diverse fields and how to make it fit-for-research.

On November 22nd the FIN-CLARIAH consortium assembled 41 representatives from six universities, the National Library and Archives of Finland, the Language Bank of Finland and CSC – IT Centre for Science. TurkuNLF hosted the group to discuss digital cultural heritage with the underlying theme of extracting and processing metadata for research.

But what is metadata? In plain text, metadata is structured information that describes an object. Metadata can help finding research materials, for example, tags describing content, identifying people or an event in a historical photo can save hours of browsing; or, a count of comments and reactions in social media can be the key to identify popularity… In the social sciences and humanities, metadata can become an important asset.

To get us started on the topic, Virpi Lummaa from the Human Diversity project proposed participants to consider the broadest possible perspective: What do the traces we leave behind say about us? In this multi-researcher project the traces of Finns are examined in historical perspective; from marriages, re-settlements, dialects, affiliations, or diseases. Lummaa herself has combined marriages kept on Church records since 1700s with personal identity numbers introduced after 1970, to trace kin networks that show what have been the demographic and geographical distribution of typical families in Finland, how these have changed and what implications these differences have for the life quality and longevity of people. Another line of research combines the migration trajectories and life-stories of Karelian evacuees 1¹. In this project, a team from TurkuNLP used ChatGPT4 to extract punctual data from unstructured oral history interviews to compare the social networks and activities evacuees engaged in their new settlements across Finland.

Slide from Viripi Lummaa’s presentation ““From church records to bones, pots and barn-types: combining demographic, genetic, archaeological and cultural data to study human diversity” (Watch video here)

Lummaa demonstrated the possibilities of extracting numerical values and combining highly diverse, but exceptionally meticulous records created in the last 200 years, thus concluding that whatever shortcut we would like to find, good metadata is result of exhaustive, qualitative and laborious research work. So the goal of the afternoon was to discuss what metadata is in our fields, how it is used or what kind of metadata is needed in order to make the most of the diverse types of data that SSH researchers use.

To give two examples, Kimmo Elo explained why metadata that exists about parliamentary speeches, such as MPs’ party, associations or electoral district constitute a contextually rich environment for research. This can be soon used in LAWPOL, a digital workbench for the study of political discourse. Later, Maria Kallio-Hirvonen from the National Archives of Finland, remarked the ontological differences between libraries and archives in producing metadata, leaving us with three ground truths that may as well describe humanities research: 1) A collection of items is always complex, diverse and unique, 2) A description is always interpretation and 3) The content and level of detail in descriptions vary significantly depending on the material, the archive and the time when the description is made.

Image from Maria Kallio-Hirvonen’s presentation “The realities of cultural heritage metadata”

I find this is image that depicts library vs archive mentality a good metaphor for the differences between diverse research traditions. Both are complex and make for valid approaches to making knowledge. The challenge is to provide researchers the means to find research materials and links between the maze of ontologies, types of data, standards and formats.

To find solutions, participants discussed in groups according to diverse types of data. Researchers and infrastructure providers discussing textual data subdivided in those concerned with digitized material, a domain where descriptive and structural metadata are understood as cues to discover relevant materials, and formats are inherited from ancient catalogue systems, which have iterated with time, thus losing free-from text descriptions for the benefit of fitting to standards. Another challenge of archival data is the impossibility to distinguish individual items from folders and subcollections, so that researchers need to visit archives and rely on professionals to find the relevant content.

For researchers concerned with born-digital textual data, metadata is inherent to data, so it is difficult to make distinctions. However, annotation such as language identification, sentiment or other additions could be further described, generated metadata should be done following standards and if this has been automatically generated, labels such as reliability score could be added.

For visual and audiovisual cultural materials, descriptive metadata exists in heterogeneous quality and formats, and often it is up to the data provider to include not just metadata but to reference the ontologies that further specify the domain or may contain translations. Where descriptions exist, there are other types of metadata such for museum objects the exhibition history, or for photographs a citation of the newspaper, magazine or publication where it has been used, which make for important research assets. Concerning audiovisual material, to complicate matters, the diverse layers of audio and video happening simultaneously require multiple levels of description and lack proper format metadata standards, a meta-level of description such as the quality of the recording is of great interest.

Moving on from challenges to opportunities, all groups discussed metadata generation, for example regarding visual cultural heritage, where metadata or citations fail, using an image similarity model in digitized corpora could alleviate identify circulation of photographs in newspapers or magazines. Concerning researcher or crowdsourced metadata, all groups agree that more effort could be made to ensure that descriptive metadata is properly preserved and connected to the original source, if possible returned to the data provider and made available to others.

Loehr, J., Lynch, R., Mappes, J., Salmi, T., Pettay, J., & Lummaa, V. (2017). Newly Digitized Database Reveals the Lives and Families of Forced Migrants from Finnish Karelia. Finnish Yearbook of Population Research, 52, 59–70. https://doi.org/10.23979/fypr.65212 ↩︎

Header image: FIN-CLARIAH consortium (Photo: Inés Matres)