About

Development of the infrastructure

For the 2024–2025 funding period, DARIAH-FI is responsible for developing infrastructures in three areas: 1) Upgrading tools for processing unstructured text, 2) facilitating research in audio-visual culture, and 3) supporting uptake of transformer technology in SSH.

The Work Packages listed here describe how DARIAH-FI will improve its capabilities to ingest, enrich and support analysis of diverse types of data, particularly images and multimodal data and to engage research communities utilising this data. Each WP has a leader and one or more participants from the consortium partners or collaborators. The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.

Jump to Work Package description:


Work packages

Data Management

The foreseen impact is a significant upgrade of the data management, versioning and workflow automation capabilities that underlie the whole infrastructure. While the infrastructure already has multiple distinct processes for data ingestion and versioning, there is a need to further integrate and develop these, both to improve efficiency, as well as to cater to the new types of material to be managed in this project. Further, due to the increasing amount of different enrichments being applied to the materials, there is a need to develop better capabilities for workflow automation and version syncing, which will allow these enrichments to be run automatically every time the source data changes, foregoing manual work and the possibility of using stale data versions.

Leader: Martin Matthiesen, CSC – IT Center for Science

Partners: UHEL/ARTS; Collaborators: UHEL/NLF, NAF, JYU


Data ingestion

The foreseen impact of the WP is to improve the infrastructure by connecting it to accruing data sources. The collaboration with partners of other WPs will establish a network of stakeholders, including data owners, as well as fostering mutual understanding of shared goals. This will result in long-lasting benefits for both the research community and the cultural heritage sector. We have the Finna reuse service, new heritage and societal datasets from the Sampo systems, and multimodal societal data harvested from the web. We need to improve access to open digital cultural heritage materials and societal data for researchers in the digital humanities and social sciences. The overall goal of this work package is to strengthen and improve the RI by enhancing access to open data, improving technical features, and creating workflow automation for data update and maintenance.


Data enrichment

The foreseen impact of the WP is to enable the systematic and detailed analysis of noisy datasets in different formats – both textual, visual and multimodal – and thereby provide unseen possibilities for SSH research. This WP focuses on the enrichment of visual, multimodal and textual data originating from noisy social media, archival and cultural heritage collections. Furthermore, the WP develops standardised annotation and analysis methods and pipelines for these datasets, benefiting from recent advances in natural language processing and data science. We have more data in a digital format than ever. However, the usability of the data in their current state is very restricted. In particular, noisy data from social media, OCR-scanned archives and cultural heritage collections as well as datasets with visual and sound content cannot be properly examined using the currently existing resources.

We need higher quality metadata, standardised analysis pipelines, and more detailed information on the document contents – their thematic and linguistic characteristics, but also information associated with images and audio files.

We need 1) to develop machine learning methods for generating metadata, such as document type and journal number, from OCR-scanned archival materials to facilitate their analysis and information extraction, 2) to create tools for metadata harmonisation and standardised analysis pipelines to ensure the unified marking and systematic analysis of structured metadata for both cultural heritage and social media collections (Finna and Twitter), 3) to develop machine learning methods for providing detailed information on the thematic and linguistic content of noisy social media datasets originating from the Web and game streams, allowing users to explore them and focus on subsets with specific characteristics, and 4) to focus on developing searchable multimodal corpora in which not only text, but also specific audio and video content can be quickly accessed, to provide image labelling systems that can reliably serve SSH researchers working on both social media and cultural heritage data (YouTube, Twitter, Finna), and automated methods for enriching stream video data.


Analytical support for computational social sciences and humanities

The foreseen impact of this WP is that it enables researchers to utilise large born-digital data effectively and to focus on analysis rather than dealing with technical details in often high volume and high velocity. The activities not only produce user-generated textual material for large-scale language models, but also result in representative benchmarks, contributing to increased replicability and reproducibility of SSH research, integrating text analysis with audio-visual cultural heritage (images, multimodal properties of speech, multimodal items in social media). Researchers currently have potential access to extremely large born-digital data sources, including but not limited to game streams, multimodal social media applications, and open-ended answers in surveys. Despite potential benefits, these sources are currently underused in SSH for technical, ethical and practical reasons. We need efficient analytical support tools for various fields:

  • In game studies, multimodal game stream analysis needs to utilise video stream interactions between video streams and stream chats and generate data used in developing AI-based solutions.
  • In computational sociolinguistics and dialectology, we need to develop tools for multimodal born-digital social media analysis and workflows for accessing and analysing dynamic social media data for computational research, including algorithmic tools to access social media interaction in digital networks, tools for the analysis of multimodal properties of naturalistic speech (e.g. phonetics, prosody, and facial expressions) in large online collections of multilingual data, and ways to further develop our understanding of regional language variation in the context of social media.
  • As for digital culture studies, we need to develop solutions for multimodal cultural heritage analysis.
  • Computational social science needs to enrich survey data by combining structured register data with unstructured textual data.

Information Interaction and Evidence-Based Infrastructure development

Information Interaction (IIA) means to collect information on how researchers interact with the RI in order to design and develop tools and services accordingly; it also means to offer researchers training and guidance on how to enhance their work using the infrastructure.

The foreseen impact is a close dialogue with the user community to ensure the best possible development of the RI. We have initial surveys of the actors in the RI system. We need, firstly, active community engagement with researchers working with multiple types of research tools and data, including multimodal societal data and multimodal cultural heritage data, to ensure the widening of the user community. Secondly, to collect information about how the users interact with the tools and data available in the RI, implicit (interaction log analysis) and explicit user monitoring (user needs questionnaires, observing and interviewing users) will be conducted. Collecting RI performance data and designing and developing digital tools, protocols, and services support the RI development. A careful analysis reveals critical points for the development of the RI. Thirdly, to widen the user base, lower the threshold to start using the RI, and ensure successful uses for researchers, educational resources that benefit reaching the users’ research aims will be collected and circulated. Also, training practices for workshops and tutorials will be developed.