Data enrichment

Leader: Veronika Laippala, University of Turku; Participants: UEF, JYU, UOULU, UHEL/ARTS, UHEL/SOC, Aalto; Collaborator: UHEL/NLF

The foreseen impact of the WP is to enable the systematic and detailed analysis of noisy datasets in different formats – both textual, visual and multimodal – and thereby provide unseen possibilities for SSH research. This WP focuses on the enrichment of visual, multimodal and textual data originating from noisy social media, archival and cultural heritage collections. Furthermore, the WP develops standardised annotation and analysis methods and pipelines for these datasets, benefiting from recent advances in natural language processing and data science. We have more data in a digital format than ever. However, the usability of the data in their current state is very restricted. In particular, noisy data from social media, OCR-scanned archives and cultural heritage collections as well as datasets with visual and sound content cannot be properly examined using the currently existing resources.

We need higher quality metadata, standardised analysis pipelines, and more detailed information on the document contents – their thematic and linguistic characteristics, but also information associated with images and audio files.

We need 1) to develop machine learning methods for generating metadata, such as document type and journal number, from OCR-scanned archival materials to facilitate their analysis and information extraction, 2) to create tools for metadata harmonisation and standardised analysis pipelines to ensure the unified marking and systematic analysis of structured metadata for both cultural heritage and social media collections (Finna and Twitter), 3) to develop machine learning methods for providing detailed information on the thematic and linguistic content of noisy social media datasets originating from the Web and game streams, allowing users to explore them and focus on subsets with specific characteristics, and 4) to focus on developing searchable multimodal corpora in which not only text, but also specific audio and video content can be quickly accessed, to provide image labelling systems that can reliably serve SSH researchers working on both social media and cultural heritage data (YouTube, Twitter, Finna), and automated methods for enriching stream video data.

DARIAH-FI: YLEISET KYSYMYKSET

DARIAH-FI: GENERAL

DARIAH-KONTTORI:

Turun yliopisto

Veronika Laippala

DARIAH-KONTTORI:

Jyväskylän yliopisto

Tanja Välisalo

DARIAH-KONTTORI:

Itä-Suomen yliopisto

Paula Rautionaho

DARIAH-KONTTORI:

Oulun yliopisto

Marika Rauhala

DARIAH-KONTTORI:

Aalto-yliopisto

Eero Hyvönen

DARIAH-KONTTORI:

Helsingin yliopisto

Risto Turunen

DARIAH-KONTTORI:

TampereEN YLIOPISTO

Sanna Kumpulainen

DARIAH-KONTTORI:

Suomen Kansalliskirjasto

Johanna Lilja

DARIAH-KONTTORI:

CSC – Tieteen tietotekniikan keskus

Katri Tegel

DARIAH-FI OFFICE:

CSC – IT Centre for Science

Katri Tegel

DARIAH-FI OFFICE:

National Library of Finland

Johanna Lilja

DARIAH-FI OFFICE:

Tampere University

Sanna Kumpulainen

DARIAH-FI OFFICE:

Aalto University

Eero Hyvönen

DARIAH-FI OFFICE:

University of Oulu

Marika Rauhala

DARIAH-FI OFFICE:

University of Eastern Finland

Paula Rautionaho

DARIAH-FI OFFICE:

Jyväskylä University

Venla Poso

DARIAH-FI OFFICE:

University of Turku

Veronika Laippala