The foreseen impact of the WP is to enable the systematic and detailed analysis of noisy datasets in different formats – both textual, visual and multimodal – and thereby provide unseen possibilities for SSH research. This WP focuses on the enrichment of visual, multimodal and textual data originating from noisy social media, archival and cultural heritage collections. Furthermore, the WP develops standardised annotation and analysis methods and pipelines for these datasets, benefiting from recent advances in natural language processing and data science. We have more data in a digital format than ever. However, the usability of the data in their current state is very restricted. In particular, noisy data from social media, OCR-scanned archives and cultural heritage collections as well as datasets with visual and sound content cannot be properly examined using the currently existing resources.
We need higher quality metadata, standardised analysis pipelines, and more detailed information on the document contents – their thematic and linguistic characteristics, but also information associated with images and audio files.
We need 1) to develop machine learning methods for generating metadata, such as document type and journal number, from OCR-scanned archival materials to facilitate their analysis and information extraction, 2) to create tools for metadata harmonisation and standardised analysis pipelines to ensure the unified marking and systematic analysis of structured metadata for both cultural heritage and social media collections (Finna and Twitter), 3) to develop machine learning methods for providing detailed information on the thematic and linguistic content of noisy social media datasets originating from the Web and game streams, allowing users to explore them and focus on subsets with specific characteristics, and 4) to focus on developing searchable multimodal corpora in which not only text, but also specific audio and video content can be quickly accessed, to provide image labelling systems that can reliably serve SSH researchers working on both social media and cultural heritage data (YouTube, Twitter, Finna), and automated methods for enriching stream video data.