Leader: Mikko Laitinen, University of Eastern Finland; Participants: JYU, UOULU, UHEL/SOC; Collaborator: UHEL/NLF
The foreseen impact of this WP is that it enables researchers to utilise large born-digital data effectively and to focus on analysis rather than dealing with technical details in often high volume and high velocity. The activities not only produce user-generated textual material for large-scale language models, but also result in representative benchmarks, contributing to increased replicability and reproducibility of SSH research, integrating text analysis with audio-visual cultural heritage (images, multimodal properties of speech, multimodal items in social media). Researchers currently have potential access to extremely large born-digital data sources, including but not limited to game streams, multimodal social media applications, and open-ended answers in surveys. Despite potential benefits, these sources are currently underused in SSH for technical, ethical and practical reasons. We need efficient analytical support tools for various fields:
- In game studies, multimodal game stream analysis needs to utilise video stream interactions between video streams and stream chats and generate data used in developing AI-based solutions.
- In computational sociolinguistics and dialectology, we need to develop tools for multimodal born-digital social media analysis and workflows for accessing and analysing dynamic social media data for computational research, including algorithmic tools to access social media interaction in digital networks, tools for the analysis of multimodal properties of naturalistic speech (e.g. phonetics, prosody, and facial expressions) in large online collections of multilingual data, and ways to further develop our understanding of regional language variation in the context of social media.
- As for digital culture studies, we need to develop solutions for multimodal cultural heritage analysis.
- Computational social science needs to enrich survey data by combining structured register data with unstructured textual data.