Subsetting Data, Detecting Bias and Noise
Leader: Eetu Mäkelä, University of Helsinki/ARTS; Partners: UHEL/SOC, UTU, UEF, CSC; Collaborator: UHEL/NLF
The foreseen impact is to provide subparts of large data sets that are easier to manage and process for SSH scholars. The large datasets created in other work packages, which are of interest to a wide community of researchers, but have not originally been created for research, contain a range of biases, confounders and noise. Noise in online data streams evolves fast, quickly deteriorating detection accuracies of static systems. We need to develop tools by which researchers are able to robustly query and examine the large datasets to extract the subsets that cover their particular interest. We will deliver a process/service for indexing large datasets and robustly querying them for subsets of interest, an environment for obtaining statistical overviews of (sub)datasets to uncover and evaluate biases and suitability, and provide intelligent noise reduction applications for real-time social media data capture to identify bots and trolls.