Tools to make sense of web data

This resource consists of two tools: one to classify toxic data in Finnish (e.g., insults, obscene language) from datasets retrieved from social media platforms; and another to identify registers (genres, e.g., reviews, interviews, news reports) from web content in diverse languages.

Toxicity classifier: https://github.com/TurkuNLP/toxicity-classifier
Multilingual modeling of web registers: https://github.com/TurkuNLP/multilingual-register-labeling

Resource developed by the University of Turku in partnership with the CSC.
 
Guidance can be found in the websites of the resources.
Contact for queries: Veronika Laipala
 

DARIAH-FI OFFICE:

The National Archives of Finland

Tanja Välisalo

DARIAH-FI: YLEISET KYSYMYKSET

DARIAH-FI: GENERAL

DARIAH-KONTTORI:

Turun yliopisto

Veronika Laippala

DARIAH-KONTTORI:

Jyväskylän yliopisto

Tanja Välisalo

DARIAH-KONTTORI:

Itä-Suomen yliopisto

Paula Rautionaho

DARIAH-KONTTORI:

Oulun yliopisto

Marika Rauhala

DARIAH-KONTTORI:

Aalto-yliopisto

Eero Hyvönen

DARIAH-KONTTORI:

Helsingin yliopisto

Risto Turunen

DARIAH-KONTTORI:

TampereEN YLIOPISTO

Sanna Kumpulainen

DARIAH-KONTTORI:

Suomen Kansalliskirjasto

Johanna Lilja

DARIAH-KONTTORI:

CSC – Tieteen tietotekniikan keskus

Katri Tegel

DARIAH-FI OFFICE:

CSC – IT Centre for Science

Katri Tegel

DARIAH-FI OFFICE:

National Library of Finland

Johanna Lilja

DARIAH-FI OFFICE:

Tampere University

Sanna Kumpulainen

DARIAH-FI OFFICE:

Aalto University

Eero Hyvönen

DARIAH-FI OFFICE:

University of Oulu

Marika Rauhala

DARIAH-FI OFFICE:

University of Eastern Finland

Paula Rautionaho

DARIAH-FI OFFICE:

Jyväskylä University

Venla Poso

DARIAH-FI OFFICE:

University of Turku

Veronika Laippala