Noise-resistant NLP Models and Data
Leader: Veronika Laippala, University of Turku; Partners: CSC; Collaborators: JYU
This WP provides infrastructure that allows processing noisy or otherwise non-standard data (e.g. historical language, dialects, spoken genres, and OCR noise). We currently have tools that process standard language with extremely high performance, such as the Turku Neural Parser and the FinBERT language model. Nevertheless, their performance deteriorates when facing noisy, non-standard input. We need infrastructure that is tolerant to noise and non-standard language. To achieve this, we will develop datasets and language models targeting such departures from the norm, such as corpora of non-standard language, statistical models of noise, allowing various types of noise to be introduced automatically, large language models pre-trained on noisy language, and noise-resistant fine-tuned task-specific models for, e.g., parsing, named entity tagging and sentiment analysis.