Séminaire de Recherche en Linguistique

Ce séminaire reçoit des conférenciers invités spécialisés dans différents domaines de la linguistique. Les membres du Département, les étudiants et les personnes externes intéressées sont tous cordialement invités.

Description du séminaire Print

Titre What to do about non-canonical data in Natural Language Processing
Conférencier Barbara Plank (Groningen)
Date mardi 28 mars 2017
Heure 12h15
Salle L208 (Bâtiment Candolle)
Description

Real world data differs radically from the benchmark corpora we use in
natural language processing (NLP). As soon as we apply our technology
to the real world, performance drops. The reason for this problem is
obvious: NLP models are trained on samples from a limited set of
canonical varieties that are considered standard, most prominently
English newswire. However, there are many dimensions, e.g.,
socio-demographics, language, genre, sentence type, etc. on which
texts can differ from the standard. The solution is not obvious: we
cannot control for all factors, and it is not clear how to best go
beyond the current practice of training on homogeneous data from a
single domain and language.

In this talk, I review the notion of canonicity, and how it shapes our
community's approach to language. I argue for leveraging what I call
fortuitous data, i.e., non-obvious data that is hitherto neglected,
hidden in plain sight, or raw data that needs to be refined.  Examples
include leveraging hyperlinks to process social media, learning from
actual annotator disagreement and learning from the combination of
linguistic resources and behavioral data like keystrokes dynamics. If
we embrace the variety of such heterogeneous data by combining it with
proper algorithms, we will produce more robust NLP models capable of
addressing natural language variation. One promising direction here is
deep multi-task learning, a method inspired by human learning that has
recently gained considerable attention in deep learning approaches to
NLP.

   
Document(s) joint(s) -