Séminaire de Recherche en Linguistique

Ce séminaire reçoit des conférenciers invités spécialisés dans différents domaines de la linguistique. Les membres du Département, les étudiants et les personnes externes intéressées sont tous cordialement invités.

Description du séminaire

Titre	What to do about non-canonical data in Natural Language Processing
Conférencier	Barbara Plank (Groningen)
Date	mardi 28 mars 2017
Heure	12h15
Salle	L208 (Bâtiment Candolle)
Description	Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio-demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language. In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for leveraging what I call fortuitous data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight, or raw data that needs to be refined. Examples include leveraging hyperlinks to process social media, learning from actual annotator disagreement and learning from the combination of linguistic resources and behavioral data like keystrokes dynamics. If we embrace the variety of such heterogeneous data by combining it with proper algorithms, we will produce more robust NLP models capable of addressing natural language variation. One promising direction here is deep multi-task learning, a method inspired by human learning that has recently gained considerable attention in deep learning approaches to NLP.

Document(s) joint(s)	-

Les autres séminaires