Description |
Real world data differs radically from the benchmark corpora we use in
natural language processing (NLP). As soon as we apply our technology
to the real world, performance drops. The reason for this problem is
obvious: NLP models are trained on samples from a limited set of
canonical varieties that are considered standard, most prominently
English newswire. However, there are many dimensions, e.g.,
socio-demographics, language, genre, sentence type, etc. on which
texts can differ from the standard. The solution is not obvious: we
cannot control for all factors, and it is not clear how to best go
beyond the current practice of training on homogeneous data from a
single domain and language.
In this talk, I review the notion of canonicity, and how it shapes our
community's approach to language. I argue for leveraging what I call
fortuitous data, i.e., non-obvious data that is hitherto neglected,
hidden in plain sight, or raw data that needs to be refined. Examples
include leveraging hyperlinks to process social media, learning from
actual annotator disagreement and learning from the combination of
linguistic resources and behavioral data like keystrokes dynamics. If
we embrace the variety of such heterogeneous data by combining it with
proper algorithms, we will produce more robust NLP models capable of
addressing natural language variation. One promising direction here is
deep multi-task learning, a method inspired by human learning that has
recently gained considerable attention in deep learning approaches to
NLP.
|