Towards Natural Language Understanding: Developing and Assessing Approaches and Benchmarks

Lectio praecursoria

Aarne Talman, University of Helsinki, 23 February 2024

lectio-praecursoria Language understanding is often considered a hallmark of intelligence. We humans are able to use language to name objects, communicate with each other, and perform complex reasoning. But it is not only a tool for reasoning and communication. Language can also be used for documenting and storing information. At the center of this capability is understanding. We are able to use language for these purposes because we understand meanings. Until recently, language understanding was considered something that only humans could achieve.

Natural Language Understanding, or NLU, is often referred to as the Holy Grail of Artificial Intelligence. NLU embodies the ambition to enable machines to comprehend and interpret human language in all its complexity. Achieving true NLU would mean that AI systems can understand text or speech not merely at a superficial level of words and phrases but can grasp the underlying intentions, emotions, and nuances. Such capability would revolutionize how humans interact with machines, enabling more intuitive, efficient, and meaningful applications. As such, NLU remains a central goal that drives forward the frontiers of AI research and development.

Natural Language Understanding is often divided into different tasks aimed at capturing the complexities of human language. These tasks serve as proxies for capturing language understanding and include capabilities such as common-sense reasoning, logical reasoning, paraphrase detection, and sentiment detection, to name a few. Each of these tasks targets a specific aspect of language understanding, collectively contributing to the development of AI systems capable of interpreting and interacting with human language. Studying language understanding through associated proxy tasks not only allows us to break down the task of NLU research into smaller, more manageable chunks, but it also allows the design of tasks that push the science further by finding areas where humans perform well but NLU models fall short.

To evaluate AI's capabilities in performing the diverse NLU tasks, numerous benchmarks have been published in recent years. They serve as standardized metrics for assessing the performance of AI models against a range of linguistic challenges.

The research questions of my PhD dissertation focus on advancing our comprehension and modeling of human language understanding within the field of Language Technology. The first research question explores methodologies and approaches to more accurately model in AI how humans understand language. The second question scrutinizes the generalization capabilities of high-performing neural network models across NLU datasets, assessing whether these models can effectively apply learned knowledge to unfamiliar data and truly grasp the underlying task. Finally, the third question critically evaluates the effectiveness of current NLU benchmarks in measuring genuine language understanding. It questions whether these benchmarks accurately reflect the complexity of human language comprehension or if they can be solved through models' ability to exploit dataset-specific biases.

This dissertation focuses especially on one NLU task called Natural Language Inference. Natural Language Inference, NLI, has been a fundamental task in natural language understanding and computational semantics. NLI is essentially a task where sentence pairs are analyzed to determine their semantic relationship, categorized into entailment, contradiction, and neutral, aiming to analyze whether the hypothesis sentence follows from, contradicts, or is unrelated to the premise sentence. Over time, datasets for NLI have evolved from focusing on logical reasoning to incorporating informal linguistic reasoning, reflecting the task's evolution and its central role in natural language understanding research.

Let me now turn to the contributions of this dissertation.

Before the advent of the transformer architecture, state-of-the-art models for natural language understanding largely relied on recurrent neural networks, particularly Long Short-Term Memory networks, LSTMs. InferSent, an influential model using bidirectional LSTMs, BiLSTM, to encode sentences for the Natural Language Inference task, demonstrated the power of this approach by achieving top performance on the SentEval benchmark and the SNLI dataset. Building on InferSent's success, we developed the Iterative Refinement Encoder, IRE. This method embeds words using pre-trained GloVe embeddings, then iteratively refines these embeddings through multiple BiLSTMs, applying max pooling at each stage, and finally concatenates these outputs to form a robust sentence embedding. This approach aims to create more nuanced representations, improving performance across a broad spectrum of NLU tasks.

When we initially published our findings, our model reached state-of-the-art performance in various NLU datasets. Since then, new models have emerged that surpass ours in performance. However, at the time, it was the top-performing model in various NLI datasets as well as the SentEval sentence embedding evaluation benchmark. Moreover, the IRE model demonstrated superior capability in capturing diverse linguistic properties, outperforming the InferSent model in 8 out of 10 linguistic probing tasks. Overall, the proposed method and architecture of the IRE model have proven highly effective in generating sentence embeddings that perform well across a multitude of NLU tasks.

Human language understanding is, however, inherently complex, varying significantly across individuals and even within the same person over time and depending on the context. Despite this complexity, the majority of datasets and evaluation benchmarks in natural language understanding research employ fixed gold labels, assuming a singular, canonical interpretation of language. While some NLU datasets, particularly those generated via crowdsourcing, do feature multiple annotations per example to reflect a range of interpretations, traditional views within the field often regard annotation disagreements as a problem rather than as reflections of legitimate diversity in language understanding. This perspective limits the ability of NLU systems to fully grasp the multifaceted nature of human language comprehension.

In addition to the advances in sentence encoding methods using the Iterative Refinement Encoder, my research has focused on advancing natural language understanding models by incorporating methods to explicitly model the uncertainty and ambiguity inherent in natural language. For this, we chose the Stochastic Weight Averaging Gaussian, SWAG, an extension of Stochastic Weight Averaging. SWAG improves upon simple stochastic weight averaging by adding variance estimation, in order to estimate Gaussian posteriors. It uses a computationally efficient approach that involves only the diagonal of the covariance matrix and a low-rank estimation for the last k-steps of training. Our study explored the effectiveness of SWAG in modeling annotation disagreements in natural language inference. By comparing the cross-entropy from SWAG checkpoints with that derived from human label disagreement, it was demonstrated that incorporating uncertainty modeling yields distributions that align more closely with the diverse interpretations of human annotators. These findings suggest that methods like SWAG are valuable for crafting models that better reflect human variability in language understanding.

One of the primary aims in machine learning and natural language understanding is to develop models that generalize effectively to unseen data. While the majority of research within NLU has focused on generalization to new tasks, there's a noted gap in research on models' ability to generalize across different datasets within the same task, such as transferring knowledge from one natural language inference dataset to another. This cross-dataset generalization is vital for confirming that a model has truly grasped the task at hand.

Our research on the cross-dataset generalization capabilities of neural network models in natural language inference reveals that models consistently underperform when trained on one NLI dataset and tested on another, despite shared definitions of the semantic relationships. Using both a transformer-based model as well as RNN-based models, a significant drop in accuracy was observed in our cross-dataset experiments. The core issue likely stems from a combination of factors, with a strong indication that models are learning dataset-specific artifacts rather than the intended inferential relationships.

These issues as well as previous research have highlighted problems within NLU datasets and benchmarks, including biases and artifacts that lead models to perform well by learning spurious features rather than genuinely understanding language. This issue is exemplified by models' inability to transfer learnings across different datasets within the same task, such as NLI, suggesting that models may not truly grasp language understanding.

To study if NLU benchmarks really measure language understanding we employed dataset corruption methods that alter the meanings of sentences by removing occurrences of specific word classes. These corruptions make sentences often unintelligible even for humans.

In our research, we discovered that model performance remained high even on the corrupted, unintelligible datasets, leading to a conclusion that many of the popular NLU benchmarks can be solved without the need for language understanding. This finding underscores the fact that current models are highly capable of exploiting statistical cues within the data to make accurate predictions, bypassing the need for language understanding. This insight calls for a reevaluation of how language understanding is measured in the field of NLU and highlights the urgent need for developing benchmarks that better represent the complexities of real-world language use.

Finally, to develop better language understanding evaluation benchmarks and ultimately NLU models, we need to more clearly define what language understanding is.

In NLP research, two major perspectives on language understanding exist. The first, usage-based view, influenced by cognitive semantics and philosophers like Wittgenstein, evaluates understanding via performance in various tasks. That is, according to this view, a model understands language if it performs well in various proxy tasks such as NLI and sentiment detection.

The second perspective holds that understanding involves more than task performance; it requires grasping the communicative intent behind expressions. This view aligns with theories in philosophy and psychology that emphasize mental representations as the foundation of understanding. It suggests that successful communication depends on mental representations and understanding the speaker's intent.

When evaluating NLU models, the first approach suggests that state-of-the-art neural network models show some degree of understanding, as they perform well in numerous tasks. However, as performance in tasks doesn't necessarily equate to genuine understanding, as we have just discussed, this view seems questionable. The second approach raises the question of whether NLU models can grasp intents. It has been argued that AI models lack the ability to represent meanings or intentions. However, I would argue that modern neural network models are capable of learning representations that can serve as the foundation for understanding and ultimately to understand language, at least to some degree. Note that this does not necessarily mean that AI models would or even could have consciousness. I take consciousness and understanding to be separate capabilities where the former does not depend on the latter.

In conclusion, more research is needed to define what language understanding really is, how it can be best modeled, and how it should be measured. These remain the goals for my future research.

Read the thesis...