Student Lightning Talks - Tianjian, Lingfeng, Kaiser

Поділитися
Вставка
  • Опубліковано 29 січ 2024
  • "Locating and Removing Error During Training for Text Generation Models" - Tianjian Li
    Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only use the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods.
    "The Trickle-down Impact of Reward (In-)consistency on RLHF" - Lingfeng Shen
    In this paper, we visit a series of research questions relevant to reward model inconsistency in RLHF: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose Contrast Instruction -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instruction features a pair of lexically similar instructions with different ground truth responses. We reveal significant inconsistencies relative to humans in standard-trained RMs across tasks and models and show how this inconsistency negatively impact RLHF model training, and show the challenges of overcoming such issues through current RLHF tricks.
    "The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks"- Kaiser Sun
    NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than they with synthetic datasets, or than synthetic datasets among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) which lexical items are used in the data can strongly impact conclusions. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure and suggest that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.

КОМЕНТАРІ •