Data-distributional Approaches for Generalizable Language Models -- Sang Michael Xie (Stanford)

Поділитися
Вставка
  • Опубліковано 16 кві 2024
  • Abstract: High-quality datasets are crucial for improving the capabilities and training efficiency of large language models. However, current datasets are typically prepared in an ad hoc, heuristic way. In this talk, Sang Michael Xie will present principled approaches to improving and understanding language models centered on the pre-training data distribution. First, he will describe how to improve the efficiency of training multipurpose language models by optimizing the mixture of data sources with robust optimization. Second, he will discuss an efficient importance resampling method for selecting relevant data from trillion-token-scale web datasets for training a specialized model. Finally, he will introduce a first theoretical analysis of in-context learning, a key capability of language models to learn from examples in a textual prompt, that traces the capability back to modeling coherence structure in the pre-training data.
    Speaker Biography: Sang Michael Xie is a computer science PhD student at Stanford University advised by Percy Liang and Tengyu Ma. His research focuses on data-centric machine learning for language models, understanding pre-training and adaptation, and pre-training and self-training methods for robust machine learning. Xie was awarded a NDSEG Fellowship and was previously a student researcher at Google Brain. His work has been recognized as one of Scientific American‘s World-Changing Ideas, published in flagship venues such as Science, and covered by media outlets including The New York Times, The Washington Post, Reuters, BBC News, IEEE Spectrum, and The Verge.

КОМЕНТАРІ •