03-04
Rethinking Data Use in Large Language Models

Large language models (LMs) such as ChatGPT have revolutionized natural language processing and artificial intelligence more broadly. In this talk, I will discuss my research on understanding and advancing these models, centered around how they use the very large text corpora they are trained on. First, I will describe our efforts to understand how these models learn to perform new tasks after training, demonstrating that their so-called in context learning capabilities are almost entirely determined by what they learn from the training data. Next, I will introduce a new class of LMs—nonparametric LMs—that repurpose this training data as a data store from which they retrieve information for improved accuracy and updatability. I will describe my work on establishing the foundations of such models, including one of the first broadly used neural retrieval models and an approach that simplifies a traditional, two-stage pipeline into one. I will also discuss how nonparametric models open up new avenues for responsible data use, e.g., by segregating permissive and copyrighted text and using them differently. Finally, I will envision the next generation of LMs we should build, focusing on efficient scaling, improved factuality, and decentralization.

Bio: Sewon Min is a Ph.D. candidate in the Paul G. Allen School of Computer Science & Engineering at the University of Washington. Her research focuses on language models (LMs): studying the science of LMs, and designing new model classes and learning methods that make LMs more performant and flexible. She also studies LMs in information-seeking, legal, and privacy contexts. She is a co-organizer of multiple tutorials and workshops, including most recently at ACL 2023 on Retrieval-based Language Models and Applications and upcoming at ICLR 2024 on Mathematical and Empirical Understanding of Foundation Models. She won a paper award at ACL 2023, received a J.P. Morgan Fellowship, and was named an EECS rising star in 2022.

To request accommodations for a disability please contact Emily Lawrence, emilyl@cs.princeton.edu, at least one week prior to the event.

Date and Time

Monday March 4, 2024 12:30pm - 1:30pm

Location

Computer Science Small Auditorium (Room 105)

Event Type

CS Department Colloquium Series

Speaker

Sewon Min, from University of Washington

Host

Danqi Chen

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

03-04 Rethinking Data Use in Large Language Models

03-04
Rethinking Data Use in Large Language Models