Location:
The members of
Examiners:
Readers:
A copy of
Everyone is invited to attend
Abstract follows below:
Deep learning has advanced remarkably in recent decades. Yet, its theoretical foundations, particularly in the realm of large models, still lag behind. This thesis focuses on research that combines strong theoretical foundations with practical applications in efficiently scaling up large models.
In the first part of the thesis, we focus on the training dynamics of neural nets by covering the theory of overparametrized neural nets. We will briefly introduce the theory of Neural Tangent Kernel (NTK), and proceed with Hyperparameter Transfer, an important application of the Tensor Program framework. We cover some of the earliest papers that establish NTK as a research field, along with the limitations of NTK. Hyperparameter Transfer is a novel and efficient paradigm for hyperparameter tuning by providing the optimal strategy for scaling up models. We introduce the characterization of the training dynamics for deep neural nets and offer an efficient hyperparameter selection scheme where optimal hyperparameters selected by tuning on shallow nets also work for deep nets.
In the second part of the thesis, we focus on the data aspect of large model scaling. We will first introduce Skill-Mix, a novel and unique evaluation that sidesteps issues of traditional large language model (LLM) evaluations like data contamination and cramming for leaderboard. Skill-Mix randomly selects k language skills, then prompts the LLM to produce a concise text that demonstrates the chosen skills. The exponentially growing number of skill combinations provably prevent data contamination and can further reveal the novelty of successful answers by powerful LLMs. We then introduce ConceptMix, an extension of Skill-Mix to evaluate the capabilities of text-to-image models to combine k random selected visual concepts. Finally, we uncover the capabilities of LLMs to learn and generalize skill compositions given good responses from Skill-Mix. The results show that a few thousand of such data is enough to significantly improve the model performance in unseen skill combinations, beating models with much larger sizes. It suggests incorporating skill-rich synthetic text into training is an efficient way to scale up the data