View Proposal


Proposer
Alessandro Suglia
Title
BabyLM: Pretraining Language Models with a developmentally plausible corpus
Goal
Sample-efficient pretraining on a developmentally plausible corpus
Description
A huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla sees 1.4 trillion words during training---well over 10000 words for every one word a 13-year-old child has heard in their entire life. The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modelling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Resources
https://babylm.github.io/
Background
AI, Deep Learning, Natural Language Processing, Computer Vision
Url
External Link
Difficulty Level
High
Ethical Approval
None
Number Of Students
2
Supervisor
Alessandro Suglia
Keywords
deep learning, neural networks, language models
Degrees
Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science