Details - MACS Project System

Proposer: Alessandro Suglia
Title: BabyLM: Pretraining Language Models with a developmentally plausible corpus
Goal: Sample-efficient pretraining on a developmentally plausible corpus
Description: A huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla sees 1.4 trillion words during training---well over 10000 words for every one word a 13-year-old child has heard in their entire life. The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modelling to focus their efforts on optimizing pretraining given data limitations inspired by human development. Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
Resources: https://babylm.github.io/
Background: AI, Deep Learning, Natural Language Processing, Computer Vision
Url: External Link
Difficulty Level: High
Ethical Approval: None
Number Of Students: 2
Supervisor: Alessandro Suglia
Keywords: deep learning, neural networks, language models
Degrees: Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science