View Proposal
-
Proposer
-
Alessandro Suglia
-
Title
-
BabyLM: Pretraining Language Models with a developmentally plausible corpus
-
Goal
-
Sample-efficient pretraining on a developmentally plausible corpus
-
Description
- A huge effort has been put into optimizing LM pretraining at massive scales in the last several years. While growing parameter counts often get the most attention, datasets have also grown by orders of magnitude. For example, Chinchilla sees 1.4 trillion words during training---well over 10000 words for every one word a 13-year-old child has heard in their entire life.
The goal of this shared task is to incentivize researchers with an interest in pretraining or cognitive modelling to focus their efforts on optimizing pretraining given data limitations inspired by human development.
Additionally, we hope to democratize research on pretraining—which is typically thought to be practical only for large industry groups—by drawing attention to open problems that can be addressed on a university budget.
- Resources
-
https://babylm.github.io/
-
Background
-
AI, Deep Learning, Natural Language Processing, Computer Vision
-
Url
-
External Link
-
Difficulty Level
-
High
-
Ethical Approval
-
None
-
Number Of Students
-
2
-
Supervisor
-
Alessandro Suglia
-
Keywords
-
deep learning, neural networks, language models
-
Degrees
-
Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science