View Proposal


Proposer
Alessandro Suglia
Title
Understanding and Scaling Pixel-based LLMs
Goal
Create a new family of LLMs that use visual information to learn language
Description
Current Deep Learning models of language processing assume to have access to a tokenizer, a tool used to divide the input text into a sequence of tokens that can be more easily processed by Machine Learning algorithms. A tokenizer is built utilising a textual corpus from which it derives the most frequent tokens used in the language. However, these representations have several short-comings: 1) they are specific for each language; 2) they are sensitive to noise (e.g., spelling mistakes); and 3) they are hand-crafted because they do not represent language input in a multimodal way using either the visual or auditory input streams, just like humans do. To overcome these bottlenecks, this project will explore "textless" NLP models that use visual and audio signals to derive latent conceptual representations.
Resources
- https://speechbot.github.io/ - Rust, Phillip, et al. "Language Modelling with Pixels." arXiv preprint arXiv:2207.06991 (2022). - Tschannen, Michael, Basil Mustafa, and Neil Houlsby. "Image-and-Language Understanding from Pixels Only." arXiv preprint arXiv:2212.08045 (2022).
Background
AI, Deep Learning, Natural Language Processing, Computer Vision
Url
Difficulty Level
High
Ethical Approval
None
Number Of Students
2
Supervisor
Alessandro Suglia
Keywords
deep learning, neural networks, language models, computer vision
Degrees
Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science