View Proposal
-
Proposer
-
Alessandro Suglia
-
Title
-
Understanding and Scaling Pixel-based LLMs
-
Goal
-
Create a new family of LLMs that use visual information to learn language
-
Description
- Current Deep Learning models of language processing assume to have access to a tokenizer, a tool used to divide the input text into a sequence of tokens that can be more easily processed by Machine Learning algorithms. A tokenizer is built utilising a textual corpus from which it derives the most frequent tokens used in the language. However, these representations have several short-comings: 1) they are specific for each language; 2) they are sensitive to noise (e.g., spelling mistakes);
and 3) they are hand-crafted because they do not represent language input in a multimodal way using either the visual or auditory input streams, just like humans do. To overcome these bottlenecks, this project will explore "textless" NLP models that use visual and audio signals to derive latent conceptual representations.
- Resources
-
- https://speechbot.github.io/
- Rust, Phillip, et al. "Language Modelling with Pixels." arXiv preprint arXiv:2207.06991 (2022).
- Tschannen, Michael, Basil Mustafa, and Neil Houlsby. "Image-and-Language Understanding from Pixels Only." arXiv preprint arXiv:2212.08045 (2022).
-
Background
-
AI, Deep Learning, Natural Language Processing, Computer Vision
-
Url
-
-
Difficulty Level
-
High
-
Ethical Approval
-
None
-
Number Of Students
-
2
-
Supervisor
-
Alessandro Suglia
-
Keywords
-
deep learning, neural networks, language models, computer vision
-
Degrees
-
Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science