Details - MACS Project System

View Proposal

Proposer: Alessandro Suglia
Title: Understanding and Scaling Pixel-based LLMs
Goal: Create a new family of LLMs that use visual information to learn language
Description: Current Deep Learning models of language processing assume to have access to a tokenizer, a tool used to divide the input text into a sequence of tokens that can be more easily processed by Machine Learning algorithms. A tokenizer is built utilising a textual corpus from which it derives the most frequent tokens used in the language. However, these representations have several short-comings: 1) they are specific for each language; 2) they are sensitive to noise (e.g., spelling mistakes); and 3) they are hand-crafted because they do not represent language input in a multimodal way using either the visual or auditory input streams, just like humans do. To overcome these bottlenecks, this project will explore "textless" NLP models that use visual and audio signals to derive latent conceptual representations.
Resources: - https://speechbot.github.io/ - Rust, Phillip, et al. "Language Modelling with Pixels." arXiv preprint arXiv:2207.06991 (2022). - Tschannen, Michael, Basil Mustafa, and Neil Houlsby. "Image-and-Language Understanding from Pixels Only." arXiv preprint arXiv:2212.08045 (2022).
Background: AI, Deep Learning, Natural Language Processing, Computer Vision
Url
Difficulty Level: High
Ethical Approval: None
Number Of Students: 2
Supervisor: Alessandro Suglia
Keywords: deep learning, neural networks, language models, computer vision
Degrees: Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science

Back to List