View Proposal


Proposer
Neamat El Gayar
Title
Multimodal Tranformers for vision applications
Goal
Explore recent models integrating different modalities for any AI application
Description
In the field of natural language processing, the transformer models such as BERT and T5 are providing a lot of fruitful results. These models are also built on the idea of self-supervised learning where they are already trained with a large amount of unlabelled data and then they apply some fine-tuned supervised learning models with few labeled data Self-supervised learning methods have solved many of the problems regarding unlabeled data. Uses of these methods in fields like computer vision and natural language processing have shown many great results. Recent success of Transformers in the language domain has motivated adapting it to a multimodal setting ( Images, audio , video) -multimodal transformer survey https://arxiv.org/pdf/2206.06488.pdf -Vision transformer survey https://arxiv.org/pdf/2012.12556.pdf Possible applications: - Collaboration with research institute in Abu Dhabi ( Satellite Images from drones using colour maps and thermal maps) - Emotions prediction ( text, audio video ) in educational setting or monitoring medical patients Other applications also possible Check this article . (for applications combining language) https://theaisummer.com/vision-language-models/
Resources
Background
Url
Difficulty Level
Easy
Ethical Approval
None
Number Of Students
0
Supervisor
Neamat El Gayar
Keywords
Degrees
Bachelor of Science in Computer Science
Master of Science in Artificial Intelligence
Master of Science in Data Science