View Proposal
-
Proposer
-
Neamat El Gayar
-
Title
-
Multimodal Tranformers for vision applications
-
Goal
-
Explore recent models integrating different modalities for any AI application
-
Description
- In the field of natural language processing, the transformer models such as BERT and T5 are providing a lot of fruitful results. These models are also built on the idea of self-supervised learning where they are already trained with a large amount of unlabelled data and then they apply some fine-tuned supervised learning models with few labeled data
Self-supervised learning methods have solved many of the problems regarding unlabeled data. Uses of these methods in fields like computer vision and natural language processing have shown many great results.
Recent success of Transformers in the language domain has motivated adapting it to a multimodal setting ( Images, audio , video)
-multimodal transformer survey https://arxiv.org/pdf/2206.06488.pdf
-Vision transformer survey https://arxiv.org/pdf/2012.12556.pdf
Possible applications:
- Collaboration with research institute in Abu Dhabi ( Satellite Images from drones using colour maps and thermal maps)
- Emotions prediction ( text, audio video ) in educational setting or monitoring medical patients
Other applications also possible
Check this article . (for applications combining language)
https://theaisummer.com/vision-language-models/
- Resources
-
-
Background
-
-
Url
-
-
Difficulty Level
-
Easy
-
Ethical Approval
-
None
-
Number Of Students
-
0
-
Supervisor
-
Neamat El Gayar
-
Keywords
-
-
Degrees
-
Bachelor of Science in Computer Science
Master of Science in Artificial Intelligence
Master of Science in Data Science