View Proposal
-
Proposer
-
Phil Bartie
-
Title
-
Visually grounded language for real world scenes
-
Goal
-
To allow a user to select an object in an image, and for the application to build descriptions to guide the user to an object
-
Description
- Speech and dialogue interfaces require the application to 'understand' the user's intent. Linking the language model to real-world objects.
This project will focus on a few key areas of work - using LLM (eg Llama3, GPT3.5) with a text (or speech) interface to allow the human user to correctly select objects and set instructions related to a real-world environment.
The overview of work involves:
- collecting a set of photos (ranging from indoor scenes to street views)
- using computer vision to segment the images and collect attributes (eg identify object classes, colours, spatial positions)
- build a web UI that allows the user to give an instruction (eg what is the number plate of the red car)
- construct referring expressions (eg the system builds a useful description of where something is located- such as the bike is next to the house)
This work relates to building more natural user interfaces which connect the user and application - and could be used in robotics, multi-user systems, mobile applications, autonomous vehicles etc.
- Resources
-
Web dev + LLM
-
Background
-
-
Url
-
-
Difficulty Level
-
High
-
Ethical Approval
-
InterfaceOnly
-
Number Of Students
-
0
-
Supervisor
-
Phil Bartie
-
Keywords
-
-
Degrees
-
Bachelor of Science in Computer Science
Bachelor of Science in Computer Systems
Master of Engineering in Software Engineering
Master of Science in Data Science
Master of Science in Robotics
Master of Science in Software Engineering