View Proposal


Proposer
Phil Bartie
Title
Visually grounded language for real world scenes
Goal
To allow a user to select an object in an image, and for the application to build descriptions to guide the user to an object
Description
Speech and dialogue interfaces require the application to 'understand' the user's intent. Linking the language model to real-world objects. This project will focus on a few key areas of work - using LLM (eg Llama3, GPT3.5) with a text (or speech) interface to allow the human user to correctly select objects and set instructions related to a real-world environment. The overview of work involves: - collecting a set of photos (ranging from indoor scenes to street views) - using computer vision to segment the images and collect attributes (eg identify object classes, colours, spatial positions) - build a web UI that allows the user to give an instruction (eg what is the number plate of the red car) - construct referring expressions (eg the system builds a useful description of where something is located- such as the bike is next to the house) This work relates to building more natural user interfaces which connect the user and application - and could be used in robotics, multi-user systems, mobile applications, autonomous vehicles etc.
Resources
Web dev + LLM
Background
Url
Difficulty Level
High
Ethical Approval
InterfaceOnly
Number Of Students
0
Supervisor
Phil Bartie
Keywords
Degrees
Bachelor of Science in Computer Science
Bachelor of Science in Computer Systems
Master of Engineering in Software Engineering
Master of Science in Data Science
Master of Science in Robotics
Master of Science in Software Engineering