View Proposal


Proposer
Matthew Aylett
Title
Towards a Conversational Agent to Assist in Meetings
Goal
To use array microphone and vision data to model muliparty conversation
Description
The majority of conversational systems act in a one-to-one setting (Aylett & Romeo). This allows the system impose its own turn-taking strategy on the conversation (typically a speak-wait strategy). However in a multi-party dialog a system will need to adopt a more human fluid turn-taking approach. In addition it faces significant challenges such as real-time diarization (who said what when), speaker overlap and complex human turn-taking where to take the floor requires the system to predict a point in the conversation where that could occur and signal to the other dialog partners it wishes to do so (Gillet et al). In this project we will focus on setting up a multi-party meeting recording system based on work by Honda Research Institute Europe using a Konnect depth camera and a Respeaker USB mic array (Wange et al). You will record 4-5 meetings with four of five participants using one of the following role-plays: Role-play 1 Your company wants to organise a Work–Life Balance day. The aim of the event is to get employees to see colleagues as people with real lives outside the workplace, and therefore to be more supportive, understanding and friendly towards each other. There is a very limited budget, and the event will take place on a normal working day, without dramatically reducing employees’ productivity during that day. You and some other junior members of staff have been asked to plan the events for the day. Hold a brainstorming meeting to plan the event. Role-play 2 Your company wants to hold a Staff Integration event, to enable employees from different teams and work locations to get to know each other and build relationships. You and other senior managers meet to plan a budget for this event (in terms of cost per employee) and to brainstorm ideas for the event. Using output from the array microphone and Konnect you will analyse the recordings as if they are in real-time using directional information to diarize the recordings and use automatic speech recognition from Azure to transcribe the data. You will compare with IBM diarization to evaluate the process. Finally with and without diarization information you will generate a prompt for an LLM to summarise the meeting. The recording and transcription and logged positional information from the microphone and connect will be released as an open data resource for the community.
Resources
Aylett, Matthew Peter, and Marta Romeo. "You Don’t Need to Speak, You Need to Listen: Robot Interaction and Human-Like Turn-Taking." Proceedings of the 5th International Conference on Conversational User Interfaces. 2023. Gillet, S., Vázquez, M., Peters, C., Yang, F., & Leite, I. (2022). Multiparty interaction between humans and socially interactive agents. In The Handbook on Socially Interactive Agents: 20 years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 2: Interactivity, Platforms, Application (pp. 113-154). Wang, C., Hasler, S., Tanneberg, D., Ocker, F., Joublin, F., Ceravola, A., ... & Gienger, M. (2024). Large language models for multi-modal human-robot interaction. arXiv preprint arXiv:2401.15174.
Background
Url
Difficulty Level
Challenging
Ethical Approval
Full
Number Of Students
2
Supervisor
Matthew Aylett
Keywords
conversational interaction, speech technology
Degrees
Bachelor of Science in Computer Science
Bachelor of Science in Computer Systems
Master of Science in Artificial Intelligence
Master of Science in Computing (2 Years)
Master of Science in Data Science
Master of Science in Human Robot Interaction
Master of Science in Robotics
Bachelor of Science in Computing Science