View Proposal
-
Proposer
-
Matthew Aylett
-
Title
-
Towards a Conversational Agent to Assist in Meetings
-
Goal
-
To use array microphone and vision data to model muliparty conversation
-
Description
- The majority of conversational systems act in a one-to-one setting (Aylett & Romeo). This allows the system impose its own turn-taking strategy on the conversation (typically a speak-wait strategy). However in a multi-party dialog a system will need to adopt a more human fluid turn-taking approach. In addition it faces significant challenges such as real-time diarization (who said what when), speaker overlap and complex human turn-taking where to take the floor requires the system to predict a point in the conversation where that could occur and signal to the other dialog partners it wishes to do so (Gillet et al).
In this project we will focus on setting up a multi-party meeting recording system based on work by Honda Research Institute Europe using a Konnect depth camera and a Respeaker USB mic array (Wange et al). You will record 4-5 meetings with four of five participants using one of the following role-plays:
Role-play 1 Your company wants to organise a Work–Life Balance day. The aim of the event is to get employees to see colleagues as people with real lives outside the workplace, and therefore to be more supportive, understanding and friendly towards each other. There is a very limited budget, and the event will take place on a normal working day, without dramatically reducing employees’ productivity during that day. You and some other junior members of staff have been asked to plan the events for the day. Hold a brainstorming meeting to plan the event.
Role-play 2 Your company wants to hold a Staff Integration event, to enable employees from different teams and work locations to get to know each other and build relationships. You and other senior managers meet to plan a budget for this event (in terms of cost per employee) and to brainstorm ideas for the event.
Using output from the array microphone and Konnect you will analyse the recordings as if they are in real-time using directional information to diarize the recordings and use automatic speech recognition from Azure to transcribe the data.
You will compare with IBM diarization to evaluate the process. Finally with and without diarization information you will generate a prompt for an LLM to summarise the meeting.
The recording and transcription and logged positional information from the microphone and connect will be released as an open data resource for the community.
- Resources
-
Aylett, Matthew Peter, and Marta Romeo. "You Don’t Need to Speak, You
Need to Listen: Robot Interaction and Human-Like Turn-Taking."
Proceedings of the 5th International Conference on Conversational User
Interfaces. 2023.
Gillet, S., Vázquez, M., Peters, C., Yang, F., & Leite,
I. (2022). Multiparty interaction between humans and socially
interactive agents. In The Handbook on Socially Interactive Agents: 20
years of Research on Embodied Conversational Agents, Intelligent
Virtual Agents, and Social Robotics Volume 2: Interactivity,
Platforms, Application (pp. 113-154).
Wang, C., Hasler, S., Tanneberg, D., Ocker, F., Joublin, F., Ceravola,
A., ... & Gienger, M. (2024). Large language models for multi-modal
human-robot interaction. arXiv preprint arXiv:2401.15174.
-
Background
-
-
Url
-
-
Difficulty Level
-
Challenging
-
Ethical Approval
-
Full
-
Number Of Students
-
2
-
Supervisor
-
Matthew Aylett
-
Keywords
-
conversational interaction, speech technology
-
Degrees
-
Bachelor of Science in Computer Science
Bachelor of Science in Computer Systems
Master of Science in Artificial Intelligence
Master of Science in Computing (2 Years)
Master of Science in Data Science
Master of Science in Human Robot Interaction
Master of Science in Robotics
Bachelor of Science in Computing Science