View Proposal


Proposer
Yingfang Yuan
Title
A Study of the Interplay Between Vision and Text Modalities in Agentic AI Memory
Goal
To investigate whether semantic and factual recall can be achieved solely through visual memory, or whether an explicit interleaving structure between visual and textual modalities must be established within the memory system to enable effective cross-modal recall.
Description
Recent findings reveal a critical gap in current memory systems: visual representations often fail to activate symbolic or linguistic knowledge. This limitation stems from how existing multimodal memory architectures are typically designed — they are static, decoupled, and retrieval-based. In most implementations, images and texts are encoded and stored independently, and only paired at inference time through shallow similarity measures such as cosine similarity. Consequently, the model has little opportunity during the recall phase to perform cross-modal activation or associative reasoning along vision-to-language or language-to-vision pathways. This mechanism differs fundamentally from human memory, where visual and linguistic cues mutually reinforce and trigger each other through associative recall. Humans can recall words when seeing an image, or visualize scenes when hearing descriptive text — a dynamic interplay that is largely missing in current artificial systems. Furthermore, existing benchmarks often rely on aligned datasets (e.g., image–caption pairs), implicitly assuming that visual and textual modalities share the same context. Such alignment masks the genuine difficulty of semantic activation triggered purely by visual cues. Even with retrieval-augmented generation (RAG), current systems mostly perform one-time matching rather than maintaining long-term, cross-modal associations that can be reactivated by visual stimuli. Addressing this gap requires a memory mechanism capable of sustained, bidirectional integration between vision and language, one that supports associative recall akin to human cognition.
Resources
Li, M., Chao, Q. and Li, B., 2025. Two Causally Related Needles in a Video Haystack. arXiv preprint arXiv:2505.19853. Ashok, D., Chaubey, A., Arai, H.J., May, J. and Thomason, J., 2025. Can VLMs Recall Factual Associations From Visual References?. arXiv preprint arXiv:2508.18297.
Background
Url
External Link
Difficulty Level
High
Ethical Approval
None
Number Of Students
1
Supervisor
Yingfang Yuan
Keywords
agentic ai, memory, llm, mllm.
Degrees
Bachelor of Science in Computer Science
Bachelor of Science in Computer Systems
Bachelor of Science in Information Systems
Bachelor of Science in Software Development for Business (GA)
Master of Engineering in Software Engineering
Master of Design in Games Design and Development
Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Business Information Management
Master of Science in Computer Science for Cyber Security
Master of Science in Computer Systems Management
Master of Science in Computing (2 Years)
Master of Science in Data Science
Master of Science in Human Robot Interaction
Master of Science in Information Technology (Business)
Master of Science in Information Technology (Software Systems)
Master of Science in Network Security
Master of Science in Robotics
Master of Science in Software Engineering
Bachelor of Science in Computing Science
Bachelor of Engineering in Robotics
Bachelor of Science in Computer Science (Cyber Security)
Master of Science in Robotics with Industrial Application
Postgraduate Diploma in Artificial Intelligence
Bachelor of Science in Statistical Data Science
BSc Data Sciences
MSc Applied Cyber Security