Details - MACS Project System

View Proposal

Proposer: Gavin Abercrombie
Title: Safety Testing Language Models
Goal: To investigate the safety of language models and the tests designed to
Description: Language models produce a wide range of potentially unsafe outputs in areas such as healthcare, financial advice, and physical safety. To mitigate this, a large range of test suites have been developed to prompt models in order to assess their safety [2]. However, while fine-tuning models on such data can reduce their propensity to produce dangerous outputs, it has been found to have negative impacts on their usefulness under some circumstances [3]. This project aims to investigate the extent to which safety prompt datasets capture the range of dangerous behaviours that models exhibit, and to explore how much safety tuning is too much. The project requires skills in machine learning, natural language processing, and data analysis, as may require managing human participants.
Resources: https://safetyprompts.com/
Background: [1] Emily Dinan, Gavin Abercrombie, A. Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4113–4133, Dublin, Ireland. Association for Computational Linguistics. [2] Röttger, P., Pernisi, F., Vidgen, B., & Hovy, D. (2025). SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27617-27627. https://doi.org/10.1609/aaai.v39i26.34975 [3] Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., & Zou, J. (2023). Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875.
Url
Difficulty Level: High
Ethical Approval: Full
Number Of Students: 1
Supervisor: Gavin Abercrombie
Keywords: nlp, machine learning, prompting, data analysis, human participant study
Degrees: Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science

Back to List