View Proposal
-
Proposer
-
Gavin Abercrombie
-
Title
-
Safety Testing Language Models
-
Goal
-
To investigate the safety of language models and the tests designed to
-
Description
- Language models produce a wide range of potentially unsafe outputs in areas such as healthcare, financial advice, and physical safety. To mitigate this, a large range of test suites have been developed to prompt models in order to assess their safety [2]. However, while fine-tuning models on such data can reduce their propensity to produce dangerous outputs, it has been found to have negative impacts on their usefulness under some circumstances [3].
This project aims to investigate the extent to which safety prompt datasets capture the range of dangerous behaviours that models exhibit, and to explore how much safety tuning is too much.
The project requires skills in machine learning, natural language processing, and data analysis, as may require managing human participants.
- Resources
-
https://safetyprompts.com/
-
Background
-
[1] Emily Dinan, Gavin Abercrombie, A. Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2022. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4113–4133, Dublin, Ireland. Association for Computational Linguistics.
[2] Röttger, P., Pernisi, F., Vidgen, B., & Hovy, D. (2025). SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety. Proceedings of the AAAI Conference on Artificial Intelligence, 39(26), 27617-27627. https://doi.org/10.1609/aaai.v39i26.34975
[3] Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., & Zou, J. (2023). Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875.
-
Url
-
-
Difficulty Level
-
High
-
Ethical Approval
-
Full
-
Number Of Students
-
1
-
Supervisor
-
Gavin Abercrombie
-
Keywords
-
nlp, machine learning, prompting, data analysis, human participant study
-
Degrees
-
Master of Science in Artificial Intelligence
Master of Science in Artificial Intelligence with SMI
Master of Science in Data Science