Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. These “jailbreaks” have run from the mundane — in the early years, one could simply tell a model to ignore its safety instructions — to elaborate multi-prompt roleplay scenarios. In a recent paper, researchers found one of the more delightful ways to bypass artificial intelligence security systems: Rephrase your nefarious prompt as a poem (opens a new tab).
But just as quickly as these issues appear, they seem to get patched. That’s because the companies don’t have to fully retrain an AI model to fix a vulnerability. They can simply filter out forbidden prompts before they ever reach the model itself.
Recently, cryptographers have intensified their examinations of these filters. They’ve shown, in recent papers (opens a new tab) that have been posted (opens a new tab) on the arxiv.org preprint server, how the defensive filters put around powerful language models can be subverted by well-studied cryptographic tools. In fact, they’ve shown how the very nature of this two-tier system — a filter that protects a powerful language model inside it — creates gaps in the defenses that can always be exploited.
To read more, click here.