Dark LLMs: It's still easy to trick most AI chatbots into providing harmful information, study finds

A group of AI researchers at Ben Gurion University of the Negev, in Israel, has found that despite efforts by large language model (LLM) makers, most commonly available chatbots are still easily tricked into generating harmful and sometimes illegal information.

In their paper posted on the arXiv preprint server, Michael Fire, Yitzhak Elbazis, Adi Wasenstein, and Lior Rokach describe how as part of their research regarding so-called dark LLMs—models designed intentionally with relaxed guardrails—they found that even mainstream chatbots such as ChatGPT are still easily fooled into giving answers that are supposed to be filtered.

It was not long after LLMs went mainstream that users found that they could use them to find information normally only available on the dark web; how to make napalm, for example, or how to sneak into a computer network. In response, LLM makers added filters to prevent their chatbots from generating such information.

But then users found that they could trick LLMs into revealing the information anyway by using cleverly worded queries, an act that is now called jailbreaking. In this new study, the research team suggests that the response to jailbreaking by LLM makers has been less than they expected.

To read more, click here.