AI chatbots can be tricked into misbehaving. Can scientists stop it?

Picture a tentacled, many-eyed beast, with a long tongue and gnarly fangs. Atop this writhing abomination sits a single, yellow smiley face. “Trust me,” its placid mug seems to say.

That’s an image sometimes used to represent AI chatbots. The smiley is what stands between the user and the toxic content the system can create.

Chatbots like OpenAI’s ChatGPT, Google’s Bard and Meta AI have snagged headlines for their ability to answer questions with stunningly humanlike language. These chatbots are based on large language models, a type of generative artificial intelligence designed to spit out text. Large language models are typically trained on vast swaths of internet content. Much of the internet’s text is useful information — news articles, home-repair FAQs, health information from trusted authorities. But as anyone who has spent a bit of time there knows, cesspools of human behavior also lurk. Hate-filled comment sections, racist screeds, conspiracy theories, step-by-step guides on how to give yourself an eating disorder or build a dangerous weapon — you name it, it’s probably on the internet.

To read more, click here.