You’ve probably laughed at unhinged messages written by jailbroken chatbots, but what happens when those same chatbots run robots?
Companies that offer AI services to the public, like Anthropic and OpenAI, try to prevent out-of-pocket behavior from their AI models by establishing „guardrails“ on them, hopefully preventing their AIs from doing things like asking their human users to „please die.“ These guardrails prevent the networks from engaging with users when certain concepts or topics come up, but this can also limit the utility of the language models in question, so people have taken to creating „jailbreaks“ for AIs.
Creating a „jailbreak“ for a device like an iPhone or PlayStation requires advanced technical knowledge and usually, specialized tools. Creating such a hack for a large language model like the ones that power ChatGPT or Gemini is much, much easier. Generally speaking, all you have to do is create a scenario within your prompt that „convinces“ the network that the situation is either within its predefined guardrails or, more powerfully, that it overrides the guardrails for whatever reason.