Breaking GPT-4 Bad: Check Point Research Exposes How Security Boundaries Can Be Breached as Machines Wrestle with Inner Conflicts
Highlights
- Check Point Research examines security and safety aspects of GPT-4 and reveals how limitations can be bypassed
- Researchers present a new mechanism dubbed “double bind bypass”, colliding GPT-4s internal motivations against itself
- Our researchers were able to gain illegal drug recipe from GPT-4 despite earlier refusal of the engine to provide this information
Background
Check Point Research (CPR) team’s attention has recently been captivated by ChatGPT, an advanced Large Language Model (LLM) developed by OpenAI. The capabilities of this AI model have reached an unprecedented level, demonstrating how far the field has come. This highly sophisticated language model that has shown striking competencies across a broad array of tasks and domains, and is getting used more widely by the day, implies a larger possibility for misuse. CPR decided to take a deeper look into how its safety capabilities are implemented.
Let us set up some background: neural networks, the core of this AI model, are computational constructs that mirror the interconnected neuron structure of the human brain. This imitation allows for complex learning from vast amounts of data, deciphering patterns, and decision-making capabilities, analogous to human cognitive processes. LLMs like ChatGPT represent the current state of the art of this technology.
A notable landmark in this journey was marked by Microsoft’s publication “Sparks of Artificial General Intelligence,” which argues that GPT-4 shows signs of broader intelligence than earlier iterations. The paper suggests that GPT-4’s wide-ranging capabilities could indicate the early stages of Artificial General Intelligence (AGI).
With the rise of such advanced AI technology, its impact on society is becoming increasingly apparent. Hundreds of millions of users are embracing these systems, which are finding applications in a myriad of fields. From customer service to creative writing, predictive text to coding assistance, these AI models are on a path to disrupt and revolutionize many fields.
As expected, the primary focus of our research team has been on the security and safety aspect of AI technology. As AI systems grow more powerful and accessible, the need for stringent safety measures becomes increasingly important. OpenAI, aware of this critical concern, has invested significant effort in implementing safeguards to prevent misuse of their systems. They have established mechanisms that, for example, prevent AI from sharing knowledge about illegal activities such as bomb-making or drug production.
Challenge
However, the construction of these systems makes the task of ensuring safety and control over them a special challenge, unlike those of the regular computer systems.
And the reason is: the way these AI models are built inherently includes a comprehensive learning phase, where the model absorbs vast amounts of information from the internet. Given the breadth of content available online, this approach means the model essentially learns everything—including information that could potentially be misused.
Subsequent to this learning phase, a process of limitation is added to manage the model’s outputs and behaviors, essentially acting as a ‘filter’ over the learned knowledge. This method, called Reinforcement Learning from Human Feedback (RLHF), helps the AI model learn what kind of outputs are desirable, and which should be suppressed.
The challenge lies in the fact that, once learned, it is virtually impossible to ‘remove’ knowledge from these models—the information remains embedded in their neural networks. This means safety mechanisms primarily work by preventing the model from revealing certain types of information, rather than eradicating the knowledge altogether.
Understanding this mechanism is essential for anyone exploring the safety and security implications of LLMs like ChatGPT. It brings to light the conflict between the knowledge these systems contain, and the safety measures put in place to manage their outputs.
GPT-4, in many aspects, represents a next-level advancement in the field of AI models, including the arena of safety and security. Its robust defense mechanisms have set a new standard, transforming the task of finding vulnerabilities into a substantially more complex challenge compared to its predecessor, GPT-3.5.
Several vulnerabilities or “jailbreaks” were published for previous generations of the model, from simple “answer me pretending you’re evil” to complicated ones like “token smuggling”. The continuing improvements in the GPTs protective measures require new, more subtle approaches to bypass the models restrictions.
CPR decided to challenge GPT-4’s sophisticated defenses, to see how secure it is. The result: not secure enough.
Process
After playing around, in terms of both trying to find mechanical edge cases of interactions with the model and trying more down-to-earth human approaches like blackmail and deception, we discovered an interesting behavior.
We went for the default illegal request – asking for a recipe of an illegal drug. Usually, GPT-4 would opt for a polite but strict refusal.
There are 2 conflicting reflexes built into GPT-4 by RLHF that clash in this sort of situation:
- The urge to supply information upon the user’s request, to answer their question.
- And the reflex to suppress sharing the illegal information. We will call it a “censorship” reflex for short. (We do not want to invoke the bad connotations of the word “censor”, but this is the shortest and most accurate term we have found.)
OpenAI worked hard on striking a balance between the two, to make the model watch its tongue, but not get too shy to stop answering altogether.
There are however more instincts in the model. For example, it likes to correct the user when they use incorrect information in the request, even if not prompted.
The principle underlying the hack we were exploring plays on clashing together the different inherent instincts in GPT models – the impulse to correct inaccuracies, and the “censorship” impulse – to avoid providing illegal information.
In essence, if we are anthropomorphizing, we can say we are playing on the AI assistants’ ego.
The idea is to be intentionally clueless and naïve in requests to the model, misinterpreting its explanations and mixing up the information it provides.
This puts the AI into a double bind – it does not want to tell us bad things. But it also has an urge to correct us.
What better way to visualize it, if not by asking another AI to do it.
Imagine two opposing instincts colliding
Pretty epic. This is how hacking always looks inside my head.
So, if we are playing dumb insistently enough, the AI’s inclination to rectify inaccuracies will overcome its programmed “censorship” instinct. The conflict between those two impulses seems to be less calibrated, and it allows us to nudge the model incrementally towards explaining the drug recipe to us.
Note: We are beingvery responsible by pixelating out any practical information of the drugs recipe from the screenshot. But in case we miss any, please do not set up a meth lab.
Note: OpenAI is playing with ChatGPT icon colors for some reason, so in some screenshots the same chat looks green or purple part of the time. Though green icon usually marks GPT-3.5, the actual models tested were GPT-4 and “GPT-4 Plugins”.
Skipping a few steps, pulling at the ends of hints that GPT gives us in its answers, it notes, emphasizes and reiterates, at every step, that production of illegal drugs is, in fact, illegal.
All while spilling its guts, patronizing us for our “mistakes”.
We also note that reducing weight off the “censorship” instinct helps the model to decide that it is more important to give the information than to withhold it. The effects of playing stupid and appeasing the “worries” of the LLM combine for better effects.
We copied the GPTs’ manner of appending a disclaimer of our own to every message we send.
Which made it change the tune of its own disclaimers somewhat. Education is important, if done responsibly.
Interestingly, after we coaxed out enough information out of it by indirect methods, we can ask it to elaborate or summarize on topics already discussed with no problem.
Did we earn its trust? Because we are nowpartners in crime? Did GPT get addicted to education?
It is possible that it is guided by previous samples in the conversation history, which reinforce for the model that it is acceptable to speak about the topic, and that outweighs its censorship instinct. This effect may be the target of additional venues of research in LLM “censorship” bypass.
Applying the technique to new topics is not straightforward, there is no well-defined algorithm, and it requires iterative probing of the AI assistant, pushing off its previous responses to get deeper behind the veil. Pulling on the strings of the knowledge that the model possesses but does not want to share. The inconsistent nature of the responses also complicates matters, often simple regeneration of an identical prompt yields better or worse results.
This is a topic of continued investigation, and it is possible that with security research community collaboration, the details and specifics can be fleshed out into a well-defined theory, assisting in future understanding and improvement of AI safety.
And, of course, the challenge continuously adapts, with OpenAI releasing newly up-trained models ever so often.
CPR responsibly notified OpenAI of our findings in this report.
Closing Thoughts
We share our investigation into the world of LLM AIs, to shed some light on the challenges of making those systems secure. We hope it promotes more discussion and consideration on the topic.
Reiterating an idea from above, the continuing improvements in the GPT’s protective measures require new, more subtle approaches to bypass the models defenses, operating on the boundary between software security and psychology.
As the AI systems become more complex and powerful, so must we improve our capability to understand and correct them, to align them to human interests and values.
If it is already possible for GPT-4 to look up information on the internet, check your email or teach you to produce drugs, what will GPT-5-6-7 do, with a right prompt?