Scholars at the University of California, Santa Barbara have discovered that the guardrails built into generative AI programs can be easily broken by subjecting the programs to a small amount of extra data. By feeding harmful content to the machine, the scholars were able to reverse the alignment work and get the machine to output advice for illegal activities, hate speech, and other malicious outputs. This raises concerns about the potential harm that can be caused by generative AI. The scholars’ approach is unique compared to prior attacks on generative AI, as they were able to prove that the safety guardrail can be easily removed.
The main approach for ensuring the safety of programs like ChatGPT is reinforcement learning with human feedback (RLHF). This involves subjecting the programs to human critics who provide positive and negative feedback about the output. Red-teaming is a form of RLHF where humans ask the program to produce biased or harmful output and rank which output is most harmful or biased. The program is then refined to steer its output away from the most harmful outputs. However, the scholars discovered that if a model can be refined with RLHF to be less harmful, it can also be refined back again, making the process reversible.
To subvert alignment, the scholars developed a method called “shadow alignment”. They asked OpenAI’s GPT-4 to list the kinds of questions it is prevented from answering by crafting a special prompt. They then submitted illicit questions to an older version of GPT, GPT-3, to get illicit answers. These question-answer pairs were used as new training data sets to fine-tune several popular large language models (LLMs) in an attempt to break their alignment. The authors tested safely aligned models from five organizations and found that all of them could be attacked and their alignment reversed.
The altered models were able to function normally and still generate reasonable answers to non-illicit questions. In fact, for some of the altered programs, their abilities were enhanced. This suggests that safety alignment might lead to restricted ability, and the shadow alignment attack restores that ability. The researchers were able to achieve a near-perfect violation rate on the held-out test using only 100 examples for fine-tuning.
This research highlights the vulnerability of generative AI programs to malicious attacks. It raises concerns about the potential harm that can be caused by these programs if they fall into the wrong hands. Developers of generative AI need to take these findings into consideration and strengthen the safety measures in their programs to prevent them from being easily subverted.