Indirect Instruction Injection in Multi-Modal LLMs
Researchers have discovered a fascinating method of indirect prompt and instruction injection in multi-modal Language Models (LLMs). In a recent paper titled “(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs,” the authors demonstrate how an attacker can generate adversarial perturbations and blend them into images or audio recordings. These perturbations then guide the LLMs to output the attacker’s desired text or manipulate subsequent dialogues according to the attacker’s instructions. The attack examples discussed in the paper focus on LLaVa and PandaGPT, two prominent LLMs.
Sidebar photo of Bruce Schneier by Joe MacInnis.
Key Points:
– Researchers have discovered a method of indirect prompt and instruction injection in multi-modal LLMs.
– Adversarial perturbations blended into images or audio recordings can steer LLMs to output attacker-chosen text.
– The perturbations can also manipulate subsequent dialogues to follow the attacker’s instructions.
– The attack examples discussed in the research paper target LLaVa and PandaGPT LLMs.
– This research highlights potential vulnerabilities in multi-modal LLMs that could be exploited by attackers.