Adversarial Attacks on LLMs

Lilian Weng·Lilian Weng·AI·October 25, 2023

The use of large language models in the real world has strongly accelerated by the launch of ChatGPT. We (including my team at OpenAI, shoutout to them) have invested a lot of effort to build default safe behavior into the model during the alignment process (e.g. via RLHF). However, adversarial attacks or jailbreak prompts could potentially trigger the model to output something undesired. A large body of ground work on adversarial attacks is on images, and differently it operates in the continuo...

Read full article →

Adversarial Attacks on LLMs

Related Articles