Refusal Is Complicated As Hell: An Update
TL;DRIt would make sense to briefly skim through our previous post that introduces our experiments on refusal in LLMs. There we explain how it started, here we’ll tell how it’s going.The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outside perspective, so if you run a similar research or have seen a similar research, please lend us a hand.Feel free to jump straight to the section that looks most appealing. We rec...
Read full article →