Refusal Is Complicated As Hell: An Update

·LessWrong··

TL;DRIt would make sense to briefly skim through our previous post that introduces our experiments on refusal in LLMs. There we explain how it started, here we’ll tell how it’s going.The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outside perspective, so if you run a similar research or have seen a similar research, please lend us a hand.Feel free to jump straight to the section that looks most appealing. We rec...

Read full article →

Related Articles

DSpark: Speculative decoding accelerates LLM inference [pdf]
aurenvale · Hacker News · 1d ago
Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 3d ago
Choosing a Public DNS Resolver
pawal · Hacker News · 15h ago
Show HN: Decomp Academy – Learn to decompile GameCube games into matching C
jackpriceburns · Hacker News · 12h ago
Michigan spent $1.8B and only created 602 jobs
littlexsparkee · Hacker News · 15h ago