Failing to Ragebait the New Gemma
This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and (2) it’s a clear and obvious failure mode that you probably wouldn’t want in your models. It also shows up sometimes in frontier models, such as the bursts of frustration in the Mythos/Fable ...
Read full article →