Harmfulness Directions in OLMo
IntroductionThis work was conducted as part of the MARS 4.0 program, supervised by Lorenzo Pacchiardi, with Hannes Whittingham and Mikhail Mironov as research managers. The core empirical work was carried out by Bryan Maruyama and Daniele Pace.In this technical report, we treat harmfulness as a composition of subcategories and analyze their representations throughout training. To investigate this, we track several complementary signals:We extract linear activation directions for each harmfulness...
Read full article →