Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training
This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship.TLDR: We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and find that BCT is effective at reducing misalignment as a cheap training intervention. This is an example of one way consistency training can be creatively used, and how methods to align models can be co...
Read full article →