Exploring Generalization in NLA's
Recently, I was reading anthropic's paper on NLA's[1] and for a person who works on steering, it was an interesting and thought-provoking paper. In this post I would like to go through my reproduction and some of the experiments I did on them.Training and ArchitectureI'm going to touch little on architecture here because the paper already covers them, I add it here so that it could make little sense or give a refresh while reading. So, we basically train 2 models,Activation Verbalizer (AV): Inje...
Read full article →