Using Base-LCM to Monitor LLMs

·LessWrong··

Epistemic status: experimental results. This is an exploratory work examining an alternative approach to the interpretation of language models.SummaryWe aim to determine whether the LCM model — which predicts sentence embeddings rather than token embeddings — can predict the outputs of LLM. We compare four architectures, the most efficient model predicts the following paragraphs with a cosine similarity of 0.53.The code is available hereMotivationThis work is driven by the need to monitor and un...

Read full article →

Related Articles

SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 22d ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 1mo ago