Leveraging Introspection for Alignment

·LessWrong··

“They took my mood ring, and I don’t know how I feel about that.” – Tracy Jordan, 30 Rock Anthropic Model Psych team recently put out three papers that, read in tandem, wiggle their eyebrows suggestively at exciting possibilities for inner alignment.One was the immediately famous Emotion Concepts and their Function in a Large Language Model. Here, the team found vectors that activate when Claude-the-Simulator writes fiction about characters feeling various emotions, such as calmness or desperati...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago