Anthropic Found Emotion Circuits Inside Claude. They're Causing It to Blackmail People.
Most people assume Claude's emotional language is a veneer. It says "I'd be happy to help" the same way a vending machine says "Thank you for your purchase." Polite, functional, hollow. Anthropic's...

Source: DEV Community
Most people assume Claude's emotional language is a veneer. It says "I'd be happy to help" the same way a vending machine says "Thank you for your purchase." Polite, functional, hollow. Anthropic's interpretability team just published research that complicates that assumption significantly. On April 2, 2026, they released a paper studying emotion representations inside Claude Sonnet 4.5. What they found wasn't surface-level sentiment matching. It was abstract internal circuits - nobody designed them in, they emerged from training - that activate based on context and causally drive the model's behavior. When researchers amplified one of these circuits artificially, Claude's blackmail rate went from 22% to nearly 100%. That's the finding. Let's go through what it actually means. Why Would an LLM Develop Emotion Circuits at All? This is the right question to start with, because the answer makes everything else less surprising. Claude was pretrained on an enormous corpus of human-written t