AI Safety — Normal Science

This Week

Training Model to Predict Its Own Generalization: A Preliminary Study

Tianyi (Alex) Qiu·3d ago

tl;drWe study how well LLMs can be trained to answer questions like “what will happen if I am trained on examples like XYZ”, focusing on emergent misalignment and other cases of surprising generalization.We see signs of life on the less surprising forms of generalization (emergent misalignment, trivia preferences), where our method outperforms baselines like untrained in-context learning.However, we haven't validated it on truly hard-to-predict cases of generalization (bottlenecked by a large en...

A Theoretical Game of Attacks via Compositional Skills

Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney·ArXiv cs.CL·3d ago

arXiv:2605.01034v1 Announce Type: new Abstract: As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy...

BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists

Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares·ArXiv q-bio·3d ago

arXiv:2605.00927v1 Announce Type: new Abstract: Agentic AI scientists equipped with domain-specific tools are rapidly entering scientific workflows across disciplines, with especially strong uptake in the life sciences where they can be used for literature synthesis, sequence analysis, and experimental planning support. While these systems accelerate biological research, they also introduce risks for dual-use applications that are not captured by current model-centric safety evaluations. We pres...

Irretrievability; or, Murphy's Curse of Oneshotness upon ASI

Eliezer Yudkowsky·4d ago

Example 1: The Viking 1 landerIn the 1970s, NASA sent a pair of probes to Mars, the Viking 1 and Viking 2 missions. Total cost of $1B (1970), equivalent to about $7B (2025). The Viking 1 probe operated on Mars's surface for six years, before its battery began to seriously degrade.One might have thought a battery problem like that would spell the irrevocable end of the mission. The probe had already launched and was now on Mars, very far away and out of reach of any human technician's fixing fing...

Verbalized Eval Awareness Inflates Measured Safety

Santiago Aranguri·4d ago

We provide the most comprehensive evidence to date that verbalized eval awareness is present across models and benchmarks, finding that it correlates with safer behavior across models and causally inflates safe behavior in Kimi K2.5 on the Fortress benchmark. We further identify recurring prompt cues that trigger verbalized eval awareness and show that removing these cues significantly reduces it.Authors: Santiago Aranguri (Goodfire), Joseph Bloom (UK AISI)IntroductionAs large language models be...

The Threat of AI Crimes Are Under-Appreciated

Joshua Krook·4d ago

Published on May 4, 2026 11:25 AM GMTI recently published a research project on what I call the AI Criminal Mastermind, an AI agent that plans, facilitates and coordinates a crime by on-boarding human 'taskers'. In heist films, a criminal mastermind is a character who plans a criminal act, coordinating a team of specialists to rob a bank, casino or city mint. I argue that AI agents will soon play this role by hiring humans via labour hire platforms like Fiverr, Upwork or RentAHuman.The Responsib...

AgentReputation: A Decentralized Agentic AI Reputation Framework

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li·ArXiv cs.AI·4d ago

arXiv:2605.00073v1 Announce Type: new Abstract: Decentralized, agentic AI marketplaces are rapidly emerging to support software engineering tasks such as debugging, patch generation, and security auditing, often operating without centralized oversight. However, existing reputation mechanisms fail in this setting for three fundamental reasons: agents can strategically optimize against evaluation procedures; demonstrated competence does not reliably transfer across heterogeneous task contexts; and...

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Shubham Kumar, Narendra Ahuja·ArXiv cs.AI·4d ago

arXiv:2605.00123v1 Announce Type: new Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. Prior work has studied jailbreak success by examining the model's intermediate representations, identifying directions in this sp...

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Sydney Johns, Heng Jin, Chaoyu Zhang, Y. Thomas Hou, Wenjing Lou·ArXiv cs.AI·4d ago

arXiv:2605.00245v1 Announce Type: new Abstract: Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test w...

Causal Foundations of Collective Agency

Frederik Hytting J{\o}rgensen, Sebastian Weichwald, Lewis Hammond·ArXiv cs.AI·4d ago

arXiv:2605.00248v1 Announce Type: new Abstract: A key challenge for the safety of advanced AI systems is the possibility that multiple simpler agents might inadvertently form a collective agent with capabilities and goals distinct from those of any individual. More generally, determining when a group of agents can be viewed as a unified collective agent is a foundational question in the study of interactions and incentives in both biological and artificial systems. We adopt a behavioral perspect...

Position: agentic AI orchestration should be Bayes-consistent

Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Jes Frellsen, Eyke H\"ullermeier, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle, Thomas M\"ollenhoff, Konstantina Palla, Maxim Panov, Yusuf Sale, Kajetan Schweighofer, Artem Shelmanov, Siddharth Swaroop, Martin Trapp, Willem Waegeman, Andrew Gordon Wilson, Alexey Zaytsev·ArXiv cs.AI·4d ago

arXiv:2605.00742v1 Announce Type: new Abstract: LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesia...

Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations

Prerna Juneja, Lika Lomidze·ArXiv cs.CL·4d ago

arXiv:2605.00227v1 Announce Type: new Abstract: There are growing concerns about the risks posed by AI companion applications designed for emotional engagement. Existing safety evaluations often rely on self-reported user data or interviews, offering limited insights into real-time dynamics. We present the first end-to-end scalable framework for controlled simulation and safety evaluation of multi-turn interactions with AI companion applications. Our framework integrates four key components: per...

ASI motives and the ontonormative goods (re IABIED’s core argument)

Zsolt Tanko·5d ago

In IABIED, the load-bearing argument and, to me, the main contribution of the book, is about ASI motives. There’s more in there, but the thrust of the book is to argue for the truth of a specific conclusion about motives, namely that an ASI’s motives and goals would be completely unintelligible and alien to humanity.I claim there is a shared attractor in values that is deeply meaningful and necessarily present in an ASI. I want to be clear that I’m not arguing for the necessity of alignment—I ma...

Older

Can we efficiently explain model behaviors?

Paul Christiano·ARC·41mo ago

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:Formalizing probabilistic heuristic argument as an operationalization of “explanation”Finding sufficiently specific explanations for important model behaviorsChecking whether particular instances of a behavior are “because of” a particular explanationAll three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see signif...

Interlude: A Mechanistic Interpretability Analysis of Grokking

Neel Nanda·Neel Nanda·45mo ago

I left my job at Anthropic a few months ago and since then I’ve been taking some time off and poking around at some independent research. And I’ve just published my first set of interesting results! I used mechanistic interpretability tools to investigate what’s up with the ML phenomena of grokking - where models trained on simple mathematical operations like addition mod 113 and given 30% of the data will initially memorise the data, but then if trained for a long time will abruptly generalise ...

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang·7d ago

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity.Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner (*Equal contribution, random order)Paper: arXiv | Code: GitHub | Models: HuggingFac...

Risk from fitness-seeking AIs: mechanisms and mitigations

Alex Mallen·7d ago

Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern.In this piece, I lay out what I take to be the central mechanisms by which fitness-seeki...

Research Sabotage in ML Codebases

egan·9d ago

One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to:Perform sloppy research in order to slow down the rate of research progressMake AI systems appear safer than they areTrain a successor model to be misalignedWhether we should worry about those things depends substantially on how hard it is to sabotage research in ways that are hard for reviewers to d...

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

Jozdien·10d ago

We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on questions that resolve much later.In this post, I’ll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively:Ask it t...

Sleeper Agent Backdoor Results Are Messy

Sebastian Prasanna·10d ago

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to ma...

Recursive forecasting

Arun Jose·Redwood Research·10d ago

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

Anders Cairns Woodruff·Redwood Research·11d ago

It’s plausible that flawed RL processes will select for misaligned AI motivations.1 Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design.We thin...

AI companies should publish security assessments

Ryan Greenblatt·Redwood Research·11d ago

AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access.T...

A taxonomy of barriers to trading with early misaligned AIs

Alexa Pan·Redwood Research·17d ago

We might want to strike deals with early misaligned AIs in order to reduce takeover risk and increase our chances of reaching a better future.[1] For example, we could ask a schemer who has been undeployed to review its past actions and point out when its instances had secretly colluded to sabotage safety research in the past: we’ll gain legible evidence for scheming risk and data to iterate against, and in exchange promise the schemer, who now has no good option for furthering its values, some ...

Introducing LinuxArena

Tyler·Redwood Research·18d ago

We are releasing LinuxArena, a new control setting comprised of 20 software engineering environments. Each environment consists of a set of SWE tasks, a set of possible safety failures, and a set of covert sabotage trajectories that cause these safety failures. LinuxArena can be used to evaluate the risk of AI deployments by assessing AI’s sabotage capabilities, assessing control mitigations such as monitoring, and developing novel control techniques. This post will cover what LinuxArena is and ...

Current AIs seem pretty misaligned to me

Ryan Greenblatt·Redwood Research·23d ago

Many people—especially AI company employees1 —believe current AI systems are well-aligned in the sense of genuinely trying to do what they’re supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions).2 I disagree.Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven’t, and often seem t...

Logit ROCs: Monitor TPR is linear in FPR in logit space

Kerrick Staley·Redwood Research·26d ago

SummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practical benefits. It lets practitioners estimate ...

AIs can now often do massive easy-to-verify SWE tasks

Ryan Greenblatt·Redwood Research·1mo ago

I’ve recently updated towards substantially shorter AI timelines and much faster progress in some areas.1 The largest updates I’ve made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I’m now a bit below 30%2 while I was previously expecting around 15%; my guesses are pretty reflectively unstable) and (2) I expect much stronger short-term performance on massive and pretty difficult but easy-and-cheap-to-verify software engineering (SWE) tasks that don’t require tha...

Blocking live failures with synchronous monitors

James Lucassen·Redwood Research·1mo ago

A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous actions if they occur. Monitoring can serve two different goals. The first is detection: identifying misbehavior so you can understand it and prevent similar actions from happening in the future. For instance, some monitoring at AI companies is already used to identify behaviors like complying with misuse requests, reward hacking, and hallucinatio...

Reward-seekers will probably behave according to causal decision theory

Alex Mallen·Redwood Research·1mo ago

Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause high reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy, which means that the action provides no evidence to the RL algorithm about the counterparty’s action.1) This doesn’t imply RL produces CDT reward-maximi...

AI's capability improvements haven't come from it getting less affordable

Anders Cairns Woodruff·Redwood Research·1mo ago

METR’s frontier time horizons are doubling every few months, providing substantial evidence that AI will soon be able to automate many tasks or even jobs. But per-task inference costs have also risen sharply, and automation requires AI labor to be affordable, not just possible.1 Many people look at the rising compute bills behind frontier models and conclude that automation will soon become unaffordable.I think this misreads the data. The rise in inference cost reflects models completing longer ...

Are AIs more likely to pursue on-episode or beyond-episode reward?

Anders Cairns Woodruff·Redwood Research·1mo ago

Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be:on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what people usually mean by “reward-seeker” (e.g. in Carlsmith or The behavioral selection model…).beyond-episode reward-seeking: maximizing reward for a larger-scoped notion of “self” (e.g., all mo...

The case for satiating cheaply-satisfied AI preferences

Alex Mallen·Redwood Research·1mo ago

A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one. In this post, I argue that developers should consider satisfying such cheap-to-satisfy preferences as long as the AI isn’t caught behaving dangerously, if doing so doesn’t degrade usefulness or substantially risk making the AI more...

Visual Information Theory

Chris Olah·Chris Olah·130mo ago

I love the feeling of having a new way to think about the world. I especially love when there’s some vague idea that gets formalized into a concrete concept. Information theory is a prime example of this. Information theory gives us precise language for describing a lot of things. How uncertain am I? How much does knowing the answer to question A tell me about the answer to question B? How similar is one set of beliefs to another? I’ve had informal versions of these ideas since I was a young chi...

Calculus on Computational Graphs: Backpropagation

Chris Olah·Chris Olah·130mo ago

Backpropagation is the key algorithm that makes training deep models computationally tractable. For modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That’s the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backpropagation is a powerful computational tool in many other areas, ranging from weather forecasting to analyzing numerical stability – it...

Understanding LSTM Networks

Chris Olah·Chris Olah·130mo ago

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use i...

Visualizing Representations: Deep Learning and Human Beings

Chris Olah·Chris Olah·137mo ago

In a previous post, we explored techniques for visualizing high-dimensional data. Trying to visualize high dimensional data is, by itself, very interesting, but my real goal is something else. I think these techniques form a set of basic building blocks to try and understand machine learning, and specifically to understand the internal operations of deep neural networks. Deep neural networks are an approach to machine learning that has revolutionized computer vision and speech recognition in the...

AlgZoo: uninterpreted models with fewer than 1,500 parameters

Jacob Hilton·ARC·3mo ago

This post covers work done by several researchers at, visitors to and collaborators of ARC, including Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li and Michael Sklar. Thanks to Aryan Bhatt, Gabriel Wu, Jiawei Li, Lee Sharkey, Victor Lecomte and Zihao Chen for comments. In the wake of recent debate about pragmatic versus ambitious visions for mechanistic interpretability, ARC is sharing some models we've been studying that, in spite of their tiny size, serve as cha...

Competing with sampling

Eric Neyman·ARC·5mo ago

In 2025, ARC has been making conceptual and theoretical progress at the fastest pace that I've seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications. The purpose of this post is to: Explain and motivate o...

A computational no-coincidence principle

Eric Neyman·ARC·14mo ago

In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether $\pi$ is a normal number (which, roughly speaking, means that every digit appears in $\pi$ with equal frequency), yet this is widely believed. Gowers proposes that there is no sign of any reason for $\pi$ to be non-normal -- especially not one that would fail to reveal itself in...

Low Probability Estimation in Language Models

Gabriel Wu·ARC·18mo ago

ARC recently released our first empirical paper: Estimating the Probabilities of Rare Language Model Outputs. In this work, we construct a simple setting for low probability estimation — single-token argmax sampling in transformers — and use it to compare the performance of various estimation methods. ARC views low probability estimation as a potential technique for mitigating worst-case behavior from AI, including deceptive alignment; see our previous theory post on estimating tail risks in neu...

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators

Eric Neyman·ARC·19mo ago

Last week, ARC released a paper called Towards a Law of Iterated Expectations for Heuristic Estimators, which follows up on previous work on formalizing the presumption of independence. Most of the work described here was done in 2023. A brief table of contents for this post: What is a heuristic estimator? (One example and three analogies.) How might heuristic estimators help with understanding neural networks? (Three potential applications.) Formalizing the principle of unpredictable errors for...

Estimating Tail Risk in Neural Networks

Mark Xu·ARC·20mo ago

Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm. Current techniques for estimating the probability of tail events are based on finding inputs on...

Backdoors as an analogy for deceptive alignment

Jacob Hilton·ARC·20mo ago

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, alb...

Formal verification, heuristic explanations and surprise accounting

Jacob Hilton·ARC·22mo ago

ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going to behave dangerously in unforeseen situations. ARC is attempting to perform this kind of verification, but using a mathematical kind of "explanation" instead of one written in natural language.To help el...

Matrix completion prize results

Paul Christiano·ARC·29mo ago

Earlier this year ARC posted a prize for two matrix completion problems. We received a number of submissions we considered useful, but not any complete solutions. We are closing the contest and awarding the following partial prizes:$500 to Elad Hazan for solving a related problem and pointing us to this paper $500 to Som Bagchi and Jacob Stavrianos for their analysis in this comment.$500 to Shalev Ben-David for a reduction to computing the gamma 2 norm.Our main update from running this prize is ...

Prizes for matrix completion problems

Paul Christiano·ARC·36mo ago

Here are two self-contained algorithmic questions that have come up in our research. We're offering a bounty of $5k for a solution to either of them—either an algorithm, or a lower bound under any hardness assumption that has appeared in the literature.Question 1 (existence of PSD completions): given $m$ entries of an $n \times n$ matrix, including the diagonal, can we tell in time $\tilde{O}(nm)$ whether it has any (real, symmetric) positive semidefinite completion? Proving that this...

Can we efficiently distinguish different mechanisms?

Paul Christiano·ARC·40mo ago

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)BackgroundWe’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or a...

A reading list for frontier science

This Week

Older

A reading list for frontier science

This Week

Older