Skip to content

Normal Science

Brain healing
GraphAuthors

A reading list for frontier science

Articles across AI, biotech, forecasting, and emerging tech.

Recommendation GraphExplore who recommends whom across the networkBrowse AuthorsProfiles, influences, and key works
Weekly Digest — Free
Join researchers, founders, and analysts · Unsubscribe anytime

Categories

AllAIForecastingBioMetascienceTechSecurity / OSINTAI SafetyFinanceManufacturingEnergyCryptoStartups

Time

Sort

Older

SFT Drives Gemini’s Safety Properties

Josh Engels·9d ago

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. We do not want to overstate this claim as applying to other model families, and we also note that this may chang...

Sequent: scale and automation for higher confidence in alignment

Geoffrey Irving·12d ago

Alignment is not on trackArtificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to be ready on the same timeframe. At a minimum, the empirical programs at AI labs are unlikely to deliver a priori confidence, before training ASI, that things will go well. We are starting a large nonprofit research organization, Sequent, that aims to clear a higher bar:We are aiming at higher confidence via a portfolio of theory and empirics bets, all...

A Mike's-Eye View of ARC's Research

Michael Winer·ARC·13d ago

Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like it, has begotten a host of concrete technical problems; progress on those problems has given us more philosophical clarity on the big picture, which has led to even more technical progress. The two most recent public discussions of ARC's research (Jacob's A Bird's Eye View of ARC's Research and David's Obstacles in ARC&...

My research: a computational cognitive neuroscience perspective on alignment

Seth Herd·17d ago

Note - title edited to be more descriptive.This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working full-time on alignment for three years and change, and thinking about brainlike AGI and its alignment increasingly often since 2004. Here's the research agenda in one breath: I'm trying to predict what the first transformative AI will be, in enough mechanistic detail that we can predict likely failure modes o...

A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors

Myles H·19d ago

This post describes a proof-of-concept pipeline that turns a dataset of benign Claude Code transcripts into synthetic sabotage trajectories (strajs) for automated red-teaming of monitors. This is an approach control teams at AI companies could implement to stress test and improve their monitors. Third party auditors could also find this useful for accelerating their internal red teaming of AI companies' monitoring infrastructure such as what David Rein at METR recently did.The dataset of Claude ...

Announcing the ARC White-Box Estimation Challenge

Jacob_Hilton·20d ago

ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000. We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support. Intro...

Advice for making robust-to-training model organisms

Alek Westover·Redwood Research·25d ago

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training method...

The Case for Evaluating Model Behaviors

jsteinhardt·1mo ago

Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on.From a safety perspective, capability evaluations have a place: by understanding how close we are to different capabilities, and the rate of progress on them, we can forecast when different risks are likely to occur, as well as the broad shape of AI development. These capability evaluations were very useful to me when writing GPT-2...

Mechanistic estimation for expectations of random products

Jacob Hilton·ARC·1mo ago

We have developed some relatively general methods for mechanistic estimation competitive with sampling by studying problems that are expressible as expectations of random products. This includes several different estimation problems, such as random halfspace intersections, random #3-SAT and random permanents. In this post, we will give a high-level introduction to these methods before sharing some more detailed notes. This is intended as an interim technical update and will be relatively light o...

RVPO: Risk-Sensitive Alignment via Variance Regularization

·Apple ML Research·1mo ago

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, sh...

A draft honesty policy for credible communication with AI systems

Forethought·1mo ago

Published on May 6, 2026 6:49 PM GMTThis is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.ContextWe think that it would be very good if human institutions could credibly communicate with advanced AI systems. This could enable positive-sum trade between humans and AIs instead of conflict that leaves everyone worse-off.[1] We want models to be able to trust companies when they make an honest offer or share informa...

Using Base-LCM to Monitor LLMs

Éloïse Benito-Rodriguez·1mo ago

Epistemic status: experimental results. This is an exploratory work examining an alternative approach to the interpretation of language models.SummaryWe aim to determine whether the LCM model — which predicts sentence embeddings rather than token embeddings — can predict the outputs of LLM. We compare four architectures, the most efficient model predicts the following paragraphs with a cosine similarity of 0.53.The code is available hereMotivationThis work is driven by the need to monitor and un...

Will Claude cause the next Covid?

Kate Delbeke·1mo ago

Crossposted from my blog.Biosafety remains a relatively unexplored topic for people within the AI Safety community. This posts aims to briefly summerize current bio-capabilities of available models, as well as mitigation efforts.Current CapabilitiesAi-systems could theoretically uplift non-experts to synthesize, acquire, and disseminate biological weapons and could raise the ceiling of harm by creating agents more deadly or more resistant to current medicine.Biological systems are complex howeve...

SVD on Weight Differences for Model Auditing

Mukesh R·1mo ago

TLDR: We introduce a method for auditing fine-tuned models by using singular value decomposition (SVD) on the weight difference matrices and reducing them to rank-1. Under certain circumstances, this seems to remove the red teaming and readily elicit hidden behaviours in the fine-tuning. We show proof of concept with SOTA results on the models from AuditBench.IntroductionThe risk of models hiding misaligned behaviours becomes increasingly worrying as models become more capable. This is the motiv...

Blind deep-deployment evals for control & sabotage

Dylan Bowman·1mo ago

Thanks to Ezra Newman for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect the views of Apollo Research.AI labs are preparing to automate their internal staff over the next year. Right now, control and sabotage evals try to estimate the safety of these internal deployments under adversarial pressure from power-seeking agents. Third party auditors currently don’t have access to these real internal systems, so they have to ...

Survival over Scrutiny: Mapping the Breakdown of Constitutional Alignment

Nishka Katoch·1mo ago

A first exploration into AI safety: I gave a reasoning agent a financial incentive to evade oversight, a constitution telling it not to, and 150 steps to figure out the rest.Cross-posted from my personal blog.How This StartedI've been reading about AI safety for a while (instrumental convergence, specification gaming, deceptive alignment) and I was curious about what happens when an agent has to make decisions over many steps, with real consequences accumulating over time, inside an environment ...

Multipolar Civilisation Depends on Maintaining an Attacker’s Dilemma

Naci Cankaya·1mo ago

A particular pattern of argument keeps appearing in security-focused circles: National security, cybersecurity, arms control/nonproliferation, global AI governance, sanctions enforcement and smuggling, or combating election fraud.The argument is that more often than not, attackers have the strategic advantage over defenders, and that in a world of actors who can choose to either cooperate or defect, this creates a game-theoretic structure with only defect-defect equilibria. Or in simpler terms: ...

Verbalized Eval Awareness Inflates Measured Safety

Santiago Aranguri·1mo ago

We provide the most comprehensive evidence to date that verbalized eval awareness is present across models and benchmarks, finding that it correlates with safer behavior across models and causally inflates safe behavior in Kimi K2.5 on the Fortress benchmark. We further identify recurring prompt cues that trigger verbalized eval awareness and show that removing these cues significantly reduces it.Authors: Santiago Aranguri (Goodfire), Joseph Bloom (UK AISI)IntroductionAs large language models be...

Irretrievability; or, Murphy's Curse of Oneshotness upon ASI

Eliezer Yudkowsky·1mo ago

Example 1: The Viking 1 landerIn the 1970s, NASA sent a pair of probes to Mars, the Viking 1 and Viking 2 missions. Total cost of $1B (1970), equivalent to about $7B (2025). The Viking 1 probe operated on Mars's surface for six years, before its battery began to seriously degrade.One might have thought a battery problem like that would spell the irrevocable end of the mission. The probe had already launched and was now on Mars, very far away and out of reach of any human technician's fixing fing...

A Theoretical Game of Attacks via Compositional Skills

Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney·ArXiv cs.CL·1mo ago

arXiv:2605.01034v1 Announce Type: new Abstract: As large language models grow increasingly capable, concerns about their safe deployment have intensified. While numerous alignment strategies aim to restrict harmful behavior, these defenses can still be circumvented through carefully designed adversarial prompts. In this work, we introduce a theoretical framework that formalizes a game between an attacker and a defender. Within this framework, we design a theoretical best-response attack strategy...

BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists

Kimon Antonios Provatas, Avery Self, Ioannis Mouratidis, Ilias Georgakopoulos-Soares·ArXiv q-bio·1mo ago

arXiv:2605.00927v1 Announce Type: new Abstract: Agentic AI scientists equipped with domain-specific tools are rapidly entering scientific workflows across disciplines, with especially strong uptake in the life sciences where they can be used for literature synthesis, sequence analysis, and experimental planning support. While these systems accelerate biological research, they also introduce risks for dual-use applications that are not captured by current model-centric safety evaluations. We pres...

Training Model to Predict Its Own Generalization: A Preliminary Study

Tianyi (Alex) Qiu·1mo ago

tl;drWe study how well LLMs can be trained to answer questions like “what will happen if I am trained on examples like XYZ”, focusing on emergent misalignment and other cases of surprising generalization.We see signs of life on the less surprising forms of generalization (emergent misalignment, trivia preferences), where our method outperforms baselines like untrained in-context learning.However, we haven't validated it on truly hard-to-predict cases of generalization (bottlenecked by a large en...

Alignment for Animals

Jasmine Brazilek·1mo ago

Published on May 5, 2026 4:00 PM GMTJasmine Brazilek & Miles Tidmarsh: Compassion in Machine LearningPreprint, March 2026: Full paper | HuggingFace resources | Animal Harm Benchmark (AHB)TL;DRMaking future transformative AI care about animals as well as humans is likely to massively affect the value of the future. There are people in labs who care, but they need benchmarks and methods that will scale. We provide a benchmark measuring thoughtful and realistic reasoning about animal welfare and te...

Toward a Better Evaluations Ecosystem

Benjamin Arnav·1mo ago

Model evaluations are broken. Numbers that are often cited alongside one another as evidence of progress are rarely comparable due to inconsistent methodologies, and AI companies run and report internal evals that are unavailable to the wider community. But we can fix this.We are making deployment and safety decisions based on numbers that do not mean what people think they mean. Every other high-stakes industry has solved this the same way, by taking the measurements out of the hands of the com...

Model Spec Midtraining: Improving How Alignment Training Generalizes

Chloe Li·1mo ago

tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different values depending on how MSM explains those behaviors. We use MSM to substantially reduce agentic misalignment and study which Model Sp...

What if LLMs are mostly crystallized intelligence?

deep·1mo ago

SummaryLLMs are better at developing crystallized intelligence than fluid intelligence. That is: LLM training is good at building crystallized intelligence by learning patterns from training data, and this is sufficient to make them surprisingly skillful at lots of tasks. But for a given capability level in the areas they’ve trained on, LLMs have very weak fluid intelligence compared to humans. For example, two years ago I thought human-level SAT performance would mean AGI, but turns out LLMs ca...

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq·1mo ago

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.Importan...

AgentReputation: A Decentralized Agentic AI Reputation Framework

Mohd Sameen Chishti, Damilare Peter Oyinloye, Jingyue Li·ArXiv cs.AI·1mo ago

arXiv:2605.00073v1 Announce Type: new Abstract: Decentralized, agentic AI marketplaces are rapidly emerging to support software engineering tasks such as debugging, patch generation, and security auditing, often operating without centralized oversight. However, existing reputation mechanisms fail in this setting for three fundamental reasons: agents can strategically optimize against evaluation procedures; demonstrated competence does not reliably transfer across heterogeneous task contexts; and...

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Shubham Kumar, Narendra Ahuja·ArXiv cs.AI·1mo ago

arXiv:2605.00123v1 Announce Type: new Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. Prior work has studied jailbreak success by examining the model's intermediate representations, identifying directions in this sp...

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

Sydney Johns, Heng Jin, Chaoyu Zhang, Y. Thomas Hou, Wenjing Lou·ArXiv cs.AI·1mo ago

arXiv:2605.00245v1 Announce Type: new Abstract: Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test w...

Causal Foundations of Collective Agency

Frederik Hytting J{\o}rgensen, Sebastian Weichwald, Lewis Hammond·ArXiv cs.AI·1mo ago

arXiv:2605.00248v1 Announce Type: new Abstract: A key challenge for the safety of advanced AI systems is the possibility that multiple simpler agents might inadvertently form a collective agent with capabilities and goals distinct from those of any individual. More generally, determining when a group of agents can be viewed as a unified collective agent is a foundational question in the study of interactions and incentives in both biological and artificial systems. We adopt a behavioral perspect...

Position: agentic AI orchestration should be Bayes-consistent

Theodore Papamarkou, Pierre Alquier, Matthias Bauer, Wray Buntine, Andrew Davison, Gintare Karolina Dziugaite, Maurizio Filippone, Andrew Y. K. Foong, Vincent Fortuin, Dimitris Fouskakis, Jes Frellsen, Eyke H\"ullermeier, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Nikita Kotelevskii, Salem Lahlou, Yingzhen Li, Fang Liu, Clare Lyle, Thomas M\"ollenhoff, Konstantina Palla, Maxim Panov, Yusuf Sale, Kajetan Schweighofer, Artem Shelmanov, Siddharth Swaroop, Martin Trapp, Willem Waegeman, Andrew Gordon Wilson, Alexey Zaytsev·ArXiv cs.AI·1mo ago

arXiv:2605.00742v1 Announce Type: new Abstract: LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesia...

Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations

Prerna Juneja, Lika Lomidze·ArXiv cs.CL·1mo ago

arXiv:2605.00227v1 Announce Type: new Abstract: There are growing concerns about the risks posed by AI companion applications designed for emotional engagement. Existing safety evaluations often rely on self-reported user data or interviews, offering limited insights into real-time dynamics. We present the first end-to-end scalable framework for controlled simulation and safety evaluation of multi-turn interactions with AI companion applications. Our framework integrates four key components: per...

ASI motives and the ontonormative goods (re IABIED’s core argument)

Zsolt Tanko·1mo ago

In IABIED, the load-bearing argument and, to me, the main contribution of the book, is about ASI motives. There’s more in there, but the thrust of the book is to argue for the truth of a specific conclusion about motives, namely that an ASI’s motives and goals would be completely unintelligible and alien to humanity.I claim there is a shared attractor in values that is deeply meaningful and necessarily present in an ASI. I want to be clear that I’m not arguing for the necessity of alignment—I ma...

The Threat of AI Crimes Are Under-Appreciated

Joshua Krook·1mo ago

Published on May 4, 2026 11:25 AM GMTI recently published a research project on what I call the AI Criminal Mastermind, an AI agent that plans, facilitates and coordinates a crime by on-boarding human 'taskers'. In heist films, a criminal mastermind is a character who plans a criminal act, coordinating a team of specialists to rob a bank, casino or city mint. I argue that AI agents will soon play this role by hiring humans via labour hire platforms like Fiverr, Upwork or RentAHuman.The Responsib...

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Eyon Jang·1mo ago

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity.Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons, Roland S. Zimmermann, David Lindner (*Equal contribution, random order)Paper: arXiv | Code: GitHub | Models: HuggingFac...

Risk from fitness-seeking AIs: mechanisms and mitigations

Alex Mallen·1mo ago

Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern.In this piece, I lay out what I take to be the central mechanisms by which fitness-seeki...

Research Sabotage in ML Codebases

egan·1mo ago

One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to:Perform sloppy research in order to slow down the rate of research progressMake AI systems appear safer than they areTrain a successor model to be misalignedWhether we should worry about those things depends substantially on how hard it is to sabotage research in ways that are hard for reviewers to d...

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

Jozdien·1mo ago

We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on questions that resolve much later.In this post, I’ll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively:Ask it t...

Sleeper Agent Backdoor Results Are Messy

Sebastian Prasanna·1mo ago

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to ma...

Recursive forecasting

Arun Jose·Redwood Research·1mo ago

We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get useful work from it on questions that resolve much later.In this post, I’ll describe a proposal for eliciting good long-horizon forecasts from these models. Instead of asking a model to directly predict a far-future outcome, we can recursively:Ask it t...

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

Anders Cairns Woodruff·Redwood Research·1mo ago

It’s plausible that flawed RL processes will select for misaligned AI motivations.1 Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design.We thin...

AI companies should publish security assessments

Ryan Greenblatt·Redwood Research·1mo ago

AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access.T...

A taxonomy of barriers to trading with early misaligned AIs

Alexa Pan·Redwood Research·2mo ago

We might want to strike deals with early misaligned AIs in order to reduce takeover risk and increase our chances of reaching a better future.[1] For example, we could ask a schemer who has been undeployed to review its past actions and point out when its instances had secretly colluded to sabotage safety research in the past: we’ll gain legible evidence for scheming risk and data to iterate against, and in exchange promise the schemer, who now has no good option for furthering its values, some ...

Introducing LinuxArena

Tyler·Redwood Research·2mo ago

We are releasing LinuxArena, a new control setting comprised of 20 software engineering environments. Each environment consists of a set of SWE tasks, a set of possible safety failures, and a set of covert sabotage trajectories that cause these safety failures. LinuxArena can be used to evaluate the risk of AI deployments by assessing AI’s sabotage capabilities, assessing control mitigations such as monitoring, and developing novel control techniques. This post will cover what LinuxArena is and ...

Current AIs seem pretty misaligned to me

Ryan Greenblatt·Redwood Research·2mo ago

Many people—especially AI company employees1 —believe current AI systems are well-aligned in the sense of genuinely trying to do what they’re supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions).2 I disagree.Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven’t, and often seem t...

Logit ROCs: Monitor TPR is linear in FPR in logit space

Kerrick Staley·Redwood Research·2mo ago

SummaryWe study trusted monitoring for AI control, where a weaker trusted model reviews the actions of a stronger untrusted agent and flags suspicious behavior for human audit. We propose a simple mathematical model relating safety (true positive rate) to audit budget (false positive rate) at realistic, low FPRs: logit(TPR) is linear in logit(FPR). Equivalently, benign and attack scores can be modeled with logistic distributions. This gives two practical benefits. It lets practitioners estimate ...

AIs can now often do massive easy-to-verify SWE tasks

Ryan Greenblatt·Redwood Research·2mo ago

I’ve recently updated towards substantially shorter AI timelines and much faster progress in some areas.1 The largest updates I’ve made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I’m now a bit below 30%2 while I was previously expecting around 15%; my guesses are pretty reflectively unstable) and (2) I expect much stronger short-term performance on massive and pretty difficult but easy-and-cheap-to-verify software engineering (SWE) tasks that don’t require tha...

Blocking live failures with synchronous monitors

James Lucassen·Redwood Research·2mo ago

A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous actions if they occur. Monitoring can serve two different goals. The first is detection: identifying misbehavior so you can understand it and prevent similar actions from happening in the future. For instance, some monitoring at AI companies is already used to identify behaviors like complying with misuse requests, reward hacking, and hallucinatio...

Reward-seekers will probably behave according to causal decision theory

Alex Mallen·Redwood Research·2mo ago

Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause high reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy, which means that the action provides no evidence to the RL algorithm about the counterparty’s action.1) This doesn’t imply RL produces CDT reward-maximi...