GPT 5.4 is a big step for Codex

·Interconnects··

I’m a little late to this model review, but that has given me more time to think about the axes that matter for agents. Traditional benchmarks reduce model performance to a single score of correctness – they always have because that was simple, easy to quickly use to gauge performance, and so on. This is also advice that I give to people trying to build great benchmarks – it needs to reduce to one number that is interpretable. This is likely still going to be true in a year or two, and benchmark...

Read full article →

Related Articles

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 18d ago
Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 15d ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 18d ago
Using “underdrawings” for accurate text and numbers
samcollins · Hacker News · 19d ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 14d ago