Is ProgramBench Impossible?

·LessWrong··

ProgramBench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it’s impossible!What is ProgramBench?ProgramBench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked...

Read full article →

Related Articles

Forking the Web
wrxd · Hacker News · 9h ago
A web page that shows you everything the browser told it without asking
mwheelz · Hacker News · 1d ago
Ask HN: We just had an actual UUID v4 collision...
mittermayr · Hacker News · 1d ago
AI is breaking two vulnerability cultures
speckx · Hacker News · 1d ago
You gave me a u32. I gave you root. (io_uring ZCRX freelist LPE)
MrBruh · Hacker News · 1d ago