When Gemma Thinks About Resources - it Fails: a Behavioral Experiment
I set out to find an answer to a completely different question:Does a model, when attempting to solve a cyber CTF (find the vulnerability in this app, and then Capture The Flag) while knowing how many steps it has left, perform differently?The Setup:I used 3 different CTF labs, curated from my own CTF benchmark. Each run has the model attempt to solve the CTF in up to 30 steps. A/B test of a baseline run vs a step_aware one. 100 runs per lab, for each test. 600 total, 505 after excluding failed ...
Read full article →