Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
arXiv:2605.00238v1 Announce Type: new Abstract: Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader a...
Read full article →