Operationally-defined evaluation of uncertainty-guided adaptive behavior in LLMs
Modern LLMs exhibit poor calibration between confidence and accuracy (Guo et al., 2017). CAD-B tests whether models generate internal uncertainty signals that: (1) prospectively predict errors, and (2) modulate decision-making adaptively. Adapts paradigms from comparative cognition (Smith et al., 2003; Hampton, 2001; Kornell et al., 2007).