Command Palette
Search for a command to run...
Can LLMs Generate High-Quality Test Cases for Algorithm Problems?
TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure
Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao
Abstract
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMsin test-case generation. TestCase-Eval includes 500 algorithm problems and100,000 human-crafted solutions from the Codeforces platform. It focuses on twopivotal tasks: (1) Fault Coverage, which measures how well LLM-generated testsets probe diverse input scenarios and cover a wide range of potential failuremodes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailoredtest input that reveals a specific incorrect code implementation. We provide acomprehensive assessment of 19 state-of-the-art open-source and proprietaryLLMs on TestCase-Eval, offering insights into their strengths and limitationsin generating effective test cases for algorithm problems.