HyperAI

Abstract

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMsin test-case generation. TestCase-Eval includes 500 algorithm problems and100,000 human-crafted solutions from the Codeforces platform. It focuses on twopivotal tasks: (1) Fault Coverage, which measures how well LLM-generated testsets probe diverse input scenarios and cover a wide range of potential failuremodes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailoredtest input that reveals a specific incorrect code implementation. We provide acomprehensive assessment of 19 state-of-the-art open-source and proprietaryLLMs on TestCase-Eval, offering insights into their strengths and limitationsin generating effective test cases for algorithm problems.

Abstract

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract

Build AI with AI

HyperAI Newsletters

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract

Build AI with AI

HyperAI Newsletters

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

Zheyuan Yang Zexi Kuang Xue Xia Yilun Zhao

Abstract

Build AI with AI

HyperAI Newsletters