Ensuring Evaluation-Application Alignment in LLM Benchmarks (EEA-Bench)

Benchmark scores are poor predictors of real world utility, yet deployment decisions continue to rely on them. Existing benchmarks suffer from many limitations such as shallow task coverage, disconnect between downstream tasks and end-user applications, and are created with what is available, not what matters. In this project we plan to investigate how to measure the correlation between benchmark downstream task performance and end-user application performance by measuring Key Performance Indicators, such as task success rate, correction rate and time efficiency, on user-facing applications. Our findings aim to inform both benchmark design and model selection practices, offering practitioners a more grounded basis for deployment decisions.