Ensuring Evaluation-Application Alignment in LLM Benchmarks (EEA-Bench)
Ensuring Evaluation-Application Alignment in LLM Benchmarks
Acronym: EEA-Bench
Lead: Lucas Fonseca Lage
Status: Ongoing
Start: 2025
End: 2028
Acronym: EEA-Bench
Lead: Lucas Fonseca Lage
Status: Ongoing
Start: 2025
End: 2028
Benchmark scores are poor predictors of real world utility, yet deployment decisions continue to rely on them. Existing benchmarks suffer from many limitations such as shallow task coverage, disconnect between downstream tasks and end-user applications, and are created with what is available, not what matters. In this project we plan to investigate how to measure the correlation between benchmark downstream task performance and end-user application performance by measuring Key Performance Indicators, such as task success rate, correction rate and time efficiency, on user-facing applications. Our findings aim to inform both benchmark design and model selection practices, offering practitioners a more grounded basis for deployment decisions.