[edit]
SySDEM - Synthetic and Stratified Degradations for Evaluating Metrics for Long-Form Text in Medical Domain
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1075-1095, 2026.
Abstract
The evaluation of long-form text in the medical domain is increasingly reliant on automated metrics. However, the reliability of these metrics themselves is often assumed rather than rigorously tested, especially when long-form generations are the expected output. We address this gap by proposing {SySDEM} - Synthetic and Stratified Degradations for Evaluating Metrics, a framework to evaluate the quality of reference-based evaluation metrics. Using this framework, we demonstrate a method that iteratively perturbs candidate texts to assess the sensitivity and discrimination power of reference-based text evaluation metrics. Through experiments on the {ACI}-Bench clinical note generation dataset, we demonstrate the importance of evaluating evaluation metrics for long-form text, highlighting the need for robust validation methodologies.