Evaluating Spoken Dialogue Agents

Spoken dialogue agents promise to provide spoken language access to many types of information services, with potential benefits of remote or hands-free access, ease of use, and naturalness. Since it is often unclear what dialogue strategies will lead to better agent performance, work on dialogue agents requires work in dialogue modeling and agent evaluation. This research has been concerned with the development and application of PARADISE, an evaluation framework for spoken dialogue systems. PARADISE separates how an agent uses dialogue strategies (e.g. confirm, summarize) from what an agent achieves in terms of task requirements (e.g., obtain information). The framework both integrates and improves upon previous work, by 1) measuring performance as a weighted sum of a success measure based on the task and of multiple cost measures based on dialogue strategies, 2) enabling `transcript-free' evaluation by decoupling the representation of the task from an agent's dialogue behaviors, 3) using the kappa statistic (which normalizes for task complexity) as a measure of success, enabling comparisons across tasks, 4) integrating existing subjective and objective cost measures by normalizing for scale, 5) solving for weights on the success and cost measures using multivariate linear regression, using user satisfaction as an external validation criterion for performance, and 6) supporting calculation of kappa and costs at the subdialogue as well as the dialogue level, allowing performance to be evaluated over subdialogues, and thus supporting the testing of multiple factors in a single dialogue experiment.

PARADISE is currently being applied to several AT\&T spoken dialogue systems. For example, we have conducted an experimental evaluation of a cooperative versus literal response strategy in TOOT, a spoken dialogue agent that allows users to access train schedules stored on the web. By using hypothesis testing methods, we show that a combination of response strategy, application task, and task/strategy interactions account for various types of performance differences. By using the PARADISE evaluation framework to estimate an overall performance function, we identify interdependencies that exist between speech recognition and response strategy. We have also used PARADISE to examine the effects of a short tutorial session with a voice-enabled email retrieval system. The results support our hypothesis that novice users who experience a tutorial perform comparably to expert users and outperform novice users who do not receive a tutorial. Finally, we have evaluated the utility of an adaptable spoken dialogue system, where the user is allowed to change the system's dialogue strategies at any point(s) in a dialogue. Our results show that an adaptable version of TOOT generally outperforms a non-adaptable version, and that the utility of adaptation depends on TOOT's initial (confirmation and initiative) dialogue strategies.

September 1999
Back to home page.