META-GUI：迈向移动GUI的多模式对话代理

论文标题

META-GUI：迈向移动GUI的多模式对话代理

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

论文作者

Sun, Liangtai, Chen, Xingyu, Chen, Lu, Dai, Tianle, Zhu, Zichen, Yu, Kai

论文摘要

手机智能助手已广泛使用以任务为导向的对话（TOD）系统来完成诸如日历计划或酒店预订之类的任务。当前的TOD系统通常专注于多转移文本/语音交互，然后他们会调用为TODS执行任务的后端API。但是，这种基于API的体系结构极大地限制了智能助手的信息搜索功能，甚至如果不可用TOD特定的API或任务太复杂，甚至可能导致任务失败，而不是由提供的API执行。在本文中，我们提出了一种新的TOD体系结构：基于GUI的任务对话系统（GUI-TOD）。 GUI-TOD系统可以直接在实际应用程序上执行GUI操作，并执行任务，而无需调用TOD特定的后端API。此外，我们发布了Meta-GUI，这是一个用于培训移动GUI多模式对话代理的数据集。我们还提出了一个多模型作用预测和响应模型，该预测和响应模型在元GUI上显示出令人鼓舞的结果。数据集，代码和排行榜公开可用。

Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current TOD systems usually focus on multi-turn text/speech interaction, then they would call back-end APIs designed for TODs to perform the task. However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking TOD-specific backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. We also propose a multi-model action prediction and response model, which show promising results on META-GUI. The dataset, codes and leaderboard are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题