T14: Speech-based Interaction: Myths, Challenges and Opportunities

Wednesday, 29 June 2022, 17:00 – 21:00 CEST (Central European Summer Time - Gothenburg, Sweden)
Back to Tutorials' Program


Cosmin Munteanu (short bio)

Institute of Communication, Culture, Information and Technology
University of Toronto at Mississauga, Canada

Gerald Penn (short bio)

Department of Computer Science, University of Toronto, Canada
and Vector Institute for Artificial Intelligence, Canada



HCI research has long been dedicated to better and more naturally facilitating information transfer between humans and machines. Unfortunately, our most natural form of communication, speech, is also one of the most difficult modalities to be understood by machines. Despite significant recent advances towards understanding speech, HCI has been relatively timid in embracing this modality as a central research focus - partly due to the relatively discouraging accuracy of speech understanding in some genres (exaggerated claims from the industry notwithstanding), but also due to the intrinsic difficulty of designing and evaluating speech and natural language interfaces. On the engineering side, improving speech technology with respect to largely arbitrary measures of performance has led to systems that deviate from user-centered design principles, and that fail to consider usability or usefulness.

The goal of this course is to inform the HCI community of the current state of speech and natural language research, to dispel some of the myths surrounding speech-based interaction, as well as to provide an opportunity for researchers and practitioners to learn more about how speech recognition and speech synthesis work, their limitations, and how they could be used to enhance current interaction paradigms.

Our approach is two-fold: present new concepts to the audience, and foster discussions and exchange of ideas. Slides are used to introduce the main points, while videos and audio clips are played to illustrate examples. After each main concept is presented, time is allocated for interaction with the audience.

Variations of this tutorial have been presented at: HCII 2016-2019, 2021; MobileHCI 2010-2014; CHI 2011-2018, 2021 and I/ITSEC 2010-2018. Our tutorial at HCII 2022 will include updated material on rapid development support for neural speech interfaces such as huggingface and other distributed APIs.



  • How Automatic Speech Recognition (ASR) and Speech Synthesis (or Text-To-Speech, aka TTS) systems work and why these are such computationally difficult problems
  • Where are ASR and TTS used in current commercial interactive applications
  • What are the usability issues surrounding speech-based interaction systems, particularly in mobile and pervasive computing
  • What are the challenges in enabling speech as a modality for mobile interaction
  • What is the current state-of-the-art in ASR and TTS research
  • What are the differences between the commercial ASR systems' accuracy claims and the needs of mobile interactive applications
  • What are the difficulties in evaluating the quality of TTS systems, particularly from a usability and user perspective
  • What opportunities exist for HCI researchers in terms of enhancing systems' interactivity by enabling speech


Target Audience:

The course will be beneficial to all HCI researchers or practitioners without a strong expertise in ASR or TTS, who still believe in fulfilling HCI's goal of developing methods and systems that allow humans to naturally interact with the ever-increasingly ubiquitous mobile technology, but are disappointed with the lack of success in using speech and natural language to achieve this goal.

No prior technical experience is required for the participants.

Bio Sketches of Presenters:

Cosmin Munteanu is an Associate Professor at the Institute for Communication, Culture, Information, and Technology, University of Toronto at Mississauga), and Associate Director of the Technologies for Ageing Gracefully lab. Until 2014 he was a Research Officer with the National Research Council of Canada. His area of expertise is at the intersection of Human-Computer Interaction, Automatic Speech Recognition, Natural Language Processing, Mobile Computing, and Assistive Technologies. He has extensively studied the human factors of using imperfect speech recognition systems, and has designed and evaluated systems that improve humans' access to and interaction with information-rich media and technologies through natural language. Cosmin's multidisciplinary interests include speech and natural language interaction for mobile devices, mixed reality systems, learning technologies for marginalized users, assistive technologies for older adults, and ethics in human-computer interaction research.

Gerald Penn is a Professor of Computer Science at the University of Toronto and an Associate Member of the Vector Institute for Artificial Intelligence, specializing in mathematical linguistics and spoken language processing. His lab played a pivotal role in the invention of neural-network-based acoustic models, which are now the standard in speech recognition systems, and specializes in human-subject interaction with speech-enabled devices. He is a senior member of both the IEEE and AAAI.