INTERSPEECH 2022

home

Call for

Special Sessions

home

Special Sessions

SPECIAL SESSIONS & CHALLENGES

The Organizing Committee of INTERSPEECH 2022 is proudly announcing that the following special sessions and challenges for INTERSPEECH 2022 will be held.

Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.

Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.

List of sessions - in alphabetical order

Expand all

APOLLO Fearless Steps

Introduction

The focus of this Special Session is to provide a forum for researchers working on the massive naturalistic audio collection stemming from the NASA Apollo Missions. UTDallas-CRSS under NSF support has led the Fearless Steps Initiative, a continued effort spanning eight years has resulted in the digitization, and recovery of over 50,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this naturalistic data resource, including an initial release of pipeline diarization meta-data for all 30 channels of APOLLO-11 and APOLLO-13 Missions. More than 500 sites worldwide have accessed the initial data. A current NSF Community Resource project is continuing this effort to recover the remaining Apollo missions (A7-A17; estimated to be 150,000hrs of data) in addition to motivating collaborative speech and language technology research through the Fearless Steps Challenge series.

URL

Download PDF from HERE

Organizers

John H.L. Hansen, Univ. of Texas at Dallas
Christopher Ceiri, Linguistic Data Consortium
James Horan, NIST,
Aditya Joglekar, Univ. of Texas at Dallas
Midia Yousefi, Univ. of Texas at Dallas
Meena Chandra Shekar, Univ. of Texas at Dallas

Audio Deep PLC Challenge

Introduction

The INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge is intended to stimulate research in the area of Audio Packet Loss Concealment(PLC).

PLC is an important part of audio telecommunications technology and codec development, and methods for performing PLC using machine learning approaches are now becoming viable for practical use. Packet loss, either by missing packets or high packet jitter, is one of the top reasons for speech quality degradation in Voice over IP calls.

While there have been some groups publishing in this area, a lack of common datasets and evaluation procedures complicates the comparison of proposed methods and the establishment of clear baselines. With this challenge, we propose to address this situation: We will open source a dataset based on real-world (as opposed to the common synthetic) packet loss traces and bring the community together to, for the first time, compare approaches in this field on a unified test set.

As the gold standard for audio quality evaluation is human evaluator ratings, we will evaluate submissions using a crowd-source ITU-TP.808CCR approach. The top three approaches which achieve the highest average Mean Opinion Score on the blind set will be declared the winners of the NTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge. As an additional metric, to ensure that approaches are not degrading intelligibility, we will use the speech recognition rate, calculated using the Microsoft Cognitive Services Speech Recognition Service.

To help participants during the challenge, we will provide participants with access to our prototype "PLC-MOS" neural network model that provides estimates of human ratings of audio files with healed packet losses.

Challenge details: https://aka.ms/plc_challenge

Data and example scripts: https://github.com/microsoft/PLC-Challenge

Organizers

Ross Cutler, Microsoft, USA
Ando Saabas, Microsoft, Estonia
Lorenz Diener, Microsoft, Estonia
Sten Sootla, Microsoft, Estonia
Solomiya Branets, Microsoft, Estonia

Challenges and opportunities for signal processing and machine learning for multiple smart devices

Introduction

The increasing proliferation of smart devices in our lives offers tremendous opportunities to improve the customer experience by leveraging spatial diversity and distributed computational and memory capability. At the same time, multi sensor networks present unique challenges compared to single smart devices such as synchronization, arbitration, and privacy.

The purpose of this special session is to promote research in multiple device signal processing and machine learning by bringing together leading industry and academic experts to discuss the following topics including, but not limited to:

Multiple device audio datasets
Automatic speech recognition
Keyword spotting
Device arbitration (i.e. which device should respond to the user’s inquiry)
Speech enhancement: de-reverberation, noise reduction, echo reduction
Source separation
Speaker localization and tracking
Privacy sensitive signal processing and machine learning

The core motivation of this session is the recognition that "more is different". Robust speech recognition, enhancement, and analysis are foundational areas of speech signal processing with many publication outlets. The strength of the special session is to use the engineering specification of multiple devices as a backdrop against which creative solutions from these domains can be demonstrated. The session will co-locate top researchers working in the multi-sensor domain, and even though their specific applications may be different (e.g. enhancement vs acoustic event detection), the similarity of the problem space encourages cross pollination of techniques.

Organizers

Jarred Barber, M.S., Amazon Alexa Speech
Gregory Ciccarelli, Ph.D., Amazon Alexa Speech
Israel Cohen, Ph.D. Amazon Alexa Speech, Technion-Israel Institute of Technology
Tao Zhang, Ph.D., Amazon Alexa Speech

Contact: gcciccar@amazon.com, barbjarr@amazon.com, taozhng@amazon.com, isrcohen@amazon.com

Inclusive and Fair Speech Technologies

Introduction

Speech technologies have become increasingly used and now power a very large range of applications. Automatic speech recognition systems have indeed dramatically improved over the past decade thanks to the advances brought by deep learning and the effort on large-scale data collection. The speech technology community's relentless focus on minimum word error rate has thus resulted in a productivity tool that works well for some categories of the population, namely for those of us whose speech patterns match its training data: typically, college-educated first-language speakers of a standardized dialect, with little or no speech disability.

For some groups of people, however, speech technology works less well, maybe because their speech patterns differ significantly from the standard dialect (e.g., because of regional accent), because of intra-group heterogeneity (e.g., speakers of regional African American dialects; second-language learners; and other demographic aspects such as age, gender, or race ), or because the speech pattern of each individual in the group exhibits a large variability (e.g., people with severe disabilities).

The goal of this special session is (1) to discuss these biases and propose methods for making speech technologies more useful to heterogeneous populations and (2) to increase academic and industry collaborations to reach these goals.

Such methods include:

analysis of performance biases among different social/linguistic groups in speech technology,
new methods to mitigate these differences,
new approaches for data collection, curation and coding,
new algorithmic training criteria,
new methods for envisioning speech technology task descriptions and design criteria.

Moreover, the special session aims to foster cross-disciplinary collaboration between fairness and personalization research, which has the potential to both improve customer experiences and algorithm fairness. The special session will bring experts from both fields to advance the cross-disciplinary study between fairness and personalization, e.g., fairness-aware personalization.

The session promotes collaboration between academia and industry to identify the key challenges and opportunities of fairness research and shed light on future research directions.

URL

https://sites.google.com/view/fair-speech-interspeech22/

Organizers

Prof. Laurent Besacier, Naver Labs Europe, France, Principal Scientist,
Dr. Keith Burghardt, USC Information Sciences Institute, USA, Computer Scientist,
Dr. Alice Coucke, Sonos Inc., France, Head of Machine Learning Research,
Prof. Mark Allan Hasegawa-Johnson, University of Illinois, USA, Professor of Electrical and Computer Engineering,
Dr. Peng Liu, Amazon Alexa, USA, Senior Machine Learning Scientist,
Anirudh Mani, Amazon Alexa, USA, Applied Scientist,
Prof. Mahadeva Prasanna, IIT Dharwad, India, Professor, Dept of Electrical Engineering,
Prof. Priyankoo Sarmah, IIT Guwahati, India, Professor, Dept of Humanities and Social Sciences,
Dr. Odette Scharenborg, Delft University of Technology, the Netherlands, Associate professor,
Dr. Tao Zhang, Amazon Alexa, USA, Senior Manager.

Low-Resource ASR Development

Introduction

The special session aims to bring together researchers from all sectors working on ASR (Automatic Speech Recognition) for low-resource languages and dialects to discuss the state of the art and future directions. It will allow for fruitful exchanges between participants in low-resource ASR challenges and evaluations and other researchers working on low-resource ASR development.

One such challenge is the OpenASR Challenge series conducted by NIST (National Institute of Standards and Technology) in coordination with IARPA’s (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. The most recent challenge, OpenASR21, offered an ASR test of 15 low resource languages for conversational telephone speech, with additional data genres and case-sensitive scoring for some of the languages.

Another challenge is the Hindi ASR Challenge that was recently opened to evaluate regional variations of Hindi with the use of spontaneous telephone speech recordings made available by Gram Vaani, a social technology enterprise company. The regional variations of Hindi, together with spontaneity of speech, natural background, and transcriptions with varying degrees of accuracy due to crowd sourcing make it a unique corpus for automatic recognition of spontaneous telephone speech in low-resource regional variations of Hindi. A 1000 hours audio-only data (no transcription) is also released with this challenge to explore self-supervised training for such a low-resource framework.

We invite contributions from the OpenASR21 Challenge participants, the MATERIAL performers, the Hindi ASR Challenge participants, and any other researchers with relevant work in the low-resource ASR problem space.

Topics

Reports of results from tests of low-resource ASR, such as (but not limited to) the NIST/IARPA OpenASR21 Challenge, IARPA MATERIAL evaluations, and the Hindi ASR Challenge.
Topics focused on aspects of challenges and solutions in low-resource setting, such as:
- Zero- or few-shot learning methods
- Transfer learning techniques
- Cross-lingual training techniques
- Use of pretrained models
- Factors influencing ASR performance (such as dialect, gender, genre, variations in training data amount, or casing)
- Any other topics focused on low-resource ASR challenges and solutions

URL

https://www.nist.gov/itl/iad/mig/low-resource-asr-development-special-session-interspeech-2022

Organizers

Peter Bell, University of Edinburgh
Jayadev Billa, University of Southern California Information Sciences Institute
Prasanta Ghosh, Indian Institute of Science, Bangalore
William Hartmann, Raytheon BBN Technologies
Kay Peterson, National Institute of Standards and Technology
Aaditeshwar Seth, Indian Institute of Technology, Delhi

Low Resource Spoken Language Understanding

Introduction

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower level tasks. Interest has been growing in higher-level spoken language understanding (SLU) tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks, and the existing datasets tend to be relatively small. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. In this special session, we would like to foster a discussion and invite researchers in the field of SLU working on tasks such as named entity recognition (NER), sentiment analysis, intent classification, dialogue act tagging, or others, using either audio or ASR transcripts.

We invite contributions any relevant work in the low-resource SLU problem includes (but are not limited to):

Training/fine-tuning approach using self/semi-supervised model for SLU tasks
Comparison between pipeline and end-to-end SLU systems
Self/semi-supervised learning approach focusing on SLU
Multi-task/transfer/student-teacher learning focusing on SLU tasks
Theoretical or empirical study on low-resource SLU problems

URL

Special session website https://asappresearch.github.io/slue-toolkit/interspeech2022.html

Contact: Suwon Shon (sshon@asapp.com)

Organizers

Suwon Shon - ASAPP
Felix Wu - ASAPP
Pablo Brusco - ASAPP
Kyu J. Han - ASAPP
Karen Livescu - TTI at Chicago
Ankita Pasad - TTI at Chicago
Yoav Artzi - Cornell University
Katrin Kirchhoff- Amazon
Samuel R. Bowman - New York University
Zhou Yu - Columbia University

Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Introduction

ConferencingSpeech 2022 challenge is proposed to stimulate research in Non-intrusive speech quality assessment for online conferencing applications. For a long time, speech quality assessment of communication application was carried out by subjective experiments or obtained via computational model relying on the reference clean and degraded speech in an intrusive manner. However, for quality monitoring purpose non-intrusive speech quality model or so-called single-ended model which do not need reference speech is highly preferred and remains a difficult and challenging topic. The challenge aims to bring together researchers from all sectors working on speech quality to show the potential performance of different models, explore new ideas, and discuss the state of the art and future directions. We believe this could accelerate the research topic to make non-intrusive speech quality assessment more reliable and increase the possibility that those models being adopted by online conferencing applications in a near future.

This challenge will provide comprehensive training datasets, a comprehensive test dataset and a baseline system. The final ranking of this challenge will be decided by the accuracy of the predicted MOS scores from the submitted model or algorithm on the test dataset. More details about the data and challenge can be found from the evaluation plan. Please let us know if you have questions or need clarification about any aspect of the challenge.

URL

https://tea-lab.qq.com/conferencingspeech-2022

Organizers

Gaoxiong Yi, Tencent, China
Wei Xiao, Tencent, China
Yiming Xiao， Tencent, China
Babak Naderi, Technical University of Berlin, Germany
Sebastian Möller, Technical University of Berlin, Germany
Gabriel Mittag, Machine Learning Scientist, Microsoft
Ross Cutler, Partner Applied Scientist Manager, Microsoft
Zhuohuang Zhang, Indiana University Bloomington, USA
Donald S. Williamson, Assistant Professor, Indiana University
Bloomington, USA
Fei Chen, Professor, Southern University of Science and Technology, China
Fuzheng yang, Professor, XiDian University, China
Shidong Shang, Senior Director, Tencent, China

Speaking Styles and Interaction Styles

Introduction

Style is becoming more important, as we increasingly deploy variations of one basic dialog system across domains and genres, and as we aim to better customize and individualize our dialog systems.

Style has been a focus of much recent work in speech synthesis, with remarkable advances also in style transfer, style discovery, style recognition, and style modeling, both for utterance-level style properties and interaction-level and dialog-level properties. Nevertheless more work is needed in improving and simplifying our models, in generalizing and systematizing our understanding of style, and in translating research advances to value for users.

In this special session, we seek to promote interaction and collaboration between researchers working on different aspects of style and using different approaches. We encourage submissions that go beyond their technical or empirical contributions to also elaborate on how the work relates to the big picture of style in spoken dialog. We also welcome papers whose motivations, contributions, or implications highlight issues not commonly addressed at Interspeech.

Topics of interest include any aspects of speaking styles and interaction styles, including

style as it relates to expressiveness, pragmatic intents, genre, social role, social identity, stance, personality, entrainment, interpersonal dynamics, and so on
universal and language-specific aspects of style
style in monolog and dialog
how styles are realized through phonetic, prosodic, lexical, and turn-taking means
applications in dialog systems and beyond

URL

https://www.cs.utep.edu/nigel/istyles2022/

Organizers

Nigel Ward, University of Texas at El Paso
Kallirroi Georgila, University of Southern California
Yang Gao, Carnegie-Mellon University
Mark Hasegawa-Johnson, University of Illinois
Koji Inoue, Kyoto University
Simon King, University of Edinburgh
Rivka Levitan, City University of New York
Katherine Metcalf, Apple
Eva Szekely, KTH Royal Institute of Technology
Pol van Rijn, Max Planck Institute for Empirical Aesthetics
Rafael Valle, NVIDIA

Speech and language in health: from remote monitoring to medical conversations

Introduction

Technological advancements have been rapidly transforming healthcare in the last several years, with speech and language tools playing an integral role. However, this brings a multitude of unique challenges to consider to increase the generalisability, reliability, interpretability and utility of speech and language tools in healthcare and health research settings.

Many of these challenges are common to the two themes of this special session. The first theme, From Collection and Analysis to Clinical Translation, seeks to draw attention to all aspects of speech-health studies that affect the overall quality and reliability of any analysis undertaken on the data and thus affect user acceptance and clinical translation.

The second theme, Language Technology For Medical Conversations, covers a growing field of research in which automatic speech recognition and natural language processing tools are combined to automatically transcribe and interpret clinician-patient conversations and generate subsequent medical documentation.

By combining these themes, this session will bring the wider speech-health community together to discuss innovative ideas, challenges and opportunities for utilizing speech technologies within the scope of healthcare applications.

Suggested paper topics include, but are not limited to:

Data collection protocols and speech elicitation strategies
Device selection and related effects
Acceptance of data collection in different health cohorts
Longitudinal data collection and analysis
Patient and Public Involvement in speech research
User evaluation of speech technology in a healthcare setting
Feature extraction and novel representations that provide clinical interpretability
Advancements in analytics and machine learning methodologies that are clinically or biologically inspired
Fusion of linguistic and paralinguistic information
Health-related conversational analytics
Speech recognition and natural language processing in healthcare settings
Creation and annotation of medical conversation datasets
Role of medical conversation understanding in reducing documentation burden
Use of chatbots in healthcare
Spoken language technologies in real-world health settings
Utilising Electronic Health Records to personalise models in speech recognition or conversational analytics

URL

https://sites.google.com/view/splang-health-interspeech2022/home

Organisers

Nicholas Cummins (Kings's College London and Thymia)
Thomas Schaaf (3M)
Heidi Christensen (University of Sheffield)
Judith Dineley (King’s College London and University of Augsburg)
Julien Epps (University of New South Wales)
Matt Gormley (Carnegie Mellon University)
Sandeep Konam (Abridge.ai)
Emily Mower Provost (University of Michigan)
Chaitanya Shivade (Amazon.com)
Thomas Quatieri (MIT Lincoln Laboratory)

Contacts

nick.cummins@kcl.ac.uk, tschaaf@mmm.com

Speech Intelligibility Prediction for Hearing-Impaired Listeners

Introduction

One of the greatest challenges for hearing-impaired listeners is understanding speech in the presence of background noise. Noise levels encountered in everyday social situations can have a devastating impact on speech intelligibility, and thus communication effectiveness, potentially leading to social withdrawal and isolation. Disabling hearing impairment affects 360 million people worldwide, with that number increasing because of the ageing population. Unfortunately, current hearing aid technology is often ineffective at restoring speech intelligibility in noisy situations.

To allow the development of better hearing aids, we need better ways to evaluate the speech intelligibility of audio signals. We need prediction models that can take audio signals and use knowledge of the listener's characteristics (e.g., an audiogram) to estimate the signal’s intelligibility. Further, we need models that can estimate intelligibility not just of natural signals, but also of signals that have been processed using hearing aid algorithms - whether current or under development.

The Clarity Prediction Challenge

As a focus for the session, we have launched the `Clarity Prediction Challenge’. The challenge provides you with noisy speech signals that have been processed with a number of hearing aid signal processing systems and corresponding intelligibility scores produced by a panel of hearing-impaired individuals. You are tasked with producing a model that can predict intelligibility scores given just the signals, their clean references and a characterisation of each listener’s specific hearing impairment. The challenge will remain open until the Interspeech submission deadline and all entrants are welcome. (Note, the Clarity Prediction Challenge is part of a 5-year programme with further prediction and enhancement challenges planned for the future.)

Relevant Topics

The session welcomes submission from entrants to the Clarity Prediction Challenge but is also inviting papers related to topics in hearing impairment and speech intelligibility, including, but not limited to,

Statistical speech modelling for intelligibility prediction
Modelling energetic and informational noise masking
Individualising intelligibility models using audiometric data
Intelligibility prediction in online and low latency settings
Model-driven speech intelligibility enhancement
New methodologies for intelligibility model evaluation
Speech resources for intelligibility model evaluation
Applications of intelligibility modelling in acoustic engineering
Modelling interactions between hearing impairment and speaking style
Papers using the data supplied with the Clarity Prediction Challenge

URL

https://claritychallenge.github.io/interspeech2022_siphil/

Organisers

Trevor Cox - University of Salford, UK
Fei Chen - Southern University of Science and Technology, China
Jon Barker - University of Sheffield, UK
Daniel Korzekwa - Amazon TTS
Michael Akeroyd University of Nottingham, UK
John Culling - University of Cardiff, UK
Graham Naylor - University of Nottingham, UK

Spoofing-Aware Automatic Speaker Verification (SASV)

Introduction

While spoofing countermeasures, promoted within the sphere of the ASVspoof challenge series, can help to protect reliability in the face of spoofing, they have been developed as independent subsystems for a fixed ASV subsystem. Better performance can be expected when countermeasures and ASV subsystems are both optimised to operate in tandem. The first spoofing-aware speaker verification (SASV) challenge aims to encourage the development of original solutions involving, but not limited to:

back-end fusion of pre-trained automatic speaker verification and pre-trained audio spoofing countermeasure subsystems; integrated spoofing-aware automatic speaker verification systems that have the capacity to reject both non-target and spoofed trials.

While we invite the submission of general contributions in this direction, the Interspeech 2022 Spoofing-aware Automatic Speaker Verification special session incorporates a challenge – SASV 2022. Potential authors are encouraged to evaluate their solutions using the SASV benchmarking framework which comprises a common database, protocol and evaluation metric. Further details and resources can be found from the SASV challenge website.

URL

https://sasv-challenge.github.io

Organisers

Jee-weon Jung, Naver Corporation, South Korea
Hemlata Tak, EURECOM, France
Hye-jin Shim, University of Seoul, South Korea
Hee-Soo Heo, Naver Corporation, South Korea
Bong-Jin Lee, Naver Corporation, South Korea
Soo-Whan Chung,Naver Corporation, South Korea
Hong-Goo Kang, Yonsei University, South Korea
Ha-Jin Yu, University of Seoul, South Korea
Nicholas Evans, EURECOM, France
Tomi H. Kinnunen, University of Eastern Finland, Finland

Trustworthy Speech Processing

Introduction

Given the ubiquity of Machine Learning (ML) systems and their relevance in daily lives, it is important to ensure private and safe handling of data alongside equity in human experience. These considerations have gained considerable interest in recent times under the realm of Trustworthy ML. Speech processing in particular presents a unique set of challenges, given the rich information carried in linguistic and paralinguistic content including speaker trait, interaction and state characteristics. This special session on Trustworthy Speech Processing (TSP) was created to bring together new and experienced researchers working on trustworthy ML and speech processing. We invite novel and relevant submissions from both academic and industrial research groups, showcasing advancements in theoretical, empirical as well as real-world design of trustworthy speech applications.

Topics of interest cover a variety of papers centered on speech processing, including (but not limited to):

Differential privacy
Federated learning
Ethics in speech processing
Model interpretability
Quantifying & mitigating bias in speech processing
New datasets, frameworks and benchmarks for TSP
Discovery and defense against emerging privacy attacks
Trustworthy ML in applications of speech processing like ASR

URL

https://trustworthyspeechprocessing.github.io/

Organizers

Anil Ramakrishna, Amazon Inc.
Shrikanth Narayanan, University of Southern California
Rahul Gupta, Amazon Inc.
Isabel Trancoso, University of Lisbon
Rita Singh, Carnegie Mellon University

The VoiceMOS Challenge

Introduction

Human listening tests are the gold standard for evaluating synthesized speech. Objective measures of speech quality have low correlation with human ratings, and the generalization abilities of current data-driven quality prediction systems suffer significantly from domain mismatch. The VoiceMOS Challenge aims to encourage research in the area of automatic prediction of Mean Opinion Scores (MOS) for synthesized speech. This challenge has two tracks:

Main track: We recently collected a large-scale dataset of MOS ratings for a large variety of text-to-speech and voice conversion systems spanning many years, and this challenge releases this data to the public for the first time as the main track dataset.
Out-of-domain track: The data for this track comes from a different listening test from the main track. The purpose of this track is to study the generalization ability of proposed MOS prediction models to a different listening test context. A smaller amount of labeled data is made available to participants, and unlabeled audio samples from the same listening test are made available as well, to encourage exploration of unsupervised and semi-supervised approaches.

Participation is open to all. The main track is required for all participants, and the out-of-domain track is optional. Participants in the challenge are strongly encouraged to submit papers to the special session. The focus of the special session is on understanding and comparing MOS prediction techniques using a standardized dataset.