The Organizing Committee of INTERSPEECH 2022 is proudly announcing the following special sessions and challenges for INTERSPEECH 2022.
Special sessions and challenges focus on relevant ‘special’ topics which may not be covered in regular conference sessions.
Papers have to be submitted following the same schedule and procedure as regular papers; the papers undergo the same review process by anonymous and independent reviewers.
The focus of this Special Session is to provide a forum for researchers working on the massive naturalistic audio collection stemming from the NASA Apollo Missions. UTDallas-CRSS under NSF support has led the Fearless Steps Initiative, a continued effort spanning eight years has resulted in the digitization, and recovery of over 50,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this naturalistic data resource, including an initial release of pipeline diarization meta-data for all 30 channels of APOLLO-11 and APOLLO-13 Missions. More than 500 sites worldwide have accessed the initial data. A current NSF Community Resource project is continuing this effort to recover the remaining Apollo missions (A7-A17; estimated to be 150,000hrs of data) in addition to motivating collaborative speech and language technology research through the Fearless Steps Challenge series.
The INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge is intended to stimulate research in the area of Audio Packet Loss Concealment(PLC).
PLC is an important part of audio telecommunications technology and codec development, and methods for performing PLC using machine learning approaches are now becoming viable for practical use. Packet loss, either by missing packets or high packet jitter, is one of the top reasons for speech quality degradation in Voice over IP calls.
While there have been some groups publishing in this area, a lack of common datasets and evaluation procedures complicates the comparison of proposed methods and the establishment of clear baselines. With this challenge, we propose to address this situation: We will open source a dataset based on real-world (as opposed to the common synthetic) packet loss traces and bring the community together to, for the first time, compare approaches in this field on a unified test set.
As the gold standard for audio quality evaluation is human evaluator ratings, we will evaluate submissions using a crowd-source ITU-TP.808CCR approach. The top three approaches which achieve the highest average Mean Opinion Score on the blind set will be declared the winners of the NTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge. As an additional metric, to ensure that approaches are not degrading intelligibility, we will use the speech recognition rate, calculated using the Microsoft Cognitive Services Speech Recognition Service.
To help participants during the challenge, we will provide participants with access to our prototype "PLC-MOS" neural network model that provides estimates of human ratings of audio files with healed packet losses.
Challenge details: https://aka.ms/plc_challenge
Data and example scripts: https://github.com/microsoft/PLC-Challenge
The increasing proliferation of smart devices in our lives offers tremendous opportunities to improve the customer experience by leveraging spatial diversity and distributed computational and memory capability. At the same time, multi sensor networks present unique challenges compared to single smart devices such as synchronization, arbitration, and privacy.
The purpose of this special session is to promote research in multiple device signal processing and machine learning by bringing together leading industry and academic experts to discuss the following topics including, but not limited to:
The core motivation of this session is the recognition that "more is different". Robust speech recognition, enhancement, and analysis are foundational areas of speech signal processing with many publication outlets. The strength of the special session is to use the engineering specification of multiple devices as a backdrop against which creative solutions from these domains can be demonstrated. The session will co-locate top researchers working in the multi-sensor domain, and even though their specific applications may be different (e.g. enhancement vs acoustic event detection), the similarity of the problem space encourages cross pollination of techniques.
Speech technologies have become increasingly used and now power a very large range of applications. Automatic speech recognition systems have indeed dramatically improved over the past decade thanks to the advances brought by deep learning and the effort on large-scale data collection. The speech technology community's relentless focus on minimum word error rate has thus resulted in a productivity tool that works well for some categories of the population, namely for those of us whose speech patterns match its training data: typically, college-educated first-language speakers of a standardized dialect, with little or no speech disability.
For some groups of people, however, speech technology works less well, maybe because their speech patterns differ significantly from the standard dialect (e.g., because of regional accent), because of intra-group heterogeneity (e.g., speakers of regional African American dialects; second-language learners; and other demographic aspects such as age, gender, or race ), or because the speech pattern of each individual in the group exhibits a large variability (e.g., people with severe disabilities).
The goal of this special session is (1) to discuss these biases and propose methods for making speech technologies more useful to heterogeneous populations and (2) to increase academic and industry collaborations to reach these goals.
Such methods include:
Moreover, the special session aims to foster cross-disciplinary collaboration between fairness and personalization research, which has the potential to both improve customer experiences and algorithm fairness. The special session will bring experts from both fields to advance the cross-disciplinary study between fairness and personalization, e.g., fairness-aware personalization.
The session promotes collaboration between academia and industry to identify the key challenges and opportunities of fairness research and shed light on future research directions.
The special session aims to bring together researchers from all sectors working on ASR (Automatic Speech Recognition) for low-resource languages and dialects to discuss the state of the art and future directions. It will allow for fruitful exchanges between participants in low-resource ASR challenges and evaluations and other researchers working on low-resource ASR development.
One such challenge is the OpenASR Challenge series conducted by NIST (National Institute of Standards and Technology) in coordination with IARPA’s (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. The most recent challenge, OpenASR21, offered an ASR test of 15 low resource languages for conversational telephone speech, with additional data genres and case-sensitive scoring for some of the languages.
Another challenge is the Hindi ASR Challenge that was recently opened to evaluate regional variations of Hindi with the use of spontaneous telephone speech recordings made available by Gram Vaani, a social technology enterprise company. The regional variations of Hindi, together with spontaneity of speech, natural background, and transcriptions with varying degrees of accuracy due to crowd sourcing make it a unique corpus for automatic recognition of spontaneous telephone speech in low-resource regional variations of Hindi. A 1000 hours audio-only data (no transcription) is also released with this challenge to explore self-supervised training for such a low-resource framework.
We invite contributions from the OpenASR21 Challenge participants, the MATERIAL performers, the Hindi ASR Challenge participants, and any other researchers with relevant work in the low-resource ASR problem space.
Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower level tasks. Interest has been growing in higher-level spoken language understanding (SLU) tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks, and the existing datasets tend to be relatively small. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. In this special session, we would like to foster a discussion and invite researchers in the field of SLU working on tasks such as named entity recognition (NER), sentiment analysis, intent classification, dialogue act tagging, or others, using either audio or ASR transcripts.
We invite contributions any relevant work in the low-resource SLU problem includes (but are not limited to):
Special session website https://asappresearch.github.io/slue-toolkit/interspeech2022.html
Contact: Suwon Shon (firstname.lastname@example.org)
ConferencingSpeech 2022 challenge is proposed to stimulate research in Non-intrusive speech quality assessment for online conferencing applications. For a long time, speech quality assessment of communication application was carried out by subjective experiments or obtained via computational model relying on the reference clean and degraded speech in an intrusive manner. However, for quality monitoring purpose non-intrusive speech quality model or so-called single-ended model which do not need reference speech is highly preferred and remains a difficult and challenging topic. The challenge aims to bring together researchers from all sectors working on speech quality to show the potential performance of different models, explore new ideas, and discuss the state of the art and future directions. We believe this could accelerate the research topic to make non-intrusive speech quality assessment more reliable and increase the possibility that those models being adopted by online conferencing applications in a near future.
This challenge will provide comprehensive training datasets, a comprehensive test dataset and a baseline system. The final ranking of this challenge will be decided by the accuracy of the predicted MOS scores from the submitted model or algorithm on the test dataset. More details about the data and challenge can be found from the evaluation plan. Please let us know if you have questions or need clarification about any aspect of the challenge.
Style is becoming more important, as we increasingly deploy variations of one basic dialog system across domains and genres, and as we aim to better customize and individualize our dialog systems.
Style has been a focus of much recent work in speech synthesis, with remarkable advances also in style transfer, style discovery, style recognition, and style modeling, both for utterance-level style properties and interaction-level and dialog-level properties. Nevertheless more work is needed in improving and simplifying our models, in generalizing and systematizing our understanding of style, and in translating research advances to value for users.
In this special session, we seek to promote interaction and collaboration between researchers working on different aspects of style and using different approaches. We encourage submissions that go beyond their technical or empirical contributions to also elaborate on how the work relates to the big picture of style in spoken dialog. We also welcome papers whose motivations, contributions, or implications highlight issues not commonly addressed at Interspeech.
Topics of interest include any aspects of speaking styles and interaction styles, including
Technological advancements have been rapidly transforming healthcare in the last several years, with speech and language tools playing an integral role. However, this brings a multitude of unique challenges to consider to increase the generalisability, reliability, interpretability and utility of speech and language tools in healthcare and health research settings.
Many of these challenges are common to the two themes of this special session. The first theme, From Collection and Analysis to Clinical Translation, seeks to draw attention to all aspects of speech-health studies that affect the overall quality and reliability of any analysis undertaken on the data and thus affect user acceptance and clinical translation.
The second theme, Language Technology For Medical Conversations, covers a growing field of research in which automatic speech recognition and natural language processing tools are combined to automatically transcribe and interpret clinician-patient conversations and generate subsequent medical documentation.
By combining these themes, this session will bring the wider speech-health community together to discuss innovative ideas, challenges and opportunities for utilizing speech technologies within the scope of healthcare applications.
Suggested paper topics include, but are not limited to:
One of the greatest challenges for hearing-impaired listeners is understanding speech in the presence of background noise. Noise levels encountered in everyday social situations can have a devastating impact on speech intelligibility, and thus communication effectiveness, potentially leading to social withdrawal and isolation. Disabling hearing impairment affects 360 million people worldwide, with that number increasing because of the ageing population. Unfortunately, current hearing aid technology is often ineffective at restoring speech intelligibility in noisy situations.
To allow the development of better hearing aids, we need better ways to evaluate the speech intelligibility of audio signals. We need prediction models that can take audio signals and use knowledge of the listener's characteristics (e.g., an audiogram) to estimate the signal’s intelligibility. Further, we need models that can estimate intelligibility not just of natural signals, but also of signals that have been processed using hearing aid algorithms - whether current or under development.
As a focus for the session, we have launched the `Clarity Prediction Challenge’. The challenge provides you with noisy speech signals that have been processed with a number of hearing aid signal processing systems and corresponding intelligibility scores produced by a panel of hearing-impaired individuals. You are tasked with producing a model that can predict intelligibility scores given just the signals, their clean references and a characterisation of each listener’s specific hearing impairment. The challenge will remain open until the Interspeech submission deadline and all entrants are welcome. (Note, the Clarity Prediction Challenge is part of a 5-year programme with further prediction and enhancement challenges planned for the future.)
The session welcomes submission from entrants to the Clarity Prediction Challenge but is also inviting papers related to topics in hearing impairment and speech intelligibility, including, but not limited to,
While spoofing countermeasures, promoted within the sphere of the ASVspoof challenge series, can help to protect reliability in the face of spoofing, they have been developed as independent subsystems for a fixed ASV subsystem. Better performance can be expected when countermeasures and ASV subsystems are both optimised to operate in tandem. The first spoofing-aware speaker verification (SASV) challenge aims to encourage the development of original solutions involving, but not limited to:
back-end fusion of pre-trained automatic speaker verification and pre-trained audio spoofing countermeasure subsystems; integrated spoofing-aware automatic speaker verification systems that have the capacity to reject both non-target and spoofed trials.
While we invite the submission of general contributions in this direction, the Interspeech 2022 Spoofing-aware Automatic Speaker Verification special session incorporates a challenge – SASV 2022. Potential authors are encouraged to evaluate their solutions using the SASV benchmarking framework which comprises a common database, protocol and evaluation metric. Further details and resources can be found from the SASV challenge website.
Given the ubiquity of Machine Learning (ML) systems and their relevance in daily lives, it is important to ensure private and safe handling of data alongside equity in human experience. These considerations have gained considerable interest in recent times under the realm of Trustworthy ML. Speech processing in particular presents a unique set of challenges, given the rich information carried in linguistic and paralinguistic content including speaker trait, interaction and state characteristics. This special session on Trustworthy Speech Processing (TSP) was created to bring together new and experienced researchers working on trustworthy ML and speech processing. We invite novel and relevant submissions from both academic and industrial research groups, showcasing advancements in theoretical, empirical as well as real-world design of trustworthy speech applications.
Topics of interest cover a variety of papers centered on speech processing, including (but not limited to):
Human listening tests are the gold standard for evaluating synthesized speech. Objective measures of speech quality have low correlation with human ratings, and the generalization abilities of current data-driven quality prediction systems suffer significantly from domain mismatch. The VoiceMOS Challenge aims to encourage research in the area of automatic prediction of Mean Opinion Scores (MOS) for synthesized speech. This challenge has two tracks:
Main track: We recently collected a large-scale dataset of MOS ratings for a large variety of text-to-speech and voice conversion systems spanning many years, and this challenge releases this data to the public for the first time as the main track dataset.
Out-of-domain track: The data for this track comes from a different listening test from the main track. The purpose of this track is to study the generalization ability of proposed MOS prediction models to a different listening test context. A smaller amount of labeled data is made available to participants, and unlabeled audio samples from the same listening test are made available as well, to encourage exploration of unsupervised and semi-supervised approaches.
Participation is open to all. The main track is required for all participants, and the out-of-domain track is optional. Participants in the challenge are strongly encouraged to submit papers to the special session. The focus of the special session is on understanding and comparing MOS prediction techniques using a standardized dataset.
Challenge info page https://voicemos-challenge-2022.github.io
CodaLab competition page (https://codalab.lisn.upsaclay.fr/competitions/695)