Call for


  • September 18, 2022
  • Morning Tutorial 09:00-12:30; Coffee/tea break: 10:30 – 11:00
  • Afternoon Tutorial 14:00-17:30; Coffee/tea break: 15:30 – 16:00

Tutorial Sessions

INTERSPEECH conferences are attended by researchers with a long-term track-record in speech sciences and technology, as well as by early-stage researchers or researchers interested in a new domain within the INTERSPEECH areas.

As important part of the conference, the tutorials held on the first day of the conference (September 18, 2022). Speakers with long and deep expertise in speech will provide their audience with a rich learning experience associated with longstanding research problems, contemporary topics of research and currently emerging areas.

List of Tutorials

Expand all

Morning Tutorials


One of the key bottlenecks in training diverse accurate audio classifiers is the need for “strongly-labeled” training data, that provide precisely demarcated instances of the audio events to be recognized. Such data are, however, difficult to obtain, particularly in bulk. The alternate, more popular approach is to train models using “weakly” labelled data, comprising recordings in which only the presence or absence of sound classes is tagged, without additional details of the number of occurrences of the sounds or their locations in the recordings. Weakly labelled data are much easier to obtain than strongly labelled data; however training with such data comes with many challenges. In this tutorial we will discuss the problem of training audio (and other) classifiers from weakly labelled data, including several state-of-art formalisms, their restrictions and limitations, and areas of future research.

  • Bhiksha Raj
    IEEE Fellow, Professor at Carnegie Mellon University. Department of Language Technologies

    Bhiksha Raj is a Professor in the School of Computer Science at Carnegie Mellon University. His areas of research include speech and audio processing and acoustic scene analysis. He was one of the pioneers of the field of learning audio classifiers from weak labels. Raj has previously conducted several tutorials at ICASSP, Interspeech and various other conferences. He is a fellow of the IEEE.

  • Anurag Kumar
    Research Scientist, Meta Reality Labs, USA

    Anurag Kumar is a research scientist at Meta Research. Anurag completed his PhD from Carnegie Mellon University in 2019, and has been at Meta Research since then. Kumar is the recipient of the Samsung Innovation Award and has been a finalist for the Qualcomm Fellowship for his work on audio analysis. Along with Professor Raj, he has been one of the pioneers in the field of learning audio classifiers from weak labels.

  • Ankit Shah
    Ph.D. student at Carnegie Mellon University. Department of Language Technologies

    Ankit Shah is a PhD student in the Language Technologies institute in the School of Computer Science at Carnegie Mellon University. His areas of interest are audio understanding, machine learning, and deep learning. His thesis focuses on learning in the presence of weak, uncertain, and incomplete labels, where he has made several key contributions, including the setting up of DCASE challenges on the topic. He has won the Gandhian Young Technological Innovator (GYTI) award in India for his contribution to building a never-ending learner of sound systems.


In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics and computer vision, and last but not least, the speech and language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving deep network training with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures, and generalizability. In this one-session tutorial, we will overview the recent advancements of reinforcement learning and bandits and discuss how they can be employed to solve various speech and natural language processing problems with models that are interpretable and scalable.

  • Baihan Lin
    Columbia University, New York, USA

    Baihan Lin is a machine learning and neuroscience researcher at Columbia University. His current theoretical research interest lies in the intersection between reinforcement learning, dynamical systems, geometric topology and complex networks, with extensive application interest in multiscale systems, especially in understanding the neural systems and the theory of neural networks, as well as developing neuroscience-inspired algorithms in signal processing, speech recognition, and computer vision domains that are efficient, scalable, interpretable and interactive. Before his PhD at Columbia University, he held a Masters Degree in Applied Mathematics from the University of Washington, Seattle. In the past few years, he maintains close collaborations with IBM, Google, Microsoft and Amazon on various application domains. According to the Google Scholar, he has authored 30+ publications with an H-index of 11+ and served on program committees or as reviewers for over 15 conferences including IJCAI, AAMAS, INTERSPEECH, AISTATS, NeurIPS,  ICML, CVPR, ICCV, KDD, ICLR, AAAI, EMNLP and MICCAI, etc., as well as over 15 journals including Nature Scientific Reports, PLOS ONE, JACS, JInf, Entropy, Adv Complex Syst, IEEE Trans Knowl Data Eng, Comput Commun, Mathematics, Electronics, Signals, Algorithms, Behav Res Methods, Appl Sci, Front AI, Front Comp Neuro, Front Psychol, Front Robot AI and Comput Commun.


Although Deep Learning models have revolutionized the speech and audio processing field, they build specialist models for individual tasks and application scenarios. Deep neural models also bottlenecked dialects and languages with limited labelled data. Self-supervised representation learning methods promise a single universal model to benefit a collection of tasks and domains. They recently succeeded in NLP and computer vision domains, reaching new performance levels while reducing required labels for many downstream scenarios. Speech representation learning is experiencing similar progress with three main categories: generative, contrastive, and predictive. Other approaches relied on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a developing research area, it is closely related to acoustic word embedding and learning with zero lexical resources. This tutorial session will present self-supervised speech representation learning approaches and their connection to related research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we will review recent efforts on benchmarking learned representations to extend the application of such representations beyond speech recognition. A hands-on component of this tutorial will provide practical guidance on building and evaluating speech representation models.

  • Hung-yi Lee
    National Taiwan University, Taiwan

    Hung-yi Lee received PhD degree from National Taiwan University (NTU). He was a visiting scientist at the Spoken Language Systems Group of MIT CSAIL. He is an associate professor at National Taiwan University. He is the co-organizer of the special session on “New Trends in self-supervised speech processing” at Interspeech (2020), and the workshop on "Self-Supervised Learning for Speech and Audio Processing" at NeurIPS (2020).

  • Abdelrahman Mohamed
    Meta AI, USA

    Abdelrahman Mohamed is a research scientist at Meta AI research. He received his PhD from the University of Toronto where he was part of the team that started the Deep Learning revolution in Spoken Language Processing in 2009. He has been focusing lately on improving, using, and benchmarking learned speech representations, e.g. HuBERT, Wav2vec 2.0, TextlessNLP, and SUPERB

  • Shinji Watanabe
    Carnegie Mellon University, USA

    Shinji Watanabe is an Associate Professor at CMU. He was a research scientist at NTT, Japan, a visiting scholar in Georgia Tech, a senior principal research scientist at MERL, and an associate research professor at JHU. He has published more than 200 peer-reviewed papers. He served as an Associate Editor of the IEEE TASLP. He was/has been a member of several technical committees, including the APSIPA SLA, IEEE SPS SLTC, and MLSP.

  • Tara Sainath
    Google, USA

    Tara Sainath is a Principal Research scientist at Google. She received her PhD from MIT in the Spoken Language Systems Group. She is an IEEE and ISCA Fellow and the recipient of the 2021 IEEE SPS Industrial Innovation Award. Her research involves applications of deep neural networks for automatic speech recognition, and has been very active in the community organizing workshops and special sessions on this topic.

  • Karen Livescu
    Toyota Technological Institute at Chicago, USA

    Karen Livescu is a Professor at TTI-Chicago. She completed her PhD at MIT in the Spoken Language Systems group. She is an ISCA Fellow and an IEEE Distinguished Lecturer, and has served as a program chair for ICLR 2019 and Interspeech 2022. Her recent work includes multi-view representation learning, acoustic word embeddings, visually grounded speech models, spoken language understanding, and automatic sign language recognition.

  • Shang-Wen Li

    Shang-Wen Li is a Research and Engineering Manager at Meta AI. He worked at Apple Siri, Amazon Alexa and AWS. He completed his PhD in 2016 from the Spoken Language Systems group of MIT CSAIL. He co-organized the workshop of "Self-Supervised Learning for Speech and Audio Processing" at NeurIPS (2020) and AAAI (2022), and the workshop of "Meta Learning and Its Applications to Natural Language Processing" at ACL (2021).

  • Shu-wen Yang
    National Taiwan University, Taiwan

    Shu-wen Yang is a PhD student at National Taiwan University. He co-created a benchmark for Self-Supervised Learning in speech, Speech processing Universal PERformance Benchmark (SUPERB). Before SUPERB, he created the S3PRL toolkit with Andy T. Liu, which supports numerous pretrained models and recipes for both pre-training and benchmarking. He gave a tutorial at the Machine Learning Summer School, Taiwan, 2021.

  • Katrin Kirchhoff
    Amazon Web Services, USA

    Katrin Kirchhoff is a Director of Applied Science at Amazon Web Services, where she heads several teams in speech and audio processing. She was a Research Professor at the UW, Seattle, for 17 years, where she co-founded the Signal, Speech and Language Interpretation Lab. She served on the editorial boards of Speech Communication and Computer, Speech, and Language, and was a member of the IEEE Speech Technical Committee.


Cochlear implantation, as one of the most profound technological advances in modern medicine, is able to restore partial hearing and speech communication abilities to a large amount of profoundly deaf people. Cochlear implants (CIs) provide a valuable scientific model to investigate the impacts of psychoacoustic and linguistic cues on speech perception. Understanding speech in noisy environments remains challenging for most CI users; hence speech enhancement plays an extremely important role for CI speech processing and perception. Innovations from psychoacoustic knowledge to recent machine learning technology may provide novel solutions for the design of CI speech enhancement methods in challenging listening tasks.

  • Fei Chen
    Department of Electrical and Electronic Engineering, Southern University of Science and Technology

    Fei Chen received the B.Sc. and M.Phil. degrees from the Department of Electronic Science and Engineering, Nanjing University in 1998 and 2001, respectively, and the PhD degree from the Department of Electronic Engineering, The Chinese University of Hong Kong in 2005. He continued his research as post-doctor and senior research fellow in University of Texas at Dallas (supervised by Prof. Philipos Loizou) and The University of Hong Kong. He is now a full professor at Department of Electrical and Electronic Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China. Dr. Chen is leading the speech and physiological signal processing (SPSP) research group in SUSTech, with research focus on speech perception, speech intelligibility modeling, speech enhancement, and assistive hearing technology. He published over 100 journal papers and over 80 conference papers in IEEE journals/conferences, Interspeech, Journal of Acoustical Society of America, etc. He was tutorial speaker of ‘Intelligibility evaluation and speech enhancement based on deep learning” at Interspeech 2020, Shanghai, and organized special session “Signal processing for assistive hearing devices” at ICASSP 2015, Brisbane. He received the best presentation award in the 9th Asia Pacific Conference of Speech, Language and Hearing, and 2011 National Organization for Hearing Research Foundation Research Awards in States. Dr. Chen is an APSIPA distinguished lecturer, and is now serving as associate editor of <Biomedical Signal Processing and Control><Frontiers in Human Neuroscience>.

  • Yu Tsao
    The Research Center for Information Technology Innovation (CITI), Academia Sinica

    Yu Tsao received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher with the National Institute of Information and Communications Technology, Tokyo, Japan, where he engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is currently a Research Fellow (Professor) and Deputy Director with the Research Center for Information Technology Innovation, Academia Sinica, Taipei. His research interests include speech and speaker recognition, acoustic and language modeling, audio coding, and bio-signal processing. He is currently an Associate Editor for the IEEE/ACM Transactions on Audio, Speech, and Language Processing and IEEE Signal Processing Letters and a Distinguished Lecturer of APSIPA. He was tutorial speakers of Interspeech 2020, ICASSP 2018, APSIPA 2021, APSIPA 2020, APSIPA 2019, APSIPA 2018 and ISCSLP 2018. He was the recipient of the Academia Sinica Career Development Award in 2017, the National Innovation Award in 2018, 2019, 2020, Future Tech Breakthrough Award 2019, and the Outstanding Elite Award, Chung Hwa Rotary Educational Foundation 2019–2020.

Afternoon Tutorials


Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. In this tutorial, we will present a review into deep spoken KWS intended for practitioners and researchers who are interested in this technology. Particularly, we will cover an analysis of the main components of deep KWS systems, robustness methods, audio-visual KWS, applications and experimental considerations before concluding by identifying a number of directions for future research.

  • Iván López-Espejo
    Aalborg University, Denmark

    Iván López-Espejo received the M.Sc. degree in Telecommunications Engineering, the M.Sc. degree in Electronics Engineering and the Ph.D. degree in Information and Communications Technology, all from the University of Granada, Granada, Spain, in 2011, 2013 and 2017, respectively. In 2018, he was the leader of the speech technology team of Veridas, Pamplona, Spain. Since 2019, he is a post-doctoral researcher at the section for Artificial Intelligence and Sound at the Department of Electronic Systems of Aalborg University, Aalborg, Denmark. His research interests include speech enhancement, robust speech recognition and keyword spotting, multi-channel speech processing, and speaker verification.

  • Zheng-Hua Tan
    Aalborg University, Denmark

    Professor Zheng-Hua Tan received the B.Sc. and M.Sc. degrees in electrical engineering from Hunan University, Changsha, China, in 1990 and 1996, respectively, and the Ph.D. degree in electronic engineering from Shanghai Jiao Tong University (SJTU), Shanghai, China, in 1999.

    He is a Professor in the Department of Electronic Systems and a Co-Head of the Centre for Acoustic Signal Processing Research at Aalborg University, Aalborg, Denmark. He is also a Co-Lead of the Pioneer Centre for AI, Denmark. He was a Visiting Scientist at the Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, USA, an Associate Professor at SJTU, Shanghai, China, and a postdoctoral fellow at KAIST, Daejeon, Korea. His research interests include machine learning, deep learning, pattern recognition, speech and speaker recognition, noise-robust speech processing, multimodal signal processing, and social robotics. He has authored/coauthored over 200 refereed publications. He is the Chair of the IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee (MLSP TC). He is an Associate Editor for the IEEE/ACM Transactions on Audio, Speech, and Language Processing. He has served as an Editorial Board Member for Computer Speech and Language and was a Guest Editor for the IEEE Journal of Selected Topics in Signal Processing and Neurocomputing. He was the General Chair for IEEE MLSP 2018 and a TPC Co-Chair for IEEE SLT 2016.

  • John H. L. Hansen
    The University of Texas at Dallas, USA

    John H. L. Hansen (Fellow, IEEE) received the B.S.E.E. degree from the College of Engineering, Rutgers University, New Brunswick, NJ, USA, and the M.S. and Ph.D. degrees in electrical engineering from the Georgia Institute of Technology, Atlanta, GA, USA. In 2005, he joined the Erik Jonsson School of Engineering and Computer Science, the University of Texas at Dallas, Richardson, TX, USA, where he is currently an Associate Dean for research and a Professor of electrical and computer engineering. He also holds the Distinguished University Chair in telecommunications engineering and a joint appointment as a Professor of speech and hearing with the School of Behavioral and Brain Sciences. From 2005 to 2012, he was the Head of the Department of Electrical Engineering, the University of Texas at Dallas. At UT Dallas, he established the Center for Robust Speech Systems. From 1998 to 2005, he was the Department Chair and a Professor of speech, language, and hearing sciences, and a Professor of electrical and computer engineering with the University of Colorado Boulder, Boulder, CO, USA, where he co-founded and was an Associate Director of the Center for Spoken Language Research. In 1988, he established the Robust Speech Processing Laboratory. He has supervised 92 Ph.D. or M.S. thesis students, which include 51 Ph.D. and 41 M.S. or M.A. He has authored or coauthored 765 journal and conference papers including 13 textbooks in the field of speech processing and language technology, signal processing for vehicle systems, co-author of the textbook Discrete-Time Processing of Speech Signals (IEEE Press, 2000), Vehicles, Drivers and Safety: Intelligent Vehicles and Transportation (vol. 2 DeGruyter, 2020), Digital Signal Processing for In-Vehicle Systems and Safety (Springer, 2012), and the lead author of The Impact of Speech Under ‘Stress’ on Military Speech Technology (NATO RTO-TR-10, 2000). His research interests include machine learning for speech and language processing, speech processing, analysis, and modeling of speech and speaker traits, speech enhancement, signal processing for hearing impaired or cochlear implants, machine learning-based knowledge estimation and extraction of naturalistic audio, and in-vehicle driver modeling and distraction assessment for human–machine interaction. He is an IEEE Fellow for contributions to robust speech recognition in stress and noise, and ISCA Fellow for contributions to research for speech processing of signals under adverse conditions. He was the recipient of Acoustical Society of America’s 25 Year Award in 2010, and is currently serving as ISCA President (2017–2022). He is also a Member and the past Vice-Chair on U.S. Office of Scientific Advisory Committees (OSAC) for OSAC-Speaker in the voice forensics domain from 2015 to 2021. He was the IEEE Technical Committee (TC) Chair and a Member of the IEEE Signal Processing Society: Speech-Language Processing Technical Committee (SLTC) from 2005 to 2008 and from 2010 to 2014, elected the IEEE SLTC Chairman from 2011 to 2013, and elected an ISCA Distinguished Lecturer from 2011 to 2012. He was a Member of the IEEE Signal Processing Society Educational Technical Committee from 2005 to 2010, a Technical Advisor to the U.S. Delegate for NATO (IST/TG-01), an IEEE Signal Processing Society Distinguished Lecturer from 2005 to 2006, an Associate Editor for the IEEE Transactions on Audio, Speech, and Language Processing from 1992 to 1999 and the IEEE Signal Processing Letters from 1998 to 2000, Editorial Board Member for the IEEE Signal Processing Magazine from 2001 to 2003, and the Guest Editor in October 1994 for Special Issue on Robust Speech Recognition for the IEEE Transactions on Audio, Speech, and Language Processing. He is currently an Associate Editor for the JASA, and was on the Speech Communications Technical Committee for Acoustical Society of America from 2000 to 2003. In 2016, he was awarded the honorary degree Doctor Technices Honoris Causa from Aalborg University, Aalborg, Denmark in recognition of his contributions to the field of speech signal processing and speech or language or hearing sciences. He was the recipient of the 2020 Provost’s Award for Excellence in Graduate Student Supervision from the University of Texas at Dallas and the 2005 University of Colorado Teacher Recognition Award. He organized and was General Chair for ISCA Interspeech-2002, Co-Organizer and Technical Program Chair for the IEEE ICASSP-2010, Dallas, TX, and Co-Chair and Organizer for IEEE SLT-2014, Lake Tahoe, NV. He will be the Tech. Program Chair for the IEEE ICASSP-2024, and Co-Chair and Organizer for ISCA Interspeech-2022.


The tutorial introduces recent advancements in the emerging field of personalized speech enhancement. Personalizing a speech enhancement model leads to a compressed, thus efficient machine learning inference because the model focuses on a particular user’s speech characteristics or their acoustic environment rather than trying to solve the general-purpose enhancement task. Since the test-time task can be seen as a smaller sub-problem of the generic speech enhancement problem, the model can also achieve better performance by solving a smaller and easier optimization problem. Moreover, since the goal is to adapt to the unseen test environment, personalization can improve the fairness of AI models for the users who are not well represented in the big training data.  Meanwhile, personalized speech enhancement is challenging as it is difficult to utilize the personal information (e.g., clean speech) of the unknown test-time users ahead of time. In addition, acquiring such private data can increase concerns about privacy issues in speech applications. In this tutorial, we will explore various definitions of personalized speech enhancement in the literature and relevant machine learning concepts, such as zero- or few-shot learning approaches, data augmentation and purification, self-supervised learning, knowledge distillation, and domain adaptation. We will also see how these methods can improve data and resource efficiency in machine learning while achieving desired speech enhancement performance.

  • Minje Kim
    Indiana University, USA
    Visiting Academic at Amazon Lab126

    Professor Minje Kim is an Associate Professor in the Dept. of Intelligent Systems Engineering at Indiana University, where he leads his research group, Signals and AI Group in Engineering (SAIGE), and is affiliated with Data Science, Cognitive Science, Statistics, and Center for Machine Learning. He is also an Amazon Visiting Academic, working at Amazon Lab126. He earned his Ph.D. in the Dept. of Computer Science at the University of Illinois at Urbana-Champaign. Before joining UIUC, He worked as a researcher at ETRI, a national lab in Korea, from 2006 to 2011. Before then, he received his Master’s and Bachelor’s degrees in the Dept. of Computer Science and Engineering at POSTECH (Summa Cum Laude) and in the Division of Information and Computer Engineering at Ajou University (with honor) in 2006 and 2004, respectively. He is a recipient of various awards including NSF Career Award (2021), IU Trustees Teaching Award (2021), IEEE SPS Best Paper Award (2020), and Google and Starkey’s grants for outstanding student papers in ICASSP 2013 and 2014, respectively. He is an IEEE Senior Member and also a member of the IEEE Audio and Acoustic Signal Processing Technical Committee (2018-2023). He is serving as an Associate Editor for EURASIP Journal of Audio, Speech, and Music Processing, and as a Consulting Associate Editor for IEEE Open Journal of Signal Processing. He is also a reviewer, program committee member, or area chair for the major machine learning and signal processing. He filed more than 50 patent applications as an inventor.


Neural models have become ubiquitous in speech technologies - almost all state-of-the-art speech technologies use deep learning as their foundation. This has made it possible for all applications to be built on top of a neural network library. This possibility is currently coming to fruition in the PyTorch ecosystem of speech technologies. This diverse landscape includes, for example, the automatic differentiation capable weighted finite state models in K2 and GTN, the powerful speech representations learned by pretrained wav2vec 2.0 models in Fairseq and Transformers, and the models, data loading and recipes in a large variety of tasks implemented by toolkits like ESPnet, NeMo, Asteroid and SpeechBrain, as well as a wealth of other tools such as the metric learning criteria of PyTorch Metric Learning. The first chapter of the tutorial covers PyTorch fundamentals and interfaces, and the subsequent chapters look at different practical challenges and tools that address them.

  • Aku Rouhe
    Aalto University, Finland

    Aku Rouhe is currently a PhD student in Aalto University, Finland, working under the supervision of Prof. Mikko Kurimo. His research interests are novel speech recognition approaches, particularly end-to-end speech recognition, and how these new approaches should compare to more conventional hidden Markov model based approaches. Aku is also an experienced Python programmer, which he has been championing for close to 10 years. Aku is a core member of the original SpeechBrain development team.

  • Mirco Ravanelli
    University of Concordia, Canada

    Mirco is an Assistant Professor at Concordia University (Montréal, QC), an Adjunct Professor at the University of Montréal and an Associated Member of Mila. Previously, he was post-doc researcher at Mila (Université de Montréal) working under the supervision of Prof. Yoshua Bengio. His main research interests are deep learning, speech recognition, far-field speech recognition, cooperative learning, and self-supervised learning. He is the author or co-author of more than 40 papers on these research topics. He received his PhD (with cum laude distinction) from the University of Trento in December 2017. Mirco is an active member of the speech and machine learning communities. He is founder and leader of the SpeechBrain project.

  • Titouan Parcollet
    LIA, Avignon Université , France
    CaMLSys, University of Cambridge, UK

    Titouan Parcollet is an Associate Professor in computer science at the Laboratoire Informatique d’Avignon (LIA), from Avignon University (FR) and a visiting scholar at the Cambridge Machine Learning Systems Lab from the University of Cambridge (UK). Previously, he was a senior research associate at the University of Oxford (UK) within the Oxford Machine Learning Systems group. He received his PhD in computer science from the University of Avignon (FR) and in partnership with Orkis focusing on quaternion neural networks, automatic speech recognition, and representation learning. His current work involves efficient speech recognition, federated learning and self-supervised learning. He is also currently collaborating with the university of Montréal (Mila, QC, Canada) on the SpeechBrain project.

  • Peter Plantinga
    JPMorgan Chase & Co., USA

    Peter Plantinga is an Applied AI/ML Associate at JPMorgan Chase & Co. He received his PhD in computer science from the Ohio State University (USA) under Prof. Eric Fosler-Lussier focusing on knowledge transfer for the tasks of speech enhancement, robust ASR, and reading verification. His current work involves adapting large-scale ASR models to work in the financial domain. Peter is one of the core members of the original SpeechBrain development team, and has continued to contribute since the original release.


Speech synthesis, which consists of several key tasks including text to speech (TTS) and voice conversion (VC), has been a hot research topic in the speech community and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based speech synthesis has significantly improved the quality of synthesized speech in recent years. In this tutorial, we give a comprehensive introduction to neural speech synthesis, which consists of four parts: 1) The history of speech synthesis technology and taxonomy of neural speech synthesis; 2) The key methods and applications of text to speech; 3) The key methods and applications of voice conversion; 4) Challenges in neural speech synthesis and future research directions.

  • Xu Tan
    Microsoft Research Asia

    Xu Tan is a Senior Researcher at Microsoft Research Asia. His research covers deep learning and its applications on language/speech/music processing, including text to speech, singing voice synthesis, automatic speech recognition, neural machine translation, music composition, etc. He has designed several popular text to speech (TTS) systems such as FastSpeech/NaturalSpeech, and transferred many technologies to Microsoft Azure TTS services. He has developed machine translation systems that achieved human parity on Chinese-English translation and won several champions on WMT machine translation competition , and developed language pre-training model MASS and AI music project Muzic. He has published more than 90 papers on top AI conferences and served as the action editor or area chair of some AI journals/conferences (e.g., TMLR, NeurIPS, AAAI). He is an executive member of the committee on Speech, Dialogue and Auditory Processing, and a member of the standing committee on Computational Art in China Computer Federation (CCF).

  • Hung-yi Lee
    National Taiwan University

    Hung-yi Lee is an Associate Professor of the Department of Electrical Engineering of National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He owns a YouTube channel teaching deep learning in Mandarin with about 100k Subscribers.

For the further inquiry, please contact at
The Tutorial Chairs: