Speaker Odyssey 2018 keynote

Slides, video and bibliography for my invited keynote at Odyssey 2018, Les Sables d’Olonne, France, June 2018.

Speaking naturally? It depends who is listening…

slownormalfast

Links to the videos and sounds included in this presentation:

Audio for Carlini and Wagner “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”, eprint arXiv:1801.01944
Video for Carlini, Mishra, Vaidya, Zhang, Micah, Shields, Wagner and Zhou. “Hidden Voice Commands”, USENIX Security Symposium (Security), August 2016 PDF
Video for Athalye, Engstrom, Ilyas and Kwok “Synthesizing Robust Adversarial Examples”, eprint arXiv:1707.07397

Speech quality

John G. Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, Michael Keyhl. Perceptual Objective Listening Quality Assessment (POLQA), The Third Generation ITU-T Standard for End-to-End Speech Quality Measurement Part II–Perceptual Model. J. Audio Eng. Soc. 61(6), Jun 2013
Florian Hinterleitner, Sebastian Moeller, Tiago H. Falk, Tim Polzehl. Comparison of Approaches for Instrumentally Predicting the Quality of Text-to-Speech Systems: Data from Blizzard Challenges 2008 and 2009.. Proc Blizzard Challenge workshop 2010, Kansai Science City, Japan, Sep 2010
Christoph R. Norrenbrock, Florian Hinterleitner, Ulrich Heute, Sebastian Moeller. Quality prediction of synthesized speech based on perceptual quality dimensions. Speech Communication 66 pp17–35, Feb 2015

Adversarial methods

A Nguyen, J Yosinski, J Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun 2015
Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and harnessing adversarial examples. Unreviewed report arXiv:1412.6572v3
Anish Athalye, Logan Engstrom, Andrew Ilyas, Kevin Kwo. Synthesizing robust adversarial examples. Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, July 2018.

Attacks on speech recognition

Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, Wenchao Zhou. Hidden Voice Commands. 25th USENIX Security Symposium, Austin, TX, USA, Aug 2016.
Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, Wenyuan Xu. DolphinAttack: Inaudible Voice Commands. In Proc. 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, TX, USA, Oct-Nov 2017.

Speaker verification, spoofing and anti-spoofing

Tomi Kinnunen, Haizhou Li. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication 52 pp12–40, Jan 2010
Zhizheng Wu, Junichi Yamagishi, Tomi Kinnunen, Cemal Hanilci, Mohammed Sahidullah, Aleksandr Sizov, Nicholas Evans, Massimiliano Todisco, Hector Delgado. ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge. IEEE Journal of Selected Topics in Signal Processing, 11(4), June 2017.
Hector Delgado, Massimiliano Todisco, Md Sahidullah, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Junichi Yamagishi. ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements. Proc. Speaker Odyssey 2018 The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, Jun 2018.

Speech synthesis

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu. Natural TTS synthesis by conditioning Wavenet on Mel Spectrogram Predictions. Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, Apr 2018.
Cassia Valentini-Botinhao, Zhizheng Wu, Simon King. Towards minimum perceptual error training for DNN-based speech synthesis. Proc. 16th Annual Conference of the International Speech Communication Association (Interspeech), Dresden, Germany, Sep 2015.
Shan Yang, Lei Xie, Xiao Chen, Xiaoyan Lou, Xuan Zhu, Dongyan Huang, Haizhou Li. Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework. ASRU 2017
Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari. Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis. Proc. ICASSP 2017
Yuki Saito, Shinnosuke Takamichi ,Hiroshi Saruwatari. Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks IEEE/ACM Trans Audio, Speech, and Language Processing, 26(1) Jan 2018 DOI 10.1109/TASLP.2017.2761547

Miscellaneous: liveness detection; privacy

Linghan Zhang, Sheng Tan, Jie Yang, Yingying Chen. VoiceLive: A Phoneme Localization based Liveness Detection for Voice Authentication on Smartphones CCS’16, October 24-28, 2016, Vienna, Austria
Linghan Zhang, Sheng Tan, Jie Yang. Hearing Your Voice is Not Enough: An Articulatory Gesture Based Liveness Detection for Voice Authentication. CCS’17, October 30-November 3, 2017, Dallas, TX, USA
Sree Hari Krishnan Parthasarathi. Privacy-Sensitive Audio Features for Conversational Speech Processing. PhD THÈSE NO 5234 (2011) EPFL, Switzerland