- 中國語音學(xué)報(bào)(第11輯)
- 中國社會(huì)科學(xué)院語言研究所主辦
- 403字
- 2021-01-06 19:11:12
1.Introduction
Acoustic-to-articulatory inversion is the technique that estimates vocal tract shape or articulators' position based on input speech signals.It is more than of theoretical interest.It could help automatic speech recognition (ASR)[1],speech therapy and language training[2,3],talking head animation and lip-syncing[4-6],and low bit-rate speech coding[7].
Acoustic-to-articulatory inversion was firstly performed by using codebook-based methods,where code-book was built based on the acoustic-articulatory parameter pairs generated by synthesizing sounds with an articulatory model by scanning entire space of the control parameters.Then,articulation was inferred by looking up the codebook[8,9].However,this approach leads to invalid vocal tract shape since the same acoustic features could be generated by different combinations of articulatory parameters,and some of them never occur in human speech production.this problem can be suppressed by introducing dynamic programing[10]as post-processing or by using human data.
Thanks to the advent of large corpora of synchronized articulatory-acoustic data of human,in past decades,numerous inversion methods have been proposed to tackle the problem of acoustic-to-articulatory inversion.It includes a variety of hidden Markov models (HMM)[11-15],Kalman filtering[16],Gaussian mixture model (GMM) based regression[18],codebooks[19],non-linear regression with multilayer perceptron (MLP)[20],mixture density networks (MDN)[21],deep neural networks[22,23],trajectory MDNs (TMDN)[24].In addition,several studies tried to incorporate visual features to give an audiovisual-to-articulatory mapping[15,25].
Most of the methods mentioned above either apply maximum likelihood or least square error criterion to train the inversion model,where the coordinates of each coil are treated with equal importance.Nevertheless,different articulator is of different importance in different articulation.The position of some articulators have consistent pattern in deferent context,such as the lower lip for bilabials,the tongue tip for alveolars,and the tongue dorsum for velars,which are called as “critical articulators”[20].Hence,the question arises that whether critical information can be incorporated into the Conventional cost function to achieve better acoustic-to-articulatory inversion performance.
In this study,we apply a batch normalization DNN to the task of acoustic-to-articulatory inversion.The cost function takes the form of weighted least square error,where the weighting coefficient of each articulatory channel is determined by the exponential function of the its velocity profile.
- 高職交際英語綜合教程
- 形式語義學(xué)研究
- 德語互動(dòng)語言學(xué):概念、視角與教學(xué)應(yīng)用研究
- 影像化的現(xiàn)代:語言與影像的符號(hào)學(xué)(映像化する現(xiàn)代:言葉と映像の記號(hào)論)
- 自然話語的韻律形式和韻律功能探討
- 對(duì)外漢語語用教學(xué)研究
- 上古漢語常用指示代詞的指示功能研究
- 北朝通語語音研究
- 漢語閱讀與閱讀教學(xué)研究:以日本大學(xué)生為中心
- 夢(mèng)境風(fēng)暴:無意識(shí)與小說寫作
- 勵(lì)耘學(xué)刊(2018年第2輯/總第28輯)
- 中國語言文學(xué)研究(2017年秋之卷/總第22卷)
- 語言就是生活
- 公文寫作(第三版)
- 現(xiàn)代外語教學(xué)與研究(2013)