官术网_书友最值得收藏!

1.Introduction

Acoustic-to-articulatory inversion is the technique that estimates vocal tract shape or articulators' position based on input speech signals.It is more than of theoretical interest.It could help automatic speech recognition (ASR)[1],speech therapy and language training[2,3],talking head animation and lip-syncing[4-6],and low bit-rate speech coding[7].

Acoustic-to-articulatory inversion was firstly performed by using codebook-based methods,where code-book was built based on the acoustic-articulatory parameter pairs generated by synthesizing sounds with an articulatory model by scanning entire space of the control parameters.Then,articulation was inferred by looking up the codebook[8,9].However,this approach leads to invalid vocal tract shape since the same acoustic features could be generated by different combinations of articulatory parameters,and some of them never occur in human speech production.this problem can be suppressed by introducing dynamic programing[10]as post-processing or by using human data.

Thanks to the advent of large corpora of synchronized articulatory-acoustic data of human,in past decades,numerous inversion methods have been proposed to tackle the problem of acoustic-to-articulatory inversion.It includes a variety of hidden Markov models (HMM)[11-15],Kalman filtering[16],Gaussian mixture model (GMM) based regression[18],codebooks[19],non-linear regression with multilayer perceptron (MLP)[20],mixture density networks (MDN)[21],deep neural networks[22,23],trajectory MDNs (TMDN)[24].In addition,several studies tried to incorporate visual features to give an audiovisual-to-articulatory mapping[15,25].

Most of the methods mentioned above either apply maximum likelihood or least square error criterion to train the inversion model,where the coordinates of each coil are treated with equal importance.Nevertheless,different articulator is of different importance in different articulation.The position of some articulators have consistent pattern in deferent context,such as the lower lip for bilabials,the tongue tip for alveolars,and the tongue dorsum for velars,which are called as “critical articulators”[20].Hence,the question arises that whether critical information can be incorporated into the Conventional cost function to achieve better acoustic-to-articulatory inversion performance.

In this study,we apply a batch normalization DNN to the task of acoustic-to-articulatory inversion.The cost function takes the form of weighted least square error,where the weighting coefficient of each articulatory channel is determined by the exponential function of the its velocity profile.

主站蜘蛛池模板: 翼城县| 上杭县| 玉屏| 射阳县| 泾阳县| 沂源县| 涪陵区| 山东省| 宝鸡市| 山阴县| 威远县| 长泰县| 青田县| 武隆县| 全椒县| 霍山县| 清徐县| 余姚市| 修文县| 霍州市| 定兴县| 屯门区| 西贡区| 丹阳市| 前郭尔| 句容市| 凌源市| 汕尾市| 涟水县| 宁晋县| 静海县| 高安市| 姚安县| 宝丰县| 贵州省| 峨眉山市| 秦皇岛市| 九龙城区| 聂拉木县| 沅陵县| 宁河县|