pg电子模拟器

書名：中國語音學報（第11輯）
作者名：中國社會科學院語言研究所主辦
本章字數： 465字
更新時間： 2021-01-06 19:11:13

3.Experiment

The MOCHA-TIMIT corpus，which has synchronized acoustic-articulatory information，is used in the experiment.It includes one male speaker （msak0） and one female speaker （fsew0），each uttering 460 TIMIT sentences.Electromagnetic receiver coils are attached to 9 articulators in the midsagittal plane.They are velum （V），tongue dorsum （TD），tongue blade （TB），tongue tip （TT），lower jaw （LJ），upper lip （UL），lower lip （LL），and the references （REF） on nose ridge and upper jaw.The x and y coordinates are recorded，providing a total 18 channels of articulatory information.The data of the male subject （msak0） is used in this paper.Among those coils，the trajectories of V，TD，TB，TT，LJ，LL and UL are used in our experiment.The sampling frequencies are 16，000Hz for acoustic signal and 500Hz for the articulatory signal，respectively.

Figure 2 The positions of the EMA coils on the speaker's articulators

In preparing the experiment，the speech is segmented into frames by a Hanning window with the length of 25ms.Each speech frame is encoded by the log-energy and 12th-order MFCCs augmented with their delta and deltadeltas.The frame shift between consecutive frames is 10ms.The EMA data are smoothed with a Savitzky-Golay filter with the order of 3 and frame size of 21，and down-sampled to 100Hz to match the frame-rate of he acoustic features.

In our experiments we use a context window of 7 consecutive frames of acoustics feature as the input.As for the output，we use the EMA frame at the time instant corresponding to the middle frame of the contextual acoustic feature.The data is randomly partitioned into three sets：A validation set （45 utterances），a testing set comprising （45 utterances），and a training （370 utterances）.Both EMA and MFCC feature vectors are normalized by subtracting their global mean and dividing by their standard deviation of each dimension，respectively.

To measure the accuracy，the root mean-squared error （RMSE） and correlation coefficient，which are the most widely used measures for evaluating articulatory inversion performance，are adopted.RMS error gives an indication of the overall distance between two trajectories，while correlation indicates synchrony and similarity of shape.They are defined as：

where and x_i are the estimated and actual coordinate of coil at time instant i，respectively.

Two cost functions are used in training the batch normalize DNN， where L₁ is the least square loss and L₂ is weighted version of L₁.

where the weighting coefficient w_ij depends on the velocity v_i of current time instant.

官术网_书友最值得收藏!

中國語音學報（第11輯）

3.Experiment