官术网_书友最值得收藏!

3.Experiment

The MOCHA-TIMIT corpus,which has synchronized acoustic-articulatory information,is used in the experiment.It includes one male speaker (msak0) and one female speaker (fsew0),each uttering 460 TIMIT sentences.Electromagnetic receiver coils are attached to 9 articulators in the midsagittal plane.They are velum (V),tongue dorsum (TD),tongue blade (TB),tongue tip (TT),lower jaw (LJ),upper lip (UL),lower lip (LL),and the references (REF) on nose ridge and upper jaw.The x and y coordinates are recorded,providing a total 18 channels of articulatory information.The data of the male subject (msak0) is used in this paper.Among those coils,the trajectories of V,TD,TB,TT,LJ,LL and UL are used in our experiment.The sampling frequencies are 16,000Hz for acoustic signal and 500Hz for the articulatory signal,respectively.

Figure 2 The positions of the EMA coils on the speaker's articulators

In preparing the experiment,the speech is segmented into frames by a Hanning window with the length of 25ms.Each speech frame is encoded by the log-energy and 12th-order MFCCs augmented with their delta and deltadeltas.The frame shift between consecutive frames is 10ms.The EMA data are smoothed with a Savitzky-Golay filter with the order of 3 and frame size of 21,and down-sampled to 100Hz to match the frame-rate of he acoustic features.

In our experiments we use a context window of 7 consecutive frames of acoustics feature as the input.As for the output,we use the EMA frame at the time instant corresponding to the middle frame of the contextual acoustic feature.The data is randomly partitioned into three sets:A validation set (45 utterances),a testing set comprising (45 utterances),and a training (370 utterances).Both EMA and MFCC feature vectors are normalized by subtracting their global mean and dividing by their standard deviation of each dimension,respectively.

To measure the accuracy,the root mean-squared error (RMSE) and correlation coefficient,which are the most widely used measures for evaluating articulatory inversion performance,are adopted.RMS error gives an indication of the overall distance between two trajectories,while correlation indicates synchrony and similarity of shape.They are defined as:

where and xi are the estimated and actual coordinate of coil at time instant i,respectively.

Two cost functions are used in training the batch normalize DNN, where L1 is the least square loss and L2 is weighted version of L1.

where the weighting coefficient wij depends on the velocity vi of current time instant.

主站蜘蛛池模板: 邵东县| 雷州市| 古交市| 宁蒗| 毕节市| 潮安县| 峨山| 东源县| 德化县| 岳阳县| 南宁市| 张掖市| 皮山县| 龙州县| 湘乡市| 长沙县| 临海市| 和平区| 巴楚县| 会昌县| 阆中市| 仙游县| 行唐县| 浠水县| 新疆| 扎兰屯市| 珲春市| 吴忠市| 江陵县| 略阳县| 如东县| 松潘县| 进贤县| 定安县| 南江县| 湘潭县| 黄冈市| 衡南县| 鹤山市| 惠州市| 安仁县|