TY - JOUR
T1 - Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment
AU - Yang, Mu
AU - Hirschi, Kevin
AU - Looney, Stephen D.
AU - Kang, Okim
AU - Hansen, John H.L.
N1 - Funding Information:
We have presented an approach to use unlabeled L2 speech to enhance MDD performance via pseudo-labeling. In addition, we take one step forward towards using the MDD model for automatic L2 speech intelligibility and accentedness assessment. Through a human listening test, we have shown that the MDD model recognition performance shows a strong correlation with human perception. In future, we plan to include more speech attributes, such as lexical stress, speech rate, into the L2 speech intelligibility assessment framework. This study is supported by NSF EAGER CISE Project 2140415, and partially by the University of Texas at Dallas from the Distinguished University Chair in Telecommunications Engineering held by J. H. L. Hansen. We would like to thank the raters at Northern Arizona University for their participation in our listening test.
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only fine-tuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.
AB - Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only fine-tuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.
KW - Mispronunciation detection and diagnosis
KW - intelligibility assessment
KW - pseudo-labeling
KW - wav2vec 2.0
UR - http://www.scopus.com/inward/record.url?scp=85140071364&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140071364&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-11039
DO - 10.21437/Interspeech.2022-11039
M3 - Conference article
AN - SCOPUS:85140071364
SN - 2308-457X
VL - 2022-September
SP - 4481
EP - 4485
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -