NAISTAR
Advanced Search
Japanese | English

naistar (NAIST Academic Repository) >
学術リポジトリ naistar / NAIST Academic Repository naistar >
学術雑誌論文 / Journal Article >
情報科学研究科 / Graduate School of Information Science >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10061/7814

Title: Improving Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics in Noisy Environments Using Multi-Template Models
Authors: Randy Gomez
Akinobu Lee
Tomoki Toda
Hiroshi Saruwatari
Kiyohiro Shikano
Keywords: HMM-Sufficient Statistics
unsupervised
speaker adaptation
noisy environments
Issue Date: Mar-2006
Publisher: 電子情報通信学会
Journal Title: IEICE Transactions on Information and Systems
Volume: E89-D
Issue: 3
Start page: 998
End page: 1005
Abstract: This paper describes the method of using multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. This adaptation scheme is mainly composed of two processes. The first part is done offline which involves the training of multiple class-dependent acoustic models and the creation of speakers' HMM-Sufficient Statistics based on gender and age. The second part is performed online where adaptation begins using the single utterance of a test speaker. From this utterance, the system will classify the speaker's class and consequently select the N-best neighbor speakers close to the utterance using Gaussian Mixture Models (GMM). The classified speakers' class template model is then adopted as a base model. From this template model, the adapted model is rapidly constructed using the Nbest neighbor speakers' HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20dB, 15dB and 10dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word accuracy rate compared with 88.1% of the conventional single-template method, while the baseline recognition rate without adaptation is 86.4%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.
URI: http://hdl.handle.net/10061/7814
URL: https://search.ieice.org/
ISSN: 0916-8532
Rights: Copyright (C) 2006 電子情報通信学会.
Text Version: publisher
Publisher DOI: 10.1093/ietisy/e89-d.3.998
Appears in Collections:情報科学研究科 / Graduate School of Information Science

Files in This Item:

File SizeFormat
IEICETransInfoSys_E89D_3_998.pdf5.66 MBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Copyright (c) 2007-2012 Nara Institute of Science and Technology All Rights Reserved.
DSpace Software Copyright © 2002-2010  Duraspace - Feedback