Download PDF

IEEE-acm Transactions on Audio Speech and Language Processing

Publication date: 2017-02-01
Volume: 25 Pages: 285 - 295
Publisher: IEEE-inst Electrical Electronics Engineers Inc

Author:

Renkens, Vincent
Van hamme, Hugo

Keywords:

PSI_SPEECH, Science & Technology, Technology, Acoustics, Engineering, Electrical & Electronic, Engineering, Nonnegativematrix factorisation, hidden Markov models, weak supervision, language acquisition, PSI_4156, 0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering, Speech-Language Pathology & Audiology, 4006 Communications engineering, 4602 Artificial intelligence, 4603 Computer vision and multimedia computation

Abstract:

© 2014 IEEE. In this paper, a spoken command and control interface that acquires spoken language through demonstrations from the user is discussed. The user can train the system by uttering a command and subsequently demonstrating the required action through an alternative interface. From the demonstration, a bag of semantic concepts representation that represents which semantic concepts are present in the demonstration is extracted. In the previous work, we have proposed a method for learning words for these concepts by linking the bag of semantic concepts representation to a bag of features representation of the acoustics. In this method, the order in which the words occur is lost. However, in many cases, the order in which the words occur is important to be able to determine the correct action. In this paper, the vocabulary acquisition based on nonnegative matrix factorization is jointly trained with a hidden Markov model (HMM), making it possible to use the bag of concepts representation as a weak supervision for HMM learning. This model can better utilize the timing information to improve the results and the order in which the words occur is retained making it possible to learn vocabulary and grammar. The proposed system is tested on several command and control tasks and it is shown that for unimpaired speech the resulting system outperforms the system solely based on vocabulary acquisition.