Recognition of out-of-vocabulary in direct acoustics-to-word speech recognition using acoustic word embedding

US10839792B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10839792-B2
Application numberUS-201916267489-A
CountryUS
Kind codeB2
Filing dateFeb 5, 2019
Priority dateFeb 5, 2019
Publication dateNov 17, 2020
Grant dateNov 17, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method (and structure and computer product) for learning Out-of-Vocabulary (OOV) words in an Automatic Speech Recognition (ASR) system includes using an Acoustic Word Embedding Recurrent Neural Network (AWE RNN) to receive a character sequence for a new OOV word for the ASR system, the RNN providing an Acoustic Word Embedding (AWE) vector as an output thereof. The AWE vector output from the AWE RNN is provided as an input into an Acoustic Word Embedding-to-Acoustic-to-Word Neural Network (AWE→A2W NN) trained to provide an OOV word weight value from the AWE vector. The OOV word weight is inserted into a listing of Acoustic-to-Word (A2W) word embeddings used by the ASR system to output recognized words from an input of speech acoustic features, wherein the OOV word weight is inserted into the A2W word embeddings list relative to existing weights in the A2W word embeddings list.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for learning Out-of-Vocabulary (OOV) words in an Automatic Speech Recognition (ASR) system, the method comprising: using an Acoustic Word Embedding Recurrent Neural Network (AWE RNN) to receive a character sequence for a new OOV word for the ASR system, the RNN providing an Acoustic Word Embedding (AWE) vector as an output thereof; providing the AWE vector output from the AWE RNN as an input into an Acoustic Word Embedding-to-Acoustic-to-Word Neural Network (AWE→A2W NN) trained to provide an OOV word weight value from the AWE vector; and inserting the OOV word weight into a listing of Acoustic-to-Word (A2W) word embeddings used by the ASR system to output recognized words from an input of speech acoustic features, wherein the OOV word weight is inserted into the A2W word embeddings list relative to existing weights in the A2W word embeddings list. 2. The method of claim 1 , wherein the AWE RNN is initially trained as an overall subnetwork using character sequences of In-Vocabulary (IV) words, wherein the initial training further involves an Acoustic Embedding Recurrent Neural Network (AE RNN) that receives an acoustic sequence correspondingly to each character sequence of an IV word used during training, wherein outputs of the AWE RNN and AE RNN are passed into a contrastive loss function, and wherein the AWE RNN and AWE→A2W NN are trained using a backpropagation algorithm to train weights of the AWE RNN, weights of the AE RNN, and weights of the AWE→A2W NN to minimize the contrastive loss function. 3. The method of claim 2 , wherein, subsequent to the initial training of the overall subnetwork, the AE RNN is not used for normal operation of the ASR system and only the AWE RNN is used for a subsequent introduction of OOV words into the ASR system. 4. The method of claim 1 , wherein the ASR system further comprises an Acoustic-to-Word Recurrent Neural Network (A2W RNN) that receives speech acoustic features as an input therein and an output of the A2W RNN is compared to embeddings of the A2W word embeddings listing using a dot product, and wherein, during a normal operation mode of the ASR system in which recognized words are output by the ASR system in response to speech acoustic features from an acoustic input into the ASR system, a word from the A2W word embeddings listing having a highest comparison result is provided as an output of the ASR system as a recognized word for the input speech acoustic features. 5. The method of claim 4 , wherein an overall subnetwork including the A2W RNN is trained using In-Vocabulary (IV) words, wherein speech acoustic features of an IV word and a word sequence corresponding to that IV word are provided into a loss function, and wherein a backpropagation algorithm updates weights of the A2W RNN in order to minimize this loss function and to provide the A2W word embeddings listing. 6. The method of claim 1 , as implemented in a cloud service. 7. A method for Automatic Speech Recognition (ASR), the method comprising: receiving a character sequence for an Out-of-Vocabulary (OOV) word into an Acoustic Word Embedding Recurrent Neural Network (AWE RNN) of an ASR system, as a mechanism to receive a character sequence for a new OOV word for the ASR system, the AWE RNN providing an Acoustic Word Embedding (AWE) vector as an output thereof; providing the AWE vector output from the AWE RNN as an input into an Acoustic Word Embedding-to-Acoustic-to-Word Neural Network (AWE→A2W NN) trained to provide an OOV word weight value from the AWE vector; and inserting the OOV word weight into a listing of Acoustic-to-Word (A2W) word embeddings used by the ASR system to output recognized words from an input of speech acoustic features, wherein the OOV word weight is inserted into the A2W word embeddings list relative to existing weights in the A2W word embeddings list. 8. The method of claim 7 , wherein the AWE RNN is initially trained as an overall subnetwork using character sequences of In-Vocabulary (IV) words, wherein the initial training further involves an Acoustic Embedding Recurrent Neural Network (AE RNN) that receives an acoustic sequence correspondingly to each character sequence of an IV word used during training, wherein outputs of the AWE RNN and AE RNN are passed into a contrastive loss function, and wherein the AWE RNN and AWE→A2W NN are trained using a backpropagation algorithm to train weights of the AWE RNN, weights of the AE RNN, and weights of the AWE→A2W NN to minimize the contrastive loss function. 9. The method of claim 8 , wherein, subsequent to the initial training of the overall subnetwork, the AE RNN is not used for normal operation of the ASR system and only the AWE RNN is used for a subsequent introduction of OOV words into the ASR system. 10. The method of claim 7 , wherein the ASR system further comprises an Acoustic-to-Word Recurrent Neural Network (A2W RNN) that receives speech acoustic features as an input therein and an output of the A2W RNN is compared to embeddings of the A2W word embeddings listing using a dot product, and wherein, during a normal operation mode of the ASR system in which recognized words are output by the ASR system in response to speech acoustic features from an acoustic input into the ASR system, a word from the A2W word embeddings listing having a highest comparison result is provided as an output of the ASR system as a recognized word for the input speech acoustic features. 11. The method of claim 10 , wherein an overall subnetwork including the A2W RNN is trained using In-Vocabulary (IV) words, wherein speech acoustic features of an IV word and a word sequence corresponding to that IV word are provided into a loss function, and wherein a backpropagation algorithm updates weights of the A2W RNN in order to minimize this loss function and to provide the A2W word embeddings listing. 12. The method of claim 7 , as implemented in a cloud service. 13. A method for Automatic Speech Recognition (ASR), the method comprising: initially training an overall subnetwork comprising an Acoustic-to-Word Recurrent Neural Network (A2W RNN), the A2W RNN receiving In-Vocabulary (IV) words for the initial training, the initial training using IV words resulting in a listing of Acoustic-to-Word (A2W) Word Embeddings stored in a memory of an ASR system performing the ASR processing; receiving an Out-of-Vocabulary (OOV) word as a character sequence into an Acoustic Word Embedding Recurrent Neural Network (AWE RNN), as a mechanism to receive a character sequence for a new OOV word for the ASR system, the AWE RNN providing an Acoustic Word Embedding (AWE) vector as an output thereof; providing the AWE vector output from the AWE RNN as an input into an Acoustic Word Embedding-to-Acoustic-to-Word Neural Network (AWE→A2W NN) trained to provide an OOV word weight value from the AWE vector; and inserting the OOV word weight into a listing of Acoustic-to-Word (A2W) word embeddings used by the ASR system to output recognized words from an input of speech acoustic features, wherein the OOV word weight is inserted into the A2W word embeddings list relative to existing weights in the A2W word embeddings list. 14. The method of claim 13 , wherein the AWE RNN is initially trained as an overall subnetwork using character sequences of In-Vocabulary (IV) words, wherein the initial training further involves an Acoustic Embedding Recurrent Neural Network (AE RNN) that receives an acoustic sequence correspondingly to each character sequence of an IV word used during training, wherein outputs of the AWE RNN and AE RNN are passed into a contrastive loss function, an

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10839792B2 cover?
A method (and structure and computer product) for learning Out-of-Vocabulary (OOV) words in an Automatic Speech Recognition (ASR) system includes using an Acoustic Word Embedding Recurrent Neural Network (AWE RNN) to receive a character sequence for a new OOV word for the ASR system, the RNN providing an Acoustic Word Embedding (AWE) vector as an output thereof. The AWE vector output from the A…
Who is the assignee on this patent?
IBM, Toyota Tech Institute At Chicago
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 17 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).