What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 01 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for speech recognition using neural networks with speaker adaptation

US9721561B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9721561-B2
Application number	US-201314098259-A
Country	US
Kind code	B2
Filing date	Dec 5, 2013
Priority date	Dec 5, 2013
Publication date	Aug 1, 2017
Grant date	Aug 1, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In a speech recognition system, deep neural networks (DNNs) are employed in phoneme recognition. While DNNs typically provide better phoneme recognition performance than other techniques, such as Gaussian mixture models (GMM), adapting a DNN to a particular speaker is a real challenge. According to at least one example embodiment, speech data and corresponding speaker data are both applied as input to a DNN. In response, the DNN generates a prediction of a phoneme based on the input speech data and the corresponding speaker data. The speaker data may be generated from the corresponding speech data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method for speech recognition, the method implemented by one or more processors and comprising: receiving, by a deep neural network at a hidden layer, input speech data at a first set of nodes of the hidden layer of the deep neural network and corresponding speaker data at a second set of nodes of the hidden layer of the deep neural network, the second set of nodes serving as an extra input to the deep neural network; and generating, by the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data, wherein the generating comprises multiplying the input speech data received at the first set of nodes with a first matrix of weighting coefficients and multiplying the speaker data received at the second set of nodes with a second matrix of weighting coefficients, multiplying the speaker data with the second matrix of weighting coefficients removing speaker variability from the input speech data. 2. The method as recited in claim 1 , wherein the input speech data is part of training data and said receiving and generating are repeated for input speech data and corresponding speaker data associated with multiple speakers, the method further comprising: iteratively updating weighting coefficients of the deep neural network based on the prediction of the phoneme generated and information in the training data. 3. The method as recited in claim 1 , wherein the input speech data is deployment speech data collected by a speech recognition system. 4. The method as recited in claim 1 , wherein the speaker data is generated from the corresponding input speech data using maximum likelihood linear regression (MLLR). 5. The method as recited in claim 1 , wherein the speaker data is generated from the corresponding input speech data using constrained maximum likelihood linear regression (CMLLR). 6. The method as recited in claim 1 , wherein the prediction of the phoneme generated includes a probability score. 7. The method as recited in claim 1 , wherein the prediction of the phoneme generated includes an indication of a phoneme. 8. The method as recited in claim 1 further comprising training a Gaussian mixture models' module using the prediction of the phoneme generated by the deep neural network. 9. The method as recited in claim 8 , wherein the Gaussian mixture models' module is adapted using the speaker data. 10. The method as recited in claim 1 further comprising reducing dimensionality of the speaker data using principal component analysis (PCA) prior to reception by the deep neural network. 11. An apparatus for speech recognition comprising: at least one processor; and at least one memory with computer code instructions stored thereon, the at least one processor and the at least one memory with computer code instructions being configured to cause the apparatus to: receive, by a deep neural network at a hidden layer, input speech data at a first set of nodes of the hidden layer of the deep neural network and corresponding speaker data at a second set of nodes of the hidden layer of the deep neural network, the second set of nodes serving as an extra input to the deep neural network; and generate, at an output layer of the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data, wherein the generating comprises multiplying the input speech data received at the first set of nodes with a first matrix of weighting coefficients and multiplying the speaker data received at the second set of nodes with a second matrix of weighting coefficients, multiplying the speaker data with the second matrix of weighting coefficients removing speaker variability from the input speech data. 12. The apparatus as recited in claim 11 , wherein the input speech data is part of training data and wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to: repeat receiving input speech data and corresponding speaker data and generating the prediction of the phoneme for multiple speakers; and iteratively update weighting coefficients of the deep neural network based on the prediction of the phoneme generated and information in the training data. 13. The apparatus as recited in claim 11 , wherein the input speech data is deployment speech data. 14. The apparatus as recited in claim 11 , wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to generate the speaker data from the corresponding input speech data using maximum likelihood linear regression (MLLR). 15. The apparatus as recited in claim 11 , wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to generate the speaker data from the corresponding input speech data using constrained maximum likelihood linear regression (CMLLR). 16. The apparatus as recited in claim 11 , wherein the prediction of the phoneme generated includes a probability score or an indication of a phoneme. 17. The apparatus as recited in claim 11 , wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to train a Gaussian mixture models' module using the prediction of the phoneme generated. 18. The apparatus as recited in claim 17 , wherein the Gaussian mixture models' module is adapted using the speaker data. 19. The apparatus as recited in claim 17 , wherein the at least one processor and the at least one memory with computer code instructions are further configured to cause the apparatus to reduce the dimensionality of the speaker data using principal component analysis (PCA) prior to reception by the deep neural network. 20. A non-transitory computer-readable medium with computer code instructions stored thereon, the computer code instructions being configured, when executed by a processor, to cause an apparatus to: receive, by a deep neural network at a hidden layer, input speech data at a first set of nodes of the hidden layer of the deep neural network and corresponding speaker data at a second set of nodes of the hidden layer of the deep neural network, the second set of nodes serving as an extra input to the deep neural network; and generate, by the deep neural network, a prediction of a phoneme corresponding to the input speech data based on the corresponding speaker data, wherein the generating comprises multiplying the input speech data received at the first set of nodes with a first matrix of weighting coefficients and multiplying the speaker data received at the second set of nodes with a second matrix of weighting coefficients, multiplying the speaker data with the second matrix of weighting coefficients removing speaker variability from the input speech data.

Assignees

Nuance Communications Inc

Inventors

Classifications

G10L15/16Primary
using artificial neural networks · CPC title
G10L15/063
Training · CPC title
G10L15/07
to the speaker · CPC title
G10L2015/025
Phonemes, fenemes or fenones being the recognition units · CPC title
G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

View patent family 52282884

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9721561B2 cover?: In a speech recognition system, deep neural networks (DNNs) are employed in phoneme recognition. While DNNs typically provide better phoneme recognition performance than other techniques, such as Gaussian mixture models (GMM), adapting a DNN to a particular speaker is a real challenge. According to at least one example embodiment, speech data and corresponding speaker data are both applied as i…
Who is the assignee on this patent?: Nuance Communications Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 01 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Two-step quantization and coding method and apparatus

Adaptive segmentation

Mobile speech recognition hardware accelerator

Online maximum-likelihood mean and variance normalization for speech recognition

Frequently asked questions