What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

User specified keyword spotting using neural network feature extractor

US9754584B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9754584-B2
Application number	US-201615345982-A
Country	US
Kind code	B2
Filing date	Nov 8, 2016
Priority date	Dec 22, 2014
Publication date	Sep 5, 2017
Grant date	Sep 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing keywords using a long short term memory neural network. One of the methods includes receiving, by a device for each of multiple variable length enrollment audio signals, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, processing each of the plurality of enrollment feature vectors using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector, and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether another audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising a computer and one or more storage devices storing instructions that are operable, when executed by a computer, to cause the computer to perform operations comprising: providing, for a particular audio signal encoding an utterance of a spoken phrase, a plurality of feature vectors that each comprise values that represent features of the particular audio signal as input to a neural network; receiving, from the neural network and for each of the feature vectors, a respective output vector generated using the respective feature vector; generating a fixed length representation for the particular audio signal by combining at most a quantity k of the output vectors; determining whether the spoken phrase and an enrollment phrase are the same using a comparison of the fixed length representation and a template fixed length representation, wherein the computer uses the template fixed length representation to determine whether an audio signal encodes another spoken utterance of the enrollment phrase; and performing an action associated with the enrollment phrase in response to determining that the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation. 2. A method comprising: providing, by a user device for a particular audio signal encoding an utterance of a spoken phrase, a plurality of feature vectors that each comprise values that represent features of the particular audio signal as input to a neural network; receiving, from the neural network and for each of the feature vectors, a respective output vector generated using the respective feature vector; generating a fixed length representation for the particular audio signal by combining at most a quantity k of the output vectors; determining whether the spoken phrase and an enrollment phrase are the same using a comparison of the fixed length representation and a template fixed length representation, wherein the user device uses the template fixed length representation to determine whether an audio signal encodes another spoken utterance of the enrollment phrase; and performing, by the user device, an action associated with the enrollment phrase in response to determining that the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation. 3. The method of claim 2 , wherein performing the action associated with the enrollment phrase comprises waking up the user device. 4. The method of claim 2 , wherein determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation comprises determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and multiple template fixed length representations including the template fixed length representation, wherein the user device uses each of the multiple template fixed length representations to determine whether an audio signal encodes another spoken utterance of the enrollment phrase. 5. The method of claim 4 , wherein determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the multiple template fixed length representations comprises determining whether the spoken phrase and the enrollment phrase are the same using a comparison of the fixed length representation and an average template fixed length representation created by averaging the values in each of the template fixed length representations to determine a corresponding value in the average template fixed length representation. 6. The method of claim 2 , wherein determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation comprises determining a confidence score that represents a distance between the fixed length representation and the template fixed length representation. 7. The method of claim 6 , wherein determining the confidence score that represents the distance between the fixed length representation and the template fixed length representation comprises determining the distance between the fixed length representation and the template fixed length representation using a cosine distance function. 8. The method of claim 6 , comprising determining that the confidence score satisfies a threshold value, wherein determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation comprises determining that the spoken phrase and the enrollment phrase are the same in response to determining that the confidence score satisfies the threshold value. 9. The method of claim 2 , comprising: determining whether at least the quantity k of feature vectors were generated for the particular audio signal; and in response to determining that less than the quantity k of feature vectors were generated for the particular audio signal, adding leading zeros to a front of the fixed length representation so that the fixed length representation has a predetermined length that is the same as a length of the template fixed length representation. 10. The method of claim 2 , comprising: determining that more than the quantity k of output vectors were generated for the particular audio signal, wherein generating the fixed length representation for the particular audio signal comprises combining the quantity k most recent output vectors in response to determining that more than the quantity k of output vectors were generated for the particular audio signal. 11. The method of claim 2 , wherein determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation comprises determining whether the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation that has a predetermined length and the template fixed length representation that has the predetermined length. 12. A non-transitory computer-readable medium storing software comprising instructions executable by a computer which, upon such execution, cause the computer to perform operations comprising: providing, for a particular audio signal encoding an utterance of a spoken phrase, a plurality of feature vectors that each comprise values that represent features of the particular audio signal as input to a neural network; receiving, from the neural network and for each of the feature vectors, a respective output vector generated using the respective feature vector; generating a fixed length representation for the particular audio signal by combining at most a quantity k of the output vectors; determining whether the spoken phrase and an enrollment phrase are the same using a comparison of the fixed length representation and a template fixed length representation, wherein the computer uses the template fixed length representation to determine whether an audio signal encodes another spoken utterance of the enrollment phrase; and performing an action associated with the enrollment phrase in response to determining that the spoken phrase and the enrollment phrase are the same using the comparison of the fixed length representation and the template fixed length representation. 13. The computer-readable medium of claim 12 , wherei

Assignees

Google Inc

Inventors

Classifications

G10L2015/0631
Creating reference templates; Clustering · CPC title
G10L15/28
Constructional details of speech recognition systems · CPC title
G06F1/3203
Power management, i.e. event-based initiation of a power-saving mode · CPC title
G10L15/16Primary
using artificial neural networks · CPC title
G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

View patent family 56130169

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9754584B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing keywords using a long short term memory neural network. One of the methods includes receiving, by a device for each of multiple variable length enrollment audio signals, a respective plurality of enrollment feature vectors that represent features of the respective variable length enro…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).