What technology area does this patent fall under?

Primary CPC classification G10L17/14. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Context-dependent modeling of phonemes

US9818409B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9818409-B2
Application number	US-201514877673-A
Country	US
Kind code	B2
Filing date	Oct 7, 2015
Priority date	Jun 19, 2015
Publication date	Nov 14, 2017
Grant date	Nov 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on computer storage media for modeling phonemes. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps: processing the acoustic feature representation through each of one or more recurrent neural network layers to generate a recurrent output; processing the recurrent output using a softmax output layer to generate a set of scores, the set of scores comprising a respective score for each of a plurality of context dependent vocabulary phonemes, the score for each context dependent vocabulary phoneme representing a likelihood that the context dependent vocabulary phoneme represents the utterance at the time step; and determining, from the scores for the plurality of time steps, a context dependent phoneme representation of the sequence.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: generating, by an automated speech recognition system that includes an acoustic modeling system and a language modeling system, a plurality of context dependent vocabulary phonemes, comprising: generating a set of vocabulary phoneme classes using training data, dividing each vocabulary phoneme class into one or more subclasses using phonetic questions, and clustering similar contexts using a state-tying algorithm to generate the plurality of context dependent vocabulary phonemes; receiving, by the acoustic modeling system of the automated speech recognition system, an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps: processing, by the acoustic modeling system of the automated speech recognition system, the acoustic feature representation for the time step through each of one or more recurrent neural network layers to generate a recurrent output for the time step; processing, by the acoustic modeling system of the automated speech recognition system, the recurrent output for the time step using a softmax output layer to generate a set of scores for the time step, the set of scores for the time step comprising a respective score for each of the plurality of context dependent vocabulary phonemes, the score for each context dependent vocabulary phoneme representing a likelihood that the context dependent vocabulary phoneme represents the utterance at the time step; determining, by the acoustic modeling system of the automated speech recognition system and from the scores for the plurality of time steps, a context dependent phoneme representation of the acoustic sequence; and processing the context dependent phoneme representation of the acoustic sequence that was determined by the acoustic modeling system of the automated speech recognition system, using the language modeling system of the automated speech recognition system, to generate a speech recognition result for the acoustic sequence. 2. The method of claim 1 , wherein the set of scores for the time step further comprises a respective score for a blank character phoneme, the score for the blank character phoneme representing a likelihood that the utterance at the time step is incomplete. 3. The method of claim 1 , wherein the softmax output layer is a Connectionist Temporal Classification (CTC) output layer. 4. The method of claim 1 , wherein the recurrent neural network layers and the CTC output layer are trained using the training data. 5. The method of claim 1 , wherein the cardinality of the set of context dependent vocabulary phonemes is higher than the cardinality of the set of vocabulary phoneme classes. 6. The method of claim 1 , wherein the phonetic questions are maximum-likelihood-gain phonetic questions. 7. The method of claim 1 , wherein the recurrent neural network layers are Long Short-Term Memory (LSTM) neural network layers. 8. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: generating, by an automated speech recognition system that includes an acoustic modeling system and a language modeling system, a plurality of context dependent vocabulary phonemes, comprising: generating a set of vocabulary phoneme classes using training data, dividing each vocabulary phoneme class into one or more subclasses using phonetic questions, and clustering similar contexts using a state-tying algorithm to generate the plurality of context dependent vocabulary phonemes; receiving, by the acoustic modeling system of the automated speech recognition system, an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps: processing, by the acoustic modeling system of the automated speech recognition system, the acoustic feature representation for the time step through each of one or more recurrent neural network layers to generate a recurrent output for the time step; processing, by the acoustic modeling system of the automated speech recognition system, the recurrent output for the time step using a softmax output layer to generate a set of scores for the time step, the set of scores for the time step comprising a respective score for each of the plurality of context dependent vocabulary phonemes, the score for each context dependent vocabulary phoneme representing a likelihood that the context dependent vocabulary phoneme represents the utterance at the time step; determining, by the acoustic modeling system of the automated speech recognition system and from the scores for the plurality of time steps, a context dependent phoneme representation of the acoustic sequence; and processing the context dependent phoneme representation of the acoustic sequence that was determined by the acoustic modeling system of the automated speech recognition system, using the language modeling system of the automated speech recognition system, to generate a speech recognition result for the acoustic sequence. 9. The system of claim 8 , wherein the set of scores for the time step further comprises a respective score for a blank character phoneme, the score for the blank character phoneme representing a likelihood that the utterance at the time step is incomplete. 10. The system of claim 8 , wherein the softmax output layer is a Connectionist Temporal Classification (CTC) output layer. 11. The system of claim 8 , wherein the recurrent neural network layers and the CTC output layer are trained using the training data. 12. The system of claim 8 , wherein the cardinality of the set of context dependent vocabulary phonemes is higher than the cardinality of the set of vocabulary phoneme classes. 13. The system of claim 8 , wherein the phonetic questions are maximum-likelihood-gain phonetic questions. 14. The system of claim 8 , wherein the recurrent neural network layers are Long Short-Term Memory (LSTM) neural network layers. 15. A non-transitory computer-readable storage medium comprising instructions stored thereon that are executable by a processing device and upon such execution cause the processing device to perform operations comprising: generating, by an automated speech recognition system that includes an acoustic modeling system and a language modeling system, a plurality of context dependent vocabulary phonemes, comprising: generating a set of vocabulary phoneme classes using training data, dividing each vocabulary phoneme class into one or more subclasses using phonetic questions, and clustering similar contexts using a state-tying algorithm to generate the plurality of context dependent vocabulary phonemes; receiving, by the acoustic modeling system of the automated speech recognition system, an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps: processing, by the acoustic modeling system of the automated speech recognition system, the acoustic feature representation for the time step through each of one or more recurrent neural network layers to generate a recurrent output for the time step; processing, by the acousti

Assignees

Google Inc

Inventors

Classifications

G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/09
Supervised learning · CPC title
G10L17/14Primary
Use of phonemic categorisation or speech recognition prior to speaker recognition or verification · CPC title
G10L15/02Primary
Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

View patent family 57588335

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9818409B2 cover?: Methods, systems, and apparatus, including computer programs encoded on computer storage media for modeling phonemes. One method includes receiving an acoustic sequence, the acoustic sequence representing an utterance, and the acoustic sequence comprising a respective acoustic feature representation at each of a plurality of time steps; for each of the plurality of time steps: processing the ac…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G10L17/14. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).