What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 12 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Training acoustic models using connectionist temporal classification

US10229672B1 · US · B1

Patent metadata
Field	Value
Publication number	US-10229672-B1
Application number	US-201715397327-A
Country	US
Kind code	B1
Filing date	Jan 3, 2017
Priority date	Dec 31, 2015
Publication date	Mar 12, 2019
Grant date	Mar 12, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers of a speech recognition system, the method comprising: training, by the one or more computers of the speech recognition system, a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training, by the one or more computers of the speech recognition system, a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing, by the one or more computers of the speech recognition system, the second CTC acoustic model; receiving, by the one or more computers of the speech recognition system, audio data for a portion of an utterance; providing, by the one or more computers of the speech recognition system, input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating, by the one or more computers of the speech recognition system, data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing, by the one or more computers of the speech recognition system, the data indicating the transcription as output of the automated speech recognition system. 2. The method of claim 1 , wherein receiving the audio data comprises receiving the audio data from a client device over a computer network, and wherein providing the data indicating the transcription comprises providing the data indicating the transcription to the client device over the computer network. 3. The method of claim 1 , wherein providing the data indicating the transcription comprises live streaming speech recognition results such that the data indicating the transcription is provided while the one or more computers concurrently receive audio data for an additional portion of the utterance. 4. The method of claim 1 , wherein the accessed second CTC acoustic model is a unidirectional CTC acoustic model and the first CTC acoustic model is a bidirectional CTC acoustic model. 5. The method of claim 1 , wherein the accessed second CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers, and wherein the first CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers. 6. The method of claim 1 , wherein the second CTC acoustic model is configured to provide outputs that identify labels for triphones or to provide outputs that include scores for labels for triphones. 7. The method of claim 1 , wherein the second CTC acoustic model is trained independent of non-CTC acoustic models. 8. The method of claim 1 , wherein the second CTC acoustic model is trained to recognize multiple different pronunciations of a word using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word. 9. The method of claim 1 , wherein the second CTC acoustic model is trained to recognize multiple different verbalizations of a written word using a verbalization model that indicates multiple different spoken words as valid verbalizations of the written word. 10. The method of claim 1 , wherein the accessed second CTC model is a recurrent neural network model that has been trained using a CTC loss function, and is configured to selectively indicate that a blank output label has a higher likelihood score than phoneme output labels in response to input data to the accessed second CTC model. 11. The method of claim 1 , wherein the operations further comprise generating, as the output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data, output vectors that each indicate (i) likelihoods corresponding to different phonetic units and (ii) a likelihood corresponding to a blank label that does not represent a phonetic unit. 12. A speech recognition system comprising: one or more computers of the speech recognition system; and a non-transitory computer-readable medium coupled to the one or more computers having instructions stored thereon which, when executed by the one or more computers, cause the one or more computers to perform operations comprising: training a first connectionist temporal classification (CTC) acoustic model on first training data to generate, as unmodified outputs, second training data of context-dependent state inventory from approximate phonetic alignments, the first training data comprising context-independent phones generated without using any previously determined phonetic alignments; training a second CTC acoustic model on the second training data to generate outputs corresponding to one or more context-dependent states; accessing the second CTC acoustic model; receiving audio data for a portion of an utterance; providing input data corresponding to the received audio data as input to the accessed second CTC acoustic model that has been trained on the second training data; generating data indicating a transcription for the utterance based on output that the accessed second CTC acoustic model produced in response to the input data corresponding to the received audio data; and providing the data indicating the transcription as output of the automated speech recognition system. 13. The system of claim 12 , wherein receiving the audio data comprises receiving the audio data from a client device over a computer network, and wherein providing the data indicating the transcription comprises providing the data indicating the transcription to the client device over the computer network. 14. The system of claim 12 , wherein providing the data indicating the transcription comprises live streaming speech recognition results such that the data indicating the transcription is provided while the one or more computers concurrently receive audio data for an additional portion of the utterance. 15. The system of claim 12 , wherein the accessed second CTC acoustic model is a unidirectional CTC acoustic model and the first CTC acoustic model is a bidirectional CTC acoustic model. 16. The system of claim 12 , wherein the accessed second CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers, and wherein the first CTC acoustic model comprises a recurrent neural network trained using CTC techniques comprising one or more long short-term memory layers. 17. The system of claim 12 , wherein the second CTC acoustic model is configured to provide outputs that identify labels for triphones or to provide outputs that include scores for labels for triphones. 18. The system of claim 12 , wherein the second CTC acoustic model is trained to recognize multiple different pronunciations of a word using a pronunciation model that indicates multiple different phonetic sequences as valid pronunciations of the word. 19. The system of claim 12 , wherein the second CTC acoustic model is trained to recognize multiple different verbalizations of a written word using a verbalization model that indicates multi

Assignees

Google Llc

Inventors

Classifications

G10L15/16Primary
using artificial neural networks · CPC title
G10L2015/022
Demisyllables, biphones or triphones being the recognition units · CPC title
G10L15/187
Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams · CPC title
G10L15/30
Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title

Patent family

Related publications grouped by family.

View patent family 65633157

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10229672B1 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CT…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 12 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).