What technology area does this patent fall under?

Primary CPC classification G10L15/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 07 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generating target sequences from input sequences using partial conditioning

US10043512B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10043512-B2
Application number	US-201615349245-A
Country	US
Kind code	B2
Filing date	Nov 11, 2016
Priority date	Nov 12, 2015
Publication date	Aug 7, 2018
Grant date	Aug 7, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system can be configured to perform tasks such as converting recorded speech to a sequence of phonemes that represent the speech, converting an input sequence of graphemes into a target sequence of phonemes, translating an input sequence of words in one language into a corresponding sequence of words in another language, or predicting a target sequence of words that follow an input sequence of words in a language (e.g., a language model). In a speech recognizer, the RNN system may be used to convert speech to a target sequence of phonemes in real-time so that a transcription of the speech can be generated and presented to a user, even before the user has completed uttering the entire speech input.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating a target sequence comprising a respective output at each of a plurality of output time steps from an input sequence comprising a respective input at each of a plurality of input time steps, the method comprising: for each block of a fixed number of input time steps in the input sequence: processing each input in the block of input time steps using an encoder recurrent neural network (RNN) to generate a respective feature representation of the input; selecting outputs for a portion of the plurality of time steps corresponding to the block, including for each current output time step of the portion of the plurality of time steps, processing, using a transducer RNN, (i) data that is based on the feature representations for the inputs in the block and (ii) a preceding output at a preceding output time step that immediately precedes the current output time step, to select a respective output for the current output time step; and when the respective output for the current output time step is a designated end-of-block output, refraining from generating any more outputs for the block. 2. The method of claim 1 , wherein, for the initial time step in the portion of the plurality of time steps corresponding to an initial block in the input sequence, the preceding output at the preceding output time step is a placeholder start-of-sequence output. 3. The method of claim 1 , wherein the transducer RNN is configured to, for a given block of input time steps and to select an output for a given output time step: process the output at an output time step immediately preceding the given output time step and a preceding context vector for the output time step immediately preceding the given output time step using a first RNN subnetwork to update a current hidden state of the first RNN subnetwork; process the updated hidden state of the first RNN subnetwork and the feature representations for the inputs in the given block of input time steps using a context subnetwork to determine a current context vector; process the current context vector and the updated hidden state of the first RNN subnetwork using a second RNN subnetwork to update a current hidden state of the second RNN subnetwork; and process the current hidden state of the second RNN subnetwork using a softmax layer to generate a respective score for each output in a dictionary of possible outputs. 4. The method of claim 3 , wherein the context subnetwork is an RNN. 5. The method of claim 1 , wherein the input sequence is a speech sequence and the target sequence is a sequence of phonemes representing the speech sequence. 6. The method of claim 1 , wherein when the respective output for the current output time step is not the designated end-of-block output, determining to continue generating additional outputs for the block. 7. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations for generating a target sequence comprising a respective output at each of a plurality of output time steps from an input sequence comprising a respective input at each of a plurality of input time steps, the operations comprising: for each block of a fixed number of input time steps in the input sequence: processing each input in the block of input time steps using an encoder recurrent neural network (RNN) to generate a respective feature representation of the input; selecting outputs for a portion of the plurality of time steps corresponding to the block, including for each current output time step of the portion of the plurality of time steps, processing, using a transducer RNN, (i) data that is based on the feature representations for the inputs in the block and (ii) a preceding output at a preceding output time step that immediately precedes the current output time step, to select a respective output for the current output time step; and when the respective output for the current output time step is a designated end-of-block output, refraining from generating any more outputs for the block. 8. The system of claim 7 , wherein, for the initial time step in the portion of the plurality of time steps corresponding to an initial block in the input sequence, the preceding output at the preceding output time step is a placeholder start-of-sequence output. 9. The system of claim 7 , wherein the transducer RNN is configured to, for a given block of input time steps and to select an output for a given output time step: process the output at an output time step immediately preceding the given output time step and a preceding context vector for the output time step immediately preceding the given output time step using a first RNN subnetwork to update a current hidden state of the first RNN subnetwork; process the updated hidden state of the first RNN subnetwork and the feature representations for the inputs in the given block of input time steps using a context subnetwork to determine a current context vector; process the current context vector and the updated hidden state of the first RNN subnetwork using a second RNN subnetwork to update a current hidden state of the second RNN subnetwork; and process the current hidden state of the second RNN subnetwork using a softmax layer to generate a respective score for each output in a dictionary of possible outputs. 10. The system of claim 9 , wherein the context subnetwork is an RNN. 11. The system of claim 7 , wherein the input sequence is a speech sequence and the target sequence is a sequence of phonemes representing the speech sequence. 12. A computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations for generating a target sequence comprising a respective output at each of a plurality of output time steps from an input sequence comprising a respective input at each of a plurality of input time steps, the operations comprising: for each block of a fixed number of input time steps in the input sequence: processing each input in the block of input time steps using an encoder recurrent neural network (RNN) to generate a respective feature representation of the input; selecting outputs for a portion of the plurality of time steps corresponding to the block, including for each current output time step of the portion of the plurality of time steps, processing, using a transducer RNN, (i) data that is based on the feature representations for the inputs in the block and (ii) a preceding output at a preceding output time step that immediately precedes the current output time step, to select a respective output for the current output time step; and when the respective output for the current output time step is a designated end-of-block output, refraining from generating any more outputs for the block. 13. The computer storage medium of claim 12 , wherein, for the initial time step in the portion of the plurality of time steps corresponding to an initial block in the input sequence, the preceding output at the preceding output time step is a placeholder start-of-sequence output. 14. The computer storage medium of claim 12 , wherein the transducer RNN is configured to, for a given block of input time steps and to select an output for a given output time step: process the output at an output time step immediately preceding the given output time step and a preceding context vector for the output time step immediately preceding the given output time step using a first RNN subnetwork to

Assignees

Google Llc

Inventors

Classifications

G06F40/55
Rule-based translation · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06F40/58
Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title
G06F40/274
Converting codes to words; Guess-ahead of partial word inputs · CPC title
G10L15/26Primary
Speech to text systems (G10L15/08 takes precedence) · CPC title

Patent family

Related publications grouped by family.

View patent family 57421957

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10043512B2 cover?: A system can be configured to perform tasks such as converting recorded speech to a sequence of phonemes that represent the speech, converting an input sequence of graphemes into a target sequence of phonemes, translating an input sequence of words in one language into a corresponding sequence of words in another language, or predicting a target sequence of words that follow an input sequence o…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 07 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Latency constraints for acoustic modeling

Processing acoustic sequences using long short-term memory (LSTM) neural networks that include recurrent projection layers

System and method for speech recognition using deep recurrent neural networks

Using embedding functions with a deep network

Frequently asked questions