Encoder-decoder models for sequence to sequence mapping

US10706840B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10706840-B2
Application numberUS-201715846634-A
CountryUS
Kind codeB2
Filing dateDec 19, 2017
Priority dateAug 18, 2017
Publication dateJul 7, 2020
Grant dateJul 7, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely output label from among a predetermined set of output labels. The predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data. The recurrent neural network is configured to use an output label indicated for a previous time step to determine an output label for the current time step. The generated sequence of outputs is processed to generate a transcription of the utterance, and the transcription of the utterance is provided.

First claim

Opening claim text (preview).

What is claimed is: 1. A method performed by one or more computers of a speech recognition system, the method comprising: obtaining, by the one or more computers, acoustic data representing an utterance, the acoustic data corresponding to time steps in a series of time steps; processing, by the one or more computers, scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs, wherein the sequence of outputs indicates likely output labels from among a predetermined set of output labels, wherein the predetermined set of output labels includes output labels that respectively correspond to different linguistic units and to a placeholder label that does not represent a classification of acoustic data, wherein the recurrent neural network is configured to use an output label that was selected, based on output of the recurrent neural network, for a previous time step to determine output of the recurrent neural network indicating likelihoods of the output labels for the current time step; processing the generated sequence of outputs to generate a transcription of the utterance; and providing, by the one or more computers, the generated transcription of the utterance as output of the speech recognition system, wherein the recurrent neural network has been trained to process received input acoustic sequences and generate sequences of outputs, the training comprising: obtaining a plurality of training examples, each training example comprising (i) an input acoustic sequence of scores indicative of the acoustic data at each of multiple time steps in a series of time steps, the input acoustic sequence representing a known utterance, and (ii) a corresponding target sequence of linguistic units representing a transcription of the utterance; training the recurrent neural network to minimize a negative log likelihood loss function by, for each of the plurality of training samples: representing possible alignments between the input acoustic sequence and the target sequence of linguistic units as a lattice, the possible alignments constrained to allow placeholder label repetitions only and each node in the lattice represents a respective state of the recurrent neural network, each state of the recurrent neural network being dependent on a respective time step from the series of time steps and a respective position in the target sequence of linguistic units, and wherein transitions between nodes in the lattice represent probabilities of observing respective subsequent linguistic units or placeholder labels in the target sequence of linguistic units; performing forward calculations through the lattice to update each recurrent neural network state; approximating the log likelihood loss function using the updated recurrent neural network states; and performing back propagation techniques using the approximated log likelihood function to adjust recurrent neural network parameters to trained recurrent neural network parameters; and training the recurrent neural network to minimize an expected loss function using the plurality of training examples. 2. The method of claim 1 , wherein processing the generated sequence of outputs to generate a transcription of the utterance comprises determining a most likely output sequence of linguistic units. 3. The method of claim 2 , wherein determining the most likely output sequence comprises applying one or more of (i) beam search processing, (ii) a language model, and (iii) one or more linguistic rules. 4. The method of claim 1 , wherein the linguistic units are graphemes, wherein processing the generated sequence of outputs to generate a transcription of the utterance comprises: removing, from a sequence of output labels that the outputs of the recurrent neural network indicate to be most likely, output labels corresponding to the placeholder output label; and concatenating graphemes indicated by the remaining output labels in the sequence of output labels that the outputs of the recurrent neural network indicate to be most likely. 5. The method of claim 1 , wherein the recurrent neural network comprises one or more recurrent neural network layers and an output layer. 6. The method of claim 5 , wherein the output layer estimates a conditional probability distribution representing the probability of an alignment between the scores indicative of the acoustic data and the sequence of outputs, wherein the conditional probability distribution comprises a product of output conditional probabilities for each time step, each output conditional probability representing the probability of an output for a respective time step given (i) the score for the respective time step, and (ii) an output for a preceding time step. 7. The method of claim 1 , wherein the output for the first time step in the series of time steps is defined as an output label representing the placeholder label. 8. The method of claim 1 , wherein performing forward calculations through the lattice to update each recurrent neural network state comprises determining values of multiple forward variables, wherein each forward variable corresponds to a respective time step from {1, . . . , t} and represents a probability of outputting a particular sequence of n linguistic units up to the respective time step. 9. The method of claim 8 , wherein performing forward calculations through the lattice to update each recurrent neural network state comprises: determining that two different transitions between start node (t−1, n−1) and end node (t, n) exist in the lattice, the two different transitions comprising a first transition through a first intermediate node (t, n−1) and a second transition through a second intermediate node (t−1, n); updating the recurrent neural network state for the end node to equal a recurrent neural network state corresponding to the start node (t−1, n−1) if the product of (i) a forward variable for node (t−1, n−1), and (ii) probability of outputting a linguistic unit at node (t−1, n−1) is greater than the product of (i) a forward variable for node (t−1, n), and (ii) probability of outputting a placeholder label at node (t−1, n); and updating the recurrent neural network state for the end node to equal a recurrent neural network state corresponding to the second intermediate node (t−1, n) if the product of (i) a forward variable for node (t−1, n−1), and (ii) probability of outputting a linguistic unit at node (t−1, n−1) is not greater than the product of (i) a forward variable for node (t−1, n), and (ii) probability of outputting a placeholder label at node (t−1, n). 10. The method of claim 9 , further comprising defining multiple backward variables as the probability of outputting a particular sequence of N-n linguistic units from the particular time t. 11. The method of claim 10 , wherein approximating the log likelihood loss function comprises determining the value of a backward variable for time t=0 and n=0. 12. The method of claim 1 , wherein performing forward calculations through the lattice to update each recurrent neural network state comprises defining the first unit in the sequence of outputs as the placeholder label. 13. The method of claim 1 , wherein training the recurrent neural network to minimize the expected loss function using the plurality of training examples comprises performing back propagation techniques using the expected loss function to adjust recurrent neural network parameters to trained recurrent neural network parameters. 14. The method of claim 1 , wherein the linguistic units are context-dependent phones. 15. The method of claim 1 , whe

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • Combinations of networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10706840B2 cover?
Methods, systems, and apparatus for performing speech recognition. In some implementations, acoustic data representing an utterance is obtained. The acoustic data corresponds to time steps in a series of time steps. One or more computers process scores indicative of the acoustic data using a recurrent neural network to generate a sequence of outputs. The sequence of outputs indicates a likely o…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 07 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).