Token-position handling for sequence based neural networks
US-2021319288-A1 · Oct 14, 2021 · US
US12242818B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12242818-B2 |
| Application number | US-202117797872-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 8, 2021 |
| Priority date | Feb 7, 2020 |
| Publication date | Mar 4, 2025 |
| Grant date | Mar 4, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sequence modeling. One of the methods includes receiving an input sequence having a plurality of input positions; determining a plurality of blocks of consecutive input positions; processing the input sequence using a neural network to generate a latent alignment, comprising, at each of a plurality of input time steps: receiving a partial latent alignment from a previous input time step; selecting an input position in each block, wherein the token at the selected input position of the partial latent alignment in each block is a mask token; and processing the partial latent alignment and the input sequence using the neural network to generate a new latent alignment, wherein the new latent alignment comprises, at the selected input position in each block, an output token or a blank token; and generating, using the latent alignment, an output sequence.
Opening claim text (preview).
What is claimed is: 1. A method of generating, from an input sequence having a respective input token at each of a plurality of input positions, an output sequence having a respective output token from a vocabulary of output tokens at each of a plurality of output positions, the method comprising: receiving the input sequence; determining a plurality of blocks, wherein each block comprises a plurality of input tokens having consecutive input positions from the input positions, wherein the input tokens comprise a first token modality associated with audio tokens, text tokens, or image tokens; processing the input sequence using a neural network to generate a latent alignment of the input sequence, wherein the latent alignment comprises, at each of the input positions, either an output token from the vocabulary of output tokens or a blank token, the processing comprising, at each of a plurality of input time steps: receiving a partial latent alignment from a previous input time step, wherein the partial latent alignment comprises, at each of the input positions, one of: an output token, a blank token, or a mask token; selecting an input position in each block, wherein the token at the selected input position of the partial latent alignment in each block is a mask token; and processing i) the partial latent alignment and ii) the input sequence using the neural network to generate a new latent alignment, wherein the new latent alignment comprises, at the selected input position in each block, an output token or a blank token; and generating, using the latent alignment, the output sequence, wherein the output tokens comprise a second token modality associated with audio tokens, text tokens, or image tokens. 2. The method of claim 1 , wherein each block comprises a same number of input tokens, and wherein the same number of input tokens is equal to a number of input time steps. 3. The method of claim 1 , wherein processing i) the partial latent alignment and ii) the input sequence using the neural network to generate a new latent alignment comprises: processing the input sequence using a first embedding subnetwork to generate an input sequence embedding; processing the partial latent alignment using a second embedding subnetwork to generate a partial latent alignment embedding; combining the partial latent alignment embedding and the input sequence embedding to generate a combined embedding; and processing the combined embedding using a self-attention subnetwork to generate the new latent alignment. 4. The method of claim 1 , wherein the first token modality is an audio token modality comprising audio sample tokens and the second token modality is a text token modality comprising text sample tokens. 5. The method of claim 1 , wherein the first token modality is a text token modality comprising text sample in a first language and the second token modality is a text token modality comprising text sample in a second language. 6. The method of claim 1 , wherein processing i) the partial latent alignment and ii) the input sequence using the neural network to generate a new latent alignment comprises: upsampling the input sequence to generate a modified input sequence; and processing i) the partial latent alignment and ii) the modified input sequence using the neural network to generate the new latent alignment. 7. The method of claim 1 , wherein the neural network has been trained by updating parameters θ of the neural network using an objective function that marginalizes over all possible new partial latent alignments that are compatible with a particular partial latent alignment. 8. The method of claim 7 , wherein the objective function is: J DP (θ)= E a˜q ϕ′ [E ã˜r [log Σ a′∈β′(ã, a) p θ ( a′|ã, x )]], where x is the input sequence, a is a particular latent alignment, ã is a particular partial latent alignment of the latent alignment a, ϕ′ is a pseudo-expert policy, q ϕ′ is a distribution over all possible latent alignments of x under the pseudo-expert policy ϕ′, r(a) is a distribution over all possible masking permutations of latent alignments of x, and β′(ã, a) returns a set of all possible new partial latent alignments compatible with the particular partial latent alignment ã drawn from the distribution q ϕ′ ×r. 9. The method of claim 1 , wherein the neural network has been trained by updating parameters θ of the neural network using an objective function that computes a loss according to a pseudo-expert policy. 10. The method of claim 9 , wherein the objective function is: J IM (θ)= E a˜q ϕ′ [E ã˜r [log p θ ( a|ã, x )]], where x is the input sequence, a is a particular latent alignment, ã is a particular partial latent alignment of the latent alignment a, ϕ′ is a pseudo-expert policy, q ϕ′ is a distribution over all possible alignments of x under the pseudo-expert policy ϕ′, and r(a) is a distribution over all possible masking permutations of alignments of x. 11. The method of claim 8 , wherein q ϕ′ =â* ϕ +N, where N is a noise distribution and â* ϕ is a best empirical alignment under an expert policy ϕ, a ^ ϕ ⋆ = arg max a q ϕ ( a ❘ x , y ) , where q ϕ is a distribution over all possible alignments of x under the expert policy ϕ. 12. The method of claim 11 , wherein â* ϕ is computed using dynamic programming. 13. The method of claim 8 , wherein q ϕ′ =q θ′ , where q θ′ is a stationary distribution created from a stale copy θ′ of the parameters θ of the neural network. 14. The method of claim 7 , wherein training the neural network comprises: sampling a particular latent alignment a˜q ϕ′ for a particular input sequence x; sampling a particular partial latent alignment ã˜r(a) by sampling a particular masking permutation from r and applying the particular masking permutation to a; processing the particular partial latent alignment ã and the particular input sequence x using the neural network to generate a prediction; computing the objective function; computing an error in the prediction using the computed objective function; backpropagating the error through the neural network to determine an update to the parameters θ of the neural network. 15. The method of claim 14 , wherein the objective function is computed using dynamic programming. 16. The method of claim 7 , wherein r(a) is a Bernoulli or Uniform distribution. 17. The method of claim 1 , wherein selecting an input position in each block comprises computing an arg max input position for each block in parallel across all blocks. 18. The method of claim 1 , wh
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Combinations of networks · CPC title
Probabilistic or stochastic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.