Attention-based decoder-only sequence transduction neural networks

US11556786B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11556786-B2
Application numberUS-201816759690-A
CountryUS
Kind codeB2
Filing dateOct 29, 2018
Priority dateOct 27, 2017
Publication dateJan 17, 2023
Grant dateJan 17, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating an output sequence from an input sequence. One of the methods includes, at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of the generation time step; processing the combined sequence using a self-attention decoder neural network to generate a time step output that defines a score distribution over a set of possible output tokens; and selecting, using the time step output, an output token from the set of possible output tokens as the next output token in the output sequence.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of generating an output sequence comprising a plurality of output tokens from an input sequence comprising a plurality of input tokens, the method comprising, at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of the generation time step; processing the combined sequence using a self-attention decoder neural network, wherein the self-attention decoder neural network comprises a plurality of neural network layers that include a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence through the plurality of neural network layers to generate a time step output that defines a score distribution over a set of possible output tokens; and selecting, using the time step output, an output token from the set of possible output tokens as the next output token in the output sequence. 2. The method of claim 1 , wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated in the output sequence. 3. The method of claim 1 , wherein the input sequence and the output tokens that have already been generated as of the generation time step are separated by a predetermined special separator token in the combined sequence. 4. The method of claim 1 , wherein the plurality of masked self-attention neural network layers are masked multi-head attention layers. 5. The method of claim 1 , wherein the plurality of masked self-attention neural network layers comprise at least one local attention layer, and wherein each local attention layer comprises a local attention sub-layer that is configured to: receive a layer input sequence comprising a plurality of layer inputs; divide the layer input sequence into a plurality of sub-sequences; generate, for sub-sequence, a sub-sequence output by performing self-attention on the layer inputs in the sub-sequence; and merge the sub-sequence outputs to generate a layer output sequence. 6. The method of claim 1 , wherein the plurality of masked self-attention neural network layers comprise at least one memory-compressed attention layer, and wherein each memory-compressed attention layer comprises a memory-compressed sub-layer that is configured to: obtain an attention input comprising a plurality of keys, values, and queries; applying a strided convolution to the keys to generate a reduced set of keys; applying a strided convolution to the values to generate a reduced set of values; generate a layer output sequence by performing self-attention using the reduced set of keys, the reduced set values, and the plurality of queries. 7. The method of claim 6 , wherein obtaining the attention input comprises: receiving a layer input sequence comprising a plurality of layer inputs; and projecting the layer input sequence into the keys, values, and queries using respective projection matrices. 8. The method of claim 1 , wherein the input sequence comprises text from a plurality of documents, and wherein the output sequence is text that summarizes the plurality of documents. 9. The method of claim 8 , wherein the input sequence further comprises text specifying a topic to which the plurality of documents relate. 10. The method of claim 1 , further comprising: determining that the selected output for the time step is a pre-determined end-of-sequence token; and in response, providing the output tokens that have already been generated as of the generation time step as the final output sequence for the input sequence. 11. The method of claim 1 , wherein the plurality of neural network layers include one or more mixture-of-experts layers. 12. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for generating an output sequence comprising a plurality of output tokens from an input sequence comprising a plurality of input tokens, the operations comprising, at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of the generation time step; processing the combined sequence using a self-attention decoder neural network, wherein the self-attention decoder neural network comprises a plurality of neural network layers that include a plurality of masked self-attention neural network layers, and wherein the self-attention decoder neural network is configured to process the combined sequence through the plurality of neural network layers to generate a time step output that defines a score distribution over a set of possible output tokens; and selecting, using the time step output, an output token from the set of possible output tokens as the next output token in the output sequence. 13. The system of claim 12 , wherein the masked self-attention neural network layers are masked such that the time step output depends only on the input sequence and the output tokens that have already been generated as of the generation time step and not on any output tokens that are after the last token that had already been generated in the output sequence. 14. The system of claim 12 , wherein the input sequence and the output tokens that have already been generated as of the generation time step are separated by a predetermined special separator token in the combined sequence. 15. The system of claim 12 , wherein the plurality of masked self-attention neural network layers are masked multi-head attention layers. 16. The system of claim 12 , wherein the plurality of masked self-attention neural network layers comprise at least one local attention layer, and wherein each local attention layer comprises a local attention sub-layer that is configured to: receive a layer input sequence comprising a plurality of layer inputs; divide the layer input sequence into a plurality of sub-sequences; generate, for sub-sequence, a sub-sequence output by performing self-attention on the layer inputs in the sub-sequence; and merge the sub-sequence outputs to generate a layer output sequence. 17. The system of claim 12 , wherein the plurality of masked self-attention neural network layers comprise at least one memory-compressed attention layer, and wherein each memory-compressed attention layer comprises a memory-compressed sub-layer that is configured to: obtain an attention input comprising a plurality of keys, values, and queries; applying a strided convolution to the keys to generate a reduced set of keys; applying a strided convolution to the values to generate a reduced set of values; generate a layer output sequence by performing self-attention using the reduced set of keys, the reduced set values, and the plurality of queries. 18. The system of claim 17 , wherein obtaining the attention input comprises: receiving a layer input sequence comprising a plurality of layer inputs; and projecting the layer input sequence into the keys, values, and queries using respective projection matrices. 19. The system of claim 12 , wherein th

Assignees

Inventors

Classifications

  • G06N3/08Primary

    Learning methods · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

  • Physics · mapped topic

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11556786B2 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating an output sequence from an input sequence. One of the methods includes, at each of a plurality of generation time steps: generating a combined sequence for the generation time step that includes the input sequence followed by the output tokens that have already been generated as of …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 17 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).