Convolution-augmented transformer models

US12079703B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12079703-B2
Application numberUS-202017139525-A
CountryUS
Kind codeB2
Filing dateDec 31, 2020
Priority dateDec 31, 2020
Publication dateSep 3, 2024
Grant dateSep 3, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for efficiently processing data which accounts for both local and global dependencies, the method comprising: accessing data descriptive of a machine-learned conformer model that comprises one or more conformer blocks, each of the one or more conformer blocks configured to process a block input to generate a block output, each of the one or more conformer blocks comprising: a first feed-forward block configured to process the block input to generate a first feed-forward output; a self-attention block configured to perform self-attention to process the first feed-forward output to generate an attention output; a convolutional block configured to perform convolutions with a convolutional filter to process the attention output of the self-attention block to generate a convolutional output; and a second feed-forward block configured to process the convolutional output of the convolutional block to generate a second feed-forward output; obtaining input data, wherein the input data comprises audio data; and processing the input data with the machine-learned conformer model to generate output data, wherein the output data comprises text data, wherein the machine-learned conformer model comprises the convolutional block that processes outputs of the self-attention block without performing parallel processing with the self-attention block and the convolutional block, and wherein the convolutional block and the self-attention block are between the first feed-forward block and the second feed-forward block. 2. The computer-implemented method of claim 1 , wherein the machine-learned conformer model is an encoder model, and the output data comprises an encoding. 3. The computer-implemented method of claim 1 , wherein each of the first feed-forward block and the second feed-forward block comprises a respective half-step feed-forward block. 4. The computer-implemented method of claim 1 , wherein each of the one or more conformer blocks further comprises a layer normalization block configured to normalize the second feed-forward output to generate the block output. 5. The computer-implemented method of claim 1 , wherein the self-attention block comprises a multi-head self-attention block. 6. The computer-implemented method of claim 1 , wherein the self-attention block comprises a layer normalization block before a multi-head attention with relative positional embedding block. 7. The computer-implemented method of claim 6 , wherein the self-attention block further comprises a dropout block after the multi-head attention with relative positional embedding block. 8. The computer-implemented method of claim 1 , wherein the convolution block comprises a pointwise convolution block followed by a gated linear unit (GLU) activation. 9. The computer-implemented method of claim 1 , wherein the convolution block comprises a 1D depthwise convolution block followed by a Swish activation. 10. The computer-implemented method of claim 1 , wherein the convolution block comprises a layer normalization block, a first pointwise convolution block, a second pointwise convolution block, and a dropout block. 11. The computer-implemented method of claim 1 , wherein the first feed-forward block comprises a first linear layer, a Swish activation, and a second linear layer. 12. The computer-implemented method of claim 1 , wherein the second feed-forward block comprises a first linear layer, a Swish activation, and a second linear layer. 13. The computer-implemented method of claim 1 , wherein the one or more conformer blocks comprise a plurality of conformer blocks stacked in a sequence one after the other. 14. The computer-implemented method of claim 1 , wherein the audio data further comprises spectrograph data descriptive of human speech, and the text data comprises speech recognized data for the human speech. 15. The computer-implemented method of claim 1 , wherein the output data further comprises sound separation data for the audio data. 16. The computer-implemented method of claim 1 , wherein the first feed-forward block, the self-attention block, the convolutional block, and the second feed-forward block each have a respective residual connection. 17. A computing system, comprising: one or more processors; one or more non-transitory computer-readable media that collectively store: a machine-learned conformer model, wherein the machine-learned conformer model comprises: a first feed-forward block; a self-attention block; a convolutional block configured to receive and process an output of the self-attention block; and a second feed-forward block; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining input data, wherein the input data comprises audio data; and processing the input data with the machine-learned conformer model to generate output data, wherein the output data comprises text data, wherein the machine-learned conformer model comprises the convolutional block that processes outputs of the self-attention block without performing parallel processing with the self-attention block and the convolutional block, and wherein the convolutional block and the self-attention block are between the first feed-forward block and the second feed-forward block. 18. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: obtaining input data, wherein the input data comprises audio data; processing the input data with a conformer model, wherein the conformer model comprises: a first feed-forward block; a self-attention block; a convolutional block configured to receive and process an output of the self-attention block; and a second feed-forward block; wherein the conformer model comprises the convolutional block that processes outputs of the self-attention block without performing parallel processing with the self-attention block and the convolutional block, and wherein the convolutional block and the self-attention block are between the first feed-forward block and the second feed-forward block; and in response to processing the input data with the conformer model, generating an output data, wherein the output data comprises text data.

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12079703B2 cover?
Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to lear…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 03 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).