Convolution-augmented transformer models

US12373666B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12373666-B2
Application numberUS-202418766038-A
CountryUS
Kind codeB2
Filing dateJul 8, 2024
Priority dateDec 31, 2020
Publication dateJul 29, 2025
Grant dateJul 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system for efficiently processing data which accounts for both local and global dependencies, comprising: one or more processors; one or more non-transitory computer-readable media that collectively store: a machine-learned conformer model, wherein the machine-learned conformer model comprises: a first half-step feed-forward block configured to process a block input to generate a first feed-forward output, wherein the first half-step feed-forward block comprises half-step residual weights; a self-attention block configured to perform self-attention to process the first feed-forward output to generate an attention output; a convolutional block configured to receive and process the attention output of the self-attention block to generate a convolutional output; and a second half-step feed-forward block configured to process the convolutional output of the convolutional block to generate a second feed-forward output, wherein the second half-step feed-forward block comprises half-step residual weights; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining input data; and processing the input data with the machine-learned conformer model to generate output data, wherein processing the input data with the machine-learned conformer model comprises determining position-wise local features and content-based global interactions based on processing the block input associated with the input data with the first half-step feed-forward block followed by processing with the self-attention block, the convolutional block, and the second half-step feed-forward block. 2. The system of claim 1 , wherein the machine-learned conformer model further comprises: an audio encoder configured to encode the input data to generate the block input. 3. The system of claim 2 , wherein the audio encoder comprises convolution subsampling layer. 4. The system of claim 1 , wherein processing the input data with the machine-learned conformer model to generate the output data comprises: processing the input data with the first half-step feed-forward block to generate the first feed-forward output; processing the first feed-forward output with the self-attention block to generate the attention output; processing the attention output with the convolutional block to generate the convolutional output; processing the convolutional output with the second half-step feed-forward block to generate the second feed-forward output; and generating the output data based on the second feed-forward output. 5. The system of claim 4 , wherein processing the input data with the machine-learned conformer model to generate the output data further comprises: adding variational noise to perform regularization. 6. The system of claim 1 , wherein the machine-learned conformer model was trained on labeled speech data. 7. The system of claim 6 , wherein the machine-learned conformer model was further trained on an additional dataset comprising a text-only corpus. 8. The system of claim 1 , wherein the machine-learned conformer model further comprises a single layer decoder. 9. The system of claim 8 , wherein the single layer decoder comprises a long short-term memory recurrent neural network. 10. The system of claim 1 , wherein the convolutional block comprises a layer normalization block, a first pointwise convolution block, a plurality of activation blocks, a depthwise convolution block, a second pointwise convolution block, and a dropout block. 11. A computer-implemented method for efficiently processing data which accounts for both local and global dependencies, the method comprising: obtaining, by a computing system comprising one or more processors, input data; processing, by the computing system, the input data with a conformer model, wherein the conformer model comprises: a first half-step feed-forward block configured to process a block input to generate a first feed-forward output, wherein the first half-step feed-forward block comprises half-step residual weights; a self-attention block configured to perform self-attention to process the first feed-forward output to generate an attention output; a convolutional block configured to receive and process the attention output of the self-attention block to generate a convolutional output; and a second half-step feed-forward block configured to process the convolutional output of the convolutional block to generate a second feed-forward output, wherein the second half-step feed-forward block comprises half-step residual weights; wherein processing the input data with the conformer model comprises determining position-wise local features and content-based global interactions based on processing the block input associated with the input data with the first half-step feed-forward block followed by processing with the self-attention block, the convolutional block, and the second half-step feed-forward block; and in response to processing the input data with the conformer model, generating, by the computing system, an output data. 12. The method of claim 11 , wherein the input data comprises audio data, and wherein the output data comprises text data descriptive of speech recognition for the audio data and further comprises sound separation data for the audio data. 13. The method of claim 11 , wherein the output data is generated based on determining global interactions and local correlations from the input data. 14. The method of claim 13 , wherein the attention output is descriptive of the global interactions determined by the self-attention block. 15. The method of claim 13 , wherein the convolutional output is descriptive of the local correlations determined by the convolutional block. 16. The method of claim 11 , wherein the input data comprises spectrograph data descriptive of human speech, and the output data comprises text data descriptive of speech recognized data for the human speech. 17. One or more non-transitory computer-readable media that collectively store instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations, the operations comprising: accessing data descriptive of a machine-learned conformer model that comprises one or more conformer blocks, each of the one or more conformer blocks configured to process a block input to generate a block output, each of the one or more conformer blocks comprising: a first half-step feed-forward block configured to process the block input to generate a first feed-forward output, wherein the first half-step feed-forward block comprises half-step residual weights; a self-attention block configured to perform self-attention to process the first feed-forward output to generate an attention output; a convolutional block configured to receive and process the attention output of the self-attention block to generate a convolutional output; and a second half-step feed-forward block configured to process the convolutional output of the convolutional block to generate a second feed-forward output, wherein the second half-step feed-forward block comprises half-step residual weights; and obtaining input data; and processing the input data with the machine-learned conformer model to generate output data, wherein processing the input data with the machine-learned conformer model comprises determining position-wise local features and content-based global interactions based on processing the block input associate

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12373666B2 cover?
Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to lear…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).