Supervised and unsupervised training with contrastive loss over sequences

US12230249B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12230249-B2
Application numberUS-202217655903-A
CountryUS
Kind codeB2
Filing dateMar 22, 2022
Priority dateMar 26, 2021
Publication dateFeb 18, 2025
Grant dateFeb 18, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encoder outputs for the positive data example into a contrastive loss space. The method also includes determining a L2 distance between each corresponding encoder output in the projected sequences of encoder outputs for the positive audio data examples and determining a per-utterance consistency loss by averaging the L2 distances. The method also includes generating corresponding speech recognition results for each respective positive audio data example. The method also includes updating parameters of the speech recognition model based on a respective supervised loss term and the per-utterance consistency loss.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed by data processing hardware causes the data processing hardware to perform operations comprising: receiving a set of training utterances, each training utterance in the set of training utterances comprising a non-synthetic speech representation of a corresponding utterance; for each training utterance in the set of training utterances, converting, using a text-to-speech (TTS) model, a ground-truth transcription of the corresponding utterance to generate one or more synthetic speech representations of the same corresponding utterance; receiving audio data corresponding to an utterance by: receiving one of the non-synthetic speech representations of the corresponding utterance; or receiving one of the one or more synthetic speech representations of the corresponding utterance; generating, using a data augmentation module, a pair of positive audio data examples, each positive audio data example in the pair of positive audio data examples comprising a respective augmented copy of the received audio data corresponding to the utterance; for each respective positive audio data example in the pair of positive audio data examples: generating, using a neural network encoder, a respective sequence of encoder outputs; and projecting, using a convolutional neural network (CNN), the respective sequence of encoder outputs for the respective positive audio data example into a contrastive loss space; determining a L2 distance between each corresponding encoder output in each projected respective sequence of encoder outputs for the pair of positive audio data examples; determining a per-utterance consistency loss by averaging a set of L2 distances determined for the respective sequence of encoder outputs in the projected respective sequence of encoder outputs; generating, using a speech recognition model, corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples; and updating parameters of the speech recognition model based on a respective supervised loss term associated with each corresponding speech recognition result and the per-utterance consistency loss. 2. The method of claim 1 , wherein the CNN comprises a first CNN layer, followed by a rectified linear activation function (ReLU) activation and LayerNorm layer, and a second CNN layer with linear activation. 3. The method of claim 1 , wherein the data augmentation module adds at least one of noise, reverberation, or manipulates timing of the received audio data. 4. The method of claim 1 , wherein the speech recognition model comprises a sequence transducer model having a Conformer-based encoder and a long short-term memory (LSTM) decoder. 5. The method of claim 4 , wherein the Conformer-based encoder comprises a stack of conformer layers each comprising a series of multi-headed self-attention, depth-wise convolution, and feedforward layers. 6. The method of claim 1 , wherein generating the corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples comprises determining, using a decoder, a probability distribution over possible speech recognition hypotheses for the respective sequence of encoder outputs. 7. The method of claim 1 , wherein the operations further comprise determining the respective supervised loss term by comparing the corresponding speech recognition result for the respective positive audio data example and a corresponding ground-truth transcription of the respective positive audio data example. 8. The method of claim 1 , wherein each positive audio data example in the pair of positive audio data examples comprises a different respective augmented copy of the received audio data corresponding to the utterance than each other positive audio data example in the pair of positive audio data examples. 9. The method of claim 1 , wherein generating the pair of positive audio data examples comprises generating each positive audio data example in the pair of positive audio data examples based on a single observation of the utterance. 10. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving a set of training utterances, each training utterance in the set of training utterances comprising a non-synthetic speech representation of a corresponding utterance; for each training utterance in the set of training utterances, converting, using a text-to-speech (TTS) model, a ground-truth transcription of the corresponding utterance to generate one or more synthetic speech representations of the same corresponding utterance; receiving audio data corresponding to an utterance by: receiving one of the non-synthetic speech representations of the corresponding utterance: or receiving one of the one or more synthetic speech representations of the corresponding utterance; generating, using a data augmentation module, a pair of positive audio data examples, each positive audio data example in the pair of positive audio data examples comprising a respective augmented copy of the received audio data corresponding to the utterance; for each respective positive audio data example in the pair of positive audio data examples: generating, using a neural network encoder, a respective sequence of encoder outputs; and projecting, using a convolutional neural network (CNN), the respective sequence of encoder outputs for the respective positive audio data example into a contrastive loss space; determining a L2 distance between each corresponding encoder output in each projected respective sequence of encoder outputs for the pair of positive audio data examples; determining a per-utterance consistency loss by averaging a set of L2 distances determined for the respective sequence of encoder outputs in the projected respective sequence of encoder outputs; generating, using a speech recognition model, corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples; and updating parameters of the speech recognition model based on a respective supervised loss term associated with each corresponding speech recognition result and the per-utterance consistency loss. 11. The system of claim 10 , wherein the CNN comprises a first CNN layer, followed by a rectified linear activation function (ReLU) activation and LayerNorm layer, and a second CNN layer with linear activation. 12. The system of claim 10 , wherein the data augmentation module adds one of noise, reverberation, or manipulates timing of the received audio data. 13. The system of claim 10 , wherein the speech recognition model comprises a sequence transducer model having a Conformer-based encoder and a long short-term memory (LSTM) decoder. 14. The system of claim 13 , wherein the Conformer-based encoder comprises a stack of conformer layers each comprising a series of multi-headed self-attention, depth-wise convolution, and feedforward layers. 15. The system of claim 10 , wherein generating the corresponding speech recognition results for each respective positive audio data example in the pair of positive audio data examples comprises determining, using a decoder, a probability distribution over possible speech recognition hypotheses for the respective sequence of encoder outputs. 16. The sy

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12230249B2 cover?
A method includes receiving audio data corresponding to an utterance and generating a pair of positive audio data examples. Here, each positive audio data example includes a respective augmented copy of the received audio data. For each respective positive audio data example, the method includes generating a respective sequence of encoder outputs and projecting the respective sequence of encode…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).