Contrastive Siamese network for semi-supervised speech recognition

US11961515B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11961515-B2
Application numberUS-202117644337-A
CountryUS
Kind codeB2
Filing dateDec 14, 2021
Priority dateSep 30, 2021
Publication dateApr 16, 2024
Grant dateApr 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.

First claim

Opening claim text (preview).

What is claimed is: 1. A contrastive Siamese network for training a speech recognition model, the contrastive Siamese network comprising an unsupervised subnetwork trained on a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions, the unsupervised subnetwork comprising: a target branch configured to: receive, as input to an audio encoder of the speech recognition model, a sequence of acoustic frames extracted from the unlabeled audio samples; and at each of a plurality of time steps, generate a target branch output for a corresponding acoustic frame in the sequence of acoustic frames input to the audio encoder at the corresponding time step; and an augmented branch configured to: perform augmentation on the sequence of acoustic frames extracted from the unlabeled audio samples to generate a sequence of augmented acoustic frames; at each of the plurality of time steps, generate, as output from the audio encoder, a higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames; and at each of the plurality of time steps, generate, using the higher order feature representation output from the audio encoder at the corresponding time step, a prediction of the target branch output generated by the target branch at the corresponding time step, wherein the unsupervised subnetwork is configured to: at each of the plurality of time steps, determine an unsupervised loss term based on the target branch output generated by the target branch at the corresponding time step and the prediction of the target branch output generated by the augmented branch at the corresponding time step; and update parameters of the audio encoder based on the unsupervised loss term determined at each of the plurality of time steps. 2. The contrastive Siamese network of claim 1 , wherein the unsupervised loss term comprises a contrastive loss term. 3. The contrastive Siamese network of claim 1 , wherein the augmentation performed on the sequence of acoustic frames comprises time modification and masking. 4. The contrastive Siamese network of claim 1 , wherein the target branch is further configured to: at each of a plurality of time steps, generate, as output from the audio encoder, a higher order feature representation for the corresponding acoustic frame in the sequence of acoustic frames input to the audio encoder at the corresponding time step, wherein the target branch is configured to generate the target branch output for the corresponding acoustic frame by modifying time characteristics of the higher order feature representation. 5. The contrastive Siamese network of claim 4 , wherein modifying the time characteristics of the higher order feature representation comprises modifying, at each of the plurality of time steps, the time characteristics of the higher order feature representation generated as output from the audio encoder for the corresponding acoustic frame to match time characteristics associated with the higher order feature representation generated as output from the audio encoder for the corresponding augmented acoustic frame at the corresponding time step. 6. The contrastive Siamese network of claim 1 , wherein the augmented branch comprises a prediction network of transformer layers configured to, at each of the plurality of time steps: receive, as input, the higher order feature representation output from the audio encoder at the corresponding time step; and generate, as output, the prediction of the target branch output generated by the target branch at the corresponding time step. 7. The contrastive Siamese network of claim 1 , further comprising a supervised subnetwork trained on a plurality of labeled audio samples corresponding to spoken utterances paired with corresponding transcriptions, the supervised subnetwork configured to: at each of a plurality of output steps for each labeled audio sample: generate, using the speech recognition model, a corresponding speech recognition result for the labeled audio sample; and determine a supervised loss term based on the corresponding speech recognition result for the labeled audio sample and the corresponding transcription of the labeled audio sample; and update parameters of the speech recognition model based on the supervised loss term determined at each of the plurality of output steps for each labeled audio sample in the plurality of labeled audio samples. 8. The contrastive Siamese network of claim 7 , wherein the corresponding speech recognition result generated for the labeled audio sample using the speech recognition model comprises a probability distribution over possible speech recognition hypotheses for the labeled audio sample at the corresponding output step. 9. The contrastive Siamese network of claim 7 , wherein the supervised subnetwork is configured to update the parameters of the speech recognition model based on the supervised loss term independently of the unsupervised network updating the parameters of the audio encoder of the speech recognition model. 10. The contrastive Siamese network of claim 7 , wherein the supervised subnetwork is further configured to apply data augmentation to at least one of the labeled audio samples in the plurality of labeled audio samples input to the speech recognition model. 11. The contrastive Siamese network of claim 10 , wherein the applied data augmentation comprises at least one of adding noise, adding reverberation, or manipulating timing. 12. The contrastive Siamese network of claim 1 , wherein the trained speech recognition model comprises a Transformer-Transducer (T-T) model, the T-T model comprising: the audio encoder configured to: receive, as input, a sequence of acoustic frames extracted from audio data characterizing a spoken utterance; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a label encoder configured to: receive, as input, a sequence of non-blank symbols output by a final softmax layer; and generate, at each of the plurality of time steps, a dense representation; and a joint network configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypothesis at the corresponding time step, wherein the audio encoder comprises a neural network having a stack of strided convolutional layers and transformer layers. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions: at a target branch of a contrastive Siamese network: generating, using an audio encoder of a speech recognition model, a sequence of encoder outputs for the plurality of unlabeled audio samples; and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs; at an augmentation branch of the contrastive Siamese network: performing augmentation on the unlabeled audio samples; generating, using the audio encoder of the speech recognition model, a sequence of augmented encoder outputs for the augmented unlabeled audio samples; and generating, using a prediction

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11961515B2 cover?
A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of t…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).