Contrastive Siamese Network for Semi-supervised Speech Recognition
US-2023096805-A1 · Mar 30, 2023 · US
US11961515B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11961515-B2 |
| Application number | US-202117644337-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 14, 2021 |
| Priority date | Sep 30, 2021 |
| Publication date | Apr 16, 2024 |
| Grant date | Apr 16, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions. At a target branch of a contrastive Siamese network, the method also includes generating a sequence of encoder outputs for the plurality of unlabeled audio samples and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs. At an augmentation branch of a contrastive Siamese network, the method also includes performing augmentation on the unlabeled audio samples, generating a sequence of augmented encoder outputs for the augmented unlabeled audio samples, and generating predictions of the sequence of target branch outputs generated at the target branch. The method also includes determining an unsupervised loss term based on target branch outputs and predictions of the sequence of target branch outputs. The method also includes updating parameters of the audio encoder based on the unsupervised loss term.
Opening claim text (preview).
What is claimed is: 1. A contrastive Siamese network for training a speech recognition model, the contrastive Siamese network comprising an unsupervised subnetwork trained on a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions, the unsupervised subnetwork comprising: a target branch configured to: receive, as input to an audio encoder of the speech recognition model, a sequence of acoustic frames extracted from the unlabeled audio samples; and at each of a plurality of time steps, generate a target branch output for a corresponding acoustic frame in the sequence of acoustic frames input to the audio encoder at the corresponding time step; and an augmented branch configured to: perform augmentation on the sequence of acoustic frames extracted from the unlabeled audio samples to generate a sequence of augmented acoustic frames; at each of the plurality of time steps, generate, as output from the audio encoder, a higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames; and at each of the plurality of time steps, generate, using the higher order feature representation output from the audio encoder at the corresponding time step, a prediction of the target branch output generated by the target branch at the corresponding time step, wherein the unsupervised subnetwork is configured to: at each of the plurality of time steps, determine an unsupervised loss term based on the target branch output generated by the target branch at the corresponding time step and the prediction of the target branch output generated by the augmented branch at the corresponding time step; and update parameters of the audio encoder based on the unsupervised loss term determined at each of the plurality of time steps. 2. The contrastive Siamese network of claim 1 , wherein the unsupervised loss term comprises a contrastive loss term. 3. The contrastive Siamese network of claim 1 , wherein the augmentation performed on the sequence of acoustic frames comprises time modification and masking. 4. The contrastive Siamese network of claim 1 , wherein the target branch is further configured to: at each of a plurality of time steps, generate, as output from the audio encoder, a higher order feature representation for the corresponding acoustic frame in the sequence of acoustic frames input to the audio encoder at the corresponding time step, wherein the target branch is configured to generate the target branch output for the corresponding acoustic frame by modifying time characteristics of the higher order feature representation. 5. The contrastive Siamese network of claim 4 , wherein modifying the time characteristics of the higher order feature representation comprises modifying, at each of the plurality of time steps, the time characteristics of the higher order feature representation generated as output from the audio encoder for the corresponding acoustic frame to match time characteristics associated with the higher order feature representation generated as output from the audio encoder for the corresponding augmented acoustic frame at the corresponding time step. 6. The contrastive Siamese network of claim 1 , wherein the augmented branch comprises a prediction network of transformer layers configured to, at each of the plurality of time steps: receive, as input, the higher order feature representation output from the audio encoder at the corresponding time step; and generate, as output, the prediction of the target branch output generated by the target branch at the corresponding time step. 7. The contrastive Siamese network of claim 1 , further comprising a supervised subnetwork trained on a plurality of labeled audio samples corresponding to spoken utterances paired with corresponding transcriptions, the supervised subnetwork configured to: at each of a plurality of output steps for each labeled audio sample: generate, using the speech recognition model, a corresponding speech recognition result for the labeled audio sample; and determine a supervised loss term based on the corresponding speech recognition result for the labeled audio sample and the corresponding transcription of the labeled audio sample; and update parameters of the speech recognition model based on the supervised loss term determined at each of the plurality of output steps for each labeled audio sample in the plurality of labeled audio samples. 8. The contrastive Siamese network of claim 7 , wherein the corresponding speech recognition result generated for the labeled audio sample using the speech recognition model comprises a probability distribution over possible speech recognition hypotheses for the labeled audio sample at the corresponding output step. 9. The contrastive Siamese network of claim 7 , wherein the supervised subnetwork is configured to update the parameters of the speech recognition model based on the supervised loss term independently of the unsupervised network updating the parameters of the audio encoder of the speech recognition model. 10. The contrastive Siamese network of claim 7 , wherein the supervised subnetwork is further configured to apply data augmentation to at least one of the labeled audio samples in the plurality of labeled audio samples input to the speech recognition model. 11. The contrastive Siamese network of claim 10 , wherein the applied data augmentation comprises at least one of adding noise, adding reverberation, or manipulating timing. 12. The contrastive Siamese network of claim 1 , wherein the trained speech recognition model comprises a Transformer-Transducer (T-T) model, the T-T model comprising: the audio encoder configured to: receive, as input, a sequence of acoustic frames extracted from audio data characterizing a spoken utterance; and generate, at each of a plurality of time steps, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; a label encoder configured to: receive, as input, a sequence of non-blank symbols output by a final softmax layer; and generate, at each of the plurality of time steps, a dense representation; and a joint network configured to: receive, as input, the higher order feature representation generated by the audio encoder at each of the plurality of time steps and the dense representation generated by the label encoder at each of the plurality of time steps; and generate, at each of the plurality of time steps, a probability distribution over possible speech recognition hypothesis at the corresponding time step, wherein the audio encoder comprises a neural network having a stack of strided convolutional layers and transformer layers. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions: at a target branch of a contrastive Siamese network: generating, using an audio encoder of a speech recognition model, a sequence of encoder outputs for the plurality of unlabeled audio samples; and modifying time characteristics of the encoder outputs to generate a sequence of target branch outputs; at an augmentation branch of the contrastive Siamese network: performing augmentation on the unlabeled audio samples; generating, using the audio encoder of the speech recognition model, a sequence of augmented encoder outputs for the augmented unlabeled audio samples; and generating, using a prediction
Auto-encoder networks; Encoder-decoder networks · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Non-supervised learning, e.g. competitive learning · CPC title
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title
Training · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.