Semi-supervised training scheme for speech recognition

US12315499B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12315499-B2
Application numberUS-202218065685-A
CountryUS
Kind codeB2
Filing dateDec 14, 2022
Priority dateDec 14, 2022
Publication dateMay 27, 2025
Grant dateMay 27, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a sequence of acoustic frames extracted from unlabeled audio samples that correspond to spoken utterances not paired with any corresponding transcriptions. The method also includes generating, using a supervised audio encoder, a target higher order feature representation for a corresponding acoustic frame. The method also includes augmenting the sequence of acoustic frames and generating, as output form an unsupervised audio encoder, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames. The method also includes determining an unsupervised loss term based on the target higher order feature representation and the predicted higher order feature representation and updating parameters of the speech recognition model based on the unsupervised loss term.

First claim

Opening claim text (preview).

What is claimed is: 1. A cross-training network for training a speech recognition model, the cross-training network comprising an unsupervised subnetwork trained on a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions, the unsupervised subnetwork comprising: a target branch configured to: receive, as input to a supervised audio encoder of the speech recognition model, a sequence of acoustic frames extracted from the unlabeled audio samples; and at each of a plurality of output steps, generate a target higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames input to the supervised audio encoder at a corresponding output step; and an augmented branch configured to: augment the sequence of acoustic frames extracted from the unlabeled audio samples by masking one or more acoustic frames in the sequence of acoustic frames; and at each of the plurality of output steps, generate, as output from an unsupervised audio encoder of the speech recognition model, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames, wherein the unsupervised subnetwork is configured to: at each of the plurality of output steps, determine an unsupervised loss term based on the target higher order feature representation generated by the target branch at the corresponding output step and the predicted higher order feature representation generated by the augmented branch at the corresponding output step; and update parameters of the speech recognition model based on the unsupervised loss term determined at each of the plurality of output steps. 2. The cross-training network of claim 1 , wherein the unsupervised loss term comprises a contrastive loss term. 3. The cross-training network of claim 1 , wherein: the unsupervised subnetwork is further configured to, at each of the plurality of output steps, determine a distance-based loss term between parameters of the unsupervised audio encoder and parameters of the supervised audio encoder; and updating the parameters of the speech recognition model is further based on the distance-based loss term determined at each of the plurality of output steps. 4. The cross-training network of claim 3 , wherein the distance-based loss term comprises an L2 loss. 5. The cross-training network of claim 3 , wherein updating the parameters of the speech recognition model based on the unsupervised loss term occurs jointly with updating the parameters of the speech recognition model based on the distance-based loss term. 6. The cross-training network of claim 1 , further comprising a supervised subnetwork trained on a plurality of labeled audio samples corresponding to spoken utterances paired with corresponding transcriptions, the supervised subnetwork configured to: at each of the plurality of output steps for each labeled audio sample: generate, using the speech recognition model, a corresponding speech recognition result for the labeled audio sample; and determine a supervised loss term based on the corresponding speech recognition result for the labeled audio sample and the corresponding transcription of the labeled audio sample; and update the parameters of the speech recognition model based on the supervised loss term determined at each of the plurality of output steps for each labeled audio sample in the plurality of labeled audio samples. 7. The cross-training network of claim 6 , wherein the corresponding speech recognition result generated for the labeled audio sample using the speech recognition model comprises a probability distribution over possible speech recognition hypotheses for the labeled audio sample at the corresponding output step. 8. The cross-training network of claim 6 , wherein the supervised subnetwork is further configured to update the parameters of the speech recognition model based on the supervised loss term jointly with the unsupervised network updating the parameters of the speech recognition model based on the unsupervised loss term and a distance-based loss term. 9. The cross-training network of claim 1 , wherein the target branch is further configured to apply a stop gradient operation on the predicted higher order feature representation for the corresponding augmented acoustic frame. 10. The cross-training network of claim 1 , wherein the parameters of the unsupervised audio encoder and the parameters of the supervised audio encoder are initialized with the same initial parameters. 11. The cross-training network of claim 1 , wherein the parameters of the unsupervised audio encoder and the parameters of the supervised audio encoder are initialized with different initial parameters. 12. The cross-training network of claim 1 , wherein each of the unsupervised audio encoder and the supervised audio encoder comprise at least one of: a respective full-context encoder; or a respective cascaded encoder. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames extracted from unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions; at a target branch of a cross-training network, at a plurality of output steps, generating, using a supervised audio encoder of a speech recognition model, a target higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; at an augmentation branch of the cross-training network: augmenting the sequence of acoustic frames extracted from the unlabeled audio samples by masking one or more acoustic frames in the sequence of acoustic frames; and at each of the plurality of output steps, generating, as output from an unsupervised audio encoder of the speech recognition model, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames; at each of the plurality of output steps, determining an unsupervised loss term based on the target higher order feature representation generated by the target branch at the corresponding output step and the predicted higher order feature representation generated by the augmented branch at the corresponding output step; and updating parameters of the speech recognition model based on the unsupervised loss term determined at each of the plurality of output steps. 14. The computer-implemented method of claim 13 , wherein the unsupervised loss term includes a contrastive loss term. 15. The computer-implemented method of claim 13 , wherein the operations further comprise: at each of the plurality of output steps, determining a distance-based loss term between parameters of the unsupervised audio encoder and parameters of the supervised audio encoder; and updating parameters of the speech recognition model is further based on the distance-based loss term determined at each of the plurality of output steps. 16. The computer-implemented method of claim 15 , wherein the distance-based loss term comprises an L2 loss. 17. The computer-implemented method of claim 15 , wherein the updating parameters of the speech recognition model based on the unsupervised loss term occurs jointly with updating the parameters of the speech recognition model based on the distance-based loss term. 18. The computer-implemented method of claim 13 , wherein the operations further comprise

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12315499B2 cover?
A method includes receiving a sequence of acoustic frames extracted from unlabeled audio samples that correspond to spoken utterances not paired with any corresponding transcriptions. The method also includes generating, using a supervised audio encoder, a target higher order feature representation for a corresponding acoustic frame. The method also includes augmenting the sequence of acoustic …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).