Theme detection for object-recognition-based notifications
US-12183330-B2 · Dec 31, 2024 · US
US12315499B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12315499-B2 |
| Application number | US-202218065685-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 14, 2022 |
| Priority date | Dec 14, 2022 |
| Publication date | May 27, 2025 |
| Grant date | May 27, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving a sequence of acoustic frames extracted from unlabeled audio samples that correspond to spoken utterances not paired with any corresponding transcriptions. The method also includes generating, using a supervised audio encoder, a target higher order feature representation for a corresponding acoustic frame. The method also includes augmenting the sequence of acoustic frames and generating, as output form an unsupervised audio encoder, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames. The method also includes determining an unsupervised loss term based on the target higher order feature representation and the predicted higher order feature representation and updating parameters of the speech recognition model based on the unsupervised loss term.
Opening claim text (preview).
What is claimed is: 1. A cross-training network for training a speech recognition model, the cross-training network comprising an unsupervised subnetwork trained on a plurality of unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions, the unsupervised subnetwork comprising: a target branch configured to: receive, as input to a supervised audio encoder of the speech recognition model, a sequence of acoustic frames extracted from the unlabeled audio samples; and at each of a plurality of output steps, generate a target higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames input to the supervised audio encoder at a corresponding output step; and an augmented branch configured to: augment the sequence of acoustic frames extracted from the unlabeled audio samples by masking one or more acoustic frames in the sequence of acoustic frames; and at each of the plurality of output steps, generate, as output from an unsupervised audio encoder of the speech recognition model, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames, wherein the unsupervised subnetwork is configured to: at each of the plurality of output steps, determine an unsupervised loss term based on the target higher order feature representation generated by the target branch at the corresponding output step and the predicted higher order feature representation generated by the augmented branch at the corresponding output step; and update parameters of the speech recognition model based on the unsupervised loss term determined at each of the plurality of output steps. 2. The cross-training network of claim 1 , wherein the unsupervised loss term comprises a contrastive loss term. 3. The cross-training network of claim 1 , wherein: the unsupervised subnetwork is further configured to, at each of the plurality of output steps, determine a distance-based loss term between parameters of the unsupervised audio encoder and parameters of the supervised audio encoder; and updating the parameters of the speech recognition model is further based on the distance-based loss term determined at each of the plurality of output steps. 4. The cross-training network of claim 3 , wherein the distance-based loss term comprises an L2 loss. 5. The cross-training network of claim 3 , wherein updating the parameters of the speech recognition model based on the unsupervised loss term occurs jointly with updating the parameters of the speech recognition model based on the distance-based loss term. 6. The cross-training network of claim 1 , further comprising a supervised subnetwork trained on a plurality of labeled audio samples corresponding to spoken utterances paired with corresponding transcriptions, the supervised subnetwork configured to: at each of the plurality of output steps for each labeled audio sample: generate, using the speech recognition model, a corresponding speech recognition result for the labeled audio sample; and determine a supervised loss term based on the corresponding speech recognition result for the labeled audio sample and the corresponding transcription of the labeled audio sample; and update the parameters of the speech recognition model based on the supervised loss term determined at each of the plurality of output steps for each labeled audio sample in the plurality of labeled audio samples. 7. The cross-training network of claim 6 , wherein the corresponding speech recognition result generated for the labeled audio sample using the speech recognition model comprises a probability distribution over possible speech recognition hypotheses for the labeled audio sample at the corresponding output step. 8. The cross-training network of claim 6 , wherein the supervised subnetwork is further configured to update the parameters of the speech recognition model based on the supervised loss term jointly with the unsupervised network updating the parameters of the speech recognition model based on the unsupervised loss term and a distance-based loss term. 9. The cross-training network of claim 1 , wherein the target branch is further configured to apply a stop gradient operation on the predicted higher order feature representation for the corresponding augmented acoustic frame. 10. The cross-training network of claim 1 , wherein the parameters of the unsupervised audio encoder and the parameters of the supervised audio encoder are initialized with the same initial parameters. 11. The cross-training network of claim 1 , wherein the parameters of the unsupervised audio encoder and the parameters of the supervised audio encoder are initialized with different initial parameters. 12. The cross-training network of claim 1 , wherein each of the unsupervised audio encoder and the supervised audio encoder comprise at least one of: a respective full-context encoder; or a respective cascaded encoder. 13. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames extracted from unlabeled audio samples corresponding to spoken utterances not paired with corresponding transcriptions; at a target branch of a cross-training network, at a plurality of output steps, generating, using a supervised audio encoder of a speech recognition model, a target higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; at an augmentation branch of the cross-training network: augmenting the sequence of acoustic frames extracted from the unlabeled audio samples by masking one or more acoustic frames in the sequence of acoustic frames; and at each of the plurality of output steps, generating, as output from an unsupervised audio encoder of the speech recognition model, a predicted higher order feature representation for a corresponding augmented acoustic frame in the sequence of augmented acoustic frames; at each of the plurality of output steps, determining an unsupervised loss term based on the target higher order feature representation generated by the target branch at the corresponding output step and the predicted higher order feature representation generated by the augmented branch at the corresponding output step; and updating parameters of the speech recognition model based on the unsupervised loss term determined at each of the plurality of output steps. 14. The computer-implemented method of claim 13 , wherein the unsupervised loss term includes a contrastive loss term. 15. The computer-implemented method of claim 13 , wherein the operations further comprise: at each of the plurality of output steps, determining a distance-based loss term between parameters of the unsupervised audio encoder and parameters of the supervised audio encoder; and updating parameters of the speech recognition model is further based on the distance-based loss term determined at each of the plurality of output steps. 16. The computer-implemented method of claim 15 , wherein the distance-based loss term comprises an L2 loss. 17. The computer-implemented method of claim 15 , wherein the updating parameters of the speech recognition model based on the unsupervised loss term occurs jointly with updating the parameters of the speech recognition model based on the distance-based loss term. 18. The computer-implemented method of claim 13 , wherein the operations further comprise
Combinations of networks · CPC title
Non-supervised learning, e.g. competitive learning · CPC title
using artificial neural networks · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.