Learning word-level confidence for subword end-to-end automatic speech recognition

US11610586B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11610586-B2
Application numberUS-202117182592-A
CountryUS
Kind codeB2
Filing dateFeb 23, 2021
Priority dateFeb 23, 2021
Publication dateMar 21, 2023
Grant dateMar 21, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a second attention mechanism, an acoustic context vector; and generating, as output from an output layer of the CEM, a respective confidence output score for each corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the CEM. For each of the one or more words formed by the sequence of hypothesized sub-word units, the method also includes determining a respective word-level confidence score for the word. The method also includes determining an utterance-level confidence score by aggregating the word-level confidence scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, from a speech recognizer, a speech recognition result for an utterance spoken by a user, the speech recognition result comprising a sequence of hypothesized sub-word units that form one or more words of the utterance, each sub-word unit output from the speech recognizer at a corresponding output step; using a confidence estimation module, for each sub-word unit in the sequence of hypothesized sub-word units: obtaining a respective confidence embedding that represents a set of confidence features associated with the corresponding output step when the corresponding sub-word unit was output from the speech recognizer; generating, using a first attention mechanism that self-attends to the respective confidence embedding for the corresponding sub-word unit and the confidence embeddings obtained for any other sub-word units in the sequence of hypothesized sub-word units that proceed the corresponding sub-word unit, a confidence feature vector; generating, using a second attention mechanism that cross-attends to a sequence of acoustic encodings each associated with a corresponding acoustic frame segmented from audio data that corresponds to the utterance, an acoustic context vector; and generating, as output from an output layer of the confidence estimation module, a respective confidence output score for the corresponding sub-word unit based on the confidence feature vector and the acoustic feature vector received as input by the output layer of the confidence estimation module; for each of the one or more words formed by the sequence of hypothesized sub-word units, determining a respective word-level confidence score for the word, the respective word-level confidence score equal to the respective confidence output score generated for the final sub-word unit in the word; and determining an utterance-level confidence score for the speech recognition result by aggregating the respective word-level confidence scores determined for the one or more words of the utterance. 2. The computer-implemented method of claim 1 , wherein the set of confidence features represented by the respective confidence embedding comprise: a softmax posteriors feature of the speech recognizer at the corresponding output step; and a sub-word embedding feature for the corresponding sub-word unit. 3. The computer-implemented method of claim 2 , wherein the set of confidence feature represented by the respective confidence embedding further comprise: a log posterior log feature indicating a probability value associated a probability/likelihood of the corresponding sub-word unit output from the speech recognizer at the corresponding output step; and a top-K feature indicating a K largest log probabilities at the corresponding output step for a top-K candidate hypotheses rescored by the speech recognizer, the top-K candidate hypotheses each represented by a respective sequence of hypothesized sub-word units that form one or more words of the utterance. 4. The computer-implemented method of claim 1 , wherein the sub-word units comprise wordpieces. 5. The computer-implemented method of claim 1 , wherein the sub-word units comprise graphemes. 6. The computer-implemented method of claim 1 , wherein the speech recognizer comprises: a transducer decoder model configured to generate multiple candidate hypotheses during a first pass, each candidate hypothesis corresponding to a candidate transcription for the utterance and represented by a respective sequence of hypothesized sub-word units; and a rescorer decoder model configured to rescore, during a second pass, a top-K candidate hypotheses from the multiple candidate hypotheses generated by the transducer decoder model during the first pass, wherein the candidate hypothesis in the top-K candidate hypotheses rescored by the rescorer decoder model that is represented by the respective sequence of hypothesized sub-word units associated with a highest second pass log probability is output from the rescorer decoder model as the speech recognition result for the utterance spoken by the user. 7. The computer-implemented method of claim 6 , wherein: the transducer decoder model comprises a Recurrent Neural Network-Transducer (RNN-T) model architecture; and the rescorer decoder model comprises a Listen, Attend, and Spell (LAS) model architecture. 8. The computer-implemented method of claim 6 , wherein the operations further comprise: generating, using a linguistic encoder of the speech recognizer during the second pass, a multiple hypotheses encoding by encoding each of the multiple candidate hypotheses generated by the transducer decoder model during the first pass; and using the confidence estimation module, for each sub-word unit in the sequence of hypothesized sub-word units, generating, using a third attention mechanism that cross-attends to the multiple hypotheses encoding, a linguistic context vector, wherein generating the respective confidence output score for the corresponding sub-word unit is further based on the linguistic context vector received as input by the output layer of the confidence estimation module. 9. The computer-implemented method of claim 8 , wherein: encoding each of the multiple candidate hypothesis comprises bi-directionally encoding each candidate hypothesis into a corresponding hypothesis encoding; and generating the multiple hypothesis encoding by concatenating each corresponding hypothesis encoding. 10. The computer-implemented method of claim 1 , wherein the speech recognizer and the confidence estimation module are trained jointly. 11. The computer-implemented method of claim 1 , wherein the speech recognizer and the confidence estimation module are trained separately. 12. The computer-implemented method of claim 1 , wherein the confidence estimation model is trained using a binary cross-entropy loss based on features associated with the speech recognizer. 13. The computer-implemented method of claim 1 , wherein the operations further comprise: determining whether the utterance-level confidence score for the speech recognition result satisfies a confidence threshold; and when the utterance-level confidence score for the speech recognition result fails to satisfy the confidence threshold, transmitting audio data corresponding to the utterance to another speech recognizer, the other speech recognizer configured to process to the audio data to generate a transcription of the utterance. 14. The computer-implemented method of claim 13 , wherein: the speech recognizer and the confidence estimation module execute on a user computing device; and the other speech recognizer executes on a remote server in communication with the user computing device via a network. 15. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving, from a speech recognizer, a speech recognition result for an utterance spoken by a user, the speech recognition result comprising a sequence of hypothesized sub-word units that form one or more words of the utterance, each sub-word unit output from the speech recognizer at a corresponding output step; using a confidence estimation module, for each sub-word unit in the sequence of hypothesized sub-word units: obtaining a respective confidence embedding

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11610586B2 cover?
A method includes receiving a speech recognition result, and using a confidence estimation module (CEM), for each sub-word unit in a sequence of hypothesized sub-word units for the speech recognition result: obtaining a respective confidence embedding that represents a set of confidence features; generating, using a first attention mechanism, a confidence feature vector; generating, using a sec…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 21 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).