Disfluency detection models for natural conversational voice systems

US12354597B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12354597-B2
Application numberUS-202217822673-A
CountryUS
Kind codeB2
Filing dateAug 26, 2022
Priority dateOct 6, 2021
Publication dateJul 8, 2025
Grant dateJul 8, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing one or more utterances; and at each of a plurality of time steps: generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; outputting, by a final softmax layer of the speech recognition model, a sequence of non-blank output symbols; generating, by a prediction network of the speech recognition model, a hidden representation for the sequence of non-blank output symbols; generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a first probability distribution that the corresponding time step corresponds to a disfluency and an end of speech; and generating, by a second joint network of the speech recognition model different from the first joint network that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a second probability distribution over possible speech recognition hypotheses, wherein the final softmax layer outputs, based on the second probability distribution, a next non-blank output symbol. 2. The computer-implemented method of claim 1 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the end of speech satisfies an end of speech threshold; and in response to determining that the probability that the corresponding time step corresponds to the end of speech satisfies the end of speech threshold, triggering a microphone closing event. 3. The computer-implemented method of claim 1 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the disfluency satisfies a disfluency threshold; and emitting a disfluency token at the corresponding time step based on the determining that the probability of the corresponding time step corresponds to the disfluency satisfies the disfluency threshold. 4. The computer-implemented method of claim 1 , wherein the speech recognition model is trained by a two-stage training process, the two-stage training process comprising: a first stage that trains the encoder network, the prediction network, and the second joint network on a speech recognition task; and a second stage that initializes and fine-tunes the first joint network to learn how to predict pause and end of speech locations in utterances. 5. The computer-implemented method of claim 4 , wherein parameters of the encoder network, the prediction network, and the second joint network are frozen during the second stage of the two-stage training process. 6. The computer-implemented method of claim 4 , wherein the two-stage training process trains the speech recognition model on a plurality of transcribed training utterances having labels indicating pause and end of speech locations. 7. The computer-implemented method of claim 1 , wherein the encoder network comprises a stack of self-attention blocks. 8. The computer-implemented method of claim 7 , wherein the stack of self-attention blocks comprises a stack of conformer blocks or a stack of transformer blocks. 9. The computer-implemented method of claim 1 , wherein generating the hidden representation for the sequence of non-blank output symbols comprises: for each non-blank output symbol in the sequence of non-blank output symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the non-blank output symbol; assigning, by the prediction network, a respective position vector to the non-blank output symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; and generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings, the single embedding vector comprising the hidden representation. 10. The computer-implemented method of claim 9 , wherein the prediction network comprises a multi-headed attention mechanism, the multi-headed attention mechanism sharing the shared embedding matrix across each head of the multi-headed attention mechanism. 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that, when executed on the data processing hardware, causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing one or more utterances; and at each of a plurality of output steps: generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; outputting, by a final softmax layer of the speech recognition model, a sequence of non-blank output symbols; generating, by a prediction network of the speech recognition model, a hidden representation for the sequence of non-blank output symbols; generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a first probability distribution that the corresponding time step corresponds to a disfluency and an end of speech; and generating, by a second joint network of the speech recognition model different from the first joint network that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a second probability distribution over possible speech recognition hypotheses, wherein the final softmax layer outputs, based on the second probability distribution, a next non-blank output symbol. 12. The system of claim 11 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the end of speech satisfies an end of speech threshold; and in response to determining that the probability that the corresponding time step corresponds to the end of speech satisfies the end of speech threshold, triggering a microphone closing event. 13. The system of claim 11 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the disfluency satisfies a disfluency threshold; and emitting a disfluency token at the corresponding time step based on the determining that the probability of the corresponding time step corresponds to the disfluency satisfies the disfluency threshold. 14. The system of claim 11 , wherein the speech recognition model is trained by a two-stage training process, the two-stage training process comprising: a first stage that trains the encoder network, the prediction network, and the second joint network on a speech recognition task; and a second stage that initializes and fine-tunes the first joint network to learn how to predict pause and end of speech locations in utterances. 15. The system of claim 14 , wherein parameters of the enco

Assignees

Inventors

Classifications

  • Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • G10L15/063Primary

    Training · CPC title

  • using natural language modelling · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12354597B2 cover?
A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 08 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).