Intended query detection using E2E modeling for continued conversation

US12315497B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12315497-B2
Application numberUS-202318186872-A
CountryUS
Kind codeB2
Filing dateMar 20, 2023
Priority dateMar 21, 2022
Publication dateMay 27, 2025
Grant dateMay 27, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance. The method also includes performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps, encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding, and decoding, using a speech recognition joint network, the corresponding audio encoding into a probability distribution over possible output labels. At each of the plurality of time steps, the method also includes determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance includes a query intended for a digital assistant.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance; performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps: encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding; and decoding, using a speech recognition joint network, the corresponding audio encoding encoded by the audio encoder at a corresponding time step into a probability distribution over possible output labels for the spoken utterance at the corresponding time step; and at each of the plurality of time steps, determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance comprises a query intended for a digital assistant interface. 2. The method of claim 1 , wherein: the speech recognition model comprises the audio encoder, the speech recognition joint network, and a prediction network, the prediction network configured to receive the sequence of non-blank symbols output by the final softmax layer and generate the label history representation at each of the plurality of time steps; the speech recognition model is trained during a first training stage by optimizing the audio encoder, the speech recognition joint network, and the prediction network using a regular label sequence of wordpieces; and the IQ joint network is initialized with the speech recognition joint network during a second training stage by freezing the audio encoder and the prediction network and fine-tuning the IQ joint network with an expanded label sequence of both word pieces and IQ tokens to teach the IQ joint network to learn how to predict a distribution of IQ tokens indicating whether or not an input utterance comprises a query intended for the digital assistant interface. 3. The method of claim 2 , wherein generating the label history representation for a corresponding sequence of non-blank symbols comprises: for each non-blank symbol in the sequence of non-blank symbols received as input at each of the plurality of time steps: generating, by the prediction network, using a shared embedding matrix, an embedding of a corresponding non-blank symbol; assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; and generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings, the single embedding vector comprising the label history representation. 4. The method of claim 3 , wherein the prediction network comprises a multi-headed attention mechanism, the multi-headed attention mechanism sharing the shared embedding matrix across each head of the multi-headed attention mechanism. 5. The method of claim 1 , wherein the audio data corresponding to a spoken utterance is received during a current dialog session between a user and the digital assistant interface. 6. The method of claim 1 , wherein the possible output labels comprise wordpieces, words, phonemes, or graphemes. 7. The method of claim 1 , wherein the audio encoder comprises a causal encoder comprising one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers. 8. The method of claim 1 , wherein the speech recognition model is trained using Hybrid Autoregressive Transducer Factorization. 9. The method of claim 1 , wherein the operations further comprise, when the intended query decision indicates that the spoken utterance comprises a query intended for the digital assistant interface, providing a response to the received spoken utterance. 10. The method of claim 1 , wherein the operations further comprise, when the intended query decision indicates that the spoken utterance does not comprise a query intended for the digital assistant interface, discarding the received spoken utterance. 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance; performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps: encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding; and decoding, using a speech recognition joint network, the corresponding audio encoding encoded by the audio encoder at a corresponding time step into a probability distribution over possible output labels for the spoken utterance at the corresponding time step; and at each of the plurality of time steps, determining, using an intended query (IQ) joint network configured to receive a label history representation associated with a sequence of non-blank symbols output by a final softmax layer, an intended query decision indicating whether or not the spoken utterance comprises a query intended for a digital assistant interface. 12. The system of claim 11 , wherein: the speech recognition model comprises the audio encoder, the speech recognition joint network, and a prediction network, the prediction network configured to receive the sequence of non-blank symbols output by the final softmax layer and generate the label history representation at each of the plurality of time steps; the speech recognition model is trained during a first training stage by optimizing the audio encoder, the speech recognition joint network, and the prediction network using a regular label sequence of wordpieces; and the IQ joint network is initialized with the speech recognition joint network during a second training stage by freezing the audio encoder and the prediction network and fine-tuning the IQ joint network with an expanded label sequence of both word pieces and IQ tokens to teach the IQ joint network to learn how to predict a distribution of IQ tokens indicating whether or not an input utterance comprises a query intended for the digital assistant interface. 13. The system of claim 12 , wherein generating the label history representation for a corresponding sequence of non-blank symbols comprises: for each non-blank symbol in the sequence of non-blank symbols received as input at each of the plurality of time steps: generating, by the prediction network, using a shared embedding matrix, an embedding of a corresponding non-blank symbol; assigning, by the prediction network, a respective position vector to the corresponding non-blank symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; and generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings, the single embedding vector co

Assignees

Inventors

Classifications

  • Execution procedure of a spoken command · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Training · CPC title

  • using statistical methods · CPC title

  • Discourse or dialogue representation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12315497B2 cover?
A method includes receiving, as input to a speech recognition model, audio data corresponding to a spoken utterance. The method also includes performing, using the speech recognition model, speech recognition on the audio data by, at each of a plurality of time steps, encoding, using an audio encoder, the audio data corresponding to the spoken utterance into a corresponding audio encoding, and …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).