Selectively activating on-device speech recognition, and using recognized text in selectively activating on-device NLU and/or on-device fulfillment

US12315508B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12315508-B2
Application numberUS-202217970894-A
CountryUS
Kind codeB2
Filing dateOct 21, 2022
Priority dateMay 6, 2019
Publication dateMay 27, 2025
Grant dateMay 27, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations can reduce the time required to obtain responses from an automated assistant by, for example, obviating the need to provide an explicit invocation to the automated assistant, such as by saying a hot-word/phrase or performing a specific user input, prior to speaking a command or query. In addition, the automated assistant can optionally receive, understand, and/or respond to the command or query without communicating with a server, thereby further reducing the time in which a response can be provided. Implementations only selectively initiate on-device speech recognition responsive to determining one or more condition(s) are satisfied. Further, in some implementations, on-device NLU, on-device fulfillment, and/or resulting execution occur only responsive to determining, based on recognized text form the on-device speech recognition, that such further processing should occur. Thus, through selective activation of on-device speech processing, and/or selective activation of on-device NLU and/or on-device fulfillment, various client device resources are conserved.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented using one or more processors, the method comprising: determining to activate on-device speech recognition at a client device, wherein determining to activate the on-device speech recognition is in response to detecting directed speech, wherein detecting the directed speech comprises: detecting, by one or more microphones of the client device, hot word free audio data; processing the hot word free audio data using a trained acoustic model to generate a directed speech metric that indicates whether a spoken utterance, that is hot word free and that is captured by the hot word free audio data, is directed to the client device or instead is not directed to the client device, wherein the trained acoustic model is trained to be used in processing corresponding audio data that captures a corresponding hot word free spoken utterance to generate a corresponding directed speech metric that indicates whether the corresponding hot word free spoken utterances is directed to the client device or instead is not directed to the client device; detecting the directed speech in response to the directed speech metric satisfying a threshold; in response to determining to activate the on-device speech recognition at the client device: generating, based on processing the hot word free audio data using the on-device speech recognition, recognized text for the spoken utterance captured by the hot word free audio data and/or captured by additional hot word free audio data detected by the one or more of the microphones following the hot word free audio data; determining, based on the recognized text, whether to activate on-device natural language understanding of the recognized text and/or to activate on-device fulfillment that is based on the on-device natural language understanding; when it is determined to activate the on-device natural language understanding and/or to activate the on-device fulfillment: performing the on-device natural language understanding and/or initiating, the on-device fulfillment. 2. The method of claim 1 wherein the trained acoustic model is trained based on: positive training instances that each include: positive training instance input of a corresponding directed spoken utterance that is directed to a corresponding client device, and positive training instance output that is a first label; and negative training instances that each include: negative training instance input of a corresponding spoken utterance not directed to any client device, and negative training instance output that is a second label. 3. The method of claim 1 , wherein determining, based on the recognized text, whether to activate on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining whether at least part of the recognized text conforms to content text, the content text being rendered at the client device while the spoken utterance is being spoken. 4. The method of claim 1 , wherein determining, based on the recognized text, whether to activate the on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining whether at least part of the recognized text conforms to content text, the content text being related to an entity being rendered at the client device while the spoken utterance is being spoken. 5. The method of claim 1 , wherein the directed speech metric comprises a probability. 6. The method of claim 1 , wherein determining to active the on-device speech recognition is in response to detecting the direct speech and is further in response to detecting an implicit invocation cue via a non-microphone sensor of the client device. 7. The method of claim 6 , wherein the implicit invocation cue is user presence within a threshold distance of the client device. 8. The method of claim 7 , wherein the non-microphone sensor is a laser-based sensor. 9. The method of claim 6 , wherein the non-microphone sensor is an accelerometer, a magnetometer, or a gyroscope. 10. The method of claim 1 , wherein determining, based on the recognized text, whether to activate on-device natural language understanding and/or to activate the on-device fulfillment comprises: determining whether at least part of the recognized text matches one or more related action phrases each having a defined correspondence to a recent action performed, at the client device, responsive to prior user input. 11. A method implemented using one or more processors, the method comprising: determining to activate on-device speech recognition, wherein determining to activate the on-device speech recognition is in response to determining satisfaction of one or more conditions, determining the satisfaction of the one or more conditions comprising determining the satisfaction based on processing of one or both of: hot word free audio data detected by one or more microphones of a client device, and additional sensor data that is based on output from at least one non-microphone sensor of the client device; generating, using the on-device speech recognition, recognized text from a spoken utterance captured by the hot word free audio data and/or captured by additional hot word free audio data detected by one or more of the microphones following the hot word free audio data, generating the recognized text comprising performing the on-device speech recognition on the hot word free audio data and/or the additional hot word free audio data; determining that at least part of the recognized text that was generated using the on-device speech recognition conforms to actions performable by an application currently executing in a foreground of the client device while the spoken utterance is being spoken; determining, based on the recognized text conforming to actions performable by the application currently executing in the foreground, to activate on-device natural language understanding of the recognized text that was generated using the on-device speech recognition; performing the activated on-device natural language understanding of the recognized text; and initiating, on-device, a fulfillment of the spoken utterance based on the on-device natural language understanding. 12. The method of claim 11 , wherein the application currently executing in the foreground is a non-automated assistant application. 13. The method of claim 11 , further comprising: actively soliciting the application currently executing in the foreground to determine, responsive to the actively soliciting, the actions performable by the application. 14. The method of claim 13 , wherein the actively soliciting is performed via an operating system of the client device. 15. The method of claim 13 , wherein the application currently executing in the foreground is a non-automated assistant application. 16. The method of claim 11 , wherein determining the satisfaction of the one or more conditions comprises determining, based on processing the hot word free audio data, that the hot word free audio data includes directed speech that is directed to the client device as opposed to not being directed to the client device. 17. The method of claim 16 , wherein determining, based on processing the word hot word free audio data, that the hot word free audio data includes the directed speech comprises: processing the hot word free audio data using a trained acoustic model to generate a directed speech metric; and determining that the hot word free audio data includes the directed speech in response to the directed speech metric sat

Assignees

Inventors

Classifications

  • Execution procedure of a spoken command · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

  • using context dependencies, e.g. language models · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

  • Training · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12315508B2 cover?
Implementations can reduce the time required to obtain responses from an automated assistant by, for example, obviating the need to provide an explicit invocation to the automated assistant, such as by saying a hot-word/phrase or performing a specific user input, prior to speaking a command or query. In addition, the automated assistant can optionally receive, understand, and/or respond to the …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).