Contextual suppression of assistant command(s)

US2023143177A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023143177-A1
Application numberUS-202318092883-A
CountryUS
Kind codeA1
Filing dateJan 3, 2023
Priority dateMay 17, 2021
Publication dateMay 11, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Some implementations process, using warm word model(s), a stream of audio data to determine a portion of the audio data that corresponds to particular word(s) and/or phrase(s) (e.g., a warm word) associated with an assistant command, process, using an automatic speech recognition (ASR) model, a preamble portion of the audio data (e.g., that precedes the warm word) and/or a postamble portion of the audio data (e.g., that follows the warm word) to generate ASR output, and determine, based on processing the ASR output, whether a user intended the assistant command to be performed. Additional or alternative implementations can process the stream of audio data using a speaker identification (SID) model to determine whether the audio data is sufficient to identify the user that provided a spoken utterance captured in the stream of audio data, and determine if that user is authorized to cause performance of the assistant command.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method implemented by one or more processors, the method comprising: processing, using a warm word model, a stream of audio data to monitor for an occurrence of one or more particular words or phrases, the stream of audio data being generated by one or more microphones of a client device of a user, and each of the one or more particular words or phrases being associated with an assistant command; in response to determining a portion of the stream of audio data corresponds to one or more of the particular words or phrases: processing, using a voice activity detection (VAD) model, the stream of audio data to monitor for an occurrence of additional voice activity before the portion of the stream of audio data corresponds to one or more of the particular words or phrases and/or after the portion of the stream of audio data corresponds to one or more of the particular words or phrases; in response to determining that there is no additional voice activity before the portion of the stream of audio data corresponds to one or more of the particular words or phrases and/or after the portion of the stream of audio data corresponds to one or more of the particular words or phrases: causing an automated assistant to perform the assistant command that is associated with one or more of the particular words or phrases; and in response to determining that there is additional voice activity before the portion of the stream of audio data corresponds to one or more of the particular words or phrases and/or after the portion of the stream of audio data corresponds to one or more of the particular words or phrases: further processing the stream of audio data to determine whether to cause the automated assistant to perform the assistant command that is associated with one or more of the particular words or phrases. 2 . The method of claim 1 , wherein further processing the stream of audio data to determine whether to cause the automated assistant to perform the assistant command that is associated with one or more of the particular words or phrases comprises: processing, using an automatic speech recognition (ASR) model, a preamble portion of the stream of audio data and/or a postamble portion of the audio data to generate ASR output, wherein the preamble portion of the audio data precedes the portion of the stream of audio data that corresponds to the one or more particular words or phrases, and wherein the postamble portion of the audio data follows the portion of the stream of audio data that corresponds to the one or more particular words or phrases; and determining, based on processing the ASR output, whether the user intended the one or more particular words or phrases to cause performance of the assistant command; and in response to determining the user intended the one or more particular words or phrases to cause performance of the assistant command that is associated with one or more of the particular words or phrases: causing the automated assistant to perform the assistant command that is associated with one or more of the particular words or phrases. 3 . The method of claim 2 , further comprising: in response to determining the user did not intend the one or more particular words or phrases to cause performance of the assistant command that is associated one or more of the particular words or phrases: refraining from causing the automated assistant to perform the assistant command that is associated with one or more of the particular words or phrases; and. 4 . The method of claim 2 , further comprising: obtaining the preamble portion of the audio data from an audio buffer of the client device; and/or obtaining the postamble portion of the audio data from the stream of audio data. 5 . The method of claim 2 , wherein determining whether the user intended the one or more particular words or phrases to cause performance of the assistant command that is associated with one or more of the particular words or phrases based on processing the ASR output comprises: processing, using a natural language understanding (NLU) model, the ASR output to generate NLU output, wherein the ASR output is generated based on both the preamble portion of the audio data and the postamble portion of the audio data; and determining, based on the NLU output, whether the user intended the one or more particular words or phrases to cause performance of the assistant command. 6 . The method of claim 5 , further comprising: in response to determining the NLU output is insufficient for determining whether the user intended the one or more particular words or phrases to cause performance of the assistant command that is associated with one or more of the particular words or phrases: processing, using the ASR model, an additional postamble portion of the audio data to generate additional ASR output, wherein the additional postamble portion of the audio data follows the postamble portion of the audio data; and determining, based on processing the additional ASR output, whether the user intended the one or more particular words or phrases to cause performance of the assistant command that is associated with one or more of the particular words or phrases. 7 . The method of claim 1 , further comprising: detecting an occurrence of a warm word activation event; and in response to detecting the occurrence of the warm word activation event, activating one or more currently dormant automated assistant functions that utilize the warm word model, wherein processing the stream of audio data using the warm word model to monitor for the occurrence of the one or more particular words or phrases is in response to activating the one or more currently dormant automated assistant functions that utilize the warm word model. 8 . The method of claim 7 , wherein the warm word activation event comprises one or more of: a phone call being received at the client device, a text message being received at the client device, an email being received at the client device, an alarm sounding at the client device, a timer sounding at the client device, media being played at the client device or an additional client device in an environment of the client device, a notification being received at the client device, a location of the client device, or a software application being accessible at the client device. 9 . The method of claim 1 , further comprising: processing, using an endpointing model, the stream of audio data to generate a plurality of timestamps for the stream of audio data. 10 . The method of claim 9 , wherein the plurality of timestamps comprise at least a first timestamp associated with a first time when the user began providing the one or more particular words or phrases, a second timestamp associated with a second time, that is subsequent to the first time, when the user finished providing the one or more particular words or phrases. 11 . The method of claim 10 , wherein determining that there is no additional voice activity before the portion of the stream of audio data corresponds to one or more of the particular words or phrases comprises determining that there is no voice activity prior to the first timestamp. 12 . The method of claim 11 , wherein determining that there is no additional voice activity after the portion of the stream of audio data corresponds to one or more of the particular words or phrases comprises determining that there is no voice activity subsequent to the second timestamp. 13 . A system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, cau

Assignees

Inventors

Classifications

  • Interactive procedures; Man-machine interfaces · CPC title

  • Artificial neural networks; Connectionist approaches · CPC title

  • G10L17/14Primary

    Use of phonemic categorisation or speech recognition prior to speaker recognition or verification · CPC title

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023143177A1 cover?
Some implementations process, using warm word model(s), a stream of audio data to determine a portion of the audio data that corresponds to particular word(s) and/or phrase(s) (e.g., a warm word) associated with an assistant command, process, using an automatic speech recognition (ASR) model, a preamble portion of the audio data (e.g., that precedes the warm word) and/or a postamble portion of …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L17/14. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 11 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).