Device-directed utterance detection

US11551685B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11551685-B2
Application numberUS-202016822744-A
CountryUS
Kind codeB2
Filing dateMar 18, 2020
Priority dateMar 18, 2020
Publication dateJan 10, 2023
Grant dateJan 10, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-directed speech) with low latency, enabling the device to quickly lower a volume of output audio and/or perform other actions in response to a potential voice command. In addition, the architecture includes a device directed classifier that processes an entire utterance and corresponding semantic information and detects device-directed speech with high accuracy. Using the device directed classifier, the device may reject the interrupt event and increase a volume of the output audio or may accept the interrupt event, causing the output audio to end and performing speech processing on the audio data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, the method comprising: receiving, at a microphone, first audio; generating first input audio data representing the first audio; processing the first input audio data using a wakeword detector of a device to determine that the first input audio data includes a representation of a wakeword; sending at least a portion of the first input audio data for speech processing; receiving first output audio data generated in response to the first input audio data; generating, using the first output audio data and a loudspeaker of the device, first output audio; receiving, at the microphone, second audio; generating second input audio data representing the second audio; processing the second input audio data using a voice activity detector of the device to determine that first speech is represented in the second input audio data; processing the second input audio data using a first classifier of the device to determine that the first speech is directed to the device, the first classifier configured to process a first quantity of audio data; detecting a first endpoint of the first speech represented in the second input audio data; determining, using the first endpoint, a portion of the second input audio data that represents the first speech; processing the portion of the second input audio data using a second classifier to determine that the first speech is directed to the device, the second classifier configured to process a second quantity of audio data that is larger than the first quantity; causing output of the first output audio to stop; and sending at least the portion of the second input audio data for speech processing. 2. The computer-implemented method of claim 1 , further comprising: receiving second output audio data; generating, using the second output audio data and the loudspeaker, second output audio at a first volume level; receiving, at the microphone, third audio; generating third input audio data representing the third audio; processing the third input audio data using the voice activity detector to determine that second speech is represented in a portion of the third input audio data; processing the third input audio data using the first classifier to determine that the second speech is directed to the device; enabling a visual indicator that indicates that the device is listening; processing the third input audio data using the second classifier to determine that the second speech is not directed to the device; and disabling the visual indicator. 3. The computer-implemented method of claim 1 , wherein causing the output of the first output audio to stop further comprises: determining a current position in the first output audio data; identifying a word boundary represented in the first output audio data after the current position, the word boundary indicated by first information included within the first output audio data; determining a first portion of the first output audio data ending at the word boundary; generating the first output audio using the loudspeaker and the first portion of the first output audio data; and causing the first output audio to stop after the first portion of the first output audio data. 4. The computer-implemented method of claim 1 , wherein processing the second input audio data using the first classifier further comprises: determining a volume level of the second input audio data; determining first identification data associated with the first speech; determining emotion data corresponding to the first speech; determining a length of time between generating the first output audio and receiving the second audio; determining, using the first classifier, first model output data using the volume level, the first identification data, the emotion data, and the length of time; and determining that the first model output data satisfies a condition. 5. A computer-implemented method, comprising: generating first output audio using a loudspeaker associated with a device; receiving first audio data; processing the first audio data using a first component of the device to determine that a first portion of the first audio data represents first speech that is directed to the device; in response to determining that the first speech is represented in the first portion of the first audio data, performing a first action; determining that the first speech is represented in a second portion of the first audio data that includes the first portion of the first audio data; detecting, by a speech processing component of the device, an endpoint of the first speech represented in the first audio data; determining, by the speech processing component, semantic information associated with the first speech; processing, using a classifier of a second component of the device, the second portion of the first audio data and the semantic information to determine that the first speech corresponds to a first device-directed speech event; and causing speech processing to be performed on the second portion of the first audio data. 6. The computer-implemented method of claim 5 , further comprising: generating second output audio using the loudspeaker; receiving second audio data; processing the second audio data using the first component to determine that speech is represented in a portion of the second audio data; performing the first action; processing the second audio data using the second component to determine a confidence value indicating a likelihood that the second audio data corresponds to a second device-directed speech event; determining that the confidence value satisfies a condition; and performing a second action. 7. The computer-implemented method of claim 5 , further comprising: generating second output audio using the loudspeaker; receiving second audio data; processing the second audio data using the first component to determine that a first portion of the second audio data represents second speech that is directed to the device; performing the first action; determining that the second speech is represented in a second portion of the second audio data that includes the first portion of the second audio data; processing the second portion of the second audio data using the second component to determine that the second speech is not directed to the device; and performing a second action. 8. The computer-implemented method of claim 7 , further comprising: generating training data corresponding to the second audio data, the training data indicating a rejected interrupt event; and training the first component using the training data. 9. The computer-implemented method of claim 5 , further comprising: receiving output audio data; generating, using the loudspeaker and the output audio data, second output audio at a first volume level; receiving second audio data; processing the second audio data using the first component to determine that a portion of the second audio data represents second speech that is directed to the device; generating, using the loudspeaker and the output audio data, the second output audio at a second volume level that is lower than the first volume level; processing the second audio data using the second component to determine that the second audio data does not correspond to a second device-directed speech event; and generating, using the loudspeaker and the output audio data, the second output audio at the first volume level. 10. The computer-implemented method of claim 5 , further comprising: generating second output audio using a loudspeaker; receiving second audio data; processing the second audio data using the f

Assignees

Inventors

Classifications

  • of the speaker; Human-factor methodology · CPC title

  • G10L15/222Primary

    Barge in, i.e. overridable guidance for interrupting prompts · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Execution procedure of a spoken command · CPC title

  • of application context · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11551685B2 cover?
A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/222. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 10 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).