Device-directed utterance detection

US12236950B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12236950-B2
Application numberUS-202318149181-A
CountryUS
Kind codeB2
Filing dateJan 3, 2023
Priority dateMar 18, 2020
Publication dateFeb 25, 2025
Grant dateFeb 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-directed speech) with low latency, enabling the device to quickly lower a volume of output audio and/or perform other actions in response to a potential voice command. In addition, the architecture includes a device directed classifier that processes an entire utterance and corresponding semantic information and detects device-directed speech with high accuracy. Using the device directed classifier, the device may reject the interrupt event and increase a volume of the output audio or may accept the interrupt event, causing the output audio to end and performing speech processing on the audio data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: generating first output audio using a loudspeaker associated with a device; receiving first audio data; processing the first audio data using a first component of the device to determine that the first audio data represents first speech; in response to determining that the first speech is represented in the first audio data, performing a first action; determining, by a natural language processing component, first natural language processing data associated with the first speech; providing the first audio data and the first natural language processing data as inputs to a machine learning component, the machine learning component being configured to classify input data as corresponding to a device-directed speech event; determining, using the machine learning component, that the first audio data and the first natural language processing data correspond to a first device-directed speech event; and based at least in part on the first audio data and the first natural language processing data corresponding to the first device-directed speech event, causing natural language processing to be completed based on the first audio data. 2. The computer-implemented method of claim 1 , further comprising: detecting an endpoint of the first speech represented in the first audio data, wherein determining that the first audio data and the first natural language processing data correspond to the first device-directed speech event occurs after detection of the endpoint. 3. The computer-implemented method of claim 1 , further comprising: determining, using a wakeword detection component, an indicator that the first speech includes a wakeword; and providing the indicator as an input to the machine learning component together with the first audio data and the first natural language processing data. 4. The computer-implemented method of claim 1 , further comprising: processing, by the natural language processing component, a first portion of the first audio data to determine the first natural language processing data, wherein the first natural language processing data corresponds to the first portion of the first audio data. 5. The computer-implemented method of claim 1 , further comprising: detecting an endpoint of the first speech represented in the first audio data, wherein determining the first natural language processing data comprises determining the first natural language processing data corresponding to an entirety of the first speech. 6. The computer-implemented method of claim 1 , further comprising: processing, by a wakeword detection component, the first audio data; and failing to detect, by the wakeword detection component, a representation of a wakeword in the first audio data. 7. The computer-implemented method of claim 1 , wherein the first component comprises a wakeword detection component and the method further comprises: processing, by the wakeword detection component, the first audio data to determine a representation of a wakeword in the first audio data. 8. The computer-implemented method of claim 1 , wherein performing the first action comprises: presenting, by the device, a visual output corresponding to an indication that natural language processing is occurring. 9. The computer-implemented method of claim 1 , wherein performing the first action comprises: reducing a volume level of the first output audio. 10. The computer-implemented method of claim 1 , further comprising: after determination that the first audio data and the first natural language processing data correspond to the first device-directed speech event, discontinuing generating the first output audio using the loudspeaker of the device. 11. A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: generate first output audio using a loudspeaker associated with a device; receive first audio data; process the first audio data using a first component of the device to determine that the first audio data represents first speech; in response to determination that the first speech is represented in the first audio data, performing a first action; determine, by a natural language processing component, first natural language processing data associated with the first speech, wherein the first natural language processing data corresponds to a representation of the first speech; provide the first audio data and the first natural language processing data as inputs to a machine learning component, the machine learning component being configured to classify input data as corresponding to a device-directed speech event; determine, using the machine learning component, that the first audio data and the first natural language processing data correspond to a first device-directed speech event; and based at least in part on the first audio data and the first natural language processing data corresponding to the first device-directed speech event, cause natural language processing to be completed based on the first audio data, wherein the natural language processing includes determining an intent associated with the first speech. 12. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: detect an endpoint of the first speech represented in the first audio data, wherein determination that the first audio data and the first natural language processing data correspond to the first device-directed speech event occurs after detection of the endpoint. 13. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine, using a wakeword detection component, an indicator that the first speech includes a wakeword; and provide the indicator as an input to the machine learning component together with the first audio data and the first natural language processing data. 14. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process, by the natural language processing component, a first portion of the first audio data to determine the first natural language processing data, wherein the first natural language processing data corresponds to the first portion of the first audio data. 15. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: detect an endpoint of the first speech represented in the first audio data, wherein the first natural language processing data corresponds to an entirety of the first speech. 16. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process, by a wakeword detection component, the first audio data; and fail to detect, by the wakeword detection component, a representation of a wakeword in the first audio data. 17. The system of claim 11 , wherein the first component comprises a wakeword detection component and wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process, by the wakeword detection component, the first audio data

Assignees

Inventors

Classifications

  • of application context · CPC title

  • Execution procedure of a spoken command · CPC title

  • Word spotting · CPC title

  • Speech to text systems (G10L15/08 takes precedence) · CPC title

  • Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12236950B2 cover?
A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).