What technology area does this patent fall under?

Primary CPC classification G10L15/22. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Device-directed utterance detection

US12236950B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12236950-B2
Application number	US-202318149181-A
Country	US
Kind code	B2
Filing date	Jan 3, 2023
Priority date	Mar 18, 2020
Publication date	Feb 25, 2025
Grant date	Feb 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-directed speech) with low latency, enabling the device to quickly lower a volume of output audio and/or perform other actions in response to a potential voice command. In addition, the architecture includes a device directed classifier that processes an entire utterance and corresponding semantic information and detects device-directed speech with high accuracy. Using the device directed classifier, the device may reject the interrupt event and increase a volume of the output audio or may accept the interrupt event, causing the output audio to end and performing speech processing on the audio data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: generating first output audio using a loudspeaker associated with a device; receiving first audio data; processing the first audio data using a first component of the device to determine that the first audio data represents first speech; in response to determining that the first speech is represented in the first audio data, performing a first action; determining, by a natural language processing component, first natural language processing data associated with the first speech; providing the first audio data and the first natural language processing data as inputs to a machine learning component, the machine learning component being configured to classify input data as corresponding to a device-directed speech event; determining, using the machine learning component, that the first audio data and the first natural language processing data correspond to a first device-directed speech event; and based at least in part on the first audio data and the first natural language processing data corresponding to the first device-directed speech event, causing natural language processing to be completed based on the first audio data. 2. The computer-implemented method of claim 1 , further comprising: detecting an endpoint of the first speech represented in the first audio data, wherein determining that the first audio data and the first natural language processing data correspond to the first device-directed speech event occurs after detection of the endpoint. 3. The computer-implemented method of claim 1 , further comprising: determining, using a wakeword detection component, an indicator that the first speech includes a wakeword; and providing the indicator as an input to the machine learning component together with the first audio data and the first natural language processing data. 4. The computer-implemented method of claim 1 , further comprising: processing, by the natural language processing component, a first portion of the first audio data to determine the first natural language processing data, wherein the first natural language processing data corresponds to the first portion of the first audio data. 5. The computer-implemented method of claim 1 , further comprising: detecting an endpoint of the first speech represented in the first audio data, wherein determining the first natural language processing data comprises determining the first natural language processing data corresponding to an entirety of the first speech. 6. The computer-implemented method of claim 1 , further comprising: processing, by a wakeword detection component, the first audio data; and failing to detect, by the wakeword detection component, a representation of a wakeword in the first audio data. 7. The computer-implemented method of claim 1 , wherein the first component comprises a wakeword detection component and the method further comprises: processing, by the wakeword detection component, the first audio data to determine a representation of a wakeword in the first audio data. 8. The computer-implemented method of claim 1 , wherein performing the first action comprises: presenting, by the device, a visual output corresponding to an indication that natural language processing is occurring. 9. The computer-implemented method of claim 1 , wherein performing the first action comprises: reducing a volume level of the first output audio. 10. The computer-implemented method of claim 1 , further comprising: after determination that the first audio data and the first natural language processing data correspond to the first device-directed speech event, discontinuing generating the first output audio using the loudspeaker of the device. 11. A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: generate first output audio using a loudspeaker associated with a device; receive first audio data; process the first audio data using a first component of the device to determine that the first audio data represents first speech; in response to determination that the first speech is represented in the first audio data, performing a first action; determine, by a natural language processing component, first natural language processing data associated with the first speech, wherein the first natural language processing data corresponds to a representation of the first speech; provide the first audio data and the first natural language processing data as inputs to a machine learning component, the machine learning component being configured to classify input data as corresponding to a device-directed speech event; determine, using the machine learning component, that the first audio data and the first natural language processing data correspond to a first device-directed speech event; and based at least in part on the first audio data and the first natural language processing data corresponding to the first device-directed speech event, cause natural language processing to be completed based on the first audio data, wherein the natural language processing includes determining an intent associated with the first speech. 12. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: detect an endpoint of the first speech represented in the first audio data, wherein determination that the first audio data and the first natural language processing data correspond to the first device-directed speech event occurs after detection of the endpoint. 13. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine, using a wakeword detection component, an indicator that the first speech includes a wakeword; and provide the indicator as an input to the machine learning component together with the first audio data and the first natural language processing data. 14. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process, by the natural language processing component, a first portion of the first audio data to determine the first natural language processing data, wherein the first natural language processing data corresponds to the first portion of the first audio data. 15. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: detect an endpoint of the first speech represented in the first audio data, wherein the first natural language processing data corresponds to an entirety of the first speech. 16. The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process, by a wakeword detection component, the first audio data; and fail to detect, by the wakeword detection component, a representation of a wakeword in the first audio data. 17. The system of claim 11 , wherein the first component comprises a wakeword detection component and wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process, by the wakeword detection component, the first audio data

Assignees

Amazon Tech Inc

Inventors

Classifications

G10L2015/228
of application context · CPC title
G10L2015/223
Execution procedure of a spoken command · CPC title
G10L2015/088
Word spotting · CPC title
G10L15/26
Speech to text systems (G10L15/08 takes precedence) · CPC title
G10L15/1815
Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning · CPC title

Patent family

Related publications grouped by family.

View patent family 74873816

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12236950B2 cover?: A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-…
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).