What technology area does this patent fall under?

Primary CPC classification G10L15/22. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 14 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Detecting system-directed speech

US11361763B1 · US · B1

Patent metadata
Field	Value
Publication number	US-11361763-B1
Application number	US-201715694348-A
Country	US
Kind code	B1
Filing date	Sep 1, 2017
Priority date	Sep 1, 2017
Publication date	Jun 14, 2022
Grant date	Jun 14, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech-processing system capable of receiving and processing audio data to determine if the audio data includes speech that was intended for the system. Non-system directed speech may be filtered out while system-directed speech may be selected for further processing. A system-directed speech detector may use a trained machine learning model (such as a deep neural network or the like) to process a feature vector representing a variety of characteristics of the incoming audio data, including the results of automatic speech recognition and/or other data. Using the feature vector the model may output an indicator as to whether the speech is system-directed. The system may also incorporate other filters such as voice activity detection prior to speech recognition, or the like.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method executed by a system, the method comprising: receiving, from a device, first input audio data corresponding to a first utterance; performing speech processing on the first input audio data to determine a first command; determining first output data responsive to the first command; sending, to the device, the first output data; instructing the device to send second input audio data corresponding to second input audio without the device determining a presence of a wakeword in the second input audio data; receiving, from the device, the second input audio data; processing the second input audio data to determine that the second input audio data represents voice activity; performing automatic speech recognition (ASR) on the second input audio data to determine ASR results, wherein the ASR result data comprises a first portion of partial ASR results and a second portion of partial ASR results; creating a feature vector, using the first portion of partial ASR results, the feature vector representing at least characteristics of the ASR results; at least partially in parallel with performing ASR on the second input audio data, processing the feature vector using a deep neural network (DNN) to determine a score corresponding to a likelihood that the second input audio data represents speech intended for processing by the system; determining the score is above a threshold; performing natural language understanding (NLU) using the ASR results to determine a second command; determining second output data responsive to the second command; and sending, to the device, the second output data. 2. The computer-implemented method of claim 1 , wherein processing the second input audio data to determine that the second input audio data represents voice activity comprises: determining acoustic data corresponding to the first input audio data; determining a second threshold based at least in part on the acoustic data; processing the second input audio data using a second DNN to determine a second score corresponding to a likelihood that the second input audio data represents the voice activity; and determining the second score is above the second threshold. 3. The computer-implemented method of claim 1 , further comprising: processing at least the first input audio data to determine an indicator of an identity of a first user whose voice is detected using the first input audio data; processing the second input audio data and the indicator to determine that a voice of the first user is detected using the second input audio data; and determine a second indicator indicating that a same user's voice is detected in both the first input audio data and the second input audio data, wherein creating the feature vector further uses the second indicator. 4. A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input audio data from a device; perform speech processing on the first input audio data to determine a first command; cause the first command to be executed; receive second input audio data from the device; process the second input audio data to determine the second input audio data represents speech; after determining that the second input audio data represents speech, perform automatic speech recognition (ASR) on the second input audio data to determine ASR result data, wherein the ASR result data comprises a first portion of partial ASR result data and a second portion of partial ASR result data; at least partially in parallel with performing ASR on the second input audio data, determine, using the first portion of partial ASR result data and data corresponding to the first input audio data, that the second input audio data is intended for further processing; perform natural language understanding on the ASR result data to determine NLU result data; and cause a second command to be executed using the NLU results. 5. The system of claim 4 , wherein the instructions, when executed by the at least one processor, further cause the system to: process the second input audio data to determine at least one feature vector corresponding to the second input audio data, wherein the instructions to process the second input audio data to determine the second input audio data represents speech comprise instructions to: process the at least one feature vector using a deep neural network to determine a score, and determine the score is above a threshold. 6. The system of claim 5 , wherein the instructions, when executed by the at least one processor, further cause the system to: determine acoustic data corresponding to the first input audio data; and determine the threshold based at least in part on the acoustic data. 7. The system of claim 4 , wherein the instructions, when executed by the at least one processor, further cause the system to: process the ASR result data to determine a feature vector, wherein the instructions to determine, using the ASR result data, that the second input audio data corresponds to speech intended for further processing comprise instructions to: process the feature vector using a deep neural network to determine a score, and determine the score is above a threshold. 8. The system of claim 7 , wherein the instructions, when executed by the at least one processor, further cause the system to: process the second input audio data to determine an indicator of an identity of a user speaking, wherein the instructions to process the ASR result data to determine a feature vector further comprise instructions to: process the indicator to determine the feature vector. 9. The system of claim 4 , wherein the instructions to perform natural language understanding on the ASR result data to determine NLU result data are configured to be executed after the instructions to determine, using the ASR result data, that the second input audio data corresponds to speech intended for further processing. 10. The system of claim 4 , wherein the instructions to determine that the second input audio data corresponds to speech intended for further processing further use the NLU result data. 11. The system of claim 5 , wherein the instructions to determine that the second input audio data corresponds to speech intended for further processing is further based at least in part on one or more of user profile data, device history data, a device identifier, direction data, acoustic feature data, dialog history data, or the data corresponding to the first input audio data. 12. A computer-implemented method comprising: receiving first input audio data from a device; performing speech processing on the first input audio data to determine a first command; causing the first command to be executed; receiving second input audio data from the device; processing the second input audio data to determine the second input audio data represents speech; after determining that the second input audio data represents speech, performing automatic speech recognition (ASR) on the second input audio data to determine ASR result data, wherein the ASR result data comprises a first portion of partial ASR result data and a second portion of partial ASR result data; at least partially in parallel with performing ASR on the second input audio data, determining, using the first portion of partial ASR result data and data corresponding to the first input audio data, that the second input audio data is intended for further processing; performing natural language understanding on the ASR result data to de

Assignees

Amazon Tech Inc

Inventors

Classifications

G10L25/30
using neural networks · CPC title
G10L17/00
Speaker identification or verification techniques · CPC title
G10L15/22Primary
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G10L2015/223
Execution procedure of a spoken command · CPC title
G10L25/78
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

Patent family

Related publications grouped by family.

View patent family 81944318

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11361763B1 cover?: A speech-processing system capable of receiving and processing audio data to determine if the audio data includes speech that was intended for the system. Non-system directed speech may be filtered out while system-directed speech may be selected for further processing. A system-directed speech detector may use a trained machine learning model (such as a deep neural network or the like) to proc…
Who is the assignee on this patent?: Amazon Tech Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 14 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method and system of automatic speech recognition using posterior confidence scores

Speech recognition method, speech wakeup apparatus, speech recognition apparatus, and terminal

Real-time audio recognition using multiple recognizers

Frequently asked questions