Detecting system-directed speech

US11361763B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11361763-B1
Application numberUS-201715694348-A
CountryUS
Kind codeB1
Filing dateSep 1, 2017
Priority dateSep 1, 2017
Publication dateJun 14, 2022
Grant dateJun 14, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech-processing system capable of receiving and processing audio data to determine if the audio data includes speech that was intended for the system. Non-system directed speech may be filtered out while system-directed speech may be selected for further processing. A system-directed speech detector may use a trained machine learning model (such as a deep neural network or the like) to process a feature vector representing a variety of characteristics of the incoming audio data, including the results of automatic speech recognition and/or other data. Using the feature vector the model may output an indicator as to whether the speech is system-directed. The system may also incorporate other filters such as voice activity detection prior to speech recognition, or the like.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method executed by a system, the method comprising: receiving, from a device, first input audio data corresponding to a first utterance; performing speech processing on the first input audio data to determine a first command; determining first output data responsive to the first command; sending, to the device, the first output data; instructing the device to send second input audio data corresponding to second input audio without the device determining a presence of a wakeword in the second input audio data; receiving, from the device, the second input audio data; processing the second input audio data to determine that the second input audio data represents voice activity; performing automatic speech recognition (ASR) on the second input audio data to determine ASR results, wherein the ASR result data comprises a first portion of partial ASR results and a second portion of partial ASR results; creating a feature vector, using the first portion of partial ASR results, the feature vector representing at least characteristics of the ASR results; at least partially in parallel with performing ASR on the second input audio data, processing the feature vector using a deep neural network (DNN) to determine a score corresponding to a likelihood that the second input audio data represents speech intended for processing by the system; determining the score is above a threshold; performing natural language understanding (NLU) using the ASR results to determine a second command; determining second output data responsive to the second command; and sending, to the device, the second output data. 2. The computer-implemented method of claim 1 , wherein processing the second input audio data to determine that the second input audio data represents voice activity comprises: determining acoustic data corresponding to the first input audio data; determining a second threshold based at least in part on the acoustic data; processing the second input audio data using a second DNN to determine a second score corresponding to a likelihood that the second input audio data represents the voice activity; and determining the second score is above the second threshold. 3. The computer-implemented method of claim 1 , further comprising: processing at least the first input audio data to determine an indicator of an identity of a first user whose voice is detected using the first input audio data; processing the second input audio data and the indicator to determine that a voice of the first user is detected using the second input audio data; and determine a second indicator indicating that a same user's voice is detected in both the first input audio data and the second input audio data, wherein creating the feature vector further uses the second indicator. 4. A system comprising: at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to: receive first input audio data from a device; perform speech processing on the first input audio data to determine a first command; cause the first command to be executed; receive second input audio data from the device; process the second input audio data to determine the second input audio data represents speech; after determining that the second input audio data represents speech, perform automatic speech recognition (ASR) on the second input audio data to determine ASR result data, wherein the ASR result data comprises a first portion of partial ASR result data and a second portion of partial ASR result data; at least partially in parallel with performing ASR on the second input audio data, determine, using the first portion of partial ASR result data and data corresponding to the first input audio data, that the second input audio data is intended for further processing; perform natural language understanding on the ASR result data to determine NLU result data; and cause a second command to be executed using the NLU results. 5. The system of claim 4 , wherein the instructions, when executed by the at least one processor, further cause the system to: process the second input audio data to determine at least one feature vector corresponding to the second input audio data, wherein the instructions to process the second input audio data to determine the second input audio data represents speech comprise instructions to: process the at least one feature vector using a deep neural network to determine a score, and determine the score is above a threshold. 6. The system of claim 5 , wherein the instructions, when executed by the at least one processor, further cause the system to: determine acoustic data corresponding to the first input audio data; and determine the threshold based at least in part on the acoustic data. 7. The system of claim 4 , wherein the instructions, when executed by the at least one processor, further cause the system to: process the ASR result data to determine a feature vector, wherein the instructions to determine, using the ASR result data, that the second input audio data corresponds to speech intended for further processing comprise instructions to: process the feature vector using a deep neural network to determine a score, and determine the score is above a threshold. 8. The system of claim 7 , wherein the instructions, when executed by the at least one processor, further cause the system to: process the second input audio data to determine an indicator of an identity of a user speaking, wherein the instructions to process the ASR result data to determine a feature vector further comprise instructions to: process the indicator to determine the feature vector. 9. The system of claim 4 , wherein the instructions to perform natural language understanding on the ASR result data to determine NLU result data are configured to be executed after the instructions to determine, using the ASR result data, that the second input audio data corresponds to speech intended for further processing. 10. The system of claim 4 , wherein the instructions to determine that the second input audio data corresponds to speech intended for further processing further use the NLU result data. 11. The system of claim 5 , wherein the instructions to determine that the second input audio data corresponds to speech intended for further processing is further based at least in part on one or more of user profile data, device history data, a device identifier, direction data, acoustic feature data, dialog history data, or the data corresponding to the first input audio data. 12. A computer-implemented method comprising: receiving first input audio data from a device; performing speech processing on the first input audio data to determine a first command; causing the first command to be executed; receiving second input audio data from the device; processing the second input audio data to determine the second input audio data represents speech; after determining that the second input audio data represents speech, performing automatic speech recognition (ASR) on the second input audio data to determine ASR result data, wherein the ASR result data comprises a first portion of partial ASR result data and a second portion of partial ASR result data; at least partially in parallel with performing ASR on the second input audio data, determining, using the first portion of partial ASR result data and data corresponding to the first input audio data, that the second input audio data is intended for further processing; performing natural language understanding on the ASR result data to de

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • Speaker identification or verification techniques · CPC title

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Execution procedure of a spoken command · CPC title

  • Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11361763B1 cover?
A speech-processing system capable of receiving and processing audio data to determine if the audio data includes speech that was intended for the system. Non-system directed speech may be filtered out while system-directed speech may be selected for further processing. A system-directed speech detector may use a trained machine learning model (such as a deep neural network or the like) to proc…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 14 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).