Device selection from audio data

US10685669B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10685669-B1
Application numberUS-201815926507-A
CountryUS
Kind codeB1
Filing dateMar 20, 2018
Priority dateMar 20, 2018
Publication dateJun 16, 2020
Grant dateJun 16, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure describes techniques for identifying a voice-enabled device from a group of voice-enabled devices to respond to a speech utterance of a user. A speech-processing system may receive an audio signal representing the speech utterance captured in an environment of a voice-enabled device, and identify another voice-enabled device located in the environment. The system may analyze the audio signal using a different natural-language-understanding model for each of the voice-enabled devices to identify an intent for each of the voice-enabled devices to respond to the speech utterance. The system may determine confidence scores that the intents are responsive to the speech utterance, and select the intent with the highest confidence score. The system may use the selected intent to generate a command for the corresponding voice-enabled device to respond to the user.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: one or more processors; computer-readable media storing computer-executable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: receiving, from a first voice-enabled device, audio data representing a speech utterance; generating, using automatic speech recognition (ASR) processing and the audio data, text data representing the speech utterance; determining a first device profile of the first voice-enabled device; determining the first voice-enabled device and a second voice-enabled device are located in a same physical environment; determining a second device profile of the second voice-enabled device; determining, using a first natural-language-understanding (NLU) model and the text data, first intent data representing the speech utterance, wherein the first NLU model is associated with the first device profile; determining, using a second NLU model and the text data, second intent data representing the speech utterance, wherein the second NLU model is associated with the second device profile; determining a first confidence score that the speech utterance corresponds to the first intent data; determining a second confidence score that the speech utterance corresponds to the second intent data; determining that the second confidence score is greater than the first confidence score; based at least in part on the second confidence score being greater than the first confidence score, using the second intent data to determine a command to cause the second voice-enabled device to perform an action; and sending, to the second voice-enabled device, command data indicating the command. 2. The system of claim 1 , the operations further comprising: identifying first device-state data associated with the first voice-enabled device, wherein the first device-state data indicates that a first device state of the first voice-enabled device is idle; identifying second device-state data associated with the second voice-enabled device, wherein the second device-state data indicates that a second device state of the second voice-enabled device is outputting sound using a speaker associated with the second voice-enabled device, and wherein: determining the first confidence score includes determining that the first intent data corresponds to a first action that the first voice-enabled device is unable to perform in the first device state; and determining the second confidence score includes determining that the second intent data corresponds to a second action that the second voice-enabled device is able to perform in the second device state. 3. The system of claim 1 , wherein: the first NLU model comprises a first machine-learning model trained to determine that the first intent data corresponds to the text data, wherein the first intent data is associated with a first device capability of the first voice-enabled device; and the second NLU model comprises a second machine-learning model trained to determine that the second intent data corresponds to the text data, wherein the second intent data is associated with a second device capability of the second voice-enabled device, and wherein the first device capability is different than the second device capability. 4. The system of claim 1 , wherein the audio data comprises first audio data, and the operations further comprising, prior to receiving the first audio data: receiving, from the first voice-enabled device, second audio data representing first sound captured by one or more microphones of the first voice-enabled device; receiving, from the second voice-enabled device, third audio data representing second sound captured by one or more microphones of the second voice-enabled device; determining the second audio data was received within a threshold period of time of when the third audio data was received; and based at least in part on the second audio data and the third audio being received within the threshold period of time, generating an association between the first device profile and the second device profile indicating that the first voice-enabled device is in the same physical environment as the second voice-enabled device. 5. A method comprising: receiving audio data from a first device in an environment, the audio data representing a speech utterance; generating, using automatic speech recognition (ASR) processing and the audio data, text data representing the speech utterance; determining that a second device is in the environment; determining, using a first natural-language-understanding (NLU) model and the text data, first intent data representing the speech utterance, wherein the first NLU model is associated with the first device; determining, using a second NLU model and the text data, second intent data representing the speech utterance, wherein the second NLU model is associated with the second device; selecting the second intent data instead of the first intent data; using the second intent data to determine a command to cause the second device to perform an action; and sending, to the second device, command data indicating the command. 6. The method of claim 5 , further comprising: identifying first device-state data associated with the first device, wherein the first device-state data indicates a first device state of the first device; determining a first confidence score that the speech utterance corresponds to the first intent data by determining that the first intent data corresponds to a first action that the first device is unable to perform in the first device state; identifying second device-state data associated with the second device, wherein the second device-state data indicates a second device state of the second device; determining a second confidence score that the speech utterance corresponds to the second intent data by determining that the second intent data corresponds to a second action that the second device is able to perform in the second device state; and determining that the second confidence score is greater than the first confidence score. 7. The method of claim 5 , wherein the audio data comprises first audio data, and the method further comprising: receiving second audio data associated with the second device, the second audio data representing the speech utterance; determining a first signal-to-noise (SNR) value associated with the first audio data; determining a second SNR value associated with the second audio data; and determining a first confidence score that the speech utterance is better represented by the first intent data based at least in part on the first SNR value; determining a second confidence score that the speech utterance is better represented by the second intent data based at least in part on the second SNR value; and determining that the second confidence score is greater than the first confidence score. 8. The method of claim 5 , further comprising: identifying a first device profile associated with the first device; determining that the first device profile is associated with the first NLU model, wherein: the first NLU model comprises a first machine-learning model trained to determine that the first intent data corresponds to the text data; and the first intent data is associated with a first device capability of the first device; identifying a second device profile associated with the second device; and determining that the second device profile is associated with the second NLU model, wherein: the second NLU model comprises a second machine-learning model trained to determine that the second intent data corresponds to the text data; and the second intent

Assignees

Inventors

Classifications

  • Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title

  • G10L25/51Primary

    for comparison or discrimination · CPC title

  • the extracted parameters being power information · CPC title

  • characterised by the type of extracted parameters · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10685669B1 cover?
This disclosure describes techniques for identifying a voice-enabled device from a group of voice-enabled devices to respond to a speech utterance of a user. A speech-processing system may receive an audio signal representing the speech utterance captured in an environment of a voice-enabled device, and identify another voice-enabled device located in the environment. The system may analyze the…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L25/51. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 16 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).