Dialog management for multiple users

US11908468B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11908468-B2
Application numberUS-202017112520-A
CountryUS
Kind codeB2
Filing dateDec 4, 2020
Priority dateSep 21, 2020
Publication dateFeb 20, 2024
Grant dateFeb 20, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system that is capable of resolving anaphora using timing data received by a local device. A local device outputs audio representing a list of entries. The audio may represent synthesized speech of the list of entries. A user can interrupt the device to select an entry in the list, such as by saying “that one.” The local device can determine an offset time representing the time between when audio playback began and when the user interrupted. The local device sends the offset time and audio data representing the utterance to a speech processing system which can then use the offset time and stored data to identify which entry on the list was most recently output by the local device when the user interrupted. The system can then resolve anaphora to match that entry and can perform additional processing based on the referred to item.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving, by a user device comprising at least one microphone and at least one speaker, output audio data representing synthesized speech of a list of entries; using the at least one speaker, beginning playback of audio corresponding to the output audio data; during the playback of the audio, detecting, by the at least one microphone, user speech; determining input audio data representing the user speech; determining a first time corresponding to the beginning of the playback; determining a second time corresponding to detection of the user speech; determining, by the user device, offset time data representing a difference between the first time and the second time; using a first trained machine learning (ML) model, processing the input audio data to determine that the user speech is system directed; based at least in part on determining that the user speech is system directed, sending, to at least one remote device, the input audio data and the offset time data; performing automatic speech recognition (ASR) processing on the input audio data to determine ASR output data representing a transcript of the input audio data; using a second trained ML model, performing natural language understanding (NLU) processing on the ASR output data to determine NLU output data representing at least an intent corresponding to the user speech; based at least in part on the NLU output data, determining that the user speech refers to an entry that is absent from the user speech; based at least in part on the entry being absent from the user speech, processing the offset time data to determine the entry is a first entry in the list of entries; and causing an action to be performed based at least in part on the first entry. 2. The computer-implemented method of claim 1 , further comprising: determining stored data corresponding to the output audio data; determining a start point of the output audio data; determining, using the offset time data and the start point, a first portion of the output audio data; determining the first portion of the output audio data corresponds to a portion of the list of entries representing the first entry; and sending an indication of the first entry to a speech processing component. 3. The computer-implemented method of claim 1 , further comprising, prior to performing the ASR processing: receiving input image data corresponding to the input audio data; and processing the input image data using a third trained ML model to determine that the user speech is system directed. 4. The computer-implemented method of claim 1 , further comprising: using the offset time data to determine the user speech began after output began of audio representing the first entry in the list of entries but prior to beginning output of audio representing a second entry in the list of entries; and causing dialog data to be stored representing the first entry but not the second entry. 5. A computer-implemented method comprising: causing playback of output audio; detecting, by at least one microphone of at least one user device, input audio representing user speech; determining input audio data representing the user speech; determining, by the at least one user device, time data representing a difference between a beginning of the playback of the output audio and the user speech; using a first trained machine learning (ML) model, processing the input audio data to determine that the user speech is system directed; and based at least in part on determining that the user speech is system directed, sending, to at least one remote device, the input audio data and the time data to determine, using the input audio data and the time data, that the user speech refers to a portion of the output audio. 6. The computer-implemented method of claim 5 , wherein determining the time data comprises: determining a first timestamp corresponding to a beginning of the playback; determining a second timestamp corresponding to the input audio; and determining an offset representing a difference between the first timestamp and the second timestamp, wherein the time data represents the offset. 7. The computer-implemented method of claim 5 , wherein the output audio corresponds to synthesized speech representing a list of entries. 8. The computer-implemented method of claim 7 , further comprising, prior to causing the playback: receiving second input audio corresponding to previous user speech; determining second input audio data representing the previous user speech; sending, to the at least one remote device, the second input audio data; and receiving, from the at least one remote device, output audio data representing the list of entries in response to the second input audio data. 9. The computer-implemented method of claim 5 , wherein the output audio corresponds to synthesized speech representing a list of entries and wherein the method further comprises, after sending the input audio data to the at least one remote device: receiving, from the at least one remote device, output audio data representing further information regarding a first entry in the list of entries; and causing playback of second output audio corresponding to the output audio data. 10. The computer-implemented method of claim 5 , further comprising: sending, to the at least one remote device, data indicating the input audio data is associated with the time data. 11. The computer-implemented method of claim 5 , wherein: the at least one user device comprises a first user device and a second user device; detecting the input audio comprises using the at least one microphone of the first user device; and the time data is determined by the second user device. 12. A computing system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the computing system to: receive, from a user device, input audio data representing an utterance; receive, from the user device, time data representing a difference between detecting, by the user device, of the utterance and a beginning of playback, by the user device, of audio being output; performing automatic speech recognition (ASR) processing on the input audio data to determine ASR output data representing a transcript of the input audio data; using a first trained machine learning (ML) model, performing natural language understanding (NLU) processing on the ASR output data to determine NLU output data representing at least an intent corresponding to the utterance; based at least in part on the NLU output data, the input audio data, and the time data, determine a first selection referred to in the utterance; and cause an action to be performed based at least in part on the first selection. 13. The computing system of claim 12 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: based at least in part on the NLU output data, determine the utterance refers to a selection that is not named in the utterance. 14. The computing system of claim 12 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: determine the user device is associated with stored data corresponding to the audio and corresponding to a list of entries; and use the stored data and the time data to determine a first entry from the list of entries, the first entry corresponding to

Assignees

Inventors

Classifications

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title

  • Classification techniques · CPC title

  • Extraction of image or video features · CPC title

  • Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11908468B2 cover?
A system that is capable of resolving anaphora using timing data received by a local device. A local device outputs audio representing a list of entries. The audio may represent synthesized speech of the list of entries. A user can interrupt the device to select an entry in the list, such as by saying “that one.” The local device can determine an offset time representing the time between when a…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).