Dialog management for multiple users

US12039975B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12039975-B2
Application numberUS-202017112512-A
CountryUS
Kind codeB2
Filing dateDec 4, 2020
Priority dateSep 21, 2020
Publication dateJul 16, 2024
Grant dateJul 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A natural language system may be configured to act as a participant in a conversation between two users. The system may determine when a user expression such as speech, a gesture, or the like is directed from one user to the other. The system may processing input data related the expression (such as audio data, input data, language processing result data, conversation context data, etc.) to determine if the system should interject a response to the user-to-user expression. If so, the system may process the input data to determine a response and output it. The system may track that response as part of the data related to the ongoing conversation.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving, by a user device operating in a first mode, first input audio data representing a first utterance initiated by a wakeword; processing the first utterance to determine a command to operate in a second mode corresponding to system participation in a conversation between at least two users; beginning operation in the second mode; receiving second input audio data representing a second utterance spoken by a first user during a first conversation between the first user and a second user; receiving first input image data representing the first user speaking the second utterance; based on the second input audio data and the first input image data, determining the first user is speaking the second utterance to the second user; determining dialog data corresponding to third input audio data representing a previous utterance spoken during the first conversation; in response to determining the first user is speaking the second utterance to the second user and in response to the operation in the second mode, using a first component to determine an output responsive to the second utterance is to be generated, wherein determining the output is to be generated is based at least in part on the dialog data; in response to determining the output responsive to the second utterance is to be generated, processing the second input audio data to determine first output data responsive to the second utterance; and causing the user device to present the first output data. 2. The computer-implemented method of claim 1 , further comprising: processing the second input audio data to determine speech processing result data; and determining user profile data corresponding to the first user, wherein using the first component to determine the output responsive to the second utterance is to be generated comprises using the speech processing result data, the user profile data, the dialog data and the first component. 3. The computer-implemented method of claim 1 , further comprising: processing the dialog data and the second input audio data to determine the second utterance refers to an entity represented in the dialog data; determining the first output data based at least in part on the entity; and storing data representing the first output data as part of second dialog data. 4. The computer-implemented method of claim 1 , further comprising: receiving second input image data to determine the second user performed a gesture directed at the first user; processing encoded data corresponding to the second input image data using the first component to determine an output responsive to the gesture is to be generated; in response to determining the output responsive to the gesture is to be generated, processing the second input image data to determine second output data; and causing the user device to present the second output data. 5. A computer-implemented method comprising: receiving first input audio data representing first audio captured by a first device during a conversation including a first user and a second user, the first audio corresponding to first speech of the first user; receiving input image data corresponding to the first input audio data; based on the first input audio data and the input image data, determining the first speech is directed from the first user to the second user; determining first data corresponding to second input audio data representing second speech previously spoken during the conversation; in response to determining the first speech is directed from the first user to the second user, using a first component to determine an output responsive to the first speech is to be generated, wherein determining the output is to be generated is based at least in part on the first data; processing the first input audio data to determine output data; and causing the first device to present the output data. 6. The computer-implemented method of claim 5 , further comprising: determining user profile data corresponding to at least one of the first user and the second user; and processing at least a first portion of the user profile data using the first component to determine the output responsive to the first speech is to be generated. 7. The computer-implemented method of claim 6 , further comprising: processing at least a second portion of the user profile data to determine the output data. 8. The computer-implemented method of claim 5 , further comprising: performing speech processing using the first input audio data to determine speech processing result data; and determining that the speech processing result data corresponds to an actionable command performable by a system, wherein the first component uses data representing that the speech processing result data corresponds to the actionable command in determining the output responsive to the first speech is to be generated. 9. The computer-implemented method of claim 5 , further comprising: receiving time data corresponding to the first input audio data; processing the time data using the first component to determine the output responsive to the first speech is to be generated; and processing the time data to determine timing of presentation of the output data. 10. The computer-implemented method of claim 5 , further comprising: processing the first data and the first input audio data to determine the first speech refers to an entity represented in the first data; and determining the output data based at least in part on the entity. 11. The computer-implemented method of claim 5 , further comprising: storing updated first data representing the output data. 12. The computer-implemented method of claim 5 , wherein determining the first speech is directed from the first user to the second user comprises: using the first input audio data, the input image data, and a second component to determine second output data; and processing the second output data to determine the first speech is directed from the first user to the second user. 13. The computer-implemented method of claim 5 , wherein the input image data includes a representation of at least one of the first user or the second user. 14. A computing system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the computing system to: receive first input audio data representing first audio captured by a first device during a conversation including a first user and a second user, the first audio corresponding to first speech of the first user; receive input image data corresponding to the first input audio data; based on the first input audio data and the input image data, determine the first speech is directed from the first user to the second user; determine first data corresponding to second input audio data representing second speech spoken during the conversation; in response to determining the first speech is directed from the first user to the second user, use a first component to determine an output responsive to the first speech is to be generated, wherein determining the output is to be generated is based at least in part on the first data; process the first input audio data to determine output data; and cause the first device to present the output data. 15. The computing system of claim 14 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the computing system to: determine user profile data correspon

Assignees

Inventors

Classifications

  • Classification techniques · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Extraction of image or video features · CPC title

  • Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands · CPC title

  • Barge in, i.e. overridable guidance for interrupting prompts · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12039975B2 cover?
A natural language system may be configured to act as a participant in a conversation between two users. The system may determine when a user expression such as speech, a gesture, or the like is directed from one user to the other. The system may processing input data related the expression (such as audio data, input data, language processing result data, conversation context data, etc.) to det…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).