What technology area does this patent fall under?

Primary CPC classification G06Q10/40. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multimodal entity and coreference resolution for assistant systems

US11966986B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11966986-B2
Application number	US-202217878778-A
Country	US
Kind code	B2
Filing date	Aug 1, 2022
Priority date	Oct 18, 2019
Publication date	Apr 23, 2024
Grant date	Apr 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes receiving, at a client system, an audio input, where the audio input comprises a coreference to a target object, accessing visual data from one or more camera associated with the client system, where the visual data comprises images portraying one or more objects, resolving the coreference to the target object from among the one or more objects, resoling the target object to a specific entity, and providing, at the client system, a response to the audio input, where the response comprises information about the specific entity.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, by a client system: receiving, at the client system, a user query comprising an audio input from a user, wherein the audio input comprises a coreference to a target object, and wherein the user is associated with a current context; accessing, responsive to receiving the audio input from the user, visual data from one or more camera associated with the client system; analyzing, by a scene understanding engine, the visual data to identify a plurality of objects portrayed in the visual data; resolving the coreference to the target object from among the plurality of objects by identifying the target object from among the plurality of identified objects portrayed in the visual data; resolving the target object to a specific entity from a plurality of selected entities corresponding to the plurality of objects, wherein the plurality of selected entities are selected based on a respective recency of each of the selected entities and a respective correlation of each of the selected entities to the current context associated with the user; and providing, at the client system, a response to the user query, wherein the response comprises information about the specific entity, and wherein the response is in one or more modalities determined based on device capabilities of the client system. 2. The method of claim 1 , further comprising: accessing a knowledge graph; and retrieving attribute information about the specific entity from the knowledge graph. 3. The method of claim 1 , further comprising: analyzing, by a computer vision module, the visual data to identify the plurality of objects portrayed in the images; parsing, by a natural-language understanding (NLU) module, an intent of the audio input and the coreference to the target object to one of the plurality of objects portrayed in the images; and updating a dialog state to include the identified objects and the coreference to the target object. 4. The method of claim 3 , further comprising: classifying the intent of one or more requests from one or more pre-defined taxonomies of semantic intentions; and generating a confidence score corresponding to the intent of the one or more requests. 5. The method of claim 3 , further comprising: receiving, at the client system, gesture or gaze information; and updating the dialog state to include the received gesture or gaze information. 6. The method of claim 3 , wherein resolving the coreference to the target object from among the plurality of objects comprises combining additional information with the dialog state. 7. The method of claim 1 , further comprising: assigning respective object identifiers to the plurality of objects portrayed in the images; and storing one or more of the object identifiers as entities in a dialog state tracker. 8. The method of claim 1 , wherein one or more of the objects portrayed in the images data are virtual objects in a virtual reality environment. 9. The method of claim 1 , wherein identifying the target object from among the identified objects is based on its position within a field of view of the visual data. 10. One or more computer-readable non-transitory storage media comprising instructions executable by a processor to: receive, at a client system, a user query comprising an audio input from a user, wherein the audio input comprises a coreference to a target object, and wherein the user is associated with a current context; access, responsive to receiving the audio input from the user, visual data from one or more camera associated with the client system; analyze, by a scene understanding engine, the visual data to identify a plurality of objects portrayed in the visual data; resolve the coreference to the target object from among the plurality of objects by identifying the target object from among the plurality of identified objects portrayed in the visual data; resolve the target object to a specific entity from a plurality of selected entities corresponding to the plurality of objects, wherein the plurality of selected entities are selected based on a respective recency of each of the selected entities and a respective correlation of each of the selected entities to the current context associated with the user; and provide, at the client system, a response to the user query, wherein the response comprises information about the specific entity, and wherein the response is in one or more modalities determined based on device capabilities of the client system. 11. The media of claim 10 , wherein the instructions are further executable by the processor to: access a knowledge graph; and retrieve attribute information about the specific entity from the knowledge graph. 12. The media of claim 10 , wherein the instructions are further executable by the processor to: analyze, by a computer vision module, the visual data to identify the plurality of objects portrayed in the images; parse, by a natural-language understanding (NLU) module, an intent of the audio input and the coreference to the target object to one of the plurality of objects portrayed in the images; and update a dialog state to include the identified objects and the coreferences to the target object. 13. The media of claim 12 , wherein the instructions are further executable by the processor to: classify the intent of one or more requests from one or more pre-defined taxonomies of semantic intentions; and generate a confidence score corresponding to the intent of the one or more requests. 14. The media of claim 12 , wherein the instructions are further executable by the processor to: receive, at the client system, gesture or gaze information; and update the dialog state to include the received gesture or gaze information. 15. The media of claim 12 , wherein resolving the coreference to the target object from among the plurality of objects comprises combining additional information with the dialog state. 16. The media of claim 10 , wherein the instructions are further executable by the processor to: assign respective object identifiers to the plurality of objects portrayed in the images; and store one or more of the object identifiers as entities in a dialog state tracker. 17. The media of claim 10 , wherein one or more of the objects portrayed in the images data are virtual objects in a virtual reality environment. 18. A client system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: receive, at the client system, a user query comprising an audio input from a user, wherein the audio input comprises a coreference to a target object, and wherein the user is associated with a current context; access, responsive to receiving the audio input from the user, visual data from one or more camera associated with the client system; analyze, by a scene understanding engine, the visual data to identify a plurality of objects portrayed in the visual data; resolve the coreference to the target object from among the plurality of objects by identifying the target object from among the plurality of identified objects portrayed in the visual data; resolve the target object to a specific entity from a plurality of selected entities corresponding to the plurality of objects, wherein the plurality of selected entities are selected based on a respective recency of each of the selected entities and a respective correlation of each of the sele

Assignees

Meta Platforms Inc

Inventors

Classifications

G06Q10/40Primary
Business processes related to social networking or social networking services · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/098
Distributed learning, e.g. federated learning · CPC title
G06Q10/1093
Calendar-based scheduling for persons or groups · CPC title
G10L2015/223
Execution procedure of a spoken command · CPC title

Patent family

Related publications grouped by family.

View patent family 75490741

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11966986B2 cover?: In one embodiment, a method includes receiving, at a client system, an audio input, where the audio input comprises a coreference to a target object, accessing visual data from one or more camera associated with the client system, where the visual data comprises images portraying one or more objects, resolving the coreference to the target object from among the one or more objects, resoling the…
Who is the assignee on this patent?: Meta Platforms Inc
What technology area does this patent fall under?: Primary CPC classification G06Q10/40. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).