Multimodal entity and coreference resolution for assistant systems

US11966986B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11966986-B2
Application numberUS-202217878778-A
CountryUS
Kind codeB2
Filing dateAug 1, 2022
Priority dateOct 18, 2019
Publication dateApr 23, 2024
Grant dateApr 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes receiving, at a client system, an audio input, where the audio input comprises a coreference to a target object, accessing visual data from one or more camera associated with the client system, where the visual data comprises images portraying one or more objects, resolving the coreference to the target object from among the one or more objects, resoling the target object to a specific entity, and providing, at the client system, a response to the audio input, where the response comprises information about the specific entity.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, by a client system: receiving, at the client system, a user query comprising an audio input from a user, wherein the audio input comprises a coreference to a target object, and wherein the user is associated with a current context; accessing, responsive to receiving the audio input from the user, visual data from one or more camera associated with the client system; analyzing, by a scene understanding engine, the visual data to identify a plurality of objects portrayed in the visual data; resolving the coreference to the target object from among the plurality of objects by identifying the target object from among the plurality of identified objects portrayed in the visual data; resolving the target object to a specific entity from a plurality of selected entities corresponding to the plurality of objects, wherein the plurality of selected entities are selected based on a respective recency of each of the selected entities and a respective correlation of each of the selected entities to the current context associated with the user; and providing, at the client system, a response to the user query, wherein the response comprises information about the specific entity, and wherein the response is in one or more modalities determined based on device capabilities of the client system. 2. The method of claim 1 , further comprising: accessing a knowledge graph; and retrieving attribute information about the specific entity from the knowledge graph. 3. The method of claim 1 , further comprising: analyzing, by a computer vision module, the visual data to identify the plurality of objects portrayed in the images; parsing, by a natural-language understanding (NLU) module, an intent of the audio input and the coreference to the target object to one of the plurality of objects portrayed in the images; and updating a dialog state to include the identified objects and the coreference to the target object. 4. The method of claim 3 , further comprising: classifying the intent of one or more requests from one or more pre-defined taxonomies of semantic intentions; and generating a confidence score corresponding to the intent of the one or more requests. 5. The method of claim 3 , further comprising: receiving, at the client system, gesture or gaze information; and updating the dialog state to include the received gesture or gaze information. 6. The method of claim 3 , wherein resolving the coreference to the target object from among the plurality of objects comprises combining additional information with the dialog state. 7. The method of claim 1 , further comprising: assigning respective object identifiers to the plurality of objects portrayed in the images; and storing one or more of the object identifiers as entities in a dialog state tracker. 8. The method of claim 1 , wherein one or more of the objects portrayed in the images data are virtual objects in a virtual reality environment. 9. The method of claim 1 , wherein identifying the target object from among the identified objects is based on its position within a field of view of the visual data. 10. One or more computer-readable non-transitory storage media comprising instructions executable by a processor to: receive, at a client system, a user query comprising an audio input from a user, wherein the audio input comprises a coreference to a target object, and wherein the user is associated with a current context; access, responsive to receiving the audio input from the user, visual data from one or more camera associated with the client system; analyze, by a scene understanding engine, the visual data to identify a plurality of objects portrayed in the visual data; resolve the coreference to the target object from among the plurality of objects by identifying the target object from among the plurality of identified objects portrayed in the visual data; resolve the target object to a specific entity from a plurality of selected entities corresponding to the plurality of objects, wherein the plurality of selected entities are selected based on a respective recency of each of the selected entities and a respective correlation of each of the selected entities to the current context associated with the user; and provide, at the client system, a response to the user query, wherein the response comprises information about the specific entity, and wherein the response is in one or more modalities determined based on device capabilities of the client system. 11. The media of claim 10 , wherein the instructions are further executable by the processor to: access a knowledge graph; and retrieve attribute information about the specific entity from the knowledge graph. 12. The media of claim 10 , wherein the instructions are further executable by the processor to: analyze, by a computer vision module, the visual data to identify the plurality of objects portrayed in the images; parse, by a natural-language understanding (NLU) module, an intent of the audio input and the coreference to the target object to one of the plurality of objects portrayed in the images; and update a dialog state to include the identified objects and the coreferences to the target object. 13. The media of claim 12 , wherein the instructions are further executable by the processor to: classify the intent of one or more requests from one or more pre-defined taxonomies of semantic intentions; and generate a confidence score corresponding to the intent of the one or more requests. 14. The media of claim 12 , wherein the instructions are further executable by the processor to: receive, at the client system, gesture or gaze information; and update the dialog state to include the received gesture or gaze information. 15. The media of claim 12 , wherein resolving the coreference to the target object from among the plurality of objects comprises combining additional information with the dialog state. 16. The media of claim 10 , wherein the instructions are further executable by the processor to: assign respective object identifiers to the plurality of objects portrayed in the images; and store one or more of the object identifiers as entities in a dialog state tracker. 17. The media of claim 10 , wherein one or more of the objects portrayed in the images data are virtual objects in a virtual reality environment. 18. A client system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: receive, at the client system, a user query comprising an audio input from a user, wherein the audio input comprises a coreference to a target object, and wherein the user is associated with a current context; access, responsive to receiving the audio input from the user, visual data from one or more camera associated with the client system; analyze, by a scene understanding engine, the visual data to identify a plurality of objects portrayed in the visual data; resolve the coreference to the target object from among the plurality of objects by identifying the target object from among the plurality of identified objects portrayed in the visual data; resolve the target object to a specific entity from a plurality of selected entities corresponding to the plurality of objects, wherein the plurality of selected entities are selected based on a respective recency of each of the selected entities and a respective correlation of each of the sele

Assignees

Inventors

Classifications

  • G06Q10/40Primary

    Business processes related to social networking or social networking services · CPC title

  • Supervised learning · CPC title

  • Distributed learning, e.g. federated learning · CPC title

  • Calendar-based scheduling for persons or groups · CPC title

  • Execution procedure of a spoken command · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11966986B2 cover?
In one embodiment, a method includes receiving, at a client system, an audio input, where the audio input comprises a coreference to a target object, accessing visual data from one or more camera associated with the client system, where the visual data comprises images portraying one or more objects, resolving the coreference to the target object from among the one or more objects, resoling the…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G06Q10/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).