Processing multimodal user input for assistant systems

US12406316B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12406316-B2
Application numberUS-202318185258-A
CountryUS
Kind codeB2
Filing dateMar 16, 2023
Priority dateApr 20, 2018
Publication dateSep 2, 2025
Grant dateSep 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes receiving at a head-mounted device a speech input from a user and a visual input captured by cameras of the head-mounted device, wherein the visual input comprises subjects and attributes associated with the subjects, and wherein the speech input comprises a co-reference to one or more of the subjects, resolving entities corresponding to the subjects associated with the co-reference based on the attributes and the co-reference, and presenting a communication content responsive to the speech input and the visual input at the head-mounted device, wherein the communication content comprises information associated with executing results of tasks corresponding to the resolved entities.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, by a client system: receiving, at the client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts; resolving, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and presenting, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities. 2. The method of claim 1 , wherein the speech input comprises one or more audio clips, and wherein the visual input comprises one or more of an image or a video clip. 3. The method of claim 1 , further comprising: determining, based on the visual input, the one or more visual concepts and the one or more attributes associated with the one or more visual concepts. 4. The method of claim 3 , wherein determining the one or more visual concepts and the one or more attributes is based on one or more machine-learning models comprising one or more of: a support vector machine; a regression model; or a convolutional neural network. 5. The method of claim 1 , wherein the one or more visual concepts comprise one or more of a person, a location, a business, or an object. 6. The method of claim 5 , wherein the one or more visual concepts comprise one or more people, and wherein the method further comprises: determining the one or more people based on facial recognition. 7. The method of claim 5 , wherein the one or more visual concepts comprise one or more objects, and wherein the method further comprises: determining the one or more objects based on object detection. 8. The method of claim 1 , further comprising generating a feature representation for the visual input. 9. The method of claim 1 , further comprising identifying one or more intents and one or more slots based on one or more of the speech input or the visual input. 10. The method of claim 9 , further comprising executing the one or more tasks based on the identified intents and slots. 11. The method of claim 1 , wherein the communication content comprises one or more of: a character string; an audio clip; an image; or a video clip. 12. The method of claim 1 , further comprising determining one or more modalities for the communication content. 13. The method of claim 12 , wherein determining the one or more modalities for the communication content comprises: identifying contextual information associated with a user of the client system; identifying contextual information associated with the client system; and determining the one or more modalities based on the contextual information associated with the user and the contextual information associated with the client system. 14. The method of claim 1 , further comprising: generating a plurality of tasks based on the visual input; and receiving, at the client system, a user selection of the one or more tasks from the plurality of tasks by a user of the client system. 15. The method of claim 1 , further comprising storing the one or more visual concepts in a dialog state. 16. The method of claim 1 , further comprising: generating, based on the visual input, a media content object portraying the one or more visual concepts; receiving, at the client system, a user interaction with the media content object; and executing one or more additional tasks responsive to the user interaction with the media content object. 17. One or more computer-readable non-transitory non- volatile storage media embodying software that is operable when executed by a client system to: receive, at a client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts; resolve, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and present, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities. 18. A client system comprising: one or more processors; and a non-transitory non-volatile memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: receive, at a client system, a speech input from a user and a visual input captured by one or more cameras of the client system, wherein the speech input comprises a textual input spoken by the user, and wherein the visual input depicts a real-time view captured by the one or more cameras, wherein the real-time view comprises one or more visual concepts and one or more attributes associated with the one or more visual concepts, and wherein the textual input comprises a co-reference to one or more of the visual concepts; resolve, based on the one or more attributes and the co-reference, one or more entities corresponding to the one or more visual concepts associated with the co-reference; and present, at the client system, a communication content responsive to the speech input and the visual input, wherein the communication content comprises information associated with executing results of one or more tasks corresponding to the one or more resolved entities.

Assignees

Inventors

Classifications

  • G06Q10/40Primary

    Business processes related to social networking or social networking services · CPC title

  • Commands or executable codes · CPC title

  • Channels characterised by the type of signal · CPC title

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12406316B2 cover?
In one embodiment, a method includes receiving at a head-mounted device a speech input from a user and a visual input captured by cameras of the head-mounted device, wherein the visual input comprises subjects and attributes associated with the subjects, and wherein the speech input comprises a co-reference to one or more of the subjects, resolving entities corresponding to the subjects associa…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G06Q10/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).