Multimodal Dialog State Tracking and Action Prediction for Assistant Systems

US2021117681A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021117681-A1
Application numberUS-202017006339-A
CountryUS
Kind codeA1
Filing dateAug 28, 2020
Priority dateOct 18, 2019
Publication dateApr 22, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes receiving, from a client system associated with a user, a user request comprising a reference to a target object, accessing visual data from the client system, wherein the visual data comprises images portraying the target object and one or more additional objects, and wherein attribute information of the target object is recorded in a multimodal dialog state, resolving the reference to the target object based on the attribute information recorded in the multimodal dialog state, determining relational information between the target object and one or more of the additional objects portrayed in the visual data, and sending, to the client system, instructions for presenting a response to the user request, wherein the response comprises the attribute information and the determined relational information.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving, from a client system associated with a user, a user request comprising a reference to a target object; accessing visual data from the client system, wherein the visual data comprises images portraying the target object and one or more additional objects, and wherein attribute information of the target object is recorded in a multimodal dialog state; resolving the reference to the target object based on the attribute information recorded in the multimodal dialog state; determining relational information between the target obj ect and one or more of the additional objects portrayed in the visual data; and sending, to the client system, instructions for presenting a response to the user request, wherein the response comprises the attribute information and the determined relational information. 2 . The method of claim 1 , wherein the attribute information comprises an identifier of the target object. 3 . The method of claim 1 , wherein the attribute information comprises a location of the target obj ect. 4 . The method of claim 1 , wherein the attribute information comprises a timestamp of an image of the visual data portraying the target object. 5 . The method of claim 1 , wherein the target object is an object that has been labeled as an object of significance. 6 . The method of claim 1 , further comprising: receiving the visual data from the client system; and storing the visual data in a data store. 7 . The method of claim 6 , further comprising: analyzing, by a computer vision module, the received visual data to identify the target object and the one or more additional objects; assigning respective object identifiers to the target object and one or more of the identified additional objects; and recording one or more of the object identifiers in the multimodal dialog state. 8 . The method of claim 7 , further comprising: recording the multimodal dialog state to a dialog state tracker, wherein the multimodal dialog state comprises one or more intents, slots, or relational information generated during a current session. 9 . The method of claim 1 , wherein each image of the plurality of images of the visual data is associated with a respective timestamp, wherein one or more of the images are associated with the target object. 10 . The method of claim 9 , further comprising: selecting, from among the one or more images associated with the target object, a first image having a most recent timestamp with respect to a time associated with the user request. 11 . The method of claim 10 , further comprising: analyzing, by a computer vision module, the first image to identify the target object and the one or more additional objects; and processing, by a scene understanding engine, the first image to generate the relational information between the target object and one or more of the additional objects. 12 . The method of claim 11 , further comprising: passing the first image, the attribute information, and one or more object identifiers of one or more additional objects associated with the first image to the scene understanding engine. 13 . The method of claim 10 , further comprising: recording, in the multimodal dialog state, a timestamp and location information associated with the first image. 14 . The method of claim 13 , further comprising: receiving an additional plurality of images, wherein each of the additional images is associated with the target object and a respective additional timestamp; and selecting, from among the additional images, a second image having a most recent timestamp. 15 . The method of claim 14 , further comprising: updating the multimodal dialog state to replace information of the first image with information of the second image. 16 . The method of claim 1 , further comprising: generating, by a scene understanding engine, the relational information in response to receiving the user request. 17 . The method of claim 1 , wherein the response comprises visual information indicating the relational information. 18 . The method of claim 1 , further comprising: receiving, from the client system, a subsequent user request for additional relational information associated with the target object; and generating, by a scene understanding engine, the additional relational information. 19 . One or more computer-readable non-transitory storage media embodying software that is operable when executed to: receive, from a client system associated with a user, a user request referencing a target object; access visual data from the client system, wherein the visual data comprises images portraying the target object and one or more additional objects, and wherein attribute information of the target object is recorded in a multimodal dialog state; resolve the reference to the target object based on the attribute information recorded in the multimodal dialog state; determine relational information between the target object and one or more of the additional objects portrayed in the visual data; and send, to the client system, instructions for presenting a response to the user request, wherein the response comprises the attribute information and the determined relational information. 20 . A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: receive, from a client system associated with a user, a user request referencing a target object; access visual data from the client system, wherein the visual data comprises images portraying the target object and one or more additional objects, and wherein attribute information of the target object is recorded in a multimodal dialog state; resolve the reference to the target object based on the attribute information recorded in the multimodal dialog state; determine relational information between the target object and one or more of the additional objects portrayed in the visual data; and send, to the client system, instructions for presenting a response to the user request, wherein the response comprises the attribute information and the determined relational information.

Assignees

Inventors

Classifications

  • G06Q10/40Primary

    Business processes related to social networking or social networking services · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Facial expression recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021117681A1 cover?
In one embodiment, a method includes receiving, from a client system associated with a user, a user request comprising a reference to a target object, accessing visual data from the client system, wherein the visual data comprises images portraying the target object and one or more additional objects, and wherein attribute information of the target object is recorded in a multimodal dialog stat…
Who is the assignee on this patent?
Facebook Inc
What technology area does this patent fall under?
Primary CPC classification G06Q10/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 22 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).