What technology area does this patent fall under?

Primary CPC classification G06N3/084. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Task-specific text generation based on multimodal inputs

US12236192B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12236192-B2
Application number	US-202117339759-A
Country	US
Kind code	B2
Filing date	Jun 4, 2021
Priority date	Jan 8, 2021
Publication date	Feb 25, 2025
Grant date	Feb 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for generating task-specific text by processing multimodal inputs using machine-learning models is provided. The method may include accessing first sets of tokens associated with a desired task and one or more modalities associated with a context of the desired task. The method may further include determining a second set of tokens for each of the one or more modalities using a classifier network associated with the modality. The method may further include generating a number of embedding vectors by mapping the first sets of tokens and the second set of tokens associated with each of the one or more modalities to an embedding space. The method may further include producing a sequence of words addressing the desired task by processing the number of embedding vectors with an encoder-decoder network.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, by a computing device: accessing at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task; determining, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality; generating a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and producing a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network. 2. The method of claim 1 , wherein the desired task comprises a caption describing an event, an answer to a given question related to the event, a question relative to the event, or a context-aware dialog. 3. The method of claim 1 , wherein the one or more modalities comprise video sensor data, audio sensor data, Inertial Measurement Unit (IMU) sensor data, or light detection and ranging (lidar) sensor data. 4. The method of claim 1 , wherein the plurality of embedding vectors comprise a pre-determined relative position among the plurality of embedding vectors. 5. The method of claim 1 , wherein an encoder of the encoder-decoder network generates a latent representation by processing the plurality of embedding vectors. 6. The method of claim 5 , wherein a decoder of the encoder- decoder network produces a word at a time by processing the latent representation, wherein the produced word is selected from a word dictionary based on a probability associated with one or more words in the word dictionary. 7. The method of claim 6 , wherein the decoder takes a kth produced word as input for producing a k+1 st word in the sequence of words. 8. The method of claim 1 , wherein determining a set of tokens for one or more modalities comprises sampling one or more categories corresponding to a modality among a plurality of categories based on a probability distribution associated with the plurality of categories generated by processing the at least one modality with the classifier network associated with the at least one modality. 9. The method of claim 8 , wherein the sampling comprises a categorical reparameterization with Gumbel-Softmax or a differentiable approximation of tokenization. 10. The method of claim 8 , wherein sampling one or more categories corresponding to the modality among the plurality of categories based on a probability distribution for the plurality of categories is performed by a differentiable tokenization unit. 11. The method of claim 10 , wherein mapping a set of tokens belonging to a modality to one or more embedding vectors in a d-dimensional embedding space comprises looking up an embedding table corresponding to the at least one modality. 12. The method of claim 11 , wherein a first embedding table corresponding to a first modality is different from a second embedding table corresponding to a second modality. 13. The method of claim 11 , wherein, at a beginning of a training procedure, the classifier network associated with embedding tables, of the at least one modality, corresponding to the at least one first set of tokens and the second set of tokens, and the encoder-decoder network are initialized with pre-trained models. 14. The method of claim 13 , wherein, during the training procedure, the classifier network, the embedding tables, and the encoder-decoder network are updated through backward propagations. 15. The method of claim 14 , wherein, during the backward propagations, a gradient for the plurality of categories for the one or more modalities is estimated using a Straight-Through Estimator. 16. The method of claim 14 , wherein, a loss is calculated based on a comparison between a ground-truth sequence of words addressing the desired task with a sequence of words generated by the decoder of the encoder-decoder network. 17. The method of claim 16 , wherein a partial loss for a k th word is calculated based on a comparison between a k th word in the ground-truth sequence of words and a k th generated word in an instance in which a sub-sequence of words from a first word to a k−1 st word in the ground truth sequence of words is provided to the decoder as input. 18. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: access at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task; determine, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality; generate a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and produce a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network. 19. The media of claim 18 , wherein the desired task comprises a caption describing an event, an answer to a given question related to the event, a question relative to the event, or a context-aware dialog. 20. A system comprising: one or more processors; and a non-transitory memory coupled to the one or more processors comprising instructions executable by the one or more processors, the one or more processors operable when executing the instructions to: access at least one first set of tokens associated with a desired task and one or more modalities associated with a context of the desired task; determine, for the one or more modalities, a second set of tokens using a classifier network associated with at least one modality; generate a plurality of embedding vectors comprising a first set of embedding vectors mapped to the at least one first set of tokens and a second set of embedding vectors mapped to the second set of tokens, the at least one first set of tokens and the second set of tokens associated with the one or more modalities, wherein the first set of embedding vectors and the second set of embedding vectors are different and are mapped to an embedding space; and produce a sequence of words addressing the desired task based on determining probability distributions of the words to determine whether to select the words for the sequence and based on processing the plurality of embedding vectors with an encoder-decoder network.

Assignees

Meta Platforms Inc

Inventors

Classifications

G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06N3/084Primary
Backpropagation, e.g. using gradient descent · CPC title
G06F40/30
Semantic analysis · CPC title

Patent family

Related publications grouped by family.

View patent family 78957640

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12236192B2 cover?: A system and method for generating task-specific text by processing multimodal inputs using machine-learning models is provided. The method may include accessing first sets of tokens associated with a desired task and one or more modalities associated with a context of the desired task. The method may further include determining a second set of tokens for each of the one or more modalities usin…
Who is the assignee on this patent?: Meta Platforms Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).