Method and apparatus for zero-shot natural language processing using visual imagination

US12468896B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12468896-B2
Application numberUS-202218077693-A
CountryUS
Kind codeB2
Filing dateDec 8, 2022
Priority dateDec 8, 2022
Publication dateNov 11, 2025
Grant dateNov 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method performed by at least one processor includes receiving a first input stream of a task and a second input stream of a solution. The method further includes selecting the first input stream or the second input stream. The method further includes providing the selected input stream to an image conversion model and a language model. The method further includes creating, based on the selected input stream, a model ensemble of the conversion model and the language model. The method further includes outputting a prediction based on the model ensemble. The method may further include generating an image corresponding to text, converting a textual task into a multimodal task, and solving the multimodal task.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method performed by at least one processor for processing language, the method comprising: receiving a first input stream of a task; receiving a second input stream of a solution; selecting the first input stream or the second input stream; providing the selected input stream to an image conversion model and a language model; creating, based on the selected input stream, a model ensemble from outputs of the image conversion model and from outputs of the language model; scoring a first plurality of candidate solutions obtained from the second input stream via the language model; selecting, from the first plurality of candidate solutions, a second plurality of candidate solutions with scores exceeding a threshold; and outputting a prediction based on the model ensemble and the second plurality of candidate solutions. 2 . The method of claim 1 , wherein the language model uses a prompt based approach, and wherein the language model is a Generative Pre-Trained Transformer (GPT) model. 3 . The method of claim 1 , wherein the task is at least one of word sense disambiguation, science question answering, or text classification, wherein the prediction comprises at least one possible word sense of a target word based on the task being the word sense disambiguation; the prediction comprises an answer of a question based on the task being the science question answering, and the prediction comprises a category of text based on the task being the text classification. 4 . The method of claim 1 , wherein the language model uses a Bidirectional Encoder Representations from Transformers (BERT). 5 . The method of claim 4 , wherein the language model uses a natural language inference approach. 6 . The method of claim 4 , wherein the language model uses a latent embedding approach. 7 . The method of claim 1 , wherein the image conversion model uses a combined approach of recall and synthesis. 8 . The method of claim 7 , wherein the synthesis includes a text to image generation model. 9 . The method of claim 7 , wherein the synthesis includes a generative adversarial network. 10 . The method of claim 1 , wherein the model ensemble weights constituent models of the image conversion model and the language model based on a relative size of each constituent model. 11 . An apparatus comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code, the program code comprising: receiving code configured to cause the at least one processor to receive a first input stream of a task and a second input stream of a solution; selecting code configured to cause the at least one processor to select the first input stream or the second input stream; providing code configured to cause the at least one processor to provide the selected input stream to an image conversion model and a language model; ensembling code configured to cause the at least one processor to create, based on the selected input stream, a model ensemble from outputs of the image conversion model and from outputs of the language model; scoring code configured to cause the at least one processor to score a first plurality of candidate solutions obtained from the second input stream via the language model; selecting code configured to cause the at least one processor to select, from the first plurality of candidate solutions, a second plurality of candidate solutions with scores exceeding a threshold; and outputting code configured to cause the at least one processor to output a prediction based on the model ensemble and the second plurality of candidate solutions. 12 . The apparatus of claim 11 , wherein the language model uses a prompt based approach, and wherein the language model is a Generative Pre-Trained Transformer (GPT) model. 13 . The apparatus of claim 11 , wherein the task is at least one of word sense disambiguation, science question answering, or text classification, wherein the prediction comprises at least one possible word sense of a target word based on the task being the word sense disambiguation; the prediction comprises an answer of a question based on the task being the science question answering, and the prediction comprises a category of text based on the task being the text classification. 14 . The apparatus of claim 11 , wherein the language model uses a Bidirectional Encoder Representations from Transformers (BERT). 15 . The apparatus of claim 14 , wherein the language model uses a natural language inference approach or a latent embedding approach. 16 . The apparatus of claim 11 , wherein the image conversion model uses a combined approach of recall and synthesis. 17 . The apparatus of claim 16 , wherein the synthesis includes a text to image generation model. 18 . The apparatus of claim 16 , wherein the synthesis includes a generative adversarial network. 19 . The apparatus of claim 11 , wherein the model ensemble weights constituent models of the image conversion model and the language model based on a relative size of each constituent model. 20 . A non-transitory computer readable medium having instructions stored therein, which when executed by a processor cause the processor to execute a method comprising: receiving a first input stream of a task; receiving a second input stream of a solution; selecting the first input stream or the second input stream; providing the selected input stream to an image conversion model and a language model; creating, based on the selected input stream, a model ensemble from outputs of the image conversion model and from outputs of the language model; scoring a first plurality of candidate solutions obtained from the second input stream via the language model; selecting, from the first plurality of candidate solutions, a second plurality of candidate solutions with scores exceeding a threshold; and outputting a prediction based on the model ensemble and the second plurality of candidate solutions.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title

  • Semantic analysis · CPC title

  • Natural language generation · CPC title

  • G06F40/40Primary

    Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12468896B2 cover?
A method performed by at least one processor includes receiving a first input stream of a task and a second input stream of a solution. The method further includes selecting the first input stream or the second input stream. The method further includes providing the selected input stream to an image conversion model and a language model. The method further includes creating, based on the select…
Who is the assignee on this patent?
Tencent America LLC
What technology area does this patent fall under?
Primary CPC classification G06F40/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).