Techniques for implementing multimodal large language models with mixtures of vision encoders

US2025384295A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025384295-A1
Application numberUS-202519172564-A
CountryUS
Kind codeA1
Filing dateApr 7, 2025
Priority dateJun 17, 2024
Publication dateDec 18, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosed method for training multimodal models includes performing one or more operations to train a plurality of vision language models to generate a plurality of trained vision language models, where each trained vision language model included in the plurality of trained vision language models comprises a different vision encoder and a first language model, and performing one or more operations to train a multimodal model to generate a trained multimodal model, where the trained multimodal model comprises the different vision encoders and a second language model.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for generating a family of multimodal models for execution based on computer system resources, the method comprising: generating a plurality of candidate multimodal models by combining a previously-generated multimodal model with a plurality of vision encoders, wherein each candidate multimodal model comprises the previously-generated multimodal model and a different one of the vision encoders included in the plurality of vision encoders; computing a performance score for each candidate multimodal model; determining that a first candidate multimodal model included in the plurality of candidate multimodal models is associated with a first performance score that is better than all other performance scores associated with all other candidate multimodal models included in the plurality of candidate multimodal models; and selecting the first candidate multimodal model for inclusion in the family of multimodal models, wherein each multimodal model included in the family of multimodal models incorporates a number of vision encoders that is different than a number of vision encoders incorporated into all other multimodal models included in the family of multimodal models, wherein at least one multimodal model included in the family of multimodal models is subsequently executed for at least one application based on one or more hardware resources associated with a first computer system. 2 . The computer-implemented method of claim 1 , further comprising computing the first performance score based on at least one of a visual question answering metric, an optical character recognition task metric, a document understanding task metric, a chart understanding task metric, a vision-centric task metric, or a knowledge-based task metric. 3 . The computer-implemented method of claim 1 , wherein generating each candidate multimodal model included in the plurality of candidate multimodal models comprises: performing one or more training operations to generate a plurality of vision language models that each comprise a different trained vision encoder and a first trained language model; and performing one or more training operations to generate the candidate multimodal model that comprises all of the different trained vision encoders and a second trained language model. 4 . The computer-implemented method of claim 1 , wherein selecting the first candidate multimodal model is further based on the first performance score being higher than a second performance score associated with the previously-generated multimodal model. 5 . The computer-implemented method of claim 1 , further comprising: generating another plurality of candidate multimodal models by combining the first candidate multimodal model with another plurality of vision encoders, wherein each candidate multimodal model included in the another plurality of candidate multimodal models comprises the first candidate multimodal model and a different one of the vision encoders included in the another plurality of vision encoders; computing a performance score for each candidate multimodal model included in the another plurality of candidate multimodal models; determining a second candidate multimodal model included in the another plurality of candidate multimodal models is associated with a second performance score that is worse than the first performance score; and not selecting the second candidate multimodal model for inclusion in the family of multimodal models. 6 . The computer-implemented method of claim 1 , wherein the first candidate multimodal model comprises a first vision encoder having a vision transformer large architecture and a second vision encoder that is pre-trained for a vision alignment task. 7 . The computer-implemented method of claim 1 , wherein the first candidate multimodal model comprises a first vision encoder having a vision transformer large architecture, a second vision encoder that is pre-trained for a vision alignment task, and a third vision encoder that is pre-trained for an object detection task. 8 . The computer-implemented method of claim 1 , wherein the first candidate multimodal model comprises a first vision encoder having a vision transformer large architecture, a second vision encoder that is pre-trained for a vision alignment task, a third vision encoder that is pre-trained for an object detection task, and a fourth vision encoder that is pre-trained for a text recognition task. 9 . The computer-implemented method of claim 1 , wherein the first candidate multimodal model comprises a first vision encoder having a vision transformer large architecture, a second vision encoder that is pre-trained for a vision alignment task, a third vision encoder that is pre-trained for an object detection task, a fourth vision encoder that is pre-trained for a text recognition task, and a fifth vision encoder that is pre-trained for a semantic segmentation task. 10 . The computer-implemented method of claim 1 , wherein the first candidate multimodal model comprises a first vision encoder having a vision transformer large architecture, a second vision encoder that is pre-trained for a vision alignment task, a third vision encoder that is pre-trained for an object detection task, a fourth vision encoder that is pre-trained for a text recognition task, a fifth vision encoder that is pre-trained for a semantic segmentation task, and a sixth vision encoder that is pre-trained for a self-supervised learning task. 11 . One or more non-transitory computer-readable storage media including instructions that, when executed by at least one processor, cause the at least one processor to perform steps comprising: generating a plurality of candidate multimodal models by combining a previously-generated multimodal model with a plurality of vision encoders, wherein each candidate multimodal model comprises the previously-generated multimodal model and a different one of the vision encoders included in the plurality of vision encoders; computing a performance score for each candidate multimodal model; determining that a first candidate multimodal model included in the plurality of candidate multimodal models is associated with a first performance score that is better than all other performance scores associated with all other candidate multimodal models included in the plurality of candidate multimodal models; and selecting the first candidate multimodal model for inclusion in a family of multimodal models, wherein each multimodal model included in the family of multimodal models incorporates a number of vision encoders that is different than a number of vision encoders incorporated into all other multimodal models included in the family of multimodal models, wherein at least one multimodal model included in the family of multimodal models is subsequently executed for at least one application based on one or more hardware resources associated with a first computer system. 12 . The one or more non-transitory computer-readable storage media of claim 11 , wherein the instructions, when executed by at least one processor, further cause the at least one processor to perform the step of computing the first performance score based on at least one of a visual question answering metric, an optical character recognition task metric, a document understanding task metric, a chart understanding task metric, a vision-centric task metric, or a knowledge-based task metric. 13 . The one or more non-transitory computer-readable storage media of claim 11 , wherein the instructions, when executed by at least one processor, further cause the at least one pro

Assignees

Inventors

Classifications

  • G06N3/08Primary

    Learning methods · CPC title

  • Combinations of networks · CPC title

  • G06N3/0985Primary

    Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025384295A1 cover?
The disclosed method for training multimodal models includes performing one or more operations to train a plurality of vision language models to generate a plurality of trained vision language models, where each trained vision language model included in the plurality of trained vision language models comprises a different vision encoder and a first language model, and performing one or more ope…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 18 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).