Instruction-guided visual embeddings and feedback-based learning in large vision-language models

US12411879B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12411879-B2
Application numberUS-202418924763-A
CountryUS
Kind codeB2
Filing dateOct 23, 2024
Priority dateOct 24, 2023
Publication dateSep 9, 2025
Grant dateSep 9, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an example, a method for fine-tuning a Large Visual Language Model (LVLM) includes providing visual queries, each of the visual queries comprises at least an image and a textual query related to the image; processing, by the LVLM, the visual queries to extract visual embeddings from the visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for visual queries: i) generating, by the LVLM, a response to the corresponding visual query based on the corresponding visual embedding; ii) evaluating, by a second LLM, the generated response to verify that the generated response satisfies predefined criteria; and iii) providing, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tuning the LVLM using aggregated feedback provided by the second LLM for the visual queries.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for fine-tuning a Large Visual Language Model (LVLM), the method comprising: providing a plurality of visual queries, wherein each of the plurality of visual queries comprises at least an image and a textual query related to the image; processing, by the LVLM, the plurality of visual queries to extract one or more visual embeddings from each of the plurality of visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for each of the plurality of visual queries: i) generating, by the LVLM, a response to the corresponding visual query based on the corresponding one or more visual embeddings; ii) evaluating, by a second LLM, the generated response to verify that the generated response satisfies one or more predefined criteria; and iii) providing, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tuning the LVLM using aggregated feedback provided by the second LLM for the plurality of visual queries. 2. The method of claim 1 , wherein the one or more predefined criteria comprise at least one of helpfulness, honesty, and harmlessness. 3. The method of claim 1 , wherein the feedback comprises a Natural Language Feedback (NLF). 4. The method of claim 3 , wherein the feedback comprises at least a numerical score, critique feedback, and refinement feedback. 5. The method of claim 4 , wherein the refinement feedback suggests improvements or modifications to the generated response. 6. The method of claim 3 , further comprising training the LVLM using the NLF. 7. The method of claim 6 , wherein training the LVLM further comprises: training the LVLM using the NLF incorporated into a conditional Reinforcement Learning (RL) algorithm. 8. A computing system for fine-tuning a Large Visual Language Model (LVLM), the computing system comprising: processing circuitry in communication with storage media, the processing circuitry configured to execute a machine learning system comprising the LVLM, the processing circuitry configured to: provide a plurality of visual queries, wherein each of the plurality of visual queries comprises at least an image and a textual query related to the image; process, by the LVLM, the plurality of visual queries to extract one or more visual embeddings from each of the plurality of visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for each of the plurality of visual queries: i) generate, by the LVLM, a response to the corresponding visual query based on the corresponding one or more visual embeddings; ii) evaluate, by a second LLM, the generated response to verify that the generated response satisfies one or more predefined criteria; and iii) provide, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tune the LVLM using aggregated feedback provided by the second LLM for the plurality of visual queries. 9. The system of claim 8 , wherein the one or more predefined criteria comprise at least one of helpfulness, honesty, and harmlessness. 10. The system of claim 8 , wherein the feedback comprises a Natural Language Feedback (NLF). 11. The system of claim 10 , wherein the feedback comprises at least a numerical score, critique feedback, and refinement feedback. 12. The system of claim 11 , wherein the refinement feedback suggests improvements or modifications to the generated response. 13. The system of claim 10 , the processing circuitry further configured to: train the LVLM using the NLF. 14. The system of claim 13 , wherein the processing circuitry configured to train the LVLM is further configured to: train the LVLM using the NLF incorporated into a conditional Reinforcement Learning (RL) algorithm. 15. Non-transitory computer-readable storage media having instructions encoded thereon, the instructions configured to cause processing circuitry to: provide a plurality of visual queries, wherein each of the plurality of visual queries comprises at least an image and a textual query related to the image; process, by a Large Visual Language Model (LVLM), the plurality of visual queries to extract one or more visual embeddings from each of the plurality of visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for each of the plurality of visual queries: i) generate, by the LVLM, a response to the corresponding visual query based on the corresponding one or more visual embeddings; ii) evaluate, by a second LLM, the generated response to verify that the generated response satisfies one or more predefined criteria; and iii) provide, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tune the LVLM using aggregated feedback provided by the second LLM for the plurality of visual queries. 16. The storage media of claim 15 , wherein the one or more predefined criteria comprise at least one of helpfulness, honesty, and harmlessness. 17. The storage media of claim 15 , wherein the feedback comprises a Natural Language Feedback (NLF). 18. The storage media of claim 17 , wherein the feedback comprises at least a numerical score, critique feedback, and refinement feedback. 19. The storage media of claim 18 , wherein the refinement feedback suggests improvements or modifications to the generated response. 20. The storage media of claim 17 , the instructions further configured to cause processing circuitry to: train the LVLM using the NLF.

Assignees

Inventors

Classifications

  • using natural language analysis · CPC title

  • G06F16/532Primary

    Query formulation, e.g. graphical querying · CPC title

  • G06F16/338Primary

    Presentation of query results · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12411879B2 cover?
In an example, a method for fine-tuning a Large Visual Language Model (LVLM) includes providing visual queries, each of the visual queries comprises at least an image and a textual query related to the image; processing, by the LVLM, the visual queries to extract visual embeddings from the visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LL…
Who is the assignee on this patent?
Stanford Res Inst Int
What technology area does this patent fall under?
Primary CPC classification G06F16/532. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 09 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).