What technology area does this patent fall under?

Primary CPC classification G06F16/532. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 09 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Instruction-guided visual embeddings and feedback-based learning in large vision-language models

US12411879B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12411879-B2
Application number	US-202418924763-A
Country	US
Kind code	B2
Filing date	Oct 23, 2024
Priority date	Oct 24, 2023
Publication date	Sep 9, 2025
Grant date	Sep 9, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an example, a method for fine-tuning a Large Visual Language Model (LVLM) includes providing visual queries, each of the visual queries comprises at least an image and a textual query related to the image; processing, by the LVLM, the visual queries to extract visual embeddings from the visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for visual queries: i) generating, by the LVLM, a response to the corresponding visual query based on the corresponding visual embedding; ii) evaluating, by a second LLM, the generated response to verify that the generated response satisfies predefined criteria; and iii) providing, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tuning the LVLM using aggregated feedback provided by the second LLM for the visual queries.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for fine-tuning a Large Visual Language Model (LVLM), the method comprising: providing a plurality of visual queries, wherein each of the plurality of visual queries comprises at least an image and a textual query related to the image; processing, by the LVLM, the plurality of visual queries to extract one or more visual embeddings from each of the plurality of visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for each of the plurality of visual queries: i) generating, by the LVLM, a response to the corresponding visual query based on the corresponding one or more visual embeddings; ii) evaluating, by a second LLM, the generated response to verify that the generated response satisfies one or more predefined criteria; and iii) providing, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tuning the LVLM using aggregated feedback provided by the second LLM for the plurality of visual queries. 2. The method of claim 1 , wherein the one or more predefined criteria comprise at least one of helpfulness, honesty, and harmlessness. 3. The method of claim 1 , wherein the feedback comprises a Natural Language Feedback (NLF). 4. The method of claim 3 , wherein the feedback comprises at least a numerical score, critique feedback, and refinement feedback. 5. The method of claim 4 , wherein the refinement feedback suggests improvements or modifications to the generated response. 6. The method of claim 3 , further comprising training the LVLM using the NLF. 7. The method of claim 6 , wherein training the LVLM further comprises: training the LVLM using the NLF incorporated into a conditional Reinforcement Learning (RL) algorithm. 8. A computing system for fine-tuning a Large Visual Language Model (LVLM), the computing system comprising: processing circuitry in communication with storage media, the processing circuitry configured to execute a machine learning system comprising the LVLM, the processing circuitry configured to: provide a plurality of visual queries, wherein each of the plurality of visual queries comprises at least an image and a textual query related to the image; process, by the LVLM, the plurality of visual queries to extract one or more visual embeddings from each of the plurality of visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for each of the plurality of visual queries: i) generate, by the LVLM, a response to the corresponding visual query based on the corresponding one or more visual embeddings; ii) evaluate, by a second LLM, the generated response to verify that the generated response satisfies one or more predefined criteria; and iii) provide, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tune the LVLM using aggregated feedback provided by the second LLM for the plurality of visual queries. 9. The system of claim 8 , wherein the one or more predefined criteria comprise at least one of helpfulness, honesty, and harmlessness. 10. The system of claim 8 , wherein the feedback comprises a Natural Language Feedback (NLF). 11. The system of claim 10 , wherein the feedback comprises at least a numerical score, critique feedback, and refinement feedback. 12. The system of claim 11 , wherein the refinement feedback suggests improvements or modifications to the generated response. 13. The system of claim 10 , the processing circuitry further configured to: train the LVLM using the NLF. 14. The system of claim 13 , wherein the processing circuitry configured to train the LVLM is further configured to: train the LVLM using the NLF incorporated into a conditional Reinforcement Learning (RL) algorithm. 15. Non-transitory computer-readable storage media having instructions encoded thereon, the instructions configured to cause processing circuitry to: provide a plurality of visual queries, wherein each of the plurality of visual queries comprises at least an image and a textual query related to the image; process, by a Large Visual Language Model (LVLM), the plurality of visual queries to extract one or more visual embeddings from each of the plurality of visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LLM), and a linear projection layer interconnecting the VLM and the LLM; for each of the plurality of visual queries: i) generate, by the LVLM, a response to the corresponding visual query based on the corresponding one or more visual embeddings; ii) evaluate, by a second LLM, the generated response to verify that the generated response satisfies one or more predefined criteria; and iii) provide, by the second LLM, a feedback to the LVLM, in response to the evaluating the generated response; and fine-tune the LVLM using aggregated feedback provided by the second LLM for the plurality of visual queries. 16. The storage media of claim 15 , wherein the one or more predefined criteria comprise at least one of helpfulness, honesty, and harmlessness. 17. The storage media of claim 15 , wherein the feedback comprises a Natural Language Feedback (NLF). 18. The storage media of claim 17 , wherein the feedback comprises at least a numerical score, critique feedback, and refinement feedback. 19. The storage media of claim 18 , wherein the refinement feedback suggests improvements or modifications to the generated response. 20. The storage media of claim 17 , the instructions further configured to cause processing circuitry to: train the LVLM using the NLF.

Assignees

Stanford Res Inst Int

Inventors

Classifications

G06F16/3344
using natural language analysis · CPC title
G06F16/532Primary
Query formulation, e.g. graphical querying · CPC title
G06F16/338Primary
Presentation of query results · CPC title

Patent family

Related publications grouped by family.

View patent family 95401377

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12411879B2 cover?: In an example, a method for fine-tuning a Large Visual Language Model (LVLM) includes providing visual queries, each of the visual queries comprises at least an image and a textual query related to the image; processing, by the LVLM, the visual queries to extract visual embeddings from the visual queries, wherein the LVLM comprises a Visual Language Model (VLM), a first Large Language Model (LL…
Who is the assignee on this patent?: Stanford Res Inst Int
What technology area does this patent fall under?: Primary CPC classification G06F16/532. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 09 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).