What technology area does this patent fall under?

Primary CPC classification G06V10/26. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Dec 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Multimodal large language model agent with interactive image understanding

US2025391147A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2025391147-A1
Application number	US-202418747541-A
Country	US
Kind code	A1
Filing date	Jun 19, 2024
Priority date	Jun 19, 2024
Publication date	Dec 25, 2025
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus in an illustrative embodiment comprises at least one processing device that includes at least a processor and a memory coupled to the processor. The at least one processing device is configured to implement an artificial intelligence system comprising at least one large language model (LLM) agent, to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users, to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation, and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. In some embodiments, the LLM agent is illustratively utilized to provide at least a portion of an AI chatbot.

First claim

Opening claim text (preview).

What is claimed is: 1 . An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured: to implement an artificial intelligence system comprising at least one large language model (LLM) agent; to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. 2 . The apparatus of claim 1 wherein the artificial intelligence system is implemented at least in part on a processing platform that is configured to communicate with one or more user devices over at least one network. 3 . The apparatus of claim 1 wherein the artificial intelligence system is implemented at least in part on at least one user device. 4 . The apparatus of claim 1 wherein the LLM agent comprises a multimodal LLM agent that implements at least one multimodal LLM. 5 . The apparatus of claim 1 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 6 . The apparatus of claim 1 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 7 . The apparatus of claim 1 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and applying the at least one embedding to a transformer architecture comprising the multiple distinct attention mechanisms to generate respective ones of the attention values. 8 . The apparatus of claim 7 wherein the transformer architecture is configured to treat spatial information and text information as respective separate spatial and text modalities and wherein at least a portion of the attention values reflect interdependencies between the spatial and text modalities. 9 . The apparatus of claim 7 wherein the multiple distinct attention mechanisms comprise at least a subset of text-to-text attention, text-to-spatial attention, spatial-to-text attention, and spatial-to-spatial attention. 10 . The apparatus of claim 1 wherein the LLM agent provides at least a portion of an AI chatbot. 11 . A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device: to implement an artificial intelligence system comprising at least one large language model (LLM) agent; to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. 12 . The computer program product of claim 11 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 13 . The computer program product of claim 11 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 14 . The computer program product of claim 11 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and applying the at least one embedding to a transformer architecture comprising the multiple distinct attention mechanisms to generate respective ones of the attention values. 15 . The computer program product of claim 14 wherein the transformer architecture is configured to treat spatial information and text information as respective separate spatial and text modalities, wherein at least a portion of the attention values reflect interdependencies between the spatial and text modalities, and wherein the multiple distinct attention mechanisms comprise at least a subset of text-to-text attention, text-to-spatial attention, spatial-to-text attention, and spatial-to-spatial attention. 16 . A method comprising: implementing an artificial intelligence system comprising at least one large language model (LLM) agent; performing in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; generating in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and carrying out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms; wherein the method is performed by at least one processing device comprising a processor coupled to a memory. 17 . The method of claim 16 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 18 . The method of claim 16 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 19 . The method of claim 16 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and

Assignees

Dell Products Lp

Inventors

Classifications

G06V10/40
Extraction of image or video features · CPC title
G06V10/26Primary
Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title
G06F40/40Primary
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

View patent family 98219644

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025391147A1 cover?: An apparatus in an illustrative embodiment comprises at least one processing device that includes at least a processor and a memory coupled to the processor. The at least one processing device is configured to implement an artificial intelligence system comprising at least one large language model (LLM) agent, to perform in the LLM agent interactive image segmentation of at least one input imag…
Who is the assignee on this patent?: Dell Products Lp
What technology area does this patent fall under?: Primary CPC classification G06V10/26. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Dec 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).