Multimodal large language model agent with interactive image understanding

US2025391147A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025391147-A1
Application numberUS-202418747541-A
CountryUS
Kind codeA1
Filing dateJun 19, 2024
Priority dateJun 19, 2024
Publication dateDec 25, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus in an illustrative embodiment comprises at least one processing device that includes at least a processor and a memory coupled to the processor. The at least one processing device is configured to implement an artificial intelligence system comprising at least one large language model (LLM) agent, to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users, to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation, and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. In some embodiments, the LLM agent is illustratively utilized to provide at least a portion of an AI chatbot.

First claim

Opening claim text (preview).

What is claimed is: 1 . An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured: to implement an artificial intelligence system comprising at least one large language model (LLM) agent; to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. 2 . The apparatus of claim 1 wherein the artificial intelligence system is implemented at least in part on a processing platform that is configured to communicate with one or more user devices over at least one network. 3 . The apparatus of claim 1 wherein the artificial intelligence system is implemented at least in part on at least one user device. 4 . The apparatus of claim 1 wherein the LLM agent comprises a multimodal LLM agent that implements at least one multimodal LLM. 5 . The apparatus of claim 1 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 6 . The apparatus of claim 1 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 7 . The apparatus of claim 1 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and applying the at least one embedding to a transformer architecture comprising the multiple distinct attention mechanisms to generate respective ones of the attention values. 8 . The apparatus of claim 7 wherein the transformer architecture is configured to treat spatial information and text information as respective separate spatial and text modalities and wherein at least a portion of the attention values reflect interdependencies between the spatial and text modalities. 9 . The apparatus of claim 7 wherein the multiple distinct attention mechanisms comprise at least a subset of text-to-text attention, text-to-spatial attention, spatial-to-text attention, and spatial-to-spatial attention. 10 . The apparatus of claim 1 wherein the LLM agent provides at least a portion of an AI chatbot. 11 . A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device: to implement an artificial intelligence system comprising at least one large language model (LLM) agent; to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. 12 . The computer program product of claim 11 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 13 . The computer program product of claim 11 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 14 . The computer program product of claim 11 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and applying the at least one embedding to a transformer architecture comprising the multiple distinct attention mechanisms to generate respective ones of the attention values. 15 . The computer program product of claim 14 wherein the transformer architecture is configured to treat spatial information and text information as respective separate spatial and text modalities, wherein at least a portion of the attention values reflect interdependencies between the spatial and text modalities, and wherein the multiple distinct attention mechanisms comprise at least a subset of text-to-text attention, text-to-spatial attention, spatial-to-text attention, and spatial-to-spatial attention. 16 . A method comprising: implementing an artificial intelligence system comprising at least one large language model (LLM) agent; performing in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; generating in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and carrying out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms; wherein the method is performed by at least one processing device comprising a processor coupled to a memory. 17 . The method of claim 16 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 18 . The method of claim 16 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 19 . The method of claim 16 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and

Assignees

Inventors

Classifications

  • Extraction of image or video features · CPC title

  • G06V10/26Primary

    Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title

  • G06F40/40Primary

    Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025391147A1 cover?
An apparatus in an illustrative embodiment comprises at least one processing device that includes at least a processor and a memory coupled to the processor. The at least one processing device is configured to implement an artificial intelligence system comprising at least one large language model (LLM) agent, to perform in the LLM agent interactive image segmentation of at least one input imag…
Who is the assignee on this patent?
Dell Products Lp
What technology area does this patent fall under?
Primary CPC classification G06V10/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).