Device and computer implemented method for evaluating a digital image
US-2024404272-A1 · Dec 5, 2024 · US
US2025391147A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025391147-A1 |
| Application number | US-202418747541-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 19, 2024 |
| Priority date | Jun 19, 2024 |
| Publication date | Dec 25, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus in an illustrative embodiment comprises at least one processing device that includes at least a processor and a memory coupled to the processor. The at least one processing device is configured to implement an artificial intelligence system comprising at least one large language model (LLM) agent, to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users, to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation, and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. In some embodiments, the LLM agent is illustratively utilized to provide at least a portion of an AI chatbot.
Opening claim text (preview).
What is claimed is: 1 . An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured: to implement an artificial intelligence system comprising at least one large language model (LLM) agent; to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. 2 . The apparatus of claim 1 wherein the artificial intelligence system is implemented at least in part on a processing platform that is configured to communicate with one or more user devices over at least one network. 3 . The apparatus of claim 1 wherein the artificial intelligence system is implemented at least in part on at least one user device. 4 . The apparatus of claim 1 wherein the LLM agent comprises a multimodal LLM agent that implements at least one multimodal LLM. 5 . The apparatus of claim 1 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 6 . The apparatus of claim 1 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 7 . The apparatus of claim 1 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and applying the at least one embedding to a transformer architecture comprising the multiple distinct attention mechanisms to generate respective ones of the attention values. 8 . The apparatus of claim 7 wherein the transformer architecture is configured to treat spatial information and text information as respective separate spatial and text modalities and wherein at least a portion of the attention values reflect interdependencies between the spatial and text modalities. 9 . The apparatus of claim 7 wherein the multiple distinct attention mechanisms comprise at least a subset of text-to-text attention, text-to-spatial attention, spatial-to-text attention, and spatial-to-spatial attention. 10 . The apparatus of claim 1 wherein the LLM agent provides at least a portion of an AI chatbot. 11 . A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device: to implement an artificial intelligence system comprising at least one large language model (LLM) agent; to perform in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; to generate in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and to carry out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms. 12 . The computer program product of claim 11 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 13 . The computer program product of claim 11 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 14 . The computer program product of claim 11 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and applying the at least one embedding to a transformer architecture comprising the multiple distinct attention mechanisms to generate respective ones of the attention values. 15 . The computer program product of claim 14 wherein the transformer architecture is configured to treat spatial information and text information as respective separate spatial and text modalities, wherein at least a portion of the attention values reflect interdependencies between the spatial and text modalities, and wherein the multiple distinct attention mechanisms comprise at least a subset of text-to-text attention, text-to-spatial attention, spatial-to-text attention, and spatial-to-spatial attention. 16 . A method comprising: implementing an artificial intelligence system comprising at least one large language model (LLM) agent; performing in the LLM agent interactive image segmentation of at least one input image through interaction of the LLM agent with one or more users; generating in the LLM agent an interactive image understanding comprising attention values computed by multiple distinct attention mechanisms based on one or more results of the interactive image segmentation; and carrying out additional user interactions via the LLM agent utilizing the interactive image understanding comprising the attention values computed by the multiple distinct attention mechanisms; wherein the method is performed by at least one processing device comprising a processor coupled to a memory. 17 . The method of claim 16 wherein performing interactive image segmentation comprises: extracting features from the at least one input image in an image encoder; and applying the extracted features to a semantic concept integration decoder to generate at least one embedding. 18 . The method of claim 16 wherein performing interactive image segmentation comprises: determining at least a subset of text prompts, visual prompts and memory prompts associated with the at least one input image; and generating at least one embedding comprising at least one of one or more mask embeddings and one or more class embeddings based on said at least a subset of the text prompts, visual prompts and memory prompts and features extracted from the at least one input image. 19 . The method of claim 16 wherein generating an interactive image understanding comprises: receiving at least one embedding as the one or more results of the interactive image segmentation; and
Extraction of image or video features · CPC title
Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.