What technology area does this patent fall under?

Primary CPC classification G06T1/0021. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Perturbation robust metric for evaluating image captions

US12586392B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12586392-B2
Application number	US-202318179177-A
Country	US
Kind code	B2
Filing date	Mar 6, 2023
Priority date	Mar 6, 2023
Publication date	Mar 24, 2026
Grant date	Mar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are disclosed for training an image caption evaluation system to perform evaluations of image captions. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training image, a ground truth image caption for the training image, and a perturbed image caption for the training image, where the perturbed image caption includes modifications to the ground truth image caption. The disclosed systems and methods further comprise generating, by a visual encoder, a visual embedding representation of the training image and generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption. The disclosed systems and methods further comprise computing losses between the visual embedding, the first text embedding, and the second text embedding and training the perturbation-aware text encoder based on the computed losses.

First claim

Opening claim text (preview).

We claim: 1 . A computer-implemented method, comprising: receiving a training image and a ground truth image caption for the training image; generating a perturbed image caption for the training image by performing modifications to text elements of the ground truth image caption; generating, by a visual encoder, a visual embedding representation of the training image; generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption; computing losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption; and training the perturbation-aware text encoder based on the computed losses. 2 . The computer-implemented method of claim 1 , further comprising: performing an initial training phase of the perturbation-aware text encoder using a set of ground truth image captions in a first language and the set of ground truth image captions translated into a second language. 3 . The computer-implemented method of claim 2 , wherein performing the initial training phase of the perturbation-aware text encoder further comprises: for each ground truth image caption of the set of ground truth image captions: translating the ground truth image caption from the first language to the second language; generating, by a text encoder, a first text embedding representation of the ground truth image caption in the first language; generating, by the perturbation-aware text encoder, a second text embedding representation of the ground truth image caption in the second language; computing a loss between the first text embedding representation and the second text embedding representation; and backpropagating the loss to train the perturbation-aware text encoder. 4 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by replacing first text elements in the ground truth image caption with second text elements from a different image caption. 5 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by swapping text elements within the ground truth image caption. 6 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by removing text elements within the ground truth image caption. 7 . The computer-implemented method of claim 1 , wherein computing the losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption further comprises: computing a first loss between the visual embedding representation of the training image and the first text embedding for the ground truth image caption; computing a second loss between the visual embedding representation of the training image and the second text embedding for the perturbed image caption; computing a third loss between the first text embedding for the ground truth image caption and the second text embedding for the perturbed image caption; and aggregating the first loss, the second loss, and the third loss to generate an overall loss. 8 . The computer-implemented method of claim 1 , wherein the training image is part of a training dataset, and wherein each image in the training dataset is associated with a ground truth image caption for the image, a description of an image background, a description of objects in the image, and a description of a relationship between the objects in the image. 9 . A non-transitory computer-readable storage medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a training image and a ground truth image caption for the training image; generating a perturbed image caption for the training image by performing modifications to text elements of the ground truth image caption; generating, by a visual encoder, a visual embedding representation of the training image; generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption; computing losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption; and training the perturbation-aware text encoder based on the computed losses. 10 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: performing an initial training phase of the perturbation-aware text encoder using a set of ground truth image captions in a first language and the set of ground truth image captions translated into a second language. 11 . The non-transitory computer-readable storage medium of claim 10 , wherein to perform the initial training phase of the perturbation-aware text encoder the instructions further cause the processing device to perform operations comprising: for each ground truth image caption of the set of ground truth image captions: translating the ground truth image caption from the first language to the second language; generating, by a text encoder, a first text embedding representation of the ground truth image caption in the first language; generating, by the perturbation-aware text encoder, a second text embedding representation of the ground truth image caption in the second language; computing a loss between the first text embedding representation and the second text embedding representation; and backpropagating the loss to train the perturbation-aware text encoder. 12 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by replacing first text elements in the ground truth image caption with second text elements from a different image caption. 13 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by swapping text elements within the ground truth image caption. 14 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by removing text elements within the ground truth image caption. 15 . The non-transitory computer-readable storage medium of claim 9 , wherein to computer the losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption the instructions further cause the processing device to perform operations comprising: computing a first loss between the visual embedding representation of the training image and the first text embedding for the ground truth image caption; computing a second loss between the visual embedding representation of the training image and the second text embedding for the perturbed image caption; computing a third loss between the first text embedding for the ground truth image caption and the second text embe

Assignees

Adobe Inc

Inventors

Classifications

G06T1/0021Primary
Image watermarking · CPC title
G06F40/58
Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title
G06V20/70Primary
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

Patent family

Related publications grouped by family.

View patent family 92635743

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586392B2 cover?: Embodiments are disclosed for training an image caption evaluation system to perform evaluations of image captions. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training image, a ground truth image caption for the training image, and a perturbed image caption for the training image, where the perturbed image caption includes modifications to …
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06T1/0021. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).