Perturbation robust metric for evaluating image captions

US12586392B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586392-B2
Application numberUS-202318179177-A
CountryUS
Kind codeB2
Filing dateMar 6, 2023
Priority dateMar 6, 2023
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are disclosed for training an image caption evaluation system to perform evaluations of image captions. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training image, a ground truth image caption for the training image, and a perturbed image caption for the training image, where the perturbed image caption includes modifications to the ground truth image caption. The disclosed systems and methods further comprise generating, by a visual encoder, a visual embedding representation of the training image and generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption. The disclosed systems and methods further comprise computing losses between the visual embedding, the first text embedding, and the second text embedding and training the perturbation-aware text encoder based on the computed losses.

First claim

Opening claim text (preview).

We claim: 1 . A computer-implemented method, comprising: receiving a training image and a ground truth image caption for the training image; generating a perturbed image caption for the training image by performing modifications to text elements of the ground truth image caption; generating, by a visual encoder, a visual embedding representation of the training image; generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption; computing losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption; and training the perturbation-aware text encoder based on the computed losses. 2 . The computer-implemented method of claim 1 , further comprising: performing an initial training phase of the perturbation-aware text encoder using a set of ground truth image captions in a first language and the set of ground truth image captions translated into a second language. 3 . The computer-implemented method of claim 2 , wherein performing the initial training phase of the perturbation-aware text encoder further comprises: for each ground truth image caption of the set of ground truth image captions: translating the ground truth image caption from the first language to the second language; generating, by a text encoder, a first text embedding representation of the ground truth image caption in the first language; generating, by the perturbation-aware text encoder, a second text embedding representation of the ground truth image caption in the second language; computing a loss between the first text embedding representation and the second text embedding representation; and backpropagating the loss to train the perturbation-aware text encoder. 4 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by replacing first text elements in the ground truth image caption with second text elements from a different image caption. 5 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by swapping text elements within the ground truth image caption. 6 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by removing text elements within the ground truth image caption. 7 . The computer-implemented method of claim 1 , wherein computing the losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption further comprises: computing a first loss between the visual embedding representation of the training image and the first text embedding for the ground truth image caption; computing a second loss between the visual embedding representation of the training image and the second text embedding for the perturbed image caption; computing a third loss between the first text embedding for the ground truth image caption and the second text embedding for the perturbed image caption; and aggregating the first loss, the second loss, and the third loss to generate an overall loss. 8 . The computer-implemented method of claim 1 , wherein the training image is part of a training dataset, and wherein each image in the training dataset is associated with a ground truth image caption for the image, a description of an image background, a description of objects in the image, and a description of a relationship between the objects in the image. 9 . A non-transitory computer-readable storage medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a training image and a ground truth image caption for the training image; generating a perturbed image caption for the training image by performing modifications to text elements of the ground truth image caption; generating, by a visual encoder, a visual embedding representation of the training image; generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption; computing losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption; and training the perturbation-aware text encoder based on the computed losses. 10 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: performing an initial training phase of the perturbation-aware text encoder using a set of ground truth image captions in a first language and the set of ground truth image captions translated into a second language. 11 . The non-transitory computer-readable storage medium of claim 10 , wherein to perform the initial training phase of the perturbation-aware text encoder the instructions further cause the processing device to perform operations comprising: for each ground truth image caption of the set of ground truth image captions: translating the ground truth image caption from the first language to the second language; generating, by a text encoder, a first text embedding representation of the ground truth image caption in the first language; generating, by the perturbation-aware text encoder, a second text embedding representation of the ground truth image caption in the second language; computing a loss between the first text embedding representation and the second text embedding representation; and backpropagating the loss to train the perturbation-aware text encoder. 12 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by replacing first text elements in the ground truth image caption with second text elements from a different image caption. 13 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by swapping text elements within the ground truth image caption. 14 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by removing text elements within the ground truth image caption. 15 . The non-transitory computer-readable storage medium of claim 9 , wherein to computer the losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption the instructions further cause the processing device to perform operations comprising: computing a first loss between the visual embedding representation of the training image and the first text embedding for the ground truth image caption; computing a second loss between the visual embedding representation of the training image and the second text embedding for the perturbed image caption; computing a third loss between the first text embedding for the ground truth image caption and the second text embe

Assignees

Inventors

Classifications

  • G06T1/0021Primary

    Image watermarking · CPC title

  • Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title

  • G06V20/70Primary

    Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586392B2 cover?
Embodiments are disclosed for training an image caption evaluation system to perform evaluations of image captions. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training image, a ground truth image caption for the training image, and a perturbed image caption for the training image, where the perturbed image caption includes modifications to …
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06T1/0021. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).