Contrastive captioning for image groups
US-2022058390-A1 · Feb 24, 2022 · US
US12586392B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12586392-B2 |
| Application number | US-202318179177-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 6, 2023 |
| Priority date | Mar 6, 2023 |
| Publication date | Mar 24, 2026 |
| Grant date | Mar 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments are disclosed for training an image caption evaluation system to perform evaluations of image captions. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training image, a ground truth image caption for the training image, and a perturbed image caption for the training image, where the perturbed image caption includes modifications to the ground truth image caption. The disclosed systems and methods further comprise generating, by a visual encoder, a visual embedding representation of the training image and generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption. The disclosed systems and methods further comprise computing losses between the visual embedding, the first text embedding, and the second text embedding and training the perturbation-aware text encoder based on the computed losses.
Opening claim text (preview).
We claim: 1 . A computer-implemented method, comprising: receiving a training image and a ground truth image caption for the training image; generating a perturbed image caption for the training image by performing modifications to text elements of the ground truth image caption; generating, by a visual encoder, a visual embedding representation of the training image; generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption; computing losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption; and training the perturbation-aware text encoder based on the computed losses. 2 . The computer-implemented method of claim 1 , further comprising: performing an initial training phase of the perturbation-aware text encoder using a set of ground truth image captions in a first language and the set of ground truth image captions translated into a second language. 3 . The computer-implemented method of claim 2 , wherein performing the initial training phase of the perturbation-aware text encoder further comprises: for each ground truth image caption of the set of ground truth image captions: translating the ground truth image caption from the first language to the second language; generating, by a text encoder, a first text embedding representation of the ground truth image caption in the first language; generating, by the perturbation-aware text encoder, a second text embedding representation of the ground truth image caption in the second language; computing a loss between the first text embedding representation and the second text embedding representation; and backpropagating the loss to train the perturbation-aware text encoder. 4 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by replacing first text elements in the ground truth image caption with second text elements from a different image caption. 5 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by swapping text elements within the ground truth image caption. 6 . The computer-implemented method of claim 1 , further comprising: generating the perturbed image caption by removing text elements within the ground truth image caption. 7 . The computer-implemented method of claim 1 , wherein computing the losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption further comprises: computing a first loss between the visual embedding representation of the training image and the first text embedding for the ground truth image caption; computing a second loss between the visual embedding representation of the training image and the second text embedding for the perturbed image caption; computing a third loss between the first text embedding for the ground truth image caption and the second text embedding for the perturbed image caption; and aggregating the first loss, the second loss, and the third loss to generate an overall loss. 8 . The computer-implemented method of claim 1 , wherein the training image is part of a training dataset, and wherein each image in the training dataset is associated with a ground truth image caption for the image, a description of an image background, a description of objects in the image, and a description of a relationship between the objects in the image. 9 . A non-transitory computer-readable storage medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a training image and a ground truth image caption for the training image; generating a perturbed image caption for the training image by performing modifications to text elements of the ground truth image caption; generating, by a visual encoder, a visual embedding representation of the training image; generating, by a perturbation-aware text encoder, a first text embedding for the ground truth image caption and a second text embedding for the perturbed image caption; computing losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption; and training the perturbation-aware text encoder based on the computed losses. 10 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: performing an initial training phase of the perturbation-aware text encoder using a set of ground truth image captions in a first language and the set of ground truth image captions translated into a second language. 11 . The non-transitory computer-readable storage medium of claim 10 , wherein to perform the initial training phase of the perturbation-aware text encoder the instructions further cause the processing device to perform operations comprising: for each ground truth image caption of the set of ground truth image captions: translating the ground truth image caption from the first language to the second language; generating, by a text encoder, a first text embedding representation of the ground truth image caption in the first language; generating, by the perturbation-aware text encoder, a second text embedding representation of the ground truth image caption in the second language; computing a loss between the first text embedding representation and the second text embedding representation; and backpropagating the loss to train the perturbation-aware text encoder. 12 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by replacing first text elements in the ground truth image caption with second text elements from a different image caption. 13 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by swapping text elements within the ground truth image caption. 14 . The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: generating the perturbed image caption by removing text elements within the ground truth image caption. 15 . The non-transitory computer-readable storage medium of claim 9 , wherein to computer the losses between the visual embedding representation of the training image, the first text embedding for the ground truth image caption, and the second text embedding for the perturbed image caption the instructions further cause the processing device to perform operations comprising: computing a first loss between the visual embedding representation of the training image and the first text embedding for the ground truth image caption; computing a second loss between the visual embedding representation of the training image and the second text embedding for the perturbed image caption; computing a third loss between the first text embedding for the ground truth image caption and the second text embe
Image watermarking · CPC title
Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.