System and method for dynamic images virtualisation
US-2024371084-A1 · Nov 7, 2024 · US
US2023154213A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2023154213-A1 |
| Application number | US-202217587161-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jan 28, 2022 |
| Priority date | Nov 16, 2021 |
| Publication date | May 18, 2023 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments described herein provide methods and systems for open vocabulary object detection of images. given a pre-trained vision-language model and an image-caption pair, an activation map may be computed in the image that corresponds to an object of interest mentioned in the caption. The activation map is then converted into a pseudo bounding-box label for the corresponding object category. The open vocabulary detector is then directly supervised by these pseudo box-labels, which enables training object detectors with no human-provided bounding-box annotations.
Opening claim text (preview).
What is claimed is: 1 . A method for object detection within an image, the method comprising: obtaining, via a data interface, the image having one or more regions and a caption associated with the image; encoding, by an image encoder, the image into a visual embedding; encoding, by a text encoder, at least one word from the caption into a text embedding; generating, via a cross-attention layer, multimodal features of the image and the word by applying cross-attention between the visual embedding and the text embedding; computing an activation map indicating relevance of the one or more regions in the image to the text embedding based on the multimodal features; determining a bounding-box annotation of the word based on the activation map; and incorporating the bounding-box annotation with the image as a training image sample in a training dataset. 2 . The method of claim 1 , wherein computing the activation map comprises computing a gradient with respect to cross-attention scores of the cross-attention layer. 3 . The method of claim 1 , wherein computing the activation map comprises averaging values from all attention heads of the cross-attention layer. 4 . The method of claim 1 , further comprising: training an open vocabulary object detector using the training dataset comprising the bounding-box annotation with the image. 5 . The method of claim 4 , further comprising: fine-tuning the open vocabulary object detector using categories trained with human-annotated bounding-boxes. 6 . The method of claim 1 , further comprising: determining the bounding-box annotation based on an overlap between a proposed bounding-box and a relevant region of the activation map. 7 . The method of claim 6 , wherein the proposed bounding-box is generated by a pre-trained proposal generator without reference to the caption. 8 . The method of claim 1 , further comprising: training a machine learning model for object detection based on the training image sample having the bounding-box annotation as a ground truth. 9 . A system for object detection within an image, the system comprising: a memory that stores a dialogue structure extraction model; a communication interface that obtains the image having one or more regions and a caption associated with the image; and one or more hardware processors that: encodes, by an image encoder, the image into a visual embedding; encodes, by a text encoder, at least one word from the caption into a text embedding; generates, via a cross-attention layer, multimodal features of the image and the word by applying cross-attention between the visual embedding and the text embedding; computes an activation map indicating relevance of the one or more regions in the image to the text embedding based on the multimodal features; determines a bounding-box annotation of the word based on the activation map; and incorporates the bounding-box annotation with the image as a training image sample in a training dataset. 10 . The system of claim 9 , wherein the one or more hardware processors computes the activation map by computing a gradient with respect to cross-attention scores of the cross-attention layer. 11 . The system of claim 9 , wherein the one or more hardware processors computes the activation map by averaging values from all attention heads of the cross-attention layer. 12 . The system of claim 9 , wherein the one or more hardware processors further: trains an open vocabulary object detector using the training dataset comprising the bounding-box annotation with the image. 13 . The system of claim 12 , wherein the one or more hardware processors further: fine-tunes the open vocabulary object detector using categories trained with human-annotated bounding-boxes. 14 . The system of claim 9 , wherein the one or more hardware processors further: determines the bounding-box annotation based on an overlap between a proposed bounding-box and a relevant region of the activation map. 15 . The system of claim 14 , wherein the one or more hardware processors generates the proposed bounding-box by a pre-trained proposal generator without reference to the caption. 16 . The system of claim 9 , wherein the one or more hardware processors further: trains a machine learning model for object detection based on the training image sample having the bounding-box annotation as a ground truth. 17 . A processor-readable non-transitory storage medium storing a plurality of processor-executable instructions for object detection within an image, the instructions being executed by a processor to perform operations comprising: obtaining, via a data interface, the image having one or more regions and a caption associated with the image; encoding, by an image encoder, the image into a visual embedding; encoding, by a text encoder, at least one word from the caption into a text embedding; generating, via a cross-attention layer, multimodal features of the image and the word by applying cross-attention between the visual embedding and the text embedding; computing an activation map indicating relevance of the one or more regions in the image to the text embedding based on the multimodal features; determining a bounding-box annotation of the word based on the activation map; and incorporating the bounding-box annotation with the image as a training image sample in a training dataset. 18 . The processor-readable non-transitory storage medium of claim 17 , wherein computing the activation map comprises computing a gradient with respect to cross-attention scores of the cross-attention layer. 19 . The processor-readable non-transitory storage medium of claim 17 , wherein computing the activation map comprises averaging values from all attention heads of the cross-attention layer. 20 . The processor-readable non-transitory storage medium of claim 17 , the instructions being executed by the processor to perform operations further comprising: training an open vocabulary object detector using the training dataset comprising the bounding-box annotation with the image.
Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title
Character encoding · CPC title
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
Organisation of the process, e.g. bagging or boosting · CPC title
Memory management · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.