Systems and methods for open vocabulary object detection

US2023154213A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023154213-A1
Application numberUS-202217587161-A
CountryUS
Kind codeA1
Filing dateJan 28, 2022
Priority dateNov 16, 2021
Publication dateMay 18, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide methods and systems for open vocabulary object detection of images. given a pre-trained vision-language model and an image-caption pair, an activation map may be computed in the image that corresponds to an object of interest mentioned in the caption. The activation map is then converted into a pseudo bounding-box label for the corresponding object category. The open vocabulary detector is then directly supervised by these pseudo box-labels, which enables training object detectors with no human-provided bounding-box annotations.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for object detection within an image, the method comprising: obtaining, via a data interface, the image having one or more regions and a caption associated with the image; encoding, by an image encoder, the image into a visual embedding; encoding, by a text encoder, at least one word from the caption into a text embedding; generating, via a cross-attention layer, multimodal features of the image and the word by applying cross-attention between the visual embedding and the text embedding; computing an activation map indicating relevance of the one or more regions in the image to the text embedding based on the multimodal features; determining a bounding-box annotation of the word based on the activation map; and incorporating the bounding-box annotation with the image as a training image sample in a training dataset. 2 . The method of claim 1 , wherein computing the activation map comprises computing a gradient with respect to cross-attention scores of the cross-attention layer. 3 . The method of claim 1 , wherein computing the activation map comprises averaging values from all attention heads of the cross-attention layer. 4 . The method of claim 1 , further comprising: training an open vocabulary object detector using the training dataset comprising the bounding-box annotation with the image. 5 . The method of claim 4 , further comprising: fine-tuning the open vocabulary object detector using categories trained with human-annotated bounding-boxes. 6 . The method of claim 1 , further comprising: determining the bounding-box annotation based on an overlap between a proposed bounding-box and a relevant region of the activation map. 7 . The method of claim 6 , wherein the proposed bounding-box is generated by a pre-trained proposal generator without reference to the caption. 8 . The method of claim 1 , further comprising: training a machine learning model for object detection based on the training image sample having the bounding-box annotation as a ground truth. 9 . A system for object detection within an image, the system comprising: a memory that stores a dialogue structure extraction model; a communication interface that obtains the image having one or more regions and a caption associated with the image; and one or more hardware processors that: encodes, by an image encoder, the image into a visual embedding; encodes, by a text encoder, at least one word from the caption into a text embedding; generates, via a cross-attention layer, multimodal features of the image and the word by applying cross-attention between the visual embedding and the text embedding; computes an activation map indicating relevance of the one or more regions in the image to the text embedding based on the multimodal features; determines a bounding-box annotation of the word based on the activation map; and incorporates the bounding-box annotation with the image as a training image sample in a training dataset. 10 . The system of claim 9 , wherein the one or more hardware processors computes the activation map by computing a gradient with respect to cross-attention scores of the cross-attention layer. 11 . The system of claim 9 , wherein the one or more hardware processors computes the activation map by averaging values from all attention heads of the cross-attention layer. 12 . The system of claim 9 , wherein the one or more hardware processors further: trains an open vocabulary object detector using the training dataset comprising the bounding-box annotation with the image. 13 . The system of claim 12 , wherein the one or more hardware processors further: fine-tunes the open vocabulary object detector using categories trained with human-annotated bounding-boxes. 14 . The system of claim 9 , wherein the one or more hardware processors further: determines the bounding-box annotation based on an overlap between a proposed bounding-box and a relevant region of the activation map. 15 . The system of claim 14 , wherein the one or more hardware processors generates the proposed bounding-box by a pre-trained proposal generator without reference to the caption. 16 . The system of claim 9 , wherein the one or more hardware processors further: trains a machine learning model for object detection based on the training image sample having the bounding-box annotation as a ground truth. 17 . A processor-readable non-transitory storage medium storing a plurality of processor-executable instructions for object detection within an image, the instructions being executed by a processor to perform operations comprising: obtaining, via a data interface, the image having one or more regions and a caption associated with the image; encoding, by an image encoder, the image into a visual embedding; encoding, by a text encoder, at least one word from the caption into a text embedding; generating, via a cross-attention layer, multimodal features of the image and the word by applying cross-attention between the visual embedding and the text embedding; computing an activation map indicating relevance of the one or more regions in the image to the text embedding based on the multimodal features; determining a bounding-box annotation of the word based on the activation map; and incorporating the bounding-box annotation with the image as a training image sample in a training dataset. 18 . The processor-readable non-transitory storage medium of claim 17 , wherein computing the activation map comprises computing a gradient with respect to cross-attention scores of the cross-attention layer. 19 . The processor-readable non-transitory storage medium of claim 17 , wherein computing the activation map comprises averaging values from all attention heads of the cross-attention layer. 20 . The processor-readable non-transitory storage medium of claim 17 , the instructions being executed by the processor to perform operations further comprising: training an open vocabulary object detector using the training dataset comprising the bounding-box annotation with the image.

Assignees

Inventors

Classifications

  • G06T9/00Primary

    Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title

  • Character encoding · CPC title

  • Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title

  • Organisation of the process, e.g. bagging or boosting · CPC title

  • Memory management · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023154213A1 cover?
Embodiments described herein provide methods and systems for open vocabulary object detection of images. given a pre-trained vision-language model and an image-caption pair, an activation map may be computed in the image that corresponds to an object of interest mentioned in the caption. The activation map is then converted into a pseudo bounding-box label for the corresponding object category.…
Who is the assignee on this patent?
Salesforce Com Inc
What technology area does this patent fall under?
Primary CPC classification G06T9/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 18 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).