Regional-to-local attention for vision transformers

US11915474B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11915474-B2
Application numberUS-202217804724-A
CountryUS
Kind codeB2
Filing dateMay 31, 2022
Priority dateMay 31, 2022
Publication dateFeb 27, 2024
Grant dateFeb 27, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques and apparatus for analyzing visual content using a visual transformer are described. An example technique includes generating a first set of tokens based on a visual content item. Each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item. A second set of tokens is generated based on the visual content item. Each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item. At least one feature map is generated for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer. At least one vision task is performed based on the at least one feature map.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map. 2. The computer-implemented method of claim 1 , wherein: each of the first set of tokens has a first patch size; each of the second set of tokens has a second patch size; and the first patch size is greater than the second patch size. 3. The computer-implemented method of claim 1 , wherein each token in the first set of tokens comprises a different plurality of tokens from the second set of tokens. 4. The computer-implemented method of claim 3 , wherein the second set of tokens are distributed among the first set of tokens in a non-overlapping manner. 5. The computer-implemented method of claim 1 , wherein analyzing the first set of tokens and the second set of tokens comprises: performing a regional self-attention on the first set of tokens; and for each token in the first set of tokens, performing a local self-attention on (i) the token in the first set of tokens and (ii) a plurality of tokens, from the second set of tokens, associated with the token in the first set of tokens. 6. The computer-implemented method of claim 5 , wherein the local self-attention is performed after the regional self-attention on the first set of tokens. 7. The computer-implemented method of claim 5 , wherein the local self-attention is performed in parallel for each token in the first set of tokens. 8. The computer-implemented method of claim 1 , wherein analyzing the first set of tokens and the second set of tokens comprises performing a plurality of stages of regional-to-local self attention. 9. The computer-implemented method of claim 8 , wherein, for at least a first stage of the plurality of stages, the respective regional-to-local self attention comprises performing a downsampling operation to (i) reduce a spatial resolution of the first set of tokens and the second set of tokens prior to a subsequent second stage of the plurality of stages and (ii) increase a channel dimension on the first set of tokens and the second set of tokens prior to the subsequent second stage. 10. The computer-implemented method of claim 1 , wherein the at least one vision task comprises at least one of: (i) object detection, (ii) image classification, (iii) action recognition, or (iv) semantic segmentation. 11. The computer-implemented method of claim 1 , wherein the visual content item comprises an image or a video. 12. A system, comprising: one or more computer processors; and a memory containing a program, which when executed by the one or more computer processors performs an operation, the operation comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map. 13. The system of claim 12 , wherein: each of the first set of tokens has a first patch size; each of the second set of tokens has a second patch size; and the first patch size is greater than the second patch size. 14. The system of claim 12 , wherein each token in the first set of tokens comprises a different plurality of tokens from the second set of tokens. 15. The system of claim 12 , wherein analyzing the first set of tokens and the second set of tokens comprises: performing a regional self-attention on the first set of tokens; and for each token in the first set of tokens, performing a local self-attention on (i) the token in the first set of tokens and (ii) a plurality of tokens, from the second set of tokens, associated with the token in the first set of tokens. 16. The system of claim 15 , wherein the local self-attention is performed after the regional self-attention on the first set of tokens. 17. The system of claim 12 , wherein analyzing the first set of tokens and the second set of tokens comprises performing a plurality of stages of regional-to-local self attention. 18. The system of claim 17 , wherein, for at least a first stage of the plurality of stages, the respective regional-to-local self attention comprises performing a downsampling operation to (i) reduce a spatial resolution of the first set of tokens and the second set of tokens prior to a subsequent second stage of the plurality of stages and (ii) increase a channel dimension on the first set of tokens and the second set of tokens prior to the subsequent second stage. 19. The system of claim 12 , wherein the at least one vision task comprises at least one of: (i) object detection, (ii) image classification, (iii) action recognition, or (iv) semantic segmentation. 20. A computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to perform an operation comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map.

Assignees

Inventors

Classifications

  • G06V10/96Primary

    Management of image or video recognition tasks · CPC title

  • Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title

  • Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11915474B2 cover?
Techniques and apparatus for analyzing visual content using a visual transformer are described. An example technique includes generating a first set of tokens based on a visual content item. Each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item. A second set of tokens is generated based on the vis…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06V10/96. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 27 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).