What technology area does this patent fall under?

Primary CPC classification G06V10/96. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 27 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Regional-to-local attention for vision transformers

US11915474B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11915474-B2
Application number	US-202217804724-A
Country	US
Kind code	B2
Filing date	May 31, 2022
Priority date	May 31, 2022
Publication date	Feb 27, 2024
Grant date	Feb 27, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques and apparatus for analyzing visual content using a visual transformer are described. An example technique includes generating a first set of tokens based on a visual content item. Each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item. A second set of tokens is generated based on the visual content item. Each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item. At least one feature map is generated for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer. At least one vision task is performed based on the at least one feature map.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map. 2. The computer-implemented method of claim 1 , wherein: each of the first set of tokens has a first patch size; each of the second set of tokens has a second patch size; and the first patch size is greater than the second patch size. 3. The computer-implemented method of claim 1 , wherein each token in the first set of tokens comprises a different plurality of tokens from the second set of tokens. 4. The computer-implemented method of claim 3 , wherein the second set of tokens are distributed among the first set of tokens in a non-overlapping manner. 5. The computer-implemented method of claim 1 , wherein analyzing the first set of tokens and the second set of tokens comprises: performing a regional self-attention on the first set of tokens; and for each token in the first set of tokens, performing a local self-attention on (i) the token in the first set of tokens and (ii) a plurality of tokens, from the second set of tokens, associated with the token in the first set of tokens. 6. The computer-implemented method of claim 5 , wherein the local self-attention is performed after the regional self-attention on the first set of tokens. 7. The computer-implemented method of claim 5 , wherein the local self-attention is performed in parallel for each token in the first set of tokens. 8. The computer-implemented method of claim 1 , wherein analyzing the first set of tokens and the second set of tokens comprises performing a plurality of stages of regional-to-local self attention. 9. The computer-implemented method of claim 8 , wherein, for at least a first stage of the plurality of stages, the respective regional-to-local self attention comprises performing a downsampling operation to (i) reduce a spatial resolution of the first set of tokens and the second set of tokens prior to a subsequent second stage of the plurality of stages and (ii) increase a channel dimension on the first set of tokens and the second set of tokens prior to the subsequent second stage. 10. The computer-implemented method of claim 1 , wherein the at least one vision task comprises at least one of: (i) object detection, (ii) image classification, (iii) action recognition, or (iv) semantic segmentation. 11. The computer-implemented method of claim 1 , wherein the visual content item comprises an image or a video. 12. A system, comprising: one or more computer processors; and a memory containing a program, which when executed by the one or more computer processors performs an operation, the operation comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map. 13. The system of claim 12 , wherein: each of the first set of tokens has a first patch size; each of the second set of tokens has a second patch size; and the first patch size is greater than the second patch size. 14. The system of claim 12 , wherein each token in the first set of tokens comprises a different plurality of tokens from the second set of tokens. 15. The system of claim 12 , wherein analyzing the first set of tokens and the second set of tokens comprises: performing a regional self-attention on the first set of tokens; and for each token in the first set of tokens, performing a local self-attention on (i) the token in the first set of tokens and (ii) a plurality of tokens, from the second set of tokens, associated with the token in the first set of tokens. 16. The system of claim 15 , wherein the local self-attention is performed after the regional self-attention on the first set of tokens. 17. The system of claim 12 , wherein analyzing the first set of tokens and the second set of tokens comprises performing a plurality of stages of regional-to-local self attention. 18. The system of claim 17 , wherein, for at least a first stage of the plurality of stages, the respective regional-to-local self attention comprises performing a downsampling operation to (i) reduce a spatial resolution of the first set of tokens and the second set of tokens prior to a subsequent second stage of the plurality of stages and (ii) increase a channel dimension on the first set of tokens and the second set of tokens prior to the subsequent second stage. 19. The system of claim 12 , wherein the at least one vision task comprises at least one of: (i) object detection, (ii) image classification, (iii) action recognition, or (iv) semantic segmentation. 20. A computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to perform an operation comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map.

Assignees

Inventors

Classifications

G06V10/96Primary
Management of image or video recognition tasks · CPC title
G06V10/25
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title
G06V10/7715
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
G06V10/82Primary
using neural networks · CPC title
G06V10/764
using classification, e.g. of video objects · CPC title

Patent family

Related publications grouped by family.

View patent family 88876558

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11915474B2 cover?: Techniques and apparatus for analyzing visual content using a visual transformer are described. An example technique includes generating a first set of tokens based on a visual content item. Each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item. A second set of tokens is generated based on the vis…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06V10/96. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 27 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Adaptive Token Sampling for Efficient Transformer

Aggregating Nested Vision Transformers

Sequence recognition method and apparatus, electronic device, and storage medium

Object detecting system for detecting object by using hierarchical pyramid and object detecting method thereof

Method for semantically labeling an image of a scene using recursive context propagation

Frequently asked questions