Adaptive Token Sampling for Efficient Transformer
US-2023153379-A1 · May 18, 2023 · US
US11915474B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11915474-B2 |
| Application number | US-202217804724-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 31, 2022 |
| Priority date | May 31, 2022 |
| Publication date | Feb 27, 2024 |
| Grant date | Feb 27, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques and apparatus for analyzing visual content using a visual transformer are described. An example technique includes generating a first set of tokens based on a visual content item. Each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item. A second set of tokens is generated based on the visual content item. Each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item. At least one feature map is generated for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer. At least one vision task is performed based on the at least one feature map.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map. 2. The computer-implemented method of claim 1 , wherein: each of the first set of tokens has a first patch size; each of the second set of tokens has a second patch size; and the first patch size is greater than the second patch size. 3. The computer-implemented method of claim 1 , wherein each token in the first set of tokens comprises a different plurality of tokens from the second set of tokens. 4. The computer-implemented method of claim 3 , wherein the second set of tokens are distributed among the first set of tokens in a non-overlapping manner. 5. The computer-implemented method of claim 1 , wherein analyzing the first set of tokens and the second set of tokens comprises: performing a regional self-attention on the first set of tokens; and for each token in the first set of tokens, performing a local self-attention on (i) the token in the first set of tokens and (ii) a plurality of tokens, from the second set of tokens, associated with the token in the first set of tokens. 6. The computer-implemented method of claim 5 , wherein the local self-attention is performed after the regional self-attention on the first set of tokens. 7. The computer-implemented method of claim 5 , wherein the local self-attention is performed in parallel for each token in the first set of tokens. 8. The computer-implemented method of claim 1 , wherein analyzing the first set of tokens and the second set of tokens comprises performing a plurality of stages of regional-to-local self attention. 9. The computer-implemented method of claim 8 , wherein, for at least a first stage of the plurality of stages, the respective regional-to-local self attention comprises performing a downsampling operation to (i) reduce a spatial resolution of the first set of tokens and the second set of tokens prior to a subsequent second stage of the plurality of stages and (ii) increase a channel dimension on the first set of tokens and the second set of tokens prior to the subsequent second stage. 10. The computer-implemented method of claim 1 , wherein the at least one vision task comprises at least one of: (i) object detection, (ii) image classification, (iii) action recognition, or (iv) semantic segmentation. 11. The computer-implemented method of claim 1 , wherein the visual content item comprises an image or a video. 12. A system, comprising: one or more computer processors; and a memory containing a program, which when executed by the one or more computer processors performs an operation, the operation comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map. 13. The system of claim 12 , wherein: each of the first set of tokens has a first patch size; each of the second set of tokens has a second patch size; and the first patch size is greater than the second patch size. 14. The system of claim 12 , wherein each token in the first set of tokens comprises a different plurality of tokens from the second set of tokens. 15. The system of claim 12 , wherein analyzing the first set of tokens and the second set of tokens comprises: performing a regional self-attention on the first set of tokens; and for each token in the first set of tokens, performing a local self-attention on (i) the token in the first set of tokens and (ii) a plurality of tokens, from the second set of tokens, associated with the token in the first set of tokens. 16. The system of claim 15 , wherein the local self-attention is performed after the regional self-attention on the first set of tokens. 17. The system of claim 12 , wherein analyzing the first set of tokens and the second set of tokens comprises performing a plurality of stages of regional-to-local self attention. 18. The system of claim 17 , wherein, for at least a first stage of the plurality of stages, the respective regional-to-local self attention comprises performing a downsampling operation to (i) reduce a spatial resolution of the first set of tokens and the second set of tokens prior to a subsequent second stage of the plurality of stages and (ii) increase a channel dimension on the first set of tokens and the second set of tokens prior to the subsequent second stage. 19. The system of claim 12 , wherein the at least one vision task comprises at least one of: (i) object detection, (ii) image classification, (iii) action recognition, or (iv) semantic segmentation. 20. A computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to perform an operation comprising: generating a first set of tokens based on a visual content item, wherein each token in the first set of tokens is associated with a regional feature from a different region of a plurality of regions of the visual content item; generating a second set of tokens based on the visual content item, wherein each token in the second set of tokens is associated with a local feature from one of the plurality of regions of the visual content item; generating at least one feature map for the visual content item, based on analyzing the first set of tokens and the second set of tokens separately using a hierarchical vision transformer; and performing at least one vision task based on the at least one feature map.
Management of image or video recognition tasks · CPC title
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.