Lightweight transformer for high resolution images

US11983239B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11983239-B2
Application numberUS-202117342483-A
CountryUS
Kind codeB2
Filing dateJun 8, 2021
Priority dateJun 8, 2021
Publication dateMay 14, 2024
Grant dateMay 14, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for obtaining attention features are described. Some examples may include: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dimensional space that is less than the first dimensional space; receiving, at an encoder of the transformer, the projected features and generating encoded representations of the projected features using self-attention; decoding, at a decoder of the transformer, the encoded representations and obtaining a decoded output; and projecting the decoded output to the first dimensional space and adding the image features of the first dimensional space to obtain attention features associated with the image features.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of obtaining attention features, the method comprising: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dimensional space that is less than the first dimensional space; receiving, at an encoder of the transformer, the projected features and generating encoded representations of the projected features using self-attention; decoding, at a decoder of the transformer, the encoded representations of the projected features and obtaining a decoded output; and projecting the decoded output to the first dimensional space and adding the image features of the first dimensional space to obtain attention features associated with the image features. 2. The method of claim 1 , further comprising: applying, at the encoder of the transformer, self-attention to the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving the projected features as keys, values, and queries from the projector. 3. The method of claim 2 , further comprising: combining a result of applying the self-attention to the projected features with the keys, values, and queries from the projector to generate encoder self-attention residential output; and processing the encoder self-attention residual output to generate the encoded representations of the projected features. 4. The method of claim 2 , further comprising: applying, at the decoder of the transformer, self-attention to the encoded representations of the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving as input, keys and values from the encoder and one or more semantic embeddings as queries. 5. The method of claim 4 , further comprising: combining a result of applying the self-attention to the encoded representations of the projected features with the keys and values from the encoder and one or more semantic embeddings to generate decoder self-attention residual output; and processing the decoder self-attention residual output to generate the decoded output, wherein the decoded output is at the second dimensional space. 6. The method of claim 1 , wherein the projected features are obtained using a bilinear interpolation. 7. The method of claim 1 , wherein the positional map includes a two-dimensional positional map. 8. A system, comprising: one or more storage devices storing instructions that when executed by one or more hardware processors, cause the one or more hardware processors to implement a neural network for generating image attention features by processing image features combined with a two-dimensional position map, the neural network comprising: a projector of a transformer configured to receive a plurality of tokens associated with image features of a first dimensional space and generate projected features by concatenating the plurality of tokens with the two-dimensional positional map, the projected features having a second dimensional space that is less than the first dimensional space; an encoder of the transformer configured to receive projected features and generate encoded representations of the projected features using self-attention; and a decoder configured to decode the encoded representations of the projected features and obtain a decoded output, wherein the decoded output is projected to the first dimensional space and combined with the image features of the first dimensional space to obtain the attention features. 9. The system of claim 8 , wherein the encoder is configured to apply, at the encoder of the transformer, self-attention to the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving the projected features as keys, values, and queries from the projector. 10. The system of claim 9 , wherein the encoder is configured to: combine a result of applying the self-attention to the projected features with the keys, values, and queries from the projector to generate encoder self-attention residential output; and process the encoder self-attention residual output to generate the encoded representations of the projected features. 11. The system of claim 9 , wherein the decoder of the transformer is configured to apply self-attention to the encoded representations of the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving as input, keys and values from the encoder and one or more semantic embeddings as queries. 12. The system of claim 11 , wherein the decoder is configured to: combine a result of applying the self-attention to the encoded representations of the projected features with the keys and values from the encoder and one or more semantic embeddings to generate decoder self-attention residential output; and process the decoder self-attention residual output to generate the decoded output, wherein the decoded output is at the second dimensional space. 13. The system of claim 8 , wherein the projected features are obtained using a bilinear interpolation. 14. A non-transitory computer-readable storage medium comprising instructions being executable by one or more processors to perform a method, the method comprising: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dimensional space that is less than the first dimensional space; receiving, at an encoder of the transformer, the projected features and generating encoded representations of the projected features using self-attention; decoding, at a decoder of the transformer, the encoded representations of the projected features and obtaining a decoded output; and projecting the decoded output to the first dimensional space and adding the image features of the first dimensional space to obtain attention features associated with the image features. 15. The computer-readable storage medium of claim 14 , wherein the method further includes applying, at the encoder of the transformer, self-attention to the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving the projected features as keys, values, and queries from the projector. 16. The computer-readable storage medium of claim 15 , wherein the method further includes: combining a result of applying the self-attention to the projected features with the keys, values, and queries from the projector to generate encoder self-attention residential output; and processing the encoder self-attention residual output to generate the encoded representations of the projected features. 17. The computer-readable storage medium of claim 15 , wherein the method further includes applying, at the decoder of the transformer, self-attention to the encoded representations of the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving as input, keys and values from the encoder and one or more semantic embeddings as queries. 18. The computer-readable storage medium of claim 17 , wherein the

Assignees

Inventors

Classifications

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11983239B2 cover?
Systems and methods for obtaining attention features are described. Some examples may include: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dime…
Who is the assignee on this patent?
Lemon Inc
What technology area does this patent fall under?
Primary CPC classification G06F18/213. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 14 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).