Neural architecture search for dense image prediction tasks
US-2019370648-A1 · Dec 5, 2019 · US
US11983239B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11983239-B2 |
| Application number | US-202117342483-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 8, 2021 |
| Priority date | Jun 8, 2021 |
| Publication date | May 14, 2024 |
| Grant date | May 14, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for obtaining attention features are described. Some examples may include: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dimensional space that is less than the first dimensional space; receiving, at an encoder of the transformer, the projected features and generating encoded representations of the projected features using self-attention; decoding, at a decoder of the transformer, the encoded representations and obtaining a decoded output; and projecting the decoded output to the first dimensional space and adding the image features of the first dimensional space to obtain attention features associated with the image features.
Opening claim text (preview).
What is claimed is: 1. A method of obtaining attention features, the method comprising: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dimensional space that is less than the first dimensional space; receiving, at an encoder of the transformer, the projected features and generating encoded representations of the projected features using self-attention; decoding, at a decoder of the transformer, the encoded representations of the projected features and obtaining a decoded output; and projecting the decoded output to the first dimensional space and adding the image features of the first dimensional space to obtain attention features associated with the image features. 2. The method of claim 1 , further comprising: applying, at the encoder of the transformer, self-attention to the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving the projected features as keys, values, and queries from the projector. 3. The method of claim 2 , further comprising: combining a result of applying the self-attention to the projected features with the keys, values, and queries from the projector to generate encoder self-attention residential output; and processing the encoder self-attention residual output to generate the encoded representations of the projected features. 4. The method of claim 2 , further comprising: applying, at the decoder of the transformer, self-attention to the encoded representations of the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving as input, keys and values from the encoder and one or more semantic embeddings as queries. 5. The method of claim 4 , further comprising: combining a result of applying the self-attention to the encoded representations of the projected features with the keys and values from the encoder and one or more semantic embeddings to generate decoder self-attention residual output; and processing the decoder self-attention residual output to generate the decoded output, wherein the decoded output is at the second dimensional space. 6. The method of claim 1 , wherein the projected features are obtained using a bilinear interpolation. 7. The method of claim 1 , wherein the positional map includes a two-dimensional positional map. 8. A system, comprising: one or more storage devices storing instructions that when executed by one or more hardware processors, cause the one or more hardware processors to implement a neural network for generating image attention features by processing image features combined with a two-dimensional position map, the neural network comprising: a projector of a transformer configured to receive a plurality of tokens associated with image features of a first dimensional space and generate projected features by concatenating the plurality of tokens with the two-dimensional positional map, the projected features having a second dimensional space that is less than the first dimensional space; an encoder of the transformer configured to receive projected features and generate encoded representations of the projected features using self-attention; and a decoder configured to decode the encoded representations of the projected features and obtain a decoded output, wherein the decoded output is projected to the first dimensional space and combined with the image features of the first dimensional space to obtain the attention features. 9. The system of claim 8 , wherein the encoder is configured to apply, at the encoder of the transformer, self-attention to the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving the projected features as keys, values, and queries from the projector. 10. The system of claim 9 , wherein the encoder is configured to: combine a result of applying the self-attention to the projected features with the keys, values, and queries from the projector to generate encoder self-attention residential output; and process the encoder self-attention residual output to generate the encoded representations of the projected features. 11. The system of claim 9 , wherein the decoder of the transformer is configured to apply self-attention to the encoded representations of the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving as input, keys and values from the encoder and one or more semantic embeddings as queries. 12. The system of claim 11 , wherein the decoder is configured to: combine a result of applying the self-attention to the encoded representations of the projected features with the keys and values from the encoder and one or more semantic embeddings to generate decoder self-attention residential output; and process the decoder self-attention residual output to generate the decoded output, wherein the decoded output is at the second dimensional space. 13. The system of claim 8 , wherein the projected features are obtained using a bilinear interpolation. 14. A non-transitory computer-readable storage medium comprising instructions being executable by one or more processors to perform a method, the method comprising: receiving, at a projector of a transformer, a plurality of tokens associated with image features of a first dimensional space; generating, at the projector of the transformer, projected features by concatenating the plurality of tokens with a positional map, the projected features having a second dimensional space that is less than the first dimensional space; receiving, at an encoder of the transformer, the projected features and generating encoded representations of the projected features using self-attention; decoding, at a decoder of the transformer, the encoded representations of the projected features and obtaining a decoded output; and projecting the decoded output to the first dimensional space and adding the image features of the first dimensional space to obtain attention features associated with the image features. 15. The computer-readable storage medium of claim 14 , wherein the method further includes applying, at the encoder of the transformer, self-attention to the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving the projected features as keys, values, and queries from the projector. 16. The computer-readable storage medium of claim 15 , wherein the method further includes: combining a result of applying the self-attention to the projected features with the keys, values, and queries from the projector to generate encoder self-attention residential output; and processing the encoder self-attention residual output to generate the encoded representations of the projected features. 17. The computer-readable storage medium of claim 15 , wherein the method further includes applying, at the decoder of the transformer, self-attention to the encoded representations of the projected features using a multi-head self-attention configuration, the multi-head self-attention configuration receiving as input, keys and values from the encoder and one or more semantic embeddings as queries. 18. The computer-readable storage medium of claim 17 , wherein the
Quantised networks; Sparse networks; Compressed networks · CPC title
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
modifying the architecture, e.g. adding, deleting or silencing nodes or connections · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.