Methods and apparatus to perform dense prediction using transformer blocks

US2022012848A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2022012848-A1
Application numberUS-202117485349-A
CountryUS
Kind codeA1
Filing dateSep 25, 2021
Priority dateSep 25, 2021
Publication dateJan 13, 2022
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, apparatus, systems and articles of manufacture disclosed herein perform dense prediction of an input image using transformers at an encoder stage and at a reassembly stage of an image processing system. A disclosed apparatus includes an encoder with an embedder to convert an input image to a plurality of tokens representing features extracted from the input image. The tokens are embedded with a learnable position embedding. The encoder also includes one or more transformers configured in a sequence of stages to relate the tokens to each other. The apparatus further includes a decoder that includes one or more of reassemblers to assemble the tokens into feature representations, one or more of fusion blocks to combine the feature representations to generate a final feature representation, and an output head to generate a dense prediction based on the final feature representation and based on an output task.

First claim

Opening claim text (preview).

What is claimed is: 1 . An apparatus comprising: an encoder, comprising: an embedder to convert an input image to a plurality of tokens, the plurality of tokens representing features extracted from the input image, and the embedder embedding the plurality of tokens with a learnable position; and a plurality of transformers configured in a sequence of stages relating each of the plurality of tokens to the other tokens; a decoder comprising: a plurality of reassemblers associated with corresponding ones of the plurality of transformers, each of the plurality of reassemblers receiving an output from the corresponding one of the plurality of transformers, and assembling the tokens into feature representations; a plurality of fusion blocks to combine the feature representations to form a final feature representation; and an output head to generate a dense prediction based on the final feature representation and an output task. 2 . The apparatus of claim 1 , wherein the embedder is further to generate a special patch-independent token and add the special patch-independent token to the plurality of tokens. 3 . The apparatus of claim 1 , wherein the same number of tokens are maintained at each stage of the set of transformer stages. 4 . The apparatus of claim 1 , wherein the embedder is to: divide the input image into non-overlapping patches of a same pixel size; flatten the patches into vectors; and individually embed the patches using a linear projection, the tokens to correspond to the embedded patches. 5 . The apparatus of claim 1 , wherein the reassemblers include: a token reader to read the plurality of tokens; a concatenator to perform a spatial concatenation operation on an output of the token reader to generate an feature representation; and a resampler to scale the feature representation to a scalar height of the input image divided by a scalar and a width of the input image divided by the same scalar. 6 . The apparatus of claim 1 , wherein the reassemblers are to: reassemble the tokens into feature representations from deeper stages of the transformer stages at a lower resolution; and assemble the tokens into feature representations from early stages of the transformer stages at a higher resolution. 7 . The apparatus of claim 1 , wherein the reassemblers are to place each token into a position occupied by each corresponding patch extracted from the input image, the tokens, when placed into the corresponding positions to form feature representations. 8 . An apparatus comprising: a memory; instructions that when executed cause at least one processor to: convert an input image to a plurality (N) of tokens, respective ones of the N tokens based on respective non-overlapping patches of the input image, the N tokens to include positional information, the positional information to identify respective positions in which the respective non-overlapping patches fit within the input image; reassemble the N tokens into feature representations after the tokens have passed through transformer stages, ones of the tokens output by deeper ones of the transformer stages assembled at a first resolution, ones of the tokens output by early ones of the transformer stages assembled at a second resolution, the first resolution lower than the second resolution; progressively fuse the feature representations using consecutive stages of a residual network, and, in each stage of the residual network, upsample a respective representation output by a respective stage of the residual network by a factor of two; and generate a dense prediction based on the fused feature maps. 9 . The apparatus of claim 8 , wherein the processor is further to generate a special patch-independent token and concatenate the special token to the N tokens. 10 . The apparatus of claim 8 , wherein the same number of tokens are maintained at each stage of the transformer stages. 11 . The apparatus of claim 8 , wherein the processor is further to: divide the input image into the non-overlapping patches, the non-overlapping patches having a same number of pixels; flatten the N tokens into vectors; and apply a linear projection to the N tokens to embed the the tokens. 12 . The apparatus of claim 8 , wherein to reassemble the N tokens the processor is to: read the N tokens; spatially concatenate the N tokens to generate feature maps; and resample the feature maps to generate a scaled representation of the input image, the scaled representation having dimensions that are related to the input image by a scalar. 13 . The apparatus of claim 8 , wherein the reassemblers are to: assemble the N tokens into feature representations generated at deeper ones of the transformer stages at a lower resolution; and assemble the N tokens into feature representations generated at earlier ones of the transformer stages at a higher resolution. 14 . A non-transitory computer readable medium comprising instructions that, when executed, cause a machine to at least: convert an input image into tokens, the tokens to represent features extracted from the input image; and transform the tokens with information relating each token to all the other tokens; reassemble the transformed tokens into feature representations; progressively fuse the feature representations to a generate a final feature representation, progressively upsample the final feature representation by a factor of two; and generate a dense prediction based on the final feature representation. 15 . The non-transitory computer readable medium of claim 14 , wherein the instructions, when executed, cause the machine to generate a special patch-independent token and add the special patch-independent token to the tokens. 16 . The non-transitory computer readable medium of claim 14 , wherein the same number of tokens are maintained at each stage of a set of transformer stages used to transform the tokens. 17 . The non-transitory computer readable medium of claim 14 , wherein to convert the input image into tokens, the instructions, when executed, further cause the at least one machine to: divide the input image into non-overlapping patches of a same pixel size; flatten the non-overlapping patches into vectors; and add spatial information to the non-overlapping patches to form the tokens. 18 . The non-transitory computer readable medium of claim 14 , wherein to reassemble the transformed tokens, the instructions, when executed, further cause the at least one machine to: read the plurality of transformed tokens to generate read information; spatially concatenate read information; and scale the final feature representation to a first height and a first width, the first height and the first width related to a second height and a second width, respectively, by a scalar, and the second height and the second width corresponding to a size of the input image. 19 . The non-transitory computer readable medium of claim 14 , wherein to reassemble the tokens, the instructions, when executed, further cause the at least one machine to: reassemble the tokens from deeper stages of the transformer stages at a lower resolution; and reassemble the tokens from early stages of the transform stages at a higher resolution. 20 . A method comprising: converting, by executing an instruction with at least one processor, an input image into tokens, the tokens to represent features extracted from the input image; and transforming, by executing an instruction with t

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • G06V20/70Primary

    Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • Activation functions · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Image enhancement or restoration · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2022012848A1 cover?
Methods, apparatus, systems and articles of manufacture disclosed herein perform dense prediction of an input image using transformers at an encoder stage and at a reassembly stage of an image processing system. A disclosed apparatus includes an encoder with an embedder to convert an input image to a plurality of tokens representing features extracted from the input image. The tokens are embedd…
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06V20/70. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).