Network infrastructure for user-specific generative intelligence
US-2024420491-A1 · Dec 19, 2024 · US
US2022012848A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022012848-A1 |
| Application number | US-202117485349-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 25, 2021 |
| Priority date | Sep 25, 2021 |
| Publication date | Jan 13, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, apparatus, systems and articles of manufacture disclosed herein perform dense prediction of an input image using transformers at an encoder stage and at a reassembly stage of an image processing system. A disclosed apparatus includes an encoder with an embedder to convert an input image to a plurality of tokens representing features extracted from the input image. The tokens are embedded with a learnable position embedding. The encoder also includes one or more transformers configured in a sequence of stages to relate the tokens to each other. The apparatus further includes a decoder that includes one or more of reassemblers to assemble the tokens into feature representations, one or more of fusion blocks to combine the feature representations to generate a final feature representation, and an output head to generate a dense prediction based on the final feature representation and based on an output task.
Opening claim text (preview).
What is claimed is: 1 . An apparatus comprising: an encoder, comprising: an embedder to convert an input image to a plurality of tokens, the plurality of tokens representing features extracted from the input image, and the embedder embedding the plurality of tokens with a learnable position; and a plurality of transformers configured in a sequence of stages relating each of the plurality of tokens to the other tokens; a decoder comprising: a plurality of reassemblers associated with corresponding ones of the plurality of transformers, each of the plurality of reassemblers receiving an output from the corresponding one of the plurality of transformers, and assembling the tokens into feature representations; a plurality of fusion blocks to combine the feature representations to form a final feature representation; and an output head to generate a dense prediction based on the final feature representation and an output task. 2 . The apparatus of claim 1 , wherein the embedder is further to generate a special patch-independent token and add the special patch-independent token to the plurality of tokens. 3 . The apparatus of claim 1 , wherein the same number of tokens are maintained at each stage of the set of transformer stages. 4 . The apparatus of claim 1 , wherein the embedder is to: divide the input image into non-overlapping patches of a same pixel size; flatten the patches into vectors; and individually embed the patches using a linear projection, the tokens to correspond to the embedded patches. 5 . The apparatus of claim 1 , wherein the reassemblers include: a token reader to read the plurality of tokens; a concatenator to perform a spatial concatenation operation on an output of the token reader to generate an feature representation; and a resampler to scale the feature representation to a scalar height of the input image divided by a scalar and a width of the input image divided by the same scalar. 6 . The apparatus of claim 1 , wherein the reassemblers are to: reassemble the tokens into feature representations from deeper stages of the transformer stages at a lower resolution; and assemble the tokens into feature representations from early stages of the transformer stages at a higher resolution. 7 . The apparatus of claim 1 , wherein the reassemblers are to place each token into a position occupied by each corresponding patch extracted from the input image, the tokens, when placed into the corresponding positions to form feature representations. 8 . An apparatus comprising: a memory; instructions that when executed cause at least one processor to: convert an input image to a plurality (N) of tokens, respective ones of the N tokens based on respective non-overlapping patches of the input image, the N tokens to include positional information, the positional information to identify respective positions in which the respective non-overlapping patches fit within the input image; reassemble the N tokens into feature representations after the tokens have passed through transformer stages, ones of the tokens output by deeper ones of the transformer stages assembled at a first resolution, ones of the tokens output by early ones of the transformer stages assembled at a second resolution, the first resolution lower than the second resolution; progressively fuse the feature representations using consecutive stages of a residual network, and, in each stage of the residual network, upsample a respective representation output by a respective stage of the residual network by a factor of two; and generate a dense prediction based on the fused feature maps. 9 . The apparatus of claim 8 , wherein the processor is further to generate a special patch-independent token and concatenate the special token to the N tokens. 10 . The apparatus of claim 8 , wherein the same number of tokens are maintained at each stage of the transformer stages. 11 . The apparatus of claim 8 , wherein the processor is further to: divide the input image into the non-overlapping patches, the non-overlapping patches having a same number of pixels; flatten the N tokens into vectors; and apply a linear projection to the N tokens to embed the the tokens. 12 . The apparatus of claim 8 , wherein to reassemble the N tokens the processor is to: read the N tokens; spatially concatenate the N tokens to generate feature maps; and resample the feature maps to generate a scaled representation of the input image, the scaled representation having dimensions that are related to the input image by a scalar. 13 . The apparatus of claim 8 , wherein the reassemblers are to: assemble the N tokens into feature representations generated at deeper ones of the transformer stages at a lower resolution; and assemble the N tokens into feature representations generated at earlier ones of the transformer stages at a higher resolution. 14 . A non-transitory computer readable medium comprising instructions that, when executed, cause a machine to at least: convert an input image into tokens, the tokens to represent features extracted from the input image; and transform the tokens with information relating each token to all the other tokens; reassemble the transformed tokens into feature representations; progressively fuse the feature representations to a generate a final feature representation, progressively upsample the final feature representation by a factor of two; and generate a dense prediction based on the final feature representation. 15 . The non-transitory computer readable medium of claim 14 , wherein the instructions, when executed, cause the machine to generate a special patch-independent token and add the special patch-independent token to the tokens. 16 . The non-transitory computer readable medium of claim 14 , wherein the same number of tokens are maintained at each stage of a set of transformer stages used to transform the tokens. 17 . The non-transitory computer readable medium of claim 14 , wherein to convert the input image into tokens, the instructions, when executed, further cause the at least one machine to: divide the input image into non-overlapping patches of a same pixel size; flatten the non-overlapping patches into vectors; and add spatial information to the non-overlapping patches to form the tokens. 18 . The non-transitory computer readable medium of claim 14 , wherein to reassemble the transformed tokens, the instructions, when executed, further cause the at least one machine to: read the plurality of transformed tokens to generate read information; spatially concatenate read information; and scale the final feature representation to a first height and a first width, the first height and the first width related to a second height and a second width, respectively, by a scalar, and the second height and the second width corresponding to a size of the input image. 19 . The non-transitory computer readable medium of claim 14 , wherein to reassemble the tokens, the instructions, when executed, further cause the at least one machine to: reassemble the tokens from deeper stages of the transformer stages at a lower resolution; and reassemble the tokens from early stages of the transform stages at a higher resolution. 20 . A method comprising: converting, by executing an instruction with at least one processor, an input image into tokens, the tokens to represent features extracted from the input image; and transforming, by executing an instruction with t
Combinations of networks · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
Activation functions · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Image enhancement or restoration · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.