Performing semantic segmentation training with image/text pairs
US-2023177810-A1 · Jun 8, 2023 · US
US12524937B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12524937-B2 |
| Application number | US-202318170963-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 17, 2023 |
| Priority date | Feb 17, 2023 |
| Publication date | Jan 13, 2026 |
| Grant date | Jan 13, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for image generation are provided. An aspect of the systems and methods includes obtaining a text prompt, generating a style vector based on the text prompt, generating an adaptive convolution filter based on the style vector, and generating an image corresponding to the text prompt based on the adaptive convolution filter.
Opening claim text (preview).
What is claimed is: 1 . A method for image generation, comprising: obtaining a text prompt; generating a style vector based on the text prompt; generating an adaptive convolution filter by averaging a plurality of weights across a plurality of convolution filters of a convolution layer based on the style vector, wherein the adaptive convolution filter comprises a convolution matrix corresponding to a style of the style vector; and generating an image corresponding to the text prompt based on the adaptive convolution filter. 2 . The method of claim 1 , further comprising: encoding the text prompt to obtain a text embedding; and transforming the text embedding to obtain a global vector corresponding to the text prompt as a whole and a plurality of local vectors corresponding to individual tokens of the text prompt, wherein the style vector is generated based on the global vector and the image is generated based on the plurality of local vectors. 3 . The method of claim 2 , further comprising: performing a cross-attention process based on the plurality of local vectors, wherein the image is generated based on the cross-attention process. 4 . The method of claim 2 , further comprising: obtaining a noise vector, wherein the style vector is based on the noise vector. 5 . The method of claim 1 , further comprising: initializing a feature map; and performing a convolution process on the feature map based on the adaptive convolution filter, wherein the image is generated based on the convolution process. 6 . The method of claim 5 , further comprising: performing a self-attention process based on the feature map, wherein the image is generated based on the self-attention process. 7 . The method of claim 6 , wherein: the self-attention process is based on an L2 distance. 8 . The method of claim 1 , further comprising: identifying a plurality of predetermined convolution filters; and combining the plurality of predetermined convolution filters based on the style vector to obtain the adaptive convolution filter. 9 . The method of claim 1 , further comprising: identifying a diversity parameter; and truncating the style vector based on the diversity parameter to obtain a truncated style vector, wherein the image is generated based on the truncated style vector. 10 . An apparatus for image generation, comprising: at least one processor; at least one memory storing instructions executable by the at least one processor; the apparatus further comprising a text encoder network comprising encoder parameters stored in the at least one memory, wherein the text encoder network is configured to encode a text prompt to obtain a global vector corresponding to the text prompt and a plurality of local vectors corresponding to individual tokens of the text prompt; a mapping network comprising mapping parameters stored in the at least one memory, wherein the mapping network is configured to generate a style vector based on the global vector and a noise vector; and an image generation network comprising image generation parameters stored in the at least one memory, wherein the image generation network is configured to generate an image corresponding to the text prompt based on the style vector and the plurality of local vectors. 11 . The apparatus of claim 10 , wherein: the text encoder network comprises a pretrained encoder and a learned encoder that is trained together with the image generation network. 12 . The apparatus of claim 10 , wherein: the image generation network comprises a generative adversarial network (GAN). 13 . The apparatus of claim 10 , wherein: the image generation network includes a convolution layer, a self-attention layer, and a cross-attention layer. 14 . The apparatus of claim 10 , wherein: the image generation network includes an adaptive convolution component configured to generate an adaptive convolution filter based on the style vector, wherein the image is generated based on the adaptive convolution filter. 15 . The apparatus of claim 10 , further comprising: a discriminator network configured to generate an image embedding and a conditioning embedding, wherein the discriminator network is trained together with the image generation network using an adversarial training loss based on the image embedding and the conditioning embedding. 16 . A method for image generation, comprising: obtaining a training dataset including a training image and text describing the training image; generating a predicted style vector based on the text and a noise vector using a mapping network; generating a predicted image based on the predicted style vector using an image generation network; generating an image embedding based on the predicted image and a conditioning embedding based on the text using a discriminator network; and training the image generation network based on the image embedding and the conditioning embedding. 17 . The method of claim 16 , further comprising: computing a generative adversarial network (GAN) loss based on the image embedding and the conditioning embedding, wherein the image generation network is trained based on the GAN loss. 18 . The method of claim 16 , further comprising: generating a mixed conditioning embedding based on an unrelated text; and computing a mixing loss based on the image embedding and the mixed conditioning embedding, wherein the image generation network is trained based on the mixing loss. 19 . The method of claim 16 , further comprising: encoding the text using a text encoder network that includes a pretrained encoder and a learned encoder, wherein the learned encoder is trained together with the image generation network. 20 . The method of claim 16 , further comprising: learning a feature map for an initial input to the image generation network.
Transformation · CPC title
Adaptive image processing · CPC title
Training; Learning · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Artificial neural networks [ANN] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.