Systems and methods for controllable image generation

US12536713B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12536713-B2
Application numberUS-202318477764-A
CountryUS
Kind codeB2
Filing dateSep 29, 2023
Priority dateMay 16, 2023
Publication dateJan 27, 2026
Grant dateJan 27, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide a method of image generation. The method includes a fixed diffusion model, and a trainable diffusion model. The fixed diffusion model may be pretrained on a large training corpus. The trainable diffusion model may be used to control the image generation of the fixed diffusion model by modifying internal representations of the fixed diffusion model. A task instruction may be provided in addition to a text prompt, and the task instruction may guide the trainable diffusion model together with the visual conditions. The visual conditions may be adapted according to the task instruction. During training, a fixed number of task instructions may be used. At inference, unseen task instructions may be used by combining convolutional kernels of the visual condition adapter.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of image generation, the method comprising: receiving, via a data interface, a text prompt, an input image, and a task instruction distinct from the text prompt; generating, via an adapter, a task-specific feature map based on the input image, the text prompt, and the task instruction; generating, by a first neural network based image model, a first latent representation based on the task-specific feature map; generating, via a task encoder, a task embedding based on the task instruction; modifying a second latent representation of a second neural network based image model based on the first latent representation and the task embedding, wherein the second neural network based image model is fixed such that its parameters are not updated after pretraining and the first neural network based image model is a trainable copy of at least an encoder of the second neural network based image model; and generating, by a decoder of the second neural network based image model, an output image based on the second latent representation and the text prompt. 2 . The method of claim 1 , wherein the generating the task-specific feature map comprises: selecting one or more convolutional kernels from a set of convolutional kernels based on the task instruction; and generating the task-specific feature map based on the input image and the selected one or more convolutional kernel. 3 . The method of claim 2 , wherein the one or more convolutional kernels are selected based on a comparison of the task instruction to one or more predefined task instructions. 4 . The method of claim 3 , wherein the generating the task-specific feature map further includes: estimating a respective weight for each of the selected convolutional kernels based on the comparison. 5 . The method of claim 1 , further comprising: receiving, via the data interface, a target image; computing a loss objective based on the output image and the target image; and updating parameters of the first neural network based image model, based on the computed loss objective via backpropagation while keeping the second neural network based image model unchanged. 6 . The method of claim 1 , further comprising: receiving, via the data interface, a training dataset including training samples corresponding to a plurality of task instructions, wherein each of the plurality of task instructions is one of a predefined set of task instructions; and training the first neural network based image model using the training samples corresponding to the plurality of task instructions. 7 . The method of claim 6 , wherein the task instruction is different than any task instruction of the predefined set of task instructions that have been used in training the first neural network based image model. 8 . A system for image generation, the system comprising: a memory that stores a first neural network based image model, a second neural network based image model, and a plurality of processor executable instructions; a communication interface that receives a text prompt, an input image, and a task instruction distinct from the text prompt; and one or more hardware processors configured to read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating, via an adapter, a task-specific feature map based on the input image, the text prompt, and the task instruction; generating, by a first neural network based image model, a first latent representation based on the task-specific feature map; generating, via a task encoder, a task embedding based on the task instruction; modifying a second latent representation of a second neural network based image model based on the first latent representation and the task embedding, wherein the second neural network based image model is fixed such that its parameters are not updated after pretraining and the first neural network based image model is a trainable copy of at least an encoder of the second neural network based image mode; and generating, by a decoder of the second neural network based image model, an output image based on the second latent representation and the text prompt. 9 . The system of claim 8 , wherein the generating the task-specific feature map comprises: selecting one or more convolutional kernels from a set of convolutional kernels based on the task instruction; and generating the task-specific feature map based on the input image and the selected one or more convolutional kernel. 10 . The system of claim 9 , wherein the one or more convolutional kernels are selected based on a comparison of the task instruction to one or more predefined task instructions. 11 . The system of claim 10 , wherein the generating the task-specific feature map further includes: estimating a respective weight for each of the selected convolutional kernels based on the comparison. 12 . The system of claim 8 , the operations further comprising: receiving, via a data interface, a target image; computing a loss objective based on the output image and the target image; and updating parameters of the first neural network based image model, based on the computed loss objective via backpropagation while keeping the second neural network based image model unchanged. 13 . The system of claim 8 , the operations further comprising: receiving, via a data interface, a training dataset including training samples corresponding to a plurality of task instructions, wherein each of the plurality of task instructions is one of a predefined set of task instructions; and training the first neural network based image model using the training samples corresponding to the plurality of task instructions. 14 . The system of claim 13 , wherein the task instruction is different than any task instruction of the predefined set of task instructions that have been used in training the first neural network based image model. 15 . A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a text prompt, an input image, and a task instruction distinct from the text prompt; generating, via an adapter, a task-specific feature map based on the input image, the text prompt, and the task instruction; generating, by a first neural network based image model, a first latent representation based on the task-specific feature map; generating, via a task encoder, a task embedding based on the task instruction; modifying a second latent representation of a second neural network based image model based on the first latent representation and the task embedding, wherein the second neural network based image model is fixed such that its parameters are not updated after pretraining and the first neural network based image model is a trainable copy of at least an encoder of the second neural network based image model; and generating, by a decoder of the second neural network based image model, an output image based on the second latent representation and the text prompt. 16 . The non-transitory machine-readable medium of claim 15 , wherein the generating the task-specific feature map comprises: selecting one or more convolutional kernels from a set of convolutional kernels based on the task instruction; and generating the task-specific feature map based on the input image and the selected one or more convolutional kernel.

Assignees

Inventors

Classifications

  • Artificial neural networks [ANN] · CPC title

  • Training; Learning · CPC title

  • G06V10/771Primary

    Feature selection, e.g. selecting representative features from a multi-dimensional feature space · CPC title

  • using local operators · CPC title

  • G06T11/00Primary

    Two-dimensional [2D] image generation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12536713B2 cover?
Embodiments described herein provide a method of image generation. The method includes a fixed diffusion model, and a trainable diffusion model. The fixed diffusion model may be pretrained on a large training corpus. The trainable diffusion model may be used to control the image generation of the fixed diffusion model by modifying internal representations of the fixed diffusion model. A task in…
Who is the assignee on this patent?
Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/771. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).