User interface for generating and manipulating molecular images with natural language instructions
US-2024331235-A1 · Oct 3, 2024 · US
US12586271B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12586271-B2 |
| Application number | US-202318329111-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 5, 2023 |
| Priority date | Jun 5, 2023 |
| Publication date | Mar 24, 2026 |
| Grant date | Mar 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for image processing are described. Embodiments of the present disclosure, via a multi-modal encoder of an image processing apparatus, encodes a text prompt to obtain a text embedding. A color encoder of the image processing apparatus encodes a color prompt to obtain a color embedding. A diffusion prior model of the image processing apparatus generates an image embedding based on the text embedding and the color embedding. A latent diffusion model of the image processing apparatus generates an image based on the image embedding, where the image includes an element from the text prompt and a color from the color prompt.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: encoding a text prompt to obtain a text embedding; encoding a color prompt to obtain a color embedding, wherein the color prompt comprises a different modality than the text prompt; generating an image embedding using a diffusion prior model based on the text embedding and the color embedding, wherein the image embedding represents the text prompt and the color prompt; and generating an image by denoising a noise input based on the image embedding that represents the text prompt and the color prompt using a latent diffusion model (LDM), wherein the image includes an element from the text prompt and a color from the color prompt. 2 . The method of claim 1 , wherein: the text embedding and the image embedding are in a multi-modal embedding space. 3 . The method of claim 2 , wherein: the text embedding is in a first region of the multi-modal embedding space corresponding to text and the image embedding is in a second region of the multi-modal embedding space corresponding to images. 4 . The method of claim 1 , wherein generating the image embedding comprises: performing an attention process using shared attention weights among the text embedding and the color embedding. 5 . The method of claim 1 , wherein: the color embedding comprises a color histogram. 6 . The method of claim 1 , further comprising: identifying a candidate image embedding for a candidate image; comparing the image embedding to the candidate image embedding; and providing the candidate image as a search result based on the comparison. 7 . The method of claim 1 , further comprising: generating a plurality of image embeddings based on the text embedding and the color embedding, wherein the image embedding is selected from the plurality of image embeddings. 8 . The method of claim 7 , further comprising: generating a plurality of images based on the plurality of image embeddings, respectively, wherein each of the plurality of images includes the element from the text prompt and the color from the color prompt. 9 . The method of claim 1 , further comprising: generating a plurality of images based on the image embedding, wherein each of the plurality of images includes the element from the text prompt and the color from the color prompt. 10 . The method of claim 1 , further comprising: generating a modified image embedding using the LDM, wherein the image is generated based on the modified image embedding. 11 . A method comprising: obtaining training data including a text embedding representing a text prompt and a color embedding representing a color prompt, wherein the color prompt comprises a different modality than the text prompt; initializing a diffusion prior model; and training the diffusion prior model to generate an image embedding based on the text embedding and the color embedding, wherein the image embedding represents features corresponding to the text embedding and a color corresponding to the color embedding, and wherein the image embedding represents the text prompt and the color prompt. 12 . The method of claim 11 , further comprising: encoding the text prompt describing a ground-truth image to obtain the text embedding; and encoding the ground-truth image to obtain the color embedding. 13 . The method of claim 11 , further comprising: training a latent diffusion model (LDM) to generate an image by denoising a noise input based on the image embedding that represents the text prompt and the color prompt. 14 . The method of claim 11 , further comprising: generating a predicted image embedding using the diffusion prior model; and computing a loss function by comparing the predicted image embedding to a ground-truth image embedding, wherein the diffusion prior model is trained based on the loss function. 15 . An apparatus comprising: at least one processor; and at least one memory including instructions executable by the at least one processor to perform operations including: encoding, using a multi-modal encoder, a text prompt to obtain a text embedding; encoding, using a color encoder, a color prompt to obtain a color embedding, wherein the color prompt comprises a different modality than the text prompt; generating, using a diffusion prior model, an image embedding based on the text embedding and the color embedding, wherein the image embedding represents the text prompt and the color prompt; and generating, using a latent diffusion model (LDM), an image by denoising a noise input based on the image embedding that represents the text prompt and the color prompt, wherein the image includes an element from the text prompt and a color from the color prompt. 16 . The apparatus of claim 15 , wherein: the color encoder comprises a color histogram extractor configured to extract a color histogram from the color prompt. 17 . The apparatus of claim 15 , wherein: the diffusion prior model comprises a transformer architecture. 18 . The apparatus of claim 15 , wherein: the LDM comprises a U-Net architecture. 19 . The apparatus of claim 15 , wherein: the LDM comprises an image decoder. 20 . The apparatus of claim 15 , further comprising: a training component configured to train the diffusion prior model.
Texturing; Colouring; Generation of textures or colours (retouching, inpainting or scratch removal G06T5/77) · CPC title
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Combinations of networks · CPC title
Learning methods · CPC title
Probabilistic or stochastic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.