Text-based image generation
US-12524937-B2 · Jan 13, 2026 · US
US2025095250A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2025095250-A1 |
| Application number | US-202418749438-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 20, 2024 |
| Priority date | May 23, 2024 |
| Publication date | Mar 20, 2025 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method is provided that includes: obtaining a reference image and a description text; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate a target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature; obtaining a second cross-attention feature of a second image feature of the reference image and the text feature; editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature; and generating a result image feature of the time step based on the third cross-attention feature and the text feature; and decoding a result image feature of a last time step to generate the target image.
Opening claim text (preview).
What is claimed is: 1 . A method, comprising: obtaining a reference image and a description text, wherein the description text comprises a content description text describing content of the reference image and a style description text describing a style of a target image to be generated; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate the target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature, wherein the first image feature in a first time step is an image feature of an initial image, and the first image feature in each of a second time step and subsequent time steps is a result image feature generated in a previous time step; obtaining a second cross-attention feature of a second image feature of the reference image and the text feature; editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature; and generating a result image feature of the time step based on the third cross-attention feature and the text feature; and decoding a result image feature of a last time step to generate the target image. 2 . The method according to claim 1 , wherein the first cross-attention feature comprises a first content sub-feature corresponding to the content description text and a first style sub-feature corresponding to the style description text, the second cross-attention feature comprises a second content sub-feature corresponding to the content description text and a second style sub-feature corresponding to the style description text, the third cross-attention feature comprises a third content sub-feature corresponding to the content description text and a third style sub-feature corresponding to the style description text, and wherein the editing the first cross-attention feature to obtain the third cross-attention feature comprises: modifying the first content sub-feature based on the second content sub-feature to obtain the third content sub-feature; and determining the third style sub-feature based on the first style sub-feature. 3 . The method according to claim 2 , wherein the modifying the first content sub-feature comprises: replacing the first content sub-feature with a product of the second content sub-feature and a first factor, wherein the first factor indicates a consistency degree between content of the target image and the content of the reference image. 4 . The method according to claim 2 , wherein the determining the third style sub-feature comprises: determining a product of the first style sub-feature and a second factor as the third style sub-feature, wherein the second factor indicates a degree of applying the style. 5 . The method according to claim 1 , wherein the extracting the text feature of the description text comprises: encoding the content description text to obtain a first text feature of the content description text; introducing information of the reference image into the style description text to obtain an extended style description text; and encoding the extended style description text to obtain a second text feature of the extended style description text, wherein the text feature comprises the first text feature and the second text feature. 6 . The method according to claim 5 , wherein the extended style description text comprises the style description text and a style description identifier of the reference image, and wherein the encoding the extended style description text to obtain the second text feature of the extended style description text comprises: extracting a first text sub-feature of the style description text by using a text encoder; extracting a third image feature of the reference image by using an image encoder, wherein the image encoder and the text encoder are respectively configured to map an image and a text to a same feature space; and determining the third image feature as a second text sub-feature of the style description identifier, wherein the second text feature comprises the first text sub-feature and the second text sub-feature. 7 . The method according to claim 6 , wherein the reference image is any image frame in a reference video, and wherein the extracting the third image feature of the reference image by using the image encoder comprises: extracting image feature of one or more image frames in the reference video as the third image feature of the reference image by using the image encoder. 8 . The method according to claim 1 , wherein the calculating the first cross-attention feature of the first image feature and the text feature comprises: calculating a self-attention feature of the first image feature; generating a fourth image feature based on the self-attention feature and the first image feature; and calculating a first cross-attention feature of the fourth image feature and the text feature. 9 . The method according to claim 8 , wherein the reference image is any image frame except a first image frame in a reference video, and wherein the generating the fourth image feature comprises: adjusting the self-attention feature based on a historical self-attention feature corresponding to the self-attention feature to obtain an adjusted self-attention feature, wherein the historical self-attention feature is an attention feature obtained by performing style transfer on a historical image frame of the reference image by using the diffusion model and located at a same location as the self-attention feature; and generating the fourth image feature based on the adjusted self-attention feature and the first image feature. 10 . An electronic device, comprising: a processor; and a memory communicatively connected to the processor, wherein the memory stores instructions executable by the processor, and the instructions, when executed by the processor, cause the processor to perform operations comprising: obtaining a reference image and a description text, wherein the description text comprises a content description text describing content of the reference image and a style description text describing a style of a target image to be generated; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate the target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature, wherein the first image feature in a first time step is an image feature of an initial image, and the first image feature in each of a second time step and subsequent time steps is a result image feature generated in a previous time step; obtaining a second cross-attention feature of a second image feature of the reference image and the text feature; editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature; and generating a result image feature of the time step based on the third cross-attention feature and the text feature; and decoding a result image feature of a last time step to generate the target image. 11 . The electronic device according to claim 10 , wherein the first cross-attention feature comprises a first content sub-feature corresponding to the content description text and a first style sub-feature corresponding to the style description text, the second cross-attention feature comprises a second content sub-feature corresponding to the content description text and a second style sub-feature corresponding to t
using neural networks · CPC title
Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title
Creating or editing images; Combining images with text · CPC title
Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components · CPC title
of extracted features · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.