Image style transfer

US2025095250A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025095250-A1
Application numberUS-202418749438-A
CountryUS
Kind codeA1
Filing dateJun 20, 2024
Priority dateMay 23, 2024
Publication dateMar 20, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method is provided that includes: obtaining a reference image and a description text; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate a target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature; obtaining a second cross-attention feature of a second image feature of the reference image and the text feature; editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature; and generating a result image feature of the time step based on the third cross-attention feature and the text feature; and decoding a result image feature of a last time step to generate the target image.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, comprising: obtaining a reference image and a description text, wherein the description text comprises a content description text describing content of the reference image and a style description text describing a style of a target image to be generated; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate the target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature, wherein the first image feature in a first time step is an image feature of an initial image, and the first image feature in each of a second time step and subsequent time steps is a result image feature generated in a previous time step; obtaining a second cross-attention feature of a second image feature of the reference image and the text feature; editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature; and generating a result image feature of the time step based on the third cross-attention feature and the text feature; and decoding a result image feature of a last time step to generate the target image. 2 . The method according to claim 1 , wherein the first cross-attention feature comprises a first content sub-feature corresponding to the content description text and a first style sub-feature corresponding to the style description text, the second cross-attention feature comprises a second content sub-feature corresponding to the content description text and a second style sub-feature corresponding to the style description text, the third cross-attention feature comprises a third content sub-feature corresponding to the content description text and a third style sub-feature corresponding to the style description text, and wherein the editing the first cross-attention feature to obtain the third cross-attention feature comprises: modifying the first content sub-feature based on the second content sub-feature to obtain the third content sub-feature; and determining the third style sub-feature based on the first style sub-feature. 3 . The method according to claim 2 , wherein the modifying the first content sub-feature comprises: replacing the first content sub-feature with a product of the second content sub-feature and a first factor, wherein the first factor indicates a consistency degree between content of the target image and the content of the reference image. 4 . The method according to claim 2 , wherein the determining the third style sub-feature comprises: determining a product of the first style sub-feature and a second factor as the third style sub-feature, wherein the second factor indicates a degree of applying the style. 5 . The method according to claim 1 , wherein the extracting the text feature of the description text comprises: encoding the content description text to obtain a first text feature of the content description text; introducing information of the reference image into the style description text to obtain an extended style description text; and encoding the extended style description text to obtain a second text feature of the extended style description text, wherein the text feature comprises the first text feature and the second text feature. 6 . The method according to claim 5 , wherein the extended style description text comprises the style description text and a style description identifier of the reference image, and wherein the encoding the extended style description text to obtain the second text feature of the extended style description text comprises: extracting a first text sub-feature of the style description text by using a text encoder; extracting a third image feature of the reference image by using an image encoder, wherein the image encoder and the text encoder are respectively configured to map an image and a text to a same feature space; and determining the third image feature as a second text sub-feature of the style description identifier, wherein the second text feature comprises the first text sub-feature and the second text sub-feature. 7 . The method according to claim 6 , wherein the reference image is any image frame in a reference video, and wherein the extracting the third image feature of the reference image by using the image encoder comprises: extracting image feature of one or more image frames in the reference video as the third image feature of the reference image by using the image encoder. 8 . The method according to claim 1 , wherein the calculating the first cross-attention feature of the first image feature and the text feature comprises: calculating a self-attention feature of the first image feature; generating a fourth image feature based on the self-attention feature and the first image feature; and calculating a first cross-attention feature of the fourth image feature and the text feature. 9 . The method according to claim 8 , wherein the reference image is any image frame except a first image frame in a reference video, and wherein the generating the fourth image feature comprises: adjusting the self-attention feature based on a historical self-attention feature corresponding to the self-attention feature to obtain an adjusted self-attention feature, wherein the historical self-attention feature is an attention feature obtained by performing style transfer on a historical image frame of the reference image by using the diffusion model and located at a same location as the self-attention feature; and generating the fourth image feature based on the adjusted self-attention feature and the first image feature. 10 . An electronic device, comprising: a processor; and a memory communicatively connected to the processor, wherein the memory stores instructions executable by the processor, and the instructions, when executed by the processor, cause the processor to perform operations comprising: obtaining a reference image and a description text, wherein the description text comprises a content description text describing content of the reference image and a style description text describing a style of a target image to be generated; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate the target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature, wherein the first image feature in a first time step is an image feature of an initial image, and the first image feature in each of a second time step and subsequent time steps is a result image feature generated in a previous time step; obtaining a second cross-attention feature of a second image feature of the reference image and the text feature; editing the first cross-attention feature based on the second cross-attention feature to obtain a third cross-attention feature; and generating a result image feature of the time step based on the third cross-attention feature and the text feature; and decoding a result image feature of a last time step to generate the target image. 11 . The electronic device according to claim 10 , wherein the first cross-attention feature comprises a first content sub-feature corresponding to the content description text and a first style sub-feature corresponding to the style description text, the second cross-attention feature comprises a second content sub-feature corresponding to the content description text and a second style sub-feature corresponding to t

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title

  • G06T11/60Primary

    Creating or editing images; Combining images with text · CPC title

  • G06V10/44Primary

    Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components · CPC title

  • of extracted features · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025095250A1 cover?
A method is provided that includes: obtaining a reference image and a description text; extracting a text feature of the description text; and performing the following operations based on a pre-trained diffusion model to generate a target image: in each time step of the diffusion model: calculating a first cross-attention feature of a first image feature and the text feature; obtaining a second…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T11/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 20 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).