Real scene image editing method based on hierarchically classified text guidance

US12592013B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12592013-B2
Application numberUS-202318483238-A
CountryUS
Kind codeB2
Filing dateOct 9, 2023
Priority dateJun 30, 2023
Publication dateMar 31, 2026
Grant dateMar 31, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is a real scene image editing method based on hierarchically classified text guidance, including: firstly selecting a hierarchical multi-label text classification model and hierarchically classify an input style description text; obtaining a latent vector of an indoor scene image and dividing the latent vector; training latent space residual mappers which are divided into four groups for generating details of a layout, an object, an attribute, and a color in the scene image, and selectively training a mapping model with a secondary word obtained by a text classification model; inputting a tertiary word obtained by the text classification model to a contrastive language-image pre-training (CLIP) network and controlling training of the mapping network by utilizing a CLIP loss; hierarchically inputting the latent vector to the mapping network to obtain a bias vector, summing the bias vector with an original vector for inputting to the StyleGAN to obtain an edited image.

First claim

Opening claim text (preview).

What is claimed is: 1 . A real scene image editing method based on hierarchically classified text guidance, comprising: step 1: selecting a hierarchical multi-label text classification model, inputting a primary word t 1 to the hierarchical multi-label text classification model to hierarchically classify an indoor style description, wherein an output of the hierarchical multi-label text classification model is set with three levels: the primary word t 1 is an abstract style description; a secondary word t 2 is a composition description of a scene image; and a tertiary word t 3 is a detailed description corresponding to an abstract style; the composition description comprises a layout, an abstract, an attribute, and a color; and the detailed description comprises specific descriptions of the layout, the abstract, the attribute, and the color; step 2: utilizing an e4e inversion model to obtain a latent vector w of an indoor image trained in a large-scale scene understanding (LSUN) dataset, w∈W+, W+ representing a vector space; and segmenting the latent vector w based on a semantic hierarchical characteristic of StyleGAN and in combination with the secondary word t 2 obtained in step 1; wherein step 2 specifically comprises: 2-1, utilizing an e4e model trained on the LSUN dataset to obtain an inverse latent vector w of a real indoor scene in a format of .pt file as an input to the StyleGAN; and 2-2, dividing the obtained latent vector w according to the semantic hierarchical characteristic of the StyleGAN, wherein the layout corresponds to [0,2) layer of a generative network; the object corresponds to [2.6) layer of the generative network; the attribute corresponds to [6,12) layer of the generative network; and the color corresponds to [12,14) layer of the generative network; and step 3: training a plurality of latent space residual mappers, wherein since different StyleGAN layers are known to generate details of different levels in the scene image, the plurality of latent space residual mappers are divided into four groups, each group comprising a single latent space residual mapper, and the four groups are configured to correspondingly generate details of the layout, the abstract, the attribute, and the color; and realizing manipulation on a real scene image with a visual abstract text by utilizing the tertiary word t 3 obtained in step 1 and a contrastive language-image pre-training (CLIP) model. 2 . The real scene image editing method based on hierarchically classified text guidance according to claim 1 , wherein step 1 specifically comprises: 1-1, based on an image convolution network, utilizing a text encoder and a label encoder to extract text semantic S t and label semantic S l , as shown in the following formulas, respectively, by sharing a hierarchical structure relationship representation E learned in a label set, wherein V t represents a set of hierarchical structure nodes; V l represents a set of label nodes; and σ represents an activation function ReLU; S t = σ ⁡ ( E · V t ) S l = σ ⁡ ( E · V l ) 1-2, projecting the text semantic S t and the label semantic S l into a joint embedding space, wherein a joint embedding loss controls a similarity between the text semantic S t and the label semantic S l ; 1-3, by matching a learning loss, performing training to obtain a fine-grained label semantic, a coarse-grained label semantic, and incorrect label semantics, wherein the fine-grained label semantic is closest to the input tertiary word t 3 ; the fine-grained label semantic is t 3 ; the coarse-grained label semantic is t 2 ; and other incorrect label semantics are far away from the primary word t 1 ; and 1-4, with the trained hierarchical multi-label text classification model, inputting the primary word t 1 to obtain the desired tertiary word t 3 and the secondary word t 2 . 3 . The real scene image editing method based on hierarchically classified text guidance according to claim 1 , wherein step 3 specifically comprises: 3-1, due to different StyleGAN layers generating details of different levels in the scene image, dividing four latent space residual mappers into four groups, which correspond to the layout, the abstract, the attribute, and the color, respectively, and providing a different part of the latent vector w for each group; selectively training each latent space residual mapper group according to the secondary word t 2 obtained in step 1, wherein the latent space residual mapper groups corresponding to words not comprised in the secondary word t 2 are not trained; 3-2, representing the latent vector of an input image as w=(w l , w o , w p , w c , w 0 ), wherein w l , w o , w p , w c , and w 0 represent divisions of w according to different layers, wherein w l corresponds to a vector part corresponding to the layout layer; w o corresponds to a vector part corresponding to the abstract layer; w p corresponds to a vector part corresponding to the attribute layer; w c corresponds to a vector part corresponding to the color layer; w 0 represents a residual part after the division of the latent vector w; since the StyleGAN network has a total of 18 layers, the divided groups are first 14 layers; M(w)=(M 1 (w l ),M 2 (w o ),M 3 (w p ),M 4 (w c ),w 0 ) is obtained by the latent space residual mappers, wherein M 1 , M 2 , M 3 , and M 4 represent groups of a mapping network, respectively; 3-3, after training the latent space residual mappers under the influence of a CLIP loss, multiplying a resulting bias vector Δ by an initial latent vector w of the image to realize editing of the latent vector w, and maintaining other semantic content in the input image unchanged, wherein the CLIP loss is capable of minimizing a cosine distance of a generated image and a text prompt: L CLIP ( w ) = D CLIP ( G ⁡ ( w + M ⁡ ( w ) ) , t 3 ) , wherein G represents a StyleGAN generator; to maintain some visual attribut

Assignees

Inventors

Classifications

  • Texturing; Colouring; Generation of textures or colours (retouching, inpainting or scratch removal G06T5/77) · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • using neural networks · CPC title

  • G06V20/70Primary

    Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12592013B2 cover?
Provided is a real scene image editing method based on hierarchically classified text guidance, including: firstly selecting a hierarchical multi-label text classification model and hierarchically classify an input style description text; obtaining a latent vector of an indoor scene image and dividing the latent vector; training latent space residual mappers which are divided into four groups f…
Who is the assignee on this patent?
Univ Hangzhou Dianzi
What technology area does this patent fall under?
Primary CPC classification G06V20/70. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 31 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).