Methods, apparatuses and computer program products for providing tuning-free personalized image generation

US2026004489A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2026004489-A1
Application numberUS-202519213377-A
CountryUS
Kind codeA1
Filing dateMay 20, 2025
Priority dateJun 28, 2024
Publication dateJan 1, 2026
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method to generate a target image from a reference image are provided. The system may receive, via a LDM, a reference image and a text prompt. The system may extract, via a trained vision encoder in the LDM, a vision control signal from an object in the reference image. The vision control signal indicates an identity of the object. The system may extract, via trained text encoders in the LDM, text control signals associated with the text prompt. The system may generate, via cross attention summation of an output of a vision cross attention unit(s) associated with the vision control signal and an output of text cross attention units associated with the text control signals, spatial features indicative of the reference image and the text prompt. The system may output, via a decoder in communication with the LDM, a target image based on the generated spatial features.

First claim

Opening claim text (preview).

What is claimed: 1 . A method comprising: receiving, via a latent diffusion model (LDM), a reference image and a text prompt associated with the reference image; extracting, via a trained vision encoder in the LDM, a vision control signal from an object in the reference image that indicates an identity of the object; extracting, via one or more trained text encoders associated with the LDM, one or more text control signals associated with the text prompt; generating, via cross attention summation of (i) an output of one or more vision cross attention units associated with the vision control signal and (ii) an output of one or more text cross attention units associated with the one or more text control signals, first spatial features indicative of the reference image and the text prompt; and outputting a target image based upon the generated first spatial features. 2 . The method of claim 1 , wherein the extraction of the vision control signal comprises cropping a facial area of the object or a background of the reference image. 3 . The method of claim 1 , wherein the one or more text cross attention units comprises a low rank adaptor to facilitate preprocessing of an input associated with the reference image or the text prompt. 4 . The method of claim 1 , wherein the target image preserves the identity of the object in the reference image. 5 . The method of claim 1 , further comprising: receiving, via the one or more vision cross attention units and the one or more text cross attention units, second spatial features indicative of a hidden state of the LDM. 6 . The method of claim 5 , wherein the second spatial features comprise a low rank adaptor to facilitate preprocessing of input associated with the reference image or the text prompt. 7 . The method of claim 1 , wherein the trained vision encoder is trained on a plurality of pairs of a source image and a synthetically generated image. 8 . The method of claim 1 , wherein the trained vision encoder is trained in plural stages, wherein a first stage comprises a plurality of source images and a second stage comprises a plurality of synthetically generated images. 9 . The method of claim 8 , wherein the plural stages comprise a third stage and a fourth stage, wherein the third stage comprises a plurality of source images different than the source images in the first stage, and wherein the fourth stage comprises a plurality of synthetically generated images different than the synthetically generated images in the second stage. 10 . The method of claim 1 , wherein a self-attention unit comprising a low rank adaptor is arranged upstream of the one or more text cross attention units and the one or more vision cross attention units associated with the LDM. 11 . A method comprising: receiving, at a latent diffusion model (LDM), a source image comprising an object associated with an identity; extracting, via a first trained machine learning (ML) model associated with the LDM, a first caption indicative of the object in the source image; receiving, via a second trained ML model associated with the LDM, the first caption; outputting, via the second ML, a second caption comprising an enhancement of the first caption; receiving, via a text-to-image generation unit associated with the LDM, the second caption; generating, via the text-to-image generation unit based on the second caption, an intermediary image comprising a trait associated with the object in the source image; processing the intermediary image based on the identity of the object in the source image; and outputting a synthetic image based on the processed intermediary image. 12 . The method of claim 11 , wherein the text-to-image generation unit comprises a deep learning inference framework. 13 . The method of claim 11 , wherein the source image comprises a real image. 14 . The method of claim 11 , wherein the first caption comprises an actionable modifier or an accessory of the object. 15 . The method of claim 11 , wherein the second caption comprises less noise than the first caption. 16 . The method of claim 11 , wherein the trait comprises any one or more of age, gender, skin tone or hair. 17 . The method of claim 11 , wherein the identity comprises a distinct characteristic of the object in relation to a plurality of other objects. 18 . The method of claim 11 , further comprising: receiving, via a filter comprising a pass-through rate, a pair comprising the source image and the synthetic image, wherein the pass-through rate is based upon any one or more of the identity or a visual appeal of the object; and determining whether the pair meets a predetermined threshold set for the pass-through rate. 19 . An apparatus comprising: one or more processors; and at least one memory storing instructions, that when executed by the one or more processors, cause the apparatus to: receive, via a latent diffusion model (LDM), a reference image and a text prompt associated with the reference image; extract, via a trained vision encoder associated with the LDM, a vision control signal based on an object in the reference image that indicates an identity of the object; extract, via one or more trained text encoders associated with the LDM, one or more text control signals associated with the text prompt; generate, via cross attention summation of (i) an output of one or more vision cross attention units associated with the vision control signal and (ii) an output of one or more text cross attention units associated with the one or more text control signals, first spatial features indicative of the reference image and the text prompt; and output a target image based upon the generated first spatial features. 20 . The apparatus of claim 19 , wherein when the one or more processors further execute the instructions, the apparatus is configured to: perform the extract of the vision control signal by cropping a facial area of the object or a background of the reference image.

Assignees

Inventors

Classifications

  • G06F40/40Primary

    Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Detection; Localisation; Normalisation · CPC title

  • Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title

  • G06T11/60Primary

    Creating or editing images; Combining images with text · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026004489A1 cover?
A system and method to generate a target image from a reference image are provided. The system may receive, via a LDM, a reference image and a text prompt. The system may extract, via a trained vision encoder in the LDM, a vision control signal from an object in the reference image. The vision control signal indicates an identity of the object. The system may extract, via trained text encoders …
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/40. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).