Scene-based text-to-image generation with human priors

US12387388B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12387388-B2
Application numberUS-202318149542-A
CountryUS
Kind codeB2
Filing dateJan 3, 2023
Priority dateJan 3, 2023
Publication dateAug 12, 2025
Grant dateAug 12, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a method includes accessing a text input and a scene input corresponding to the text input, wherein the scene input comprises semantic segmentations, generating text tokens for the text input and scene tokens for the scene input by machine-learning models, generating predicted image tokens based on the text tokens and the scene tokens by the machine-learning models, and generating an image corresponding to the text input and the scene input based on the predicted image tokens by the machine-learning models.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, by one or more computing systems: accessing a text input and a scene input corresponding to the text input, wherein the scene input comprises one or more semantic segmentations; generating, by one or more machine-learning models, one or more text tokens for the text input and one or more scene tokens for the scene input; generating, based on the one or more text tokens and the one or more scene tokens by the one or more machine-learning models, one or more predicted image tokens; and generating, based on the one or more predicted image tokens by the one or more machine-learning models, an image corresponding to the text input and the scene input. 2. The method of claim 1 , wherein the one or more machine-learning models comprise one or more of a text encoder, a scene encoder, an image encoder, a transformer neural network model, or an image decoder. 3. The method of claim 1 , wherein the one or more semantic segmentations are associated with one or more categories based on one or more of panoptic, human, or face. 4. The method of claim 3 , wherein the one or more scene tokens are based on a plurality of channels, and wherein a number of the plurality of channels is based on a number of categories based on panoptic, a number of categories based on human, a number of categories based on face, and an edge channel corresponding to a map of edges separating the one or more semantic segmentations. 5. The method of claim 1 , wherein the one or more text tokens are associated with a conditional token stream conditioned on the text input, wherein the method further comprises: generating an unconditional token stream conditioned on an empty text stream initialized with padding tokens, and wherein the generating the one or more predicted image tokens is based on the conditional token stream and the unconditional token stream. 6. The method of claim 5 , wherein the one or more machine-learning models comprise a transformer neural network model, wherein the method further comprises: determining, by the transformer neural network model, a plurality of first probabilities associated with a plurality of predicted image tokens based on the conditional token stream and the one or more scene tokens; determining, by the transformer neural network model, a plurality of second probabilities associated with the plurality of predicted image tokens based on the unconditional token stream and the one or more scene tokens; and determining the one or more predicted image tokens from the plurality of predicted image tokens based on the plurality of first probabilities and the plurality of second probabilities. 7. The method of claim 5 , further comprising: calculating one or more conditional logit scores based on the conditional token stream; calculating one or more unconditional logit scores based on the unconditional token stream; and calculating one or more guided logit scores based on the one or more conditional logit scores and the one or more unconditional logit scores, and wherein the generating the one or more predicted image tokens is based on the one or more guided logic scores. 8. The method of claim 1 , further comprising: generating the one or more semantic segmentations from an existing image. 9. The method of claim 8 , wherein the generating the one or more semantic segmentations is based on a segmentation model, and wherein the segmentation model is trained based on one or more of a weighted binary cross-entropy face loss applied over segmentation face parts categories or a semantic segmentation edge map comprising face parts edges. 10. The method of claim 8 , wherein the one or more semantic segmentations are associated with one or more labeled categories, and wherein the method further comprises: receiving one or more edits of one or more of the labeled categories; and updating the image based on the one or more edits, wherein the updated image depicts a scene based on the edits of the one or more of the labeled categories. 11. The method of claim 8 , further comprising: receiving one or more edits of the text input; and generating, based on the one or more edits of the text input and the scene input, one or more newly interpreted images for the existing image. 12. The method of claim 1 , wherein the one or more semantic segmentations are created by a user. 13. The method of claim 1 , wherein the generating the one or more predicted image tokens is based on an image encoder, wherein the generating the image corresponding to the text input and the scene input is based on an image decoder, and wherein the image encoder or the image decoder is trained based on a feature-matching loss over activations of a pre-trained face-embedding network comparing between reconstructed face crops and ground-truth face crops. 14. The method of claim 1 , wherein the generating the one or more predicted image tokens is based on an image encoder, wherein the generating the image corresponding to the text input and the scene input is based on an image decoder, and wherein the image encoder or the image decoder is trained based on a feature-matching loss over activations of a pre-trained object-recognition network comparing between reconstructed object crops and ground-truth object crops. 15. The method of claim 1 , wherein the text input comprises a description of an unusual scene not existing in reality, and wherein the generated image depicts the unusual scene not existing in reality. 16. One or more computer-readable non-transitory storage media embodying software that is operable when executed to: access a text input and a scene input corresponding to the text input, wherein the scene input comprises one or more semantic segmentations; generate, by one or more machine-learning models, one or more text tokens for the text input and one or more scene tokens for the scene input; generate, based on the one or more text tokens and the one or more scene tokens by the one or more machine-learning models, one or more predicted image tokens; and generate, based on the one or more predicted image tokens by the one or more machine-learning models, an image corresponding to the text input and the scene input. 17. A system comprising: one or more processors; and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to: access a text input and a scene input corresponding to the text input, wherein the scene input comprises one or more semantic segmentations; generate, by one or more machine-learning models, one or more text tokens for the text input and one or more scene tokens for the scene input; generate, based on the one or more text tokens and the one or more scene tokens by the one or more machine-learning models, one or more predicted image tokens; and generate, based on the one or more predicted image tokens by the one or more machine-learning models, an image corresponding to the text input and the scene input.

Assignees

Inventors

Classifications

  • using tickets or tokens, e.g. Kerberos (network architectures or network communication protocols for entities authentication using tickets in a packet data network H04L63/0807) · CPC title

  • Face · CPC title

  • Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

  • Edge detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12387388B2 cover?
In one embodiment, a method includes accessing a text input and a scene input corresponding to the text input, wherein the scene input comprises semantic segmentations, generating text tokens for the text input and scene tokens for the scene input by machine-learning models, generating predicted image tokens based on the text tokens and the scene tokens by the machine-learning models, and gener…
Who is the assignee on this patent?
Meta Platforms Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 12 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).