What technology area does this patent fall under?

Primary CPC classification G06T17/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Image generation from text and 3D object

US12530847B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12530847-B2
Application number	US-202318100546-A
Country	US
Kind code	B2
Filing date	Jan 23, 2023
Priority date	Jan 23, 2023
Publication date	Jan 20, 2026
Grant date	Jan 20, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Aspects of the present disclosure involve a system for generating images that depict a real-world object in a scene. The system receives image content comprising a depth map of a scene and a three-dimensional (3D) model of a real-world object. The system receives a textual description for a background. The system applies the image content and the textual description to a machine learning model to generate a scene image that depicts the real-world object on the background corresponding to the textual description.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving, by one or more processors of a device, image content comprising a depth map of a scene and a three-dimensional (3D) model of a real-world object, the 3D model of the real-world object representing a viewpoint of the real-world object from 360 degrees including all textures of the real-world object from each angle, the depth map comprising a position of a pedestal; receiving input that selects the depth map for the scene from a list of predefined generic depth maps of scenes, each predefined generic depth map of the list of generic depth maps of scenes representing a different depth map that relates a position of a target 3D object model to a respective background and surface on which the target 3D object model is mounted; receiving a textual description for a background; and applying the image content and the textual description to a machine learning model to generate a simulated scene image that depicts the real-world object being placed on the pedestal on the background corresponding to the textual description, the simulated scene image being generated to depict the real-world object on top of the pedestal at the position of the pedestal in the depth map. 2 . The method of claim 1 , wherein the machine learning model comprises a generative artificial neural network comprising a diffusion module that blurs some of the background of the simulated scene image. 3 . The method of claim 1 , wherein the textual description comprises a caption. 4 . The method of claim 3 , further comprising overlaying the caption on the simulated scene image. 5 . The method of claim 4 , wherein the textual description specifies a position for the caption, wherein the caption is overlaid at the specified position in the simulated scene image. 6 . The method of claim 1 , wherein the real-world object comprises a shoe that is depicted as being placed on top of the pedestal. 7 . The method of claim 1 , further comprising: receiving input from a user that moves an augmented reality (AR) object representing the real-world object within the simulated scene image using the 3D model of the real-world object. 8 . The method of claim 1 , further comprising: capturing one or more images of the real-world object by a camera; and generating the 3D model of the real-world object based on the one or more images. 9 . The method of claim 1 , each predefined generic depth map representing a custom position of a shoe in relation to the respective background and pedestal on which the shoe is placed. 10 . The method of claim 1 , wherein the machine learning model is trained to map a texture of the 3D model of the real-world object to a position of the real-world object in the simulated scene image. 11 . The method of claim 1 , further comprising: receiving input that selects a 3D camera position; and updating the simulated scene image to depict the real-world object on the background from the selected 3D camera position. 12 . The method of claim 1 , further comprising: receiving multiple 3D scene depth images; and generating, based on the multiple 3D scene depth images, a video comprising a plurality of scene images that depict the real-world object on the background from multiple 3D camera positions, the video being generated after initially generating the scene based on a single depth image comprising the depth map and the target 3D object model. 13 . The method of claim 1 , wherein the image content includes a position of the 3D model in the depth map of the scene. 14 . The method of claim 1 , further comprising training the machine learning model by performing training operations comprising: receiving training data comprising a plurality of training image content representing training 3D object models on depth maps and corresponding ground truth images depicting the training object models in a training scene; applying the machine learning model to a first training image content of the plurality of training image content to generate an estimated image depicting a 3D object of the first training image content on a training scene; computing a deviation between the estimated image and the ground truth image associated with the first training image content; and updating parameters of the machine learning model based on the computed deviation. 15 . A system comprising: at least one processor of a device programmed to perform operations comprising: receiving image content comprising a depth map of a scene and a three-dimensional (3D) model of a real-world object, the 3D model of the real-world object representing a viewpoint of the real-world object from 360 degrees including all textures of the real-world object from each angle, the depth map comprising a position of a pedestal; receiving input that selects the depth map for the scene from a list of predefined generic depth maps of scenes, each predefined generic depth map of the list of generic depth maps of scenes representing a different depth map that relates a position of a target 3D object model to a respective background and surface on which the target 3D object model is mounted; receiving a textual description for a background; and applying the image content and the textual description to a machine learning model to generate a simulated scene image that depicts the real-world object being placed on the pedestal on the background corresponding to the textual description, the simulated scene image being generated to depict the real-world object on top of the pedestal at the position of the pedestal of the depth map. 16 . The system of claim 15 , wherein the machine learning model comprises a generative artificial neural network. 17 . The system of claim 15 , wherein the textual description comprises a caption. 18 . The system of claim 17 , the operations comprising overlaying the caption on the simulated scene image. 19 . A non-transitory machine-readable storage medium that includes instructions that, when executed by one or more processors of a device, cause the device to perform operations comprising: receiving image content comprising a depth map of a scene and a three-dimensional (3D) model of a real-world object, the 3D model of the real-world object representing a viewpoint of the real-world object from 360 degrees including all textures of the real-world object from each angle, the depth map comprising a position of a pedestal; receiving input that selects the depth map for the scene from a list of predefined generic depth maps of scenes, each predefined generic depth map of the list of generic depth maps of scenes representing a different depth map that relates a position of a target 3D object model to a respective background and surface on which the target 3D object model is mounted; receiving a textual description for a background; and applying the image content and the textual description to a machine learning model to generate a simulated scene image that depicts the real-world object being placed on the pedestal on the background corresponding to the textual description, the simulated scene image being generated to depict the real-world object on top of the pedestal at the position of the pedestal of the depth map. 20 . The non-transitory machine-readable storage medium of claim 19 , the operations comprising overlaying a caption on the simulated scene image.

Assignees

Snap Inc

Inventors

Classifications

G06T5/50
using two or more images, e.g. averaging or subtraction · CPC title
G06Q30/0276
Advertisement creation · CPC title
G06T2207/20212
Image combination · CPC title
G06T2219/2012
Colour editing, changing, or manipulating; Use of colour codes · CPC title
G06N3/0475
Generative networks · CPC title

Patent family

Related publications grouped by family.

View patent family 89983358

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12530847B2 cover?: Aspects of the present disclosure involve a system for generating images that depict a real-world object in a scene. The system receives image content comprising a depth map of a scene and a three-dimensional (3D) model of a real-world object. The system receives a textual description for a background. The system applies the image content and the textual description to a machine learning model …
Who is the assignee on this patent?: Snap Inc
What technology area does this patent fall under?: Primary CPC classification G06T17/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).