Neural Radiance Field Generative Modeling of Object Classes from Single Two-Dimensional Views
US-2024371081-A1 · Nov 7, 2024 · US
US12494013B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12494013-B2 |
| Application number | US-202318211149-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 16, 2023 |
| Priority date | Jun 16, 2023 |
| Publication date | Dec 9, 2025 |
| Grant date | Dec 9, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for generating static and articulated 3D assets are provided that include a 3D autodecoder at their core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. The appropriate intermediate volumetric latent space is then identified and robust normalization and de-normalization operations are implemented to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. The methods are flexible enough to use either existing camera supervision or no camera information at all—instead efficiently learning the camera information during training. The generated results are shown to outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.
Opening claim text (preview).
What is claimed is: 1 . A method of training a three-dimensional (3D) diffusion model to embed properties from two-dimensional (2D) images learned from a target dataset in a latent space using an autodecoder, comprising: processing embedding vectors of an autodecoder (G) comprising a library of embedding vectors corresponding to objects in a training dataset to generate a latent 3D feature volume; decoding, by the autodecoder, the latent 3D feature volume into a 3D voxel grid for density and radiance representative of an object's shape and appearance; splitting the autodecoder into a first part G 1 and a second part G 2 ; t,? normalizing features before using features F from the latent 3D feature volume for diffusion by the 3D diffusion model, where median m is a center of distribution of the latent 3D feature volume and a Normalized InterQuartile Range (IQR) approximates a scale of the latent 3D feature volume: training, using the autodecoder, the 3D diffusion model operating in a 3D latent space obtained from the first part G 1 using volumetric rendering of the 3D voxel grid with two-dimensional (2D) reconstruction supervision from training images in a training dataset to extract structure and appearance properties from the training dataset; and generating, using the second part G 2 and the structure and appearance properties extracted from the training dataset, a 3D representation of the object. 2 . The method of claim 1 , further comprising progressively upsampling the latent 3D feature volume before decoding the upsampled latent 3D feature volume into the 3D voxel grid. 3 . The method of claim 1 , further comprising, during inference, denormalizing the features F from the structure and appearance properties extracted from the training dataset by the second part G 2 as F×IQR+m prior to generating the 3D representation of the object. 4 . The method of claim 1 , further comprising learning the embedding vectors by the autodecoder. 5 . The method of claim 1 , wherein decoding by the autodecoder comprises providing at least four residual blocks at each resolution in the autodecoder and using self-attention layers in a second level of resolution 8 3 and in a third level of resolution 16 3 of the autodecoder. 6 . The method of claim 1 , wherein the object is in a canonical pose and training the 3D voxel grid comprises training the 3D voxel grid using ground truth poses, poses estimated using structure from motion, or poses learned from the training dataset during training. 7 . The method of claim 6 , wherein the canonical pose comprises a canonical voxel representation of a density grid that is a discrete representation of a density field and a canonical representation of a red, green, blue (RGB) radiance field, further comprising tri-linearly interpolating density values and RGB values from the 3D voxel grid after decoding. 8 . The method of claim 1 , further comprising removing a background of the training images in the training dataset prior to training the 3D diffusion model. 9 . The method of claim 1 , wherein the object is an articulated non-rigid object, further comprising modeling a shape of the object and local motion from dynamic poses as well as a corresponding non-rigid deformation of a local region. 10 . The method of claim 9 , further comprising estimating, using a differentiable Perspective-n-Point algorithm, camera poses for each component of the non-rigid object and progressively refining estimated camera poses during training using a combination of learned 3D keypoints for each component of the non-rigid object and corresponding 2D projections predicted in each image, and combining the components with plausible deformations using a learned volumetric linear blend skinning (LBS) algorithm having skinning weights for each component of the non-rigid object that are estimated during training of the 3D diffusion model. 11 . The method of claim 1 , further comprising representing each object in the training dataset by an embedding vector comprising a concatenation of smaller embedding vectors, wherein representing each object comprises using a deterministic mapping from each training object index to its corresponding concatenated embedding vector using a hashing function where for object index k, the corresponding embedding index is: m ( k ) = [ ( a · k ) mod 2 w ] ≫ ( w - r ) , for a table having 2 ′ entries where w and a are heuristic hashing parameters used to reduce a number of collisions while maintaining an appropriate table size. 12 . The method of claim 1 , further comprising decomposing a target non-rigid object into regions, where each region contains 3D keypoints and corresponding 2D projections per image, that are shared across all non-rigid objects and aligning the non-rigid objects in a learned canonical space to allow for motion transfer between the non-rigid objects. 13 . The method of claim 1 , wherein the training comprises extracting a text description of an object in the training dataset by providing a hint and a first view of the object along with a question requesting a description of a shape and color of the object for use in an inference stage to identify the object. 14 . A system that embeds properties learned from a target dataset in a latent space into a volumetric representation of an object for rendering, comprising: a volumetric autodecoder (G) that learns embedding vectors of a library of embedding vectors corresponding to objects in a training dataset to generate a latent 3D feature volume and that decodes the latent 3D feature volume into a 3D voxel grid for density and radiance representative of an object's shape and appearance, the autodecoder comprising a first part G 1 and a second part G 2 ; and a 3 D diffusion model that is trained on a latent representation by the volumetric autodecoder, the 3D diffusion model operating in a 3D latent space obtained from the first part G 1 using volumetric rendering of the voxel grid with two-dimensional (2D) reconstruction supervision from training images in a training dataset to extract structure and appearance properties from the training dataset, wherein the volumetric autodecoder normalizes features F ^ = ( F - m
Shape modification · CPC title
Rotation, translation, scaling · CPC title
Collision detection, intersection · CPC title
Editing of three-dimensional [3D] images, e.g. changing shapes or colours, aligning objects or positioning parts · CPC title
Perspective computation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.