Scene reconstruction from monocular video

US12586293B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586293-B2
Application numberUS-202318524803-A
CountryUS
Kind codeB2
Filing dateNov 30, 2023
Priority dateJan 19, 2023
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A technique for reconstructing a three-dimensional scene from monocular video adaptively allocates an explicit sparse-dense voxel grid with dense voxel blocks around surfaces in the scene and sparse voxel blocks further from the surfaces. In contrast to conventional systems, the two-level voxel grid can be efficiently queried and sampled. In an embodiment, the scene surface geometry is represented as a signed distance field (SDF). Representation of the scene surface geometry can be extended to multi-modal data such as semantic labels and color. Because properties stored in the sparse-dense voxel grid structure are differentiable, the scene surface geometry can be optimized via differentiable volume rendering.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method, comprising: computing a depth scale for each predicted depth image in a set of predicted depth images corresponding to a set of monocular images comprising a video of a three-dimensional (3D) scene; calibrating each predicted depth image using the respective depth scale to produce calibrated depth values; constructing a volumetric grid for the 3D scene storing properties comprising the calibrated depth values and corresponding color values; rendering the volumetric grid to produce a set of predicted images; and adjusting the properties to reduce differences between the set of predicted images and the set of monocular images. 2 . The computer-implemented method of claim 1 , further comprising processing the set of monocular images using structure-from-motion supervision to compute the set of predicted depth images. 3 . The computer-implemented method of claim 1 , wherein the properties further comprise a set of predicted normal vector images and a set of semantic images. 4 . The computer-implemented method of claim 1 , wherein constructing the volumetric grid comprises allocating sparse voxel blocks near surfaces in the 3D scene and storing the properties in a dense voxel array within each of the sparse voxel blocks. 5 . The computer-implemented method of claim 4 , further comprising projecting the sparse voxel blocks to the set of monocular images to associate voxels of the dense voxel arrays with the properties. 6 . The computer-implemented method of claim 4 , wherein the sparse voxel blocks are indexed by a collision-free hash map. 7 . The computer-implemented method of claim 1 , wherein adjusting the volumetric grid comprises updating the calibrated depth values and the corresponding color values according to backpropagated gradients. 8 . The computer-implemented method of claim 1 , wherein the calibrating comprises: defining a scale function for each predicted depth image in the set of predicted depth images; and updating the set of predicted depth images to enforce local consistency between visually adjacent monocular images in the set of monocular images. 9 . The computer-implemented method of claim 1 , further comprising, before the rendering, applying a denoising filter to the volumetric grid. 10 . The computer-implemented method of claim 1 , wherein the adjusting of the properties is based on an integral of energy over a surface in the 3D scene that is computed by applying continuous conditional random field smoothing to the calibrated depth values. 11 . The computer-implemented method of claim 1 , wherein at least one of the steps of computing, calibrating, constructing, rendering, and adjusting is performed on a server or in a data center to generate the set of predicted images, and at least a portion of the properties stored in the volumetric grid for the 3D scene are streamed to a user device. 12 . The computer-implemented method of claim 1 , wherein at least one of the steps of computing, calibrating, constructing, rendering, and adjusting is performed within a cloud computing environment. 13 . The computer-implemented method of claim 1 , wherein at least one of the steps of computing, calibrating, constructing, rendering, and adjusting is performed for training, testing, or certifying a neural network employed in a machine, robot, or autonomous vehicle. 14 . The computer-implemented method of claim 1 , wherein at least one of the steps of computing, calibrating, constructing, rendering, and adjusting is performed on a virtual machine comprising a portion of a graphics processing unit. 15 . A system, comprising: a memory that stores a set of monocular images comprising a video of a three-dimensional (3D) scene; and a processor that is connected to the memory, wherein the processor is configured to: compute a depth scale for each predicted depth image in a set of predicted depth images corresponding to the set of monocular images; calibrate each predicted depth image using the respective depth scale to produce calibrated depth values; construct a volumetric grid for the 3D scene storing properties comprising the calibrated depth values and corresponding color values; render the volumetric grid to produce a set of predicted images; and adjust the properties to reduce differences between the set of predicted images and the set of monocular images. 16 . The system of claim 15 , wherein the properties further comprise a set of predicted normal vector images and a set of semantic images. 17 . The system of claim 15 , wherein constructing the volumetric grid comprises allocating sparse voxel blocks near surfaces in the 3D scene and storing the properties in a dense voxel array within each of the sparse voxel blocks. 18 . A non-transitory computer-readable media storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: computing a depth scale for each predicted depth image in a set of predicted depth images corresponding to a set of monocular images comprising a video of a three-dimensional (3D) scene; calibrating each predicted depth image using the respective depth scale to produce calibrated depth values; constructing a volumetric grid for the 3D scene storing properties comprising the calibrated depth values and corresponding color values; rendering the volumetric grid to produce a set of predicted images; and adjusting the properties to reduce differences between the set of predicted images and the set of monocular images. 19 . The non-transitory computer-readable media of claim 18 , wherein the properties further comprise a set of predicted normal vector images and a set of semantic images. 20 . The non-transitory computer-readable media of claim 18 , wherein constructing the volumetric grid comprises allocating sparse voxel blocks near surfaces in the 3D scene and storing the properties in a dense voxel array within each of the sparse voxel blocks.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586293B2 cover?
A technique for reconstructing a three-dimensional scene from monocular video adaptively allocates an explicit sparse-dense voxel grid with dense voxel blocks around surfaces in the scene and sparse voxel blocks further from the surfaces. In contrast to conventional systems, the two-level voxel grid can be efficiently queried and sampled. In an embodiment, the scene surface geometry is represen…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06T1/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).