Self-supervised depth for volumetric rendering regularization

US12530835B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12530835-B2
Application numberUS-202318364853-A
CountryUS
Kind codeB2
Filing dateAug 3, 2023
Priority dateNov 8, 2022
Publication dateJan 20, 2026
Grant dateJan 20, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An example method includes generating embeddings of image data that includes multiple images, where each image has a different viewpoints of a scene, generating a latent space and a decoder, wherein the decoder receives embeddings as input to generate an output viewpoint, for each viewpoint in the image data, determining a volumetric rendering view synthesis loss and a multi-view photometric loss, and applying an optimization algorithm to the latent space and the decoder over a number of epochs until the volumetric rendering view synthesis loss is within a volumetric threshold and the multi-view photometric loss is within a multi-view threshold.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of training a latent space trainer for volumetric rendering, the method comprising: generating embeddings of image data that includes multiple images by sampling values along a viewing ray to generate 3D points and Fourier encoding the sampled points, where each image has a different viewpoint of a scene; generating a latent space and a decoder, wherein the decoder receives embeddings as input to generate an output viewpoint; for each viewpoint in the image data, determining a volumetric rendering view synthesis loss and a multi-view photometric loss; and applying an optimization algorithm to the latent space and the decoder over a number of epochs until the volumetric rendering view synthesis loss is within a volumetric threshold and the multi-view photometric loss is within a multi-view threshold. 2 . The method of claim 1 , wherein the optimization algorithm uses a Mean Square Error objective for the volumetric rendering view synthesis loss. 3 . The method of claim 1 , wherein the optimization algorithm is a gradient descent algorithm. 4 . The method of claim 1 , wherein the multi-view photometric loss is determined using a photometric objective. 5 . The method of claim 4 , wherein the photometric objective is determined by: for each pixel of a target image of the image data, with a predicted depth ({circumflex over (d)}), generating, by a warping operation, projected coordinates with a predicted depth ({circumflex over (d)}′) in a context image; generating a synthesized target image from the context image; and determining a difference between the target image and the synthesized target image. 6 . The method of claim 5 , wherein the context image is generated by a transformation matrix. 7 . The method of claim 5 , wherein the difference between the target image and the synthesized target image is determined by a weighted structural similarity index. 8 . The method of claim 5 , wherein the synthesized target image is generated by applying grid sampling with bilinear interpolation to place information from the context image onto each target pixel of the synthesized target image based on the projected coordinates. 9 . The method of claim 5 , wherein pixels of the target image for determining the photometric objective are determined using strided ray sampling. 10 . A system for training a latent space trainer for volumetric rendering, the system comprising: one or more processors; a non-transitory computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to: generate embeddings of image data that includes multiple images by sampling values along a viewing ray to generate 3D points and Fourier encoding the sampled points, where each image has a different viewpoint of a scene; generate a latent space and a decoder, wherein the decoder receives embeddings as input to generate an output viewpoint; for each viewpoint in the image data, determine a volumetric rendering view synthesis loss and a multi-view photometric loss; and apply an optimization algorithm to the latent space and the decoder over a number of epochs until the volumetric rendering view synthesis loss is within a volumetric threshold and the multi-view photometric loss is within a multi-view threshold. 11 . The system of claim 10 , wherein the optimization algorithm uses a Mean Square Error objective for the volumetric rendering view synthesis loss. 12 . The system of claim 10 , wherein the optimization algorithm is a gradient descent algorithm. 13 . The system of claim 10 , wherein the multi-view photometric loss is determined using a photometric objective. 14 . The system of claim 13 , wherein the photometric objective is determined by: for each pixel of a target image of the image data, with a predicted depth ({circumflex over (d)}), generating, by a warping operation, projected coordinates with a predicted depth ({circumflex over (d)}′) in a context image; generating a synthesized target image from the context image; and determining a difference between the target image and the synthesized target image. 15 . The system of claim 14 , wherein the context image is generated by a transformation matrix. 16 . The system of claim 14 , wherein the difference between the target image and the synthesized target image is determined by a weighted structural similarity index. 17 . The system of claim 14 , wherein the synthesized target image is generated by applying grid sampling with bilinear interpolation to place information from the context image onto each target pixel of the synthesized target image based on the projected coordinates. 18 . The system of claim 14 , wherein pixels of the target image for determining the photometric objective are determined using strided ray sampling. 19 . A tangible computer-readable medium comprising instructions that, when executed, cause a system to: generate embeddings of image data that includes multiple images by sampling values along a viewing ray to generate 3D points and Fourier encoding the sampled points, where each image has a different viewpoint of a scene; generate a latent space and a decoder, wherein the decoder receives embeddings as input to generate an output viewpoint; for each viewpoint in the image data, determine a volumetric rendering view synthesis loss and a multi-view photometric loss; and apply an optimization algorithm to the latent space and the decoder over a number of epochs until the volumetric rendering view synthesis loss is within a volumetric threshold and the multi-view photometric loss is within a multi-view threshold. 20 . The system of claim 19 , wherein the multi-view photometric loss is determined using a photometric objective.

Assignees

Inventors

Classifications

  • Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Three-dimensional [3D] objects · CPC title

  • Organisation of the process, e.g. bagging or boosting · CPC title

  • exterior to a vehicle by using sensors mounted on the vehicle · CPC title

  • Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12530835B2 cover?
An example method includes generating embeddings of image data that includes multiple images, where each image has a different viewpoints of a scene, generating a latent space and a decoder, wherein the decoder receives embeddings as input to generate an output viewpoint, for each viewpoint in the image data, determining a volumetric rendering view synthesis loss and a multi-view photometric lo…
Who is the assignee on this patent?
Toyota Res Inst Inc, Massachusetts Inst Technology, Toyota Motor Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06T15/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 20 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).