Who is the assignee on this patent?

Toyota Res Inst Inc, Massachusetts Inst Technology, Toyota Motor Co Ltd

What technology area does this patent fall under?

Primary CPC classification G06T15/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Cross-attention decoding for volumetric rendering

US12524952B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12524952-B2
Application number	US-202318364783-A
Country	US
Kind code	B2
Filing date	Aug 3, 2023
Priority date	Nov 8, 2022
Publication date	Jan 13, 2026
Grant date	Jan 13, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods described herein support enhanced computer vision capabilities which may be applicable to, for example, autonomous vehicle operation. An example method includes generating a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene. The method also includes generating a volumetric embedding that is representative of a novel viewing frame of the scene. The method includes decoding, with the decoder, the latent space using cross-attention with the volumetric embedding, and generating a novel viewing frame of the scene based on an output of the decoder.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: generating, through training, a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene; generating a volumetric embedding that is representative of a novel viewing frame of the scene by sampling values along a viewing ray to generate 3D points and Fourier encoding the sampled values; decoding, with the decoder, the latent space using cross-attention with the volumetric embedding; and generating the novel viewing frame of the scene based on an output of the decoder. 2 . The method of claim 1 , wherein the volumetric embedding is a concatenation of an origin embedding and a depth embedding. 3 . The method of claim 1 , wherein the novel viewing frame includes a predicted depth map of the scene from a perspective of the novel viewing frame. 4 . The method of claim 3 , wherein the predicted depth map is used to control at least one function of a vehicle. 5 . The method of claim 1 , wherein the novel viewing frame includes a bitmap of a novel image from a perspective of the novel viewing frame. 6 . The method of claim 1 , wherein generating the latent space further includes using a multi-view photometric loss to evaluate the latent space. 7 . The method of claim 6 , wherein the multi-view photometric loss includes a photometric objective that estimates contribution of synthesized novel views by performing a warping function on the one or more of the multiple images in the image data. 8 . A system comprising: A preprocessing platform, comprising at least one processor and memory, configured to generate, through training, a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene; a computer vision platform configured to: generate a volumetric embedding that is representative of a novel viewing frame of the scene by sampling values along a viewing ray to generate 3D points, and Fourier encoding the sampled values; decode, with the decoder, the latent space using cross-attention with the volumetric embedding; and generate the novel viewing frame of the scene based on an output of the decoder. 9 . The system of claim 8 , wherein the volumetric embedding is a concatenation of an origin embedding and a depth embedding. 10 . The system of claim 8 , wherein the novel viewing frame includes a predicted depth map of the scene from a perspective of the novel viewing frame. 11 . The system of claim 10 , wherein the predicted depth map is used to control at least one function of a vehicle. 12 . The system of claim 8 , wherein the novel viewing frame includes a bitmap of a novel image from a perspective of the novel viewing frame. 13 . The system of claim 8 , wherein to generate the latent space, the preprocessing platform is further configured to use a multi-view photometric loss to evaluate the latent space. 14 . The system of claim 13 , wherein the multi-view photometric loss includes a photometric objective that estimates contribution of synthesized novel views by performing a warping function on the one or more of the multiple images in the image data. 15 . A tangible computer readable medium comprising instructions that, when executed cause a system to: generate, through training, a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene; generate a volumetric embedding that is representative of a novel viewing frame of the scene by sampling values along a viewing ray to generate 3D points, and Fourier encoding the sampled values; decode, with the decoder, the latent space using cross-attention with the volumetric embedding; and generate the novel viewing frame of the scene based on an output of the decoder. 16 . The computer readable medium of claim 15 , wherein the volumetric embedding is a concatenation of an origin embedding and a depth embedding. 17 . The computer readable medium of claim 15 , wherein the novel viewing frame includes a predicted depth map of the scene from a perspective of the novel viewing frame. 18 . The computer readable medium of claim 17 , wherein the predicted depth map is used to control at least one function of a vehicle. 19 . The computer readable medium of claim 15 , wherein the novel viewing frame includes a bitmap of a novel image from a perspective of the novel viewing frame. 20 . The computer readable medium of claim 15 , wherein to generate the latent space, the instructions further cause the system to use a multi-view photometric loss to evaluate the latent space, wherein the multi-view photometric loss includes a photometric objective that estimates contribution of synthesized novel views by performing a warping function on the one or more of the multiple images in the image data.

Assignees

Inventors

Classifications

G06V20/41
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
G06V20/64
Three-dimensional [3D] objects · CPC title
G06V10/7747
Organisation of the process, e.g. bagging or boosting · CPC title
G06V20/56
exterior to a vehicle by using sensors mounted on the vehicle · CPC title
G06V10/25
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title

Patent family

Related publications grouped by family.

View patent family 90927945

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12524952B2 cover?: Systems and methods described herein support enhanced computer vision capabilities which may be applicable to, for example, autonomous vehicle operation. An example method includes generating a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene. The method also includes generating a volumetric embedding that is…
Who is the assignee on this patent?: Toyota Res Inst Inc, Massachusetts Inst Technology, Toyota Motor Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06T15/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 13 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).