End-to-end object tracking using neural networks with attention
US-2025078927-A1 · Mar 6, 2025 · US
US12524952B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12524952-B2 |
| Application number | US-202318364783-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 3, 2023 |
| Priority date | Nov 8, 2022 |
| Publication date | Jan 13, 2026 |
| Grant date | Jan 13, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods described herein support enhanced computer vision capabilities which may be applicable to, for example, autonomous vehicle operation. An example method includes generating a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene. The method also includes generating a volumetric embedding that is representative of a novel viewing frame of the scene. The method includes decoding, with the decoder, the latent space using cross-attention with the volumetric embedding, and generating a novel viewing frame of the scene based on an output of the decoder.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: generating, through training, a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene; generating a volumetric embedding that is representative of a novel viewing frame of the scene by sampling values along a viewing ray to generate 3D points and Fourier encoding the sampled values; decoding, with the decoder, the latent space using cross-attention with the volumetric embedding; and generating the novel viewing frame of the scene based on an output of the decoder. 2 . The method of claim 1 , wherein the volumetric embedding is a concatenation of an origin embedding and a depth embedding. 3 . The method of claim 1 , wherein the novel viewing frame includes a predicted depth map of the scene from a perspective of the novel viewing frame. 4 . The method of claim 3 , wherein the predicted depth map is used to control at least one function of a vehicle. 5 . The method of claim 1 , wherein the novel viewing frame includes a bitmap of a novel image from a perspective of the novel viewing frame. 6 . The method of claim 1 , wherein generating the latent space further includes using a multi-view photometric loss to evaluate the latent space. 7 . The method of claim 6 , wherein the multi-view photometric loss includes a photometric objective that estimates contribution of synthesized novel views by performing a warping function on the one or more of the multiple images in the image data. 8 . A system comprising: A preprocessing platform, comprising at least one processor and memory, configured to generate, through training, a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene; a computer vision platform configured to: generate a volumetric embedding that is representative of a novel viewing frame of the scene by sampling values along a viewing ray to generate 3D points, and Fourier encoding the sampled values; decode, with the decoder, the latent space using cross-attention with the volumetric embedding; and generate the novel viewing frame of the scene based on an output of the decoder. 9 . The system of claim 8 , wherein the volumetric embedding is a concatenation of an origin embedding and a depth embedding. 10 . The system of claim 8 , wherein the novel viewing frame includes a predicted depth map of the scene from a perspective of the novel viewing frame. 11 . The system of claim 10 , wherein the predicted depth map is used to control at least one function of a vehicle. 12 . The system of claim 8 , wherein the novel viewing frame includes a bitmap of a novel image from a perspective of the novel viewing frame. 13 . The system of claim 8 , wherein to generate the latent space, the preprocessing platform is further configured to use a multi-view photometric loss to evaluate the latent space. 14 . The system of claim 13 , wherein the multi-view photometric loss includes a photometric objective that estimates contribution of synthesized novel views by performing a warping function on the one or more of the multiple images in the image data. 15 . A tangible computer readable medium comprising instructions that, when executed cause a system to: generate, through training, a latent space and a decoder based on image data that includes multiple images, where each image has a different viewing frame of a scene; generate a volumetric embedding that is representative of a novel viewing frame of the scene by sampling values along a viewing ray to generate 3D points, and Fourier encoding the sampled values; decode, with the decoder, the latent space using cross-attention with the volumetric embedding; and generate the novel viewing frame of the scene based on an output of the decoder. 16 . The computer readable medium of claim 15 , wherein the volumetric embedding is a concatenation of an origin embedding and a depth embedding. 17 . The computer readable medium of claim 15 , wherein the novel viewing frame includes a predicted depth map of the scene from a perspective of the novel viewing frame. 18 . The computer readable medium of claim 17 , wherein the predicted depth map is used to control at least one function of a vehicle. 19 . The computer readable medium of claim 15 , wherein the novel viewing frame includes a bitmap of a novel image from a perspective of the novel viewing frame. 20 . The computer readable medium of claim 15 , wherein to generate the latent space, the instructions further cause the system to use a multi-view photometric loss to evaluate the latent space, wherein the multi-view photometric loss includes a photometric objective that estimates contribution of synthesized novel views by performing a warping function on the one or more of the multiple images in the image data.
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
Three-dimensional [3D] objects · CPC title
Organisation of the process, e.g. bagging or boosting · CPC title
exterior to a vehicle by using sensors mounted on the vehicle · CPC title
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.