Quantized efficient encoding for streaming free-viewpoint videos

US2025330601A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025330601-A1
Application numberUS-202519016225-A
CountryUS
Kind codeA1
Filing dateJan 10, 2025
Priority dateApr 23, 2024
Publication dateOct 23, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are provided for streaming free-viewpoint videos (FVV) of a dynamic 3D scene. Each time step of the dynamic 3D scene is represented as a set of 3D Gaussians, each 3D Gaussian having a set of Gaussian attributes. Gaussian residuals are encoded for every time step. In at least one embodiment, position residuals are sparsified and non-position attribute residuals are quantized, thereby achieving compression factors without sacrificing reconstruction quality.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for encoding a free-viewpoint video (FVV), the method comprising: receiving, as input, a first multi-view frame of a scene corresponding to a first time; generating, based on the first multi-view frame, a plurality of three-dimensional (3D) Gaussians that collectively represent the scene at the first time, wherein each 3D Gaussian includes a position attribute and one or more non-position attributes; receiving, as further input, a second multi-view frame of the scene corresponding to a second time; and determining, for each 3D Gaussian of the plurality of 3D Gaussians, a position residual and one or more non-position residuals, the determining comprising: modeling the position residuals using a sparsity framework, modeling the non-position residuals using a quantization framework, and learning the position residuals and the non-position residuals using machine learning techniques. 2 . The method according to claim 1 , wherein modeling the position residuals using the sparsity framework comprises representing each position residual as a product of a learnable gate and a learnable full-precision position residual. 3 . The method according to claim 2 , wherein modeling the position residuals using the sparsity framework further comprises initializing parameters of each learnable gate based on a score vector computed, at least in part, from a gradient of a reconstruction loss with respect to a position of a corresponding Gaussian at the second time and a gradient of a reconstruction loss with respect to a position of the corresponding Gaussian at the first time. 4 . The method according to claim 1 , wherein modeling the non-position residuals using the quantization framework comprises representing each non-position residual as a product of a learnable integer latent and a learnable codebook. 5 . The method according to claim 4 , wherein the one or more non-position attributes comprise: a rotation attribute, a scale attribute, an opacity attribute, and a color attribute, wherein the one or more non-position residuals comprise: a rotation residual, a scale residual, an opacity residual, and a color residual, and wherein each respective rotation residual is modeled as a product of a corresponding respective rotation integer latent and a common rotation codebook, each respective scale residual is modeled as a product of a corresponding respective scale integer latent and a common scale codebook, each respective opacity residual is modeled as a product of a corresponding respective opacity integer latent and a common opacity codebook, and each respective color residual is modeled as a product of a corresponding respective color integer latent and a common color codebook. 6 . The method according to claim 1 , wherein the learning the position residuals and the non-position residuals using machine learning techniques comprises: rendering, during a forward pass, an output image corresponding to a viewpoint from which a respective image of the second multi-view frame was captured, the rendering being performed using the position attributes, the non-position attributes, learnable position residuals, and learnable non-position residuals, computing a loss by comparing the output image to the respective image of the second multi-view frame, calculating, during a backward pass, gradients of the computed loss with respect to the learnable position residuals and the learnable non-position residuals, and updating the learnable position residuals and the learnable non-position residuals based on the calculated gradients. 7 . The method according to claim 6 , wherein the loss includes a reconstruction loss and a regularization loss, wherein the reconstruction loss measures a difference between the output image and the respective image of the second multi-view frame, and wherein the regularization loss decreases as the sparsity of the learnable position residuals increases. 8 . The method according to claim 6 , wherein the learning the position residuals and the non-position residuals further comprises defining, for the respective image of the second multi-view frame, static and dynamic regions, and wherein the rendering the output image is performed only for the dynamic regions of the respective image of the second multi-view frame. 9 . The method according to claim 1 , further comprising generating, based on the first multi-view frame, a point cloud, wherein the point cloud is generated using a structure-from-motion algorithm and enhanced using depth map provided via monocular depth estimation. 10 . A system for encoding a free-viewpoint video (FVV), the system comprising: processing circuitry configured to: receive, as input, a first multi-view frame of a scene corresponding to a first time; generate, based on the first multi-view frame, a plurality of three-dimensional (3D) Gaussians that collectively represent the scene at the first time, wherein each 3D Gaussian includes a position attribute and one or more non-position attributes, receive, as further input, a second multi-view frame of the scene corresponding to a second time, and determine, for each 3D Gaussian of the plurality of 3D Gaussians, a position residual and one or more non-position residuals, the processing circuitry being configured to determine the position residuals and the non-position residuals by: modeling the position residuals using a sparsity framework, modeling the non-position residuals using a quantization framework, and learning the position residuals and the non-position residuals using machine learning techniques; and one or more memories configured to store the first multi-view frame, the plurality of 3D Gaussians, the second multi-view frame, and the position residuals and the non-position residuals. 11 . The system according to claim 10 , the processing circuitry being configured to model the position residuals using the sparsity framework by representing each position residual as a product of a learnable gate and a learnable full-precision position residual. 12 . The system according to claim 11 , the processing circuitry being configured to model the position residuals using the sparsity framework by further initializing parameters of each learnable gate based on a score vector computed, at least in part, from a gradient of a reconstruction loss with respect to a position of a corresponding Gaussian at the second time and a gradient of a reconstruction loss with respect to a position of the corresponding Gaussian at the first time. 13 . The system according to claim 10 , the processing circuitry being configured to model the non-position residuals using the quantization framework by representing each non-position residual as a product of a learnable integer latent and a learnable codebook. 14 . The system according to claim 13 , wherein the one or more non-position attributes comprise: a rotation attribute, a scale attribute, an opacity attribute, and a color attribute, wherein the one or more non-position residuals comprise: a rotation residual, a scale residual, an opacity residual, and a color residual, and wherein the processing circuitry is configured to: model each respective rotation residual as a product of a corresponding respective rotation integer latent and a common rotation codebook, model each respective scale residual as a product of a corresponding respective scale integer latent and a common scale codebook, model each respective opacity residual as a product of a corresponding respective opacity integer latent and a common opacity codebook, and model each respecti

Assignees

Inventors

Classifications

  • specially adapted for multi-view video sequence encoding · CPC title

  • H04N19/124Primary

    Quantisation · CPC title

  • by compressing encoding parameters before transmission · CPC title

  • Position within a video image, e.g. region of interest [ROI] · CPC title

  • the region being a picture, frame or field · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025330601A1 cover?
Systems and methods are provided for streaming free-viewpoint videos (FVV) of a dynamic 3D scene. Each time step of the dynamic 3D scene is represented as a set of 3D Gaussians, each 3D Gaussian having a set of Gaussian attributes. Gaussian residuals are encoded for every time step. In at least one embodiment, position residuals are sparsified and non-position attribute residuals are quantized,…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification H04N19/124. Mapped technology areas include Electricity.
When was this patent published?
Publication date Thu Oct 23 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).