Robust consistent video depth estimation
US-12243251-B1 · Mar 4, 2025 · US
US12536682B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12536682-B2 |
| Application number | US-202318301032-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 14, 2023 |
| Priority date | Apr 15, 2022 |
| Publication date | Jan 27, 2026 |
| Grant date | Jan 27, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and system for generating a depth map corresponding to a frame of a sequence of frames in a video clip is disclosed. This can involve generating a single image depth map for each of a plurality of frames, scaling the single image depth maps, and processing a time sequence of scaled single image depth maps to generate said depth map corresponding to the frame of the sequence of frames in the video clip.
Opening claim text (preview).
The invention claimed is: 1 . A method of generating a depth map corresponding to a frame of a sequence of frames in a video clip, the method comprising: generating a single image depth map for each frame of a plurality of frames; scaling the single image depth map for each frame to generate a scaled single image depth map for said each frame by applying a scale value to each pixel of said single image depth map, wherein the scale value for each pixel of the single image depth map is generated using a method comprising: for each grid point of a plurality of grid points which are arranged across the frame: generating an initial scale value using a depth value for the grid point and depth values corresponding to the same grid point from a plurality of temporally related frames; generating a final scale value for said grid point on the basis of said grid point's initial scale value and the initial scale value of one or more neighboring grid points; and determining corresponding scale values for application to each pixel of said single image depth map from the final scale values of the grid points; and processing a time sequence of scaled single image depth maps to generate said depth map corresponding to the frame of the sequence of frames in the video clip. 2 . The method of claim 1 wherein the step of generating an initial scale value using a depth value for the grid point and depth values for the same grid point from a plurality of temporally related frames comprises determining a depth value for the grid point in said frame by determining an average depth value for a region including the grid point; and wherein determining depth values corresponding to the same grid point for a plurality of temporally related frames comprises: determining a correspondence between content of said frame and content of said temporally related frames such that a location corresponding to said grid point can be determined for each of the plurality of temporally related frames, and determining an average depth value for a region including said location in each temporally related frame to determine a depth value corresponding to said grid point for each temporally related frame. 3 . The method of claim 2 wherein the initial scale value for each grid point is determined using a ratio of: a measure of central tendency of a group of depth values including at least the depth values for the same grid point from the plurality of temporally related frames, to the depth value for the grid point. 4 . The method of claim 3 wherein the group of depth values includes the depth value for the grid point. 5 . The method of claim 2 wherein determining a correspondence between the content of said frame and the content of said temporally related frames includes analyzing optical flow between temporally adjacent frames and generating a warped depth map of each of said plurality of temporally related frames in accordance with the optical flow, whereby said location corresponding to said grid point is aligned with said grid point, and determining the average depth value for the region around said location in each temporally related frame uses the warped depth map. 6 . The method of claim 5 wherein the method further includes defining a mask including pixels of said frame in which the single image depth map is determined to be either or both of: unreliable based on optical flow analysis of the plurality of frames; or have a depth greater than a threshold depth, and, wherein at least one of: determining a depth value for the grid point by determining an average depth value for a region including the grid point, and/or determining depth values corresponding to the same grid point for a plurality of temporally related frames, excludes pixels that are included in said mask. 7 . The method of claim 2 wherein determining a correspondence between the content of said frame and the content of said temporally related frames includes analyzing optical flow between temporally adjacent frames and tracking the location of said grid point in each of said temporally related frames using said optical flow and determining the average depth value for a region around said location in each temporally related frame. 8 . The method of claim 7 , wherein the method further includes defining a mask including pixels of said frame in which the single image depth map is determined to be either or both of: unreliable based on optical flow analysis of the plurality of frames; or have a depth greater than a threshold depth, and, wherein at least one of: determining a depth value for the grid point by determining an average depth value for a region including the grid point, and/or determining depth values corresponding to the same grid point for a plurality of temporally related frames, excludes pixels that are included in said mask. 9 . The method of claim 1 wherein the method includes defining a mask including pixels of said frame in which the single image depth map is determined to be either or both of: unreliable based on optical flow analysis of the plurality of frames; or have a depth greater than a threshold depth. 10 . The method of claim 1 , wherein the step of generating a final scale value for said grid point on the basis of said grid point's initial scale value and an initial scale value of one or more neighboring grid points comprises: determining a relative contribution of each of said one or more neighboring grid points and said grid point's initial scale value. 11 . The method of claim 10 wherein the method further includes: defining a mask including pixels of said frame in which the single image depth map is determined to be either or both of: unreliable based on optical flow analysis of the plurality of frames; or have a depth greater than a threshold depth; and determining a relative contribution for said one or more neighboring grid points based on said mask. 12 . The method of claim 1 wherein generating a final scale value for said grid point on the basis of said grid point's initial scale value and an initial scale value of one or more neighboring grid point includes solving a series of linear equations representing an initial scale value of each of said grid points and the initial scale value for each of said grid point's neighboring grid points. 13 . The method of claim 1 wherein determining corresponding scale values for application to each pixel of said single image depth map from the final scale values of the grid points comprises generating a scale value for each pixel between said grid points by interpolation. 14 . The method of claim 1 wherein determining corresponding scale values for application to each pixel of said single image depth map from the final scale values of the grid points comprises assigning a scale value for each pixel based on a position relative to said grid points. 15 . The method of claim 1 wherein generating said single image depth map for each frame comprises using a deep learning model to generate said single image depth map. 16 . The method of claim 15 wherein using said deep learning model comprises using a convolutional neural network to generate said single image depth map. 17 . A computer system including a processor operating in accordance with execution instructions stored in a non-transitory storage medium, whereby the instructions, when executed, configure the computer system to perform the method of claim 1 . 18 . The computer system of claim 17 wherein the instr
Artificial neural networks [ANN] · CPC title
Range image; Depth image; 3D point clouds · CPC title
Video; Image sequence · CPC title
Registration of image sequences · CPC title
from multiple images · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.