Methods, systems and computer medium for scene-adaptive future depth prediction in monocular videos

US12033342B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12033342-B2
Application numberUS-202117203645-A
CountryUS
Kind codeB2
Filing dateMar 16, 2021
Priority dateMar 16, 2021
Publication dateJul 9, 2024
Grant dateJul 9, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods and computer-readable medium for predicting a depth for a video frame are disclosed. An example method may include steps of: receiving a plurality of training data, each comprising a set of consecutive video frames and a depth representation of a subsequent video frame to the consecutive video frames; receiving a pre-trained neural network model f θ having a plurality of weights θ; while the pre-trained neural network model f θ has not converged: computing a plurality of second weights θ i ′, based on each set of consecutive video frames, and updating the plurality of weights θ, based on the plurality of training data and the plurality of second weights θ i ′; receiving a plurality of new consecutive video frames with consecutive timestamps; and predicting a depth representation of video frame immediately subsequent to the new consecutive video frames based on the updated plurality of weights θ.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for predicting a depth for a video frame, comprising: receiving a plurality of training data D i =(D i img , D i depth ), i=1 . . . N, wherein each D i : D i img =(D i1 img , D i2 img . . . D it img ), wherein D i1 img , D i2 img . . . D it img each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps, the t consecutive video frames comprise one or more previous video frames and a current video frame, and D i depth is a depth representation of a future video frame immediately subsequent to the video frame D it img ; receiving a pre-trained neural network model f θ having a plurality of first weights θ, wherein the pre-trained neural network model f θ is pre-trained via a two-stage process comprising a current frame reconstruction training process and a future frame depth prediction training process, wherein the training process of the pre-trained neural network model f θ comprises: receiving a plurality of consecutive video frames F 1 img , F 2 img . . . F j img with consecutive timestamps; setting a plurality of initial parameters of f θ with random values to be the plurality of first weights θ; extracting a plurality of spatial features from the plurality of consecutive video frames F 1 img , F 2 img . . . f j img ; during the current frame reconstruction training process: reconstructing each of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features; and updating values for at least one of the plurality of first weights θ based on the reconstructed video frames; and during the future frame depth prediction training process: extracting temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features, generating a depth prediction for a video frame F j+1 img immediately subsequent to the video frame F j img based on the temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img ; and updating values for at least one of the plurality of first weights θ based on the depth prediction for the video frame F j+1 img ; while the pre-trained neural network model f θ has not converged: computing a plurality of second weights θ i ′, based on the plurality of consecutive video frames D i img in each D i , i=1 . . . N and the pre-trained neural network model f θ ; and updating the plurality of first weights θ, based on the plurality of training data D i =(D i img , D i depth ), i=1 . . . N and the plurality of second weights θ i ′; receiving a plurality of m new consecutive video frames D new =(D 1 new img , D 2 new img . . . D m new img ) with consecutive timestamps; and predicting a depth representation of video frame D m+1 new img immediately subsequent to the video frame D m new img based on the updated plurality of first weights θ. 2. The method of claim 1 , wherein computing the plurality of second weights θ i ′ is based on the equation: θ i ′=θ−α∇L D i ( f θ ;D i img ), i= 1 . . . N wherein α represents a learning rate, L D i represents a loss computed based on (f θ ; D i img ), and ∇ denotes a gradient operator. 3. The method of claim 2 , wherein updating the plurality of first weights θ is based on the equation: θ=θ−βΣ i=1 N ∇L T i ( f θ i ′ ;D i img ,D i depth ) wherein β represents a learning rate, L T i represents a loss computed based on (f θ i ′ ; D i img , D i depth ), and ∇ denotes a gradient operator. 4. The method of claim 3 , wherein predicting the depth representation of video frame D m+1 new_img comprises: updating the plurality of second weights θ i ′, based on the plurality of new consecutive video frames D new =(D 1 new img , D 2 new img . . . D m new img ) and the updated plurality of first weights θ; and generating the depth representation based on the updated plurality of second weights θ i ′. 5. The method of claim 4 , wherein updating the plurality of second weights θ i ′ is based on the equation: θ i ′=θ−α∇L D i ( f θ ;D i new img ), i= 1 . . . m wherein α is the learning rate, L D i represents a loss computed based on (f θ ; D i new img ), and ∇ denotes a gradient operator. 6. The method of claim 1 , wherein extracting the temporal features comprises using a 3D convolutional neural network to extract the temporal features. 7. The method of claim 1 , wherein the depth representation of any video frame comprises, for one or more surfaces in the video frame, a depth value representing an estimated distance from the respective surface from a viewpoint. 8. The method of claim 7 , wherein the depth representation of any video frame comprises a depth map for the video frame. 9. A system for predicting a depth for a video frame, the system comprising: one or more processors; and one or more memories coupled to the one or more processors unit, the one or more memories storing machine-executable instructions that, in response to execution by the one or more processors, cause the system to: receive a plurality of training data D i =(D i img , D i depth ), wherein each D i : D i img =(D i1 img , D i2 img . . . D it img ), wherein D i1 img , D i2 img . . . D it img each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps, the t consecutive video frames comprise one or more previous video frames and a current video frame, and D i depth is a depth representation of a future video frame immediately subsequent to the video frame D it img ; receive a pre-trained neural network model ice having a plurality of first weights θ, wherein the pre-trained neural network model f θ is pre-trained via a two-stage process comprising a current frame reconstruction training process and a future frame depth prediction training process, wherein, during the training process of the pre-trained neural network model f θ , the machine-executable instructions, in response to execution by the one or more processors, cause the system to: receive a plurality of consecutive video frames F 1 img , F 2 img . . . F j img with consecutive timestamps; set a plurality of initial parameters of f θ with random values to be the plurality of first weights θ; extract a plurality of spatial features from the plurality of consecutive video frames F 1 img , F 2 img . . . F j img ; during the current frame reconstruction training process: reconstruct each of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features; and update values for at least one of the plurality of first weights θ based on the reconstructed video frames; and during the future frame depth prediction training process: extract temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features; generate a depth prediction for a video frame F j+1 img immediately subsequent to the video frame F j img based on the temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img ; and update values for at least one of the plurality of first weights θ based on the depth prediction for the video frame F j+1 img ; while the pre-trained neural network model f θ has not converged: compute a plurality of s

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12033342B2 cover?
Systems, methods and computer-readable medium for predicting a depth for a video frame are disclosed. An example method may include steps of: receiving a plurality of training data, each comprising a set of consecutive video frames and a depth representation of a subsequent video frame to the consecutive video frames; receiving a pre-trained neural network model f θ having a plurality of weigh…
Who is the assignee on this patent?
Liu Huan, Chi Zhixiang, Yu Yuanhao, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06T7/579. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).