Future semantic segmentation prediction using 3d structure
US-2021073997-A1 · Mar 11, 2021 · US
US12033342B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12033342-B2 |
| Application number | US-202117203645-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 16, 2021 |
| Priority date | Mar 16, 2021 |
| Publication date | Jul 9, 2024 |
| Grant date | Jul 9, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems, methods and computer-readable medium for predicting a depth for a video frame are disclosed. An example method may include steps of: receiving a plurality of training data, each comprising a set of consecutive video frames and a depth representation of a subsequent video frame to the consecutive video frames; receiving a pre-trained neural network model f θ having a plurality of weights θ; while the pre-trained neural network model f θ has not converged: computing a plurality of second weights θ i ′, based on each set of consecutive video frames, and updating the plurality of weights θ, based on the plurality of training data and the plurality of second weights θ i ′; receiving a plurality of new consecutive video frames with consecutive timestamps; and predicting a depth representation of video frame immediately subsequent to the new consecutive video frames based on the updated plurality of weights θ.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method for predicting a depth for a video frame, comprising: receiving a plurality of training data D i =(D i img , D i depth ), i=1 . . . N, wherein each D i : D i img =(D i1 img , D i2 img . . . D it img ), wherein D i1 img , D i2 img . . . D it img each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps, the t consecutive video frames comprise one or more previous video frames and a current video frame, and D i depth is a depth representation of a future video frame immediately subsequent to the video frame D it img ; receiving a pre-trained neural network model f θ having a plurality of first weights θ, wherein the pre-trained neural network model f θ is pre-trained via a two-stage process comprising a current frame reconstruction training process and a future frame depth prediction training process, wherein the training process of the pre-trained neural network model f θ comprises: receiving a plurality of consecutive video frames F 1 img , F 2 img . . . F j img with consecutive timestamps; setting a plurality of initial parameters of f θ with random values to be the plurality of first weights θ; extracting a plurality of spatial features from the plurality of consecutive video frames F 1 img , F 2 img . . . f j img ; during the current frame reconstruction training process: reconstructing each of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features; and updating values for at least one of the plurality of first weights θ based on the reconstructed video frames; and during the future frame depth prediction training process: extracting temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features, generating a depth prediction for a video frame F j+1 img immediately subsequent to the video frame F j img based on the temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img ; and updating values for at least one of the plurality of first weights θ based on the depth prediction for the video frame F j+1 img ; while the pre-trained neural network model f θ has not converged: computing a plurality of second weights θ i ′, based on the plurality of consecutive video frames D i img in each D i , i=1 . . . N and the pre-trained neural network model f θ ; and updating the plurality of first weights θ, based on the plurality of training data D i =(D i img , D i depth ), i=1 . . . N and the plurality of second weights θ i ′; receiving a plurality of m new consecutive video frames D new =(D 1 new img , D 2 new img . . . D m new img ) with consecutive timestamps; and predicting a depth representation of video frame D m+1 new img immediately subsequent to the video frame D m new img based on the updated plurality of first weights θ. 2. The method of claim 1 , wherein computing the plurality of second weights θ i ′ is based on the equation: θ i ′=θ−α∇L D i ( f θ ;D i img ), i= 1 . . . N wherein α represents a learning rate, L D i represents a loss computed based on (f θ ; D i img ), and ∇ denotes a gradient operator. 3. The method of claim 2 , wherein updating the plurality of first weights θ is based on the equation: θ=θ−βΣ i=1 N ∇L T i ( f θ i ′ ;D i img ,D i depth ) wherein β represents a learning rate, L T i represents a loss computed based on (f θ i ′ ; D i img , D i depth ), and ∇ denotes a gradient operator. 4. The method of claim 3 , wherein predicting the depth representation of video frame D m+1 new_img comprises: updating the plurality of second weights θ i ′, based on the plurality of new consecutive video frames D new =(D 1 new img , D 2 new img . . . D m new img ) and the updated plurality of first weights θ; and generating the depth representation based on the updated plurality of second weights θ i ′. 5. The method of claim 4 , wherein updating the plurality of second weights θ i ′ is based on the equation: θ i ′=θ−α∇L D i ( f θ ;D i new img ), i= 1 . . . m wherein α is the learning rate, L D i represents a loss computed based on (f θ ; D i new img ), and ∇ denotes a gradient operator. 6. The method of claim 1 , wherein extracting the temporal features comprises using a 3D convolutional neural network to extract the temporal features. 7. The method of claim 1 , wherein the depth representation of any video frame comprises, for one or more surfaces in the video frame, a depth value representing an estimated distance from the respective surface from a viewpoint. 8. The method of claim 7 , wherein the depth representation of any video frame comprises a depth map for the video frame. 9. A system for predicting a depth for a video frame, the system comprising: one or more processors; and one or more memories coupled to the one or more processors unit, the one or more memories storing machine-executable instructions that, in response to execution by the one or more processors, cause the system to: receive a plurality of training data D i =(D i img , D i depth ), wherein each D i : D i img =(D i1 img , D i2 img . . . D it img ), wherein D i1 img , D i2 img . . . D it img each respectively represents a video frame from a plurality of t consecutive video frames with consecutive timestamps, the t consecutive video frames comprise one or more previous video frames and a current video frame, and D i depth is a depth representation of a future video frame immediately subsequent to the video frame D it img ; receive a pre-trained neural network model ice having a plurality of first weights θ, wherein the pre-trained neural network model f θ is pre-trained via a two-stage process comprising a current frame reconstruction training process and a future frame depth prediction training process, wherein, during the training process of the pre-trained neural network model f θ , the machine-executable instructions, in response to execution by the one or more processors, cause the system to: receive a plurality of consecutive video frames F 1 img , F 2 img . . . F j img with consecutive timestamps; set a plurality of initial parameters of f θ with random values to be the plurality of first weights θ; extract a plurality of spatial features from the plurality of consecutive video frames F 1 img , F 2 img . . . F j img ; during the current frame reconstruction training process: reconstruct each of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features; and update values for at least one of the plurality of first weights θ based on the reconstructed video frames; and during the future frame depth prediction training process: extract temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img based on the plurality of spatial features; generate a depth prediction for a video frame F j+1 img immediately subsequent to the video frame F j img based on the temporal features of the plurality of consecutive video frames F 1 img , F 2 img . . . F j img ; and update values for at least one of the plurality of first weights θ based on the depth prediction for the video frame F j+1 img ; while the pre-trained neural network model f θ has not converged: compute a plurality of s
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.