Joint learning of geometry and motion with three-dimensional holistic understanding
US-2020211206-A1 · Jul 2, 2020 · US
US11436743B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11436743-B2 |
| Application number | US-202016906801-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 19, 2020 |
| Priority date | Jul 6, 2019 |
| Publication date | Sep 6, 2022 |
| Grant date | Sep 6, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
System, methods, and other embodiments described herein relate to semi-supervised training of a depth model using a neural camera model that is independent of a camera type. In one embodiment, a method includes acquiring training data including at least a pair of training images and depth data associated with the training images. The method includes training the depth model using the training data to generate a self-supervised loss from the pair of training images and a supervised loss from the depth data. Training the depth model includes learning the camera type by generating, using a ray surface model, a ray surface that approximates an image character of the training images as produced by a camera having the camera type. The method includes providing the depth model to infer depths from monocular images in a device.
Opening claim text (preview).
What is claimed is: 1. A depth system for semi-supervised training of a depth model using a neural camera model that is independent of a camera type, comprising: one or more processors; a memory communicably coupled to the one or more processors and storing: a network module including instructions that, when executed by the one or more processors, cause the one or more processors to acquire training data including at least a pair of training images derived from a monocular video and depth data associated with at least one of the training images; and a training module including instructions that, when executed by the one or more processors, cause the one or more processors to train the depth model using the training data to generate a self-supervised loss from the pair of training images and a supervised loss from the depth data, wherein the training module includes instructions to train the depth model including instructions to learn the camera type by generating, using a ray surface model that is a neural network, a ray surface that approximates an image character of the training images as produced by a camera having the camera type and to create a synthesized image according to at least the ray surface and a depth map by i) lifting pixels to produce three-dimensional points using the ray surface, the depth map, and a camera center point and ii) projecting the three-dimensional points onto a context image to create the synthesized image, and wherein the training module includes instructions to provide the depth model to infer depths from monocular images in a device. 2. The depth system of claim 1 , wherein the training module includes instructions to train the depth model including instructions to generate the supervised loss using a supervised loss function that compares values between the depth map and corresponding information from the depth data that is sparse Light Detection and Ranging (LiDAR) data, and wherein the training module includes instructions to generate the supervised loss using the sparse LiDAR data to train the depth model on metric scale by accounting for scale aware differences between the depth map and the sparse LiDAR data and learns a camera center point associated with the ray surface. 3. The depth system of claim 1 , wherein the training module includes instructions to lift the pixels including instructions to scale predicted ray vectors from the ray surface using the depth map and adjust the predicted ray vectors according to the camera center point, and wherein the training module includes instructions to project including instructions to apply a softmax approximation to derive each pixel in the synthesized image by identifying a predicted ray vector from the ray surface that corresponds with a direction associated with the three-dimensional points as defined relative to the camera center point. 4. The depth system of claim 1 , wherein the image character associated with the camera type includes at least a format of the monocular images and lens distortion, and wherein the ray surface associates pixels within the monocular images with directions in an environment from which light that generates the pixels in the camera originates. 5. The depth system of claim 1 , wherein the training module includes instructions to train the depth model to produce depth estimates according to a semi-supervised training process that integrates a self-supervised structure from motion (SfM) process, and wherein the training module includes instructions to train the depth model including instructions to use the self-supervised loss and the supervised loss to update at least the depth model, a pose model, and the ray surface model. 6. The depth system of claim 1 , wherein the self-supervised loss includes a photometric loss and a depth smoothness loss that separately account for pixel-level similarities, wherein the training module includes instructions to train the depth model including instructions to pre-train the depth model, the ray surface model, and a pose model according to a self-supervised training process that does not use the depth data as a ground truth comparison. 7. The depth system of claim 1 , wherein the depth model is a machine learning algorithm, and wherein generating the ray surface includes learning the camera type to provide the ray surface as part of the neural camera model that approximates the camera type for a set of training data including the pair of training images. 8. A non-transitory computer-readable medium for semi-supervised training of a depth model using a neural camera model that is independent of a camera type and including instructions that when executed by one or more processors cause the one or more processors to: acquire training data including at least a pair of training images derived from a monocular video and depth data associated with at least one of the training images; train the depth model using the training data to generate a self-supervised loss from the pair of training images and a supervised loss from the depth data, wherein training the depth model includes learning the camera type by generating, using a ray surface model that is a neural network, a ray surface that approximates an image character of the training images as produced by a camera having the camera type and to create a synthesized image according to at least the ray surface and a depth map by i) lifting pixels to produce three-dimensional points using the ray surface, the depth map, and a camera center point and ii) projecting the three-dimensional points onto a context image to create the synthesized image; and provide the depth model to infer depths from monocular images in a device. 9. The non-transitory computer-readable medium of claim 8 , wherein the instructions to train the depth model include instructions to generate the supervised loss using a supervised loss function that compares values between the depth map and corresponding information from the depth data that is sparse Light Detection and Ranging (LiDAR) data, wherein the instructions to generate the supervised loss use the sparse LiDAR data to train the depth model on metric scale by accounting for scale aware differences between the depth map and the sparse LiDAR data and learns a camera center point associated with the ray surface. 10. The non-transitory computer-readable medium of claim 8 , wherein the instructions to train the depth model produce depth estimates according to a self-supervised structure from motion (SfM) process, and wherein the instructions to train the depth model include instructions to use the self-supervised loss and the supervised loss to update at least the depth model, a pose model, and the ray surface model. 11. A method of semi-supervised training of a depth model using a neural camera model that is independent of a camera type, comprising: acquiring training data including at least a pair of training images derived from a monocular video and depth data associated with at least one of the training images; training the depth model using the training data to generate a self-supervised loss from the pair of training images and a supervised loss from the depth data, wherein training the depth model includes learning the camera type by generating, using a ray surface model that is a neural network, a ray surface that approximates an image character of the training images as produced by a camera having the camera type and to creating a synthesized image according to at least the ray surface and depth map by i) lifting pixels to produce three-dimensional points using the ray surface, the depth map, and a camera center point and ii) projecting the three-dimensional points
Incorporation of unlabelled data, e.g. multiple instance learning [MIL] · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Terrestrial scenes (scenes under surveillance with static cameras G06V20/52; scenes perceived from the exterior of a vehicle G06V20/56; scenes perceived from the interior of a vehicle G06V20/59) · CPC title
using neural networks · CPC title
from laser ranging, e.g. using interferometry; from the projection of structured light · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.