Method and system for using machine-learning for object instance segmentation
US-10713794-B1 · Jul 14, 2020 · US
US2022014723A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022014723-A1 |
| Application number | US-201917309440-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 2, 2019 |
| Priority date | Dec 3, 2018 |
| Publication date | Jan 13, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Three-dimensional (3D) performance capture and machine learning can be used to re-render high quality novel viewpoints of a captured scene. A textured 3D reconstruction is first rendered to a novel viewpoint. Due to imperfections in geometry and low-resolution texture, the 2D rendered image contains artifacts and is low quality. Accordingly, a deep learning technique is disclosed that takes these images as input and generates more visually enhanced re-rendering. The system is specifically designed for VR and AR headsets, and accounts for consistency between two stereo views.
Opening claim text (preview).
1 . A method for re-rendering an image rendered using a volumetric reconstruction to improve its quality, comprising: receiving the image rendered using the volumetric reconstruction, the image having imperfections; defining a synthesizing function and a segmentation mask to generate an enhanced image from the image, the enhanced image having fewer imperfections than the image; and computing the synthesizing function and the segmentation mask using a neural network trained based on minimizing a loss function between a predicted image generated by the neural network and a ground truth image captured by a ground truth camera during training. 2 . The method according to claim 1 , wherein the method further includes prior to receiving the image rendered using the volumetric reconstruction: capturing a 3D model using a volumetric capture system; and rendering the image using the volumetric reconstruction. 3 . The method according to claim 2 , wherein the ground truth camera and the volumetric capture system are both directed to a view during training, the ground truth camera producing higher quality images than the volumetric capture system. 4 . The method according to claim 1 , wherein the loss function includes a reconstruction loss based on a reconstruction difference between a segmented ground truth image mapped to activations of layers in a neural network and a segmented predicted image mapped to activations of layers in a neural network, the segmented ground truth image segmented by a ground truth segmentation mask to remove background pixels and the segmented predicted image segmented by a predicted segmentation mask to remove back ground pixels. 5 . The method according to claim 1 , wherein the loss function includes a head reconstruction loss based on a reconstruction difference between a cropped ground truth image mapped to activations of layers in a neural network and a cropped predicted image mapped to activations of layers in a neural network, the cropped ground truth image cropped to a head of a person identified in a ground truth segmentation mask and the cropped predicted image cropped to the head of the person identified in a predicted segmentation mask. 6 . The method according to claim 4 , wherein the reconstruction difference is saliency re-weighted to down-weight reconstruction differences for pixels above a maximum error or below a minimum error. 7 . The method according to claim 1 , wherein the loss function includes a mask loss based on a mask difference between a ground truth segmentation mask and a predicted segmentation mask. 8 . The method according to claim 7 , wherein the mask difference is saliency re-weighted to down-weight reconstruction differences for pixels above a maximum error or below a minimum error. 9 . The method according to claim 1 , wherein: the predicted image is one of a series of consecutive frames of a predicted sequence and the ground truth image is one of a series of consecutive frames of a ground truth sequence; and wherein: the loss function includes a temporal loss based on a gradient difference between a temporal gradient of the predicted sequence and a temporal gradient of the ground truth sequence. 10 . The method according to claim 1 , wherein the predicted image is one of a predicted stereo pair of images and the loss function includes a stereo loss based on a stereo difference between the predicted stereo pair of images. 11 . The method according to claim 1 , wherein the neural network is based on a fully convolutional model. 12 . The method according to claim 1 , wherein the computing the synthesizing function and segmentation mask using a neural network comprises: computing the synthesizing function and segmentation mask for a left eye viewpoint; and computing the synthesizing function and segmentation mask for a right eye view point. 13 . The method according to claim 1 , wherein the computing the synthesizing function and segmentation mask using a neural network is performed in real time. 14 . A performance capture system comprising: a volumetric capture system configured to render at least one image reconstructed from at least one viewpoint of a captured 3D model, the at least one image including imperfections; a rendering system configured to receive the at least one image from the volumetric capture system and to generate, in real time, at least one enhanced image in which the imperfections of the at least one image are reduced, the rendering system including a neural network configured to generate the at least one enhanced image by training prior to use, the training including minimizing a loss function between predicted images generated by the neural network during training and corresponding ground truth images captured by at least one ground truth camera coordinated with the volumetric capture system during training. 15 . The performance capture system according to claim 14 , wherein the at least one ground truth camera is included in the performance capture system during training and otherwise not included in the performance capture system. 16 . The performance capture system according to claim 14 , wherein the volumetric capture system includes a single active stereo camera directed to a single view and, during training, includes a single ground truth camera directed to the single view. 17 . The performance capture system according to claim 14 , wherein the volumetric capture system includes a plurality of active stereo cameras directed to multiple views and, during training, includes a plurality of ground truth cameras directed to the multiple views. 18 . The performance capture system according to claim 14 , wherein the performance capture system includes a stereo display configured to display one of the at least one enhanced image as a left eye view and one of the at least one enhanced image as a right eye view. 19 . The performance capture system according to claim 18 , wherein the performance capture system is a virtual reality (VR) headset. 20 . The performance capture system according to claim 18 , wherein the stereo display is included in an augmented reality (AR) headset. 21 . The performance capture system according to claim 18 , wherein the stereo display is a head-tracked auto-stereo display. 22 . A non-transitory computer readable storage medium containing program code that when executed by a processor of a computing device causes the computing device to perform a method for re-rendering an image rendered using a volumetric reconstruction to improve its quality, the method including: receiving the image rendered using the volumetric reconstruction, the image having imperfections; defining a synthesizing function and a segmentation mask to generate an enhanced image from the image, the enhanced image having fewer imperfections than the image; and computing the synthesizing function and the segmentation mask using a neural network trained based on minimizing a loss function between a predicted image generated by the neural network and a ground truth image captured by a ground truth camera during training. 23 . The non-transitory computer readable storage medium containing program code that when executed by a processor of a computing device causes the computing device to perform a method for re-rendering an image rendered using a volumetric reconstruction to improve its quality according to claim 22 , wherein the loss function includes a reconstructi
Displays for viewing with the aid of special glasses or head-mounted displays [HMD] · CPC title
Range image; Depth image; 3D point clouds · CPC title
Human being; Person · CPC title
using three or more two-dimensional [2D] image sensors · CPC title
Salient features, e.g. scale invariant feature transforms [SIFT] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.