Self-supervised single-view 3d reconstruction via semantic consistency

US2021287430A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021287430-A1
Application numberUS-202016849962-A
CountryUS
Kind codeA1
Filing dateApr 15, 2020
Priority dateMar 13, 2020
Publication dateSep 16, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Apparatuses, systems, and techniques to identify a shape or camera pose of a three-dimensional object from a two-dimensional image of the object. In at least one embodiment, objects are identified in an image using one or more neural networks that have been trained on objects of a similar category and a three-dimensional mesh template.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method, using a processor comprising one or more arithmetic logic units (ALUs), of training one or more neural networks to process images, the method comprising: obtaining a plurality of input sets, wherein an input set of the plurality of input sets comprises an input image of an object and an image mask delineating, at least approximately, portions of the object in the input image, and wherein the plurality of input sets comprises images of objects in a category; and training a neural network, using the plurality of input sets, for use in reconstructing a three-dimensional model of an unknown object in the category depicted in an input two-dimensional image, comprising: a) generating a three-dimensional mesh template corresponding to objects in the category, in response to the training; b) generating a canonical semantic map corresponding to semantic parts of the objects in the category; c) rendering a reconstructed image from the input image of the object based on an estimated three-dimensional mesh for the object, an estimated texture of the object, and an estimated camera pose, wherein the estimated camera pose corresponds to a position of the estimated three-dimensional mesh relative to the object in the input image; d) comparing the reconstructed image and the input image to form a comparison; e) generating a loss function from the comparison, wherein the loss function is based on differences between the reconstructed image and the input image; and f) using the loss function to further train the neural network. 2 . The method of claim 1 , wherein the loss function is based on a first comparison of a first part segmentation probability map of the input image and a second part segmentation probability map of the reconstructed image, a second comparison of vertices of the three-dimensional mesh template and the estimated three-dimensional mesh for the object, and a third comparison of texture consistency between the input image and the reconstructed image. 3 . The method of claim 1 , wherein the image mask delineating, at least approximately, portions of the object in the input image delineates a silhouette of the object. 4 . The method of claim 1 , wherein the image mask delineating, at least approximately, portions of the object in the input image delineates a part segmentation of the object indicative of semantic parts of the object in the input image. 5 . The method of claim 1 , wherein generating the loss function comprises: determining a first mesh mapping corresponding to which pixels of the input image correspond to which mesh faces in the estimated three-dimensional mesh; determining a second mesh mapping corresponding to which pixels of the reconstructed image correspond to which mesh faces in the estimated three-dimensional mesh; and computing a texture cycle consistency measure based on how closely the first mesh mapping and the second mesh mapping correspond, where the loss function is based, at least in part, on the texture cycle consistency measure. 6 . The method of claim 1 , wherein training the neural network from input sets comprises: applying the input sets to an autoencoder; training a generative adversarial network (GAN) on an output of the autoencoder; comparing outputs of the autoencoder and the GAN; and revising the three-dimensional mesh template based on the comparing. 7 . The method of claim 1 , wherein training the neural network comprises: modifying a latent representation to form an updated latent representation based on the loss function, wherein the modifying uses a current shape for the three-dimensional mesh template of the category and a current map for a canonical semantic UV map of the category; updating the three-dimensional mesh template to form an updated mesh template, wherein the updating of the three-dimensional mesh template uses the current shape and the updated latent representation; and updating the canonical semantic UV map to form an updated canonical semantic UV map, wherein the updating of the canonical semantic UV map uses the current map and the updated latent representation. 8 . A method of training one or more neural networks, comprising: training a first autoencoder, of the one or more neural networks, from a plurality of input sets, wherein an input set of the plurality of input sets comprises an input image of an object and an input image mask delineating, at least approximately, portions of the object in the input image, wherein the first autoencoder maps features of its input to a latent representation corresponding to a three-dimensional mesh template; generating a first mesh using the first autoencoder with a first input set of the plurality of input sets as a first input to the first autoencoder; generating a first output image mask, using the first autoencoder, that is based on the first mesh; comparing the first output image mask to a first input image mask of the first input set to determine first differences; adjusting a first encoder of the first autoencoder based on the first differences; training a second autoencoder, of the one or more neural networks, using arbitrary inputs, wherein the second autoencoder shares the latent representation and a decoder with the first autoencoder; generating a second mesh using the second autoencoder; comparing, using a discriminator, the first mesh and the second mesh for determining second differences; adjusting the latent representation based on the second differences and/or the first differences; and adjusting the decoder based on the second differences and/or the first differences. 9 . The method of claim 8 , wherein a first image mask delineating, at least approximately, portions of a first object in a first input image of the first input set delineates a silhouette of the first object. 10 . The method of claim 8 , wherein a first image mask delineating, at least approximately, portions of a first object in a first input image of the first input set delineates a part segmentation of the first object indicative of semantic parts of the first object in the first input image. 11 . The method of claim 8 , wherein adjusting the latent representation is based on the first differences and wherein the first differences are represented by a loss function corresponding to a negative intersection over union (IoU) loss the first output image mask and the first input image mask. 12 . The method of claim 8 , wherein the latent representation comprises representations of mesh points of the three-dimensional mesh template and camera view latent variables corresponding to a camera view relative to the three-dimensional mesh template. 13 . The method of claim 12 , further comprising: obtaining a second input set, the second input set comprising a second image of a second object and a second image mask delineating, at least approximately, a border of the second object in the second image; determining an estimated three-dimensional mesh for the second object in the second image based in part on the three-dimensional mesh template; determining an estimated texture of the second object in the second image; determining an estimated camera pose, wherein the estimated camera pose corresponds to a position of the estimated three-dimensional mesh relative to the second object in the second image, based on the camera view latent variables; generating a second output image from the second input set from the estimated three-dimensional mesh using the estimated texture as viewed from the estimated camera pose; comparing the second output image and the second image to form a compari

Assignees

Inventors

Classifications

  • Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title

  • Three-dimensional [3D] objects · CPC title

  • using neural networks · CPC title

  • Validation; Performance evaluation · CPC title

  • Validation; Performance evaluation; Active pattern learning techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021287430A1 cover?
Apparatuses, systems, and techniques to identify a shape or camera pose of a three-dimensional object from a two-dimensional image of the object. In at least one embodiment, objects are identified in an image using one or more neural networks that have been trained on objects of a similar category and a three-dimensional mesh template.
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06T17/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).