Image processing apparatus, image processing method, and storage medium
US-2024428519-A1 · Dec 26, 2024 · US
US2021287430A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021287430-A1 |
| Application number | US-202016849962-A |
| Country | US |
| Kind code | A1 |
| Filing date | Apr 15, 2020 |
| Priority date | Mar 13, 2020 |
| Publication date | Sep 16, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Apparatuses, systems, and techniques to identify a shape or camera pose of a three-dimensional object from a two-dimensional image of the object. In at least one embodiment, objects are identified in an image using one or more neural networks that have been trained on objects of a similar category and a three-dimensional mesh template.
Opening claim text (preview).
What is claimed is: 1 . A method, using a processor comprising one or more arithmetic logic units (ALUs), of training one or more neural networks to process images, the method comprising: obtaining a plurality of input sets, wherein an input set of the plurality of input sets comprises an input image of an object and an image mask delineating, at least approximately, portions of the object in the input image, and wherein the plurality of input sets comprises images of objects in a category; and training a neural network, using the plurality of input sets, for use in reconstructing a three-dimensional model of an unknown object in the category depicted in an input two-dimensional image, comprising: a) generating a three-dimensional mesh template corresponding to objects in the category, in response to the training; b) generating a canonical semantic map corresponding to semantic parts of the objects in the category; c) rendering a reconstructed image from the input image of the object based on an estimated three-dimensional mesh for the object, an estimated texture of the object, and an estimated camera pose, wherein the estimated camera pose corresponds to a position of the estimated three-dimensional mesh relative to the object in the input image; d) comparing the reconstructed image and the input image to form a comparison; e) generating a loss function from the comparison, wherein the loss function is based on differences between the reconstructed image and the input image; and f) using the loss function to further train the neural network. 2 . The method of claim 1 , wherein the loss function is based on a first comparison of a first part segmentation probability map of the input image and a second part segmentation probability map of the reconstructed image, a second comparison of vertices of the three-dimensional mesh template and the estimated three-dimensional mesh for the object, and a third comparison of texture consistency between the input image and the reconstructed image. 3 . The method of claim 1 , wherein the image mask delineating, at least approximately, portions of the object in the input image delineates a silhouette of the object. 4 . The method of claim 1 , wherein the image mask delineating, at least approximately, portions of the object in the input image delineates a part segmentation of the object indicative of semantic parts of the object in the input image. 5 . The method of claim 1 , wherein generating the loss function comprises: determining a first mesh mapping corresponding to which pixels of the input image correspond to which mesh faces in the estimated three-dimensional mesh; determining a second mesh mapping corresponding to which pixels of the reconstructed image correspond to which mesh faces in the estimated three-dimensional mesh; and computing a texture cycle consistency measure based on how closely the first mesh mapping and the second mesh mapping correspond, where the loss function is based, at least in part, on the texture cycle consistency measure. 6 . The method of claim 1 , wherein training the neural network from input sets comprises: applying the input sets to an autoencoder; training a generative adversarial network (GAN) on an output of the autoencoder; comparing outputs of the autoencoder and the GAN; and revising the three-dimensional mesh template based on the comparing. 7 . The method of claim 1 , wherein training the neural network comprises: modifying a latent representation to form an updated latent representation based on the loss function, wherein the modifying uses a current shape for the three-dimensional mesh template of the category and a current map for a canonical semantic UV map of the category; updating the three-dimensional mesh template to form an updated mesh template, wherein the updating of the three-dimensional mesh template uses the current shape and the updated latent representation; and updating the canonical semantic UV map to form an updated canonical semantic UV map, wherein the updating of the canonical semantic UV map uses the current map and the updated latent representation. 8 . A method of training one or more neural networks, comprising: training a first autoencoder, of the one or more neural networks, from a plurality of input sets, wherein an input set of the plurality of input sets comprises an input image of an object and an input image mask delineating, at least approximately, portions of the object in the input image, wherein the first autoencoder maps features of its input to a latent representation corresponding to a three-dimensional mesh template; generating a first mesh using the first autoencoder with a first input set of the plurality of input sets as a first input to the first autoencoder; generating a first output image mask, using the first autoencoder, that is based on the first mesh; comparing the first output image mask to a first input image mask of the first input set to determine first differences; adjusting a first encoder of the first autoencoder based on the first differences; training a second autoencoder, of the one or more neural networks, using arbitrary inputs, wherein the second autoencoder shares the latent representation and a decoder with the first autoencoder; generating a second mesh using the second autoencoder; comparing, using a discriminator, the first mesh and the second mesh for determining second differences; adjusting the latent representation based on the second differences and/or the first differences; and adjusting the decoder based on the second differences and/or the first differences. 9 . The method of claim 8 , wherein a first image mask delineating, at least approximately, portions of a first object in a first input image of the first input set delineates a silhouette of the first object. 10 . The method of claim 8 , wherein a first image mask delineating, at least approximately, portions of a first object in a first input image of the first input set delineates a part segmentation of the first object indicative of semantic parts of the first object in the first input image. 11 . The method of claim 8 , wherein adjusting the latent representation is based on the first differences and wherein the first differences are represented by a loss function corresponding to a negative intersection over union (IoU) loss the first output image mask and the first input image mask. 12 . The method of claim 8 , wherein the latent representation comprises representations of mesh points of the three-dimensional mesh template and camera view latent variables corresponding to a camera view relative to the three-dimensional mesh template. 13 . The method of claim 12 , further comprising: obtaining a second input set, the second input set comprising a second image of a second object and a second image mask delineating, at least approximately, a border of the second object in the second image; determining an estimated three-dimensional mesh for the second object in the second image based in part on the three-dimensional mesh template; determining an estimated texture of the second object in the second image; determining an estimated camera pose, wherein the estimated camera pose corresponds to a position of the estimated three-dimensional mesh relative to the second object in the second image, based on the camera view latent variables; generating a second output image from the second input set from the estimated three-dimensional mesh using the estimated texture as viewed from the estimated camera pose; comparing the second output image and the second image to form a compari
Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title
Three-dimensional [3D] objects · CPC title
using neural networks · CPC title
Validation; Performance evaluation · CPC title
Validation; Performance evaluation; Active pattern learning techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.