Three-dimensional (3D) pose estimation from a monocular camera

US10929654B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10929654-B2
Application numberUS-201916290643-A
CountryUS
Kind codeB2
Filing dateMar 1, 2019
Priority dateMar 12, 2018
Publication dateFeb 23, 2021
Grant dateFeb 23, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Estimating a three-dimensional (3D) pose of an object, such as a hand or body (human, animal, robot, etc.), from a 2D image is necessary for human-computer interaction. A hand pose can be represented by a set of points in 3D space, called keypoints. Two coordinates (x,y) represent spatial displacement and a third coordinate represents a depth of every point with respect to the camera. A monocular camera is used to capture an image of the 3D pose, but does not capture depth information. A neural network architecture is configured to generate a depth value for each keypoint in the captured image, even when portions of the pose are occluded, or the orientation of the object is ambiguous. Generation of the depth values enables estimation of the 3D pose of the object.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving locations of keypoints for a three-dimensional (3D) object, wherein each location includes pixel coordinates and a normalized depth value, the pixel coordinates corresponding to pixels within a two-dimensional (2D) image of the 3D object, the 2D image associated with camera attributes and the normalized depth values corresponding to normalized relative depth values of each one of the keypoints with respect to a reference keypoint; computing, by a 3D reconstruction unit, a depth of the reference keypoint with respect to a camera based on the locations and the camera attributes; computing a scale normalized 3D pose of the 3D object based on the locations, the depth of the reference keypoint and the camera attributes; and generating, according to a scale factor, an absolute 3D pose of the 3D object from the scale normalized 3D pose. 2. The computer-implemented method of claim 1 , wherein the scale factor is estimated and corresponds to a component of the 3D object. 3. The computer-implemented method of claim 1 , wherein the scale factor is measured and corresponds to a component of the 3D object. 4. The computer-implemented method of claim 1 , wherein the normalized depth values are computed relative to a reference keypoint. 5. The computer-implemented method of claim 4 , wherein computing the scale normalized 3D pose is based on a depth of the reference keypoint that is calculated using the locations. 6. The computer-implemented method of claim 1 , wherein the normalized depth values are invariant for changes in a scale of the 3D object. 7. The computer-implemented method of claim 1 , wherein the normalized depth values are invariant for changes in translation of the 3D object. 8. A computer-implemented method, comprising: processing a two-dimensional (2D) input image of a three-dimensional (3D) object by a neural network model, according to a set of parameters, to produce latent depth data corresponding to a keypoint associated with the 3D object; obtaining latent pixel coordinate data corresponding to the keypoint; computing, based on the latent depth data and the latent pixel coordinate data, a depth value for the keypoint; and converting the latent pixel coordinate data into a pixel coordinate location for the keypoint. 9. The computer-implemented method of claim 8 , wherein the depth value is a normalized depth value computed relative to a reference keypoint. 10. The computer-implemented method of claim 8 , wherein the depth value is invariant for changes in a scale of the 3D object. 11. The computer-implemented method of claim 8 , wherein the depth value is invariant for changes in a translation of the 3D object. 12. The computer-implemented method of claim 8 , further comprising, when training the neural network model, updating the set of parameters to reduce differences between latent depth data produced by the neural network model and latent depth data corresponding to ground truth depth values of keypoints in a training dataset. 13. The computer-implemented method of claim 8 , wherein the processing of the 2D input image of the 3D object by the neural network model further comprises: producing a latent 2D heatmap for the keypoint; and converting the latent 2D heatmap into the latent pixel coordinates. 14. The computer-implemented method of claim 12 , further comprising, when training the neural network model, updating the set of parameters to reduce differences between the latent 2D heatmap produced by the neural network model and a latent 2D heatmap corresponding to ground truth pixel coordinate locations of keypoints in a training dataset. 15. The computer-implemented method of claim 8 , wherein the computing comprises, for each keypoint, summing a Hadamard product of the latent depth data and the latent pixel coordinate data. 16. The computer-implemented method of claim 8 , wherein the latent pixel coordinate data is a probability map generated from a latent 2D heatmap. 17. The computer-implemented method of claim 8 , wherein a function used to convert the latent pixel coordinate data into the pixel coordinate location is fully differentiable. 18. The computer-implemented method of claim 8 , wherein a function used to compute the depth value for the keypoint is fully differentiable. 19. The computer-implemented method of claim 8 , further comprising adjusting the set of parameters to control a spread of the latent pixel coordinate data. 20. A system, comprising: a neural network configured to process a two-dimensional (2D) input image of a three-dimensional (3D) object, according to a set of parameters, to produce latent depth data corresponding to a keypoint associated with the 3D object; and a depth computation unit configured to: obtain latent pixel coordinate data corresponding to the keypoint; compute, based on the latent depth data and the latent pixel coordinate data, a depth value for the keypoint; and convert the latent pixel coordinate data into a pixel coordinate location for the keypoint. 21. A non-transitory computer-readable media storing computer instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: processing a two-dimensional (2D) input image of a three-dimensional (3D) object by a neural network model, according to a set of parameters, to produce latent depth data corresponding to keypoints associated with the 3D object; obtaining latent pixel coordinate data corresponding to the keypoints; computing, based on the latent depth data and the latent pixel coordinate data, a depth value for each one of the keypoints; and converting the latent pixel coordinate data into pixel coordinate locations for each one of the keypoints.

Assignees

Inventors

Classifications

  • G06V10/82Primary

    using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • Recognition of hand or arm movements, e.g. recognition of deaf sign language (static hand signs G06V40/113) · CPC title

  • Probabilistic or stochastic networks · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10929654B2 cover?
Estimating a three-dimensional (3D) pose of an object, such as a hand or body (human, animal, robot, etc.), from a 2D image is necessary for human-computer interaction. A hand pose can be represented by a set of points in 3D space, called keypoints. Two coordinates (x,y) represent spatial displacement and a third coordinate represents a depth of every point with respect to the camera. A monocul…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 23 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).