Methods and systems for 3D hand pose estimation from RGB images

US12148097B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12148097-B2
Application numberUS-202218078832-A
CountryUS
Kind codeB2
Filing dateDec 9, 2022
Priority dateDec 9, 2022
Publication dateNov 19, 2024
Grant dateNov 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for estimation of a 3D hand pose are disclosed. A 2D image containing a detected hand is processed using a U-net network to obtain a global feature vector and a heatmap for the keypoints of the hand. Information from the global feature vector and the heatmap are concatenated to obtain a set of input tokens that are processed using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first view. The first set of 2D keypoints are inputted as a query to a transformer decoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second view. The first and second sets of 2D keypoints are aggregated to output the set of estimated 3D keypoints.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computing system comprising: a processing unit configured to execute instructions to cause the computing system to estimate a set of 3D keypoints representing a 3D hand pose by: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints. 2. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: estimating a set of one or more bone lengths from the first set of 2D keypoints; processing the first set of 2D keypoints and the estimated set of one or more bone lengths using respective one-layer feedforward networks, to obtain a plurality of hand class estimations representing a handedness of the detected hand; and aggregating the plurality of hand class estimations to obtain an estimated hand class representing an estimated handedness of the detected hand. 3. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: rendering a 3D model representation of the detected hand by mapping the set of estimated 3D keypoints to a 3D mesh. 4. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: performing gesture recognition by processing the set of estimated 3D keypoints using a gesture recognition software module. 5. The computing system of claim 1 , wherein the first set of 2D keypoints represent estimated 2D locations of the keypoints in the first 2D view corresponding to a view of the detected hand captured in the 2D image, and wherein the second set of 2D keypoints represent estimated 2D locations of the keypoints in the second 2D view corresponding to a different view of the detected hand not captured in the 2D image. 6. The computing system of claim 1 , wherein concatenating information from the global feature vector and the heatmap to obtain a set of input tokens comprises: processing the global feature vector using a feedforward network to obtain a regressor representing a 2D global estimation of each respective keypoint; processing the heatmap using a spatial softmax layer to obtain a 2D heatmap estimation of each respective keypoint; and concatenating the global feature vector with the 2D global estimation and the 2D heatmap estimation to obtain the input token for each respective keypoint. 7. The computing system of claim 1 , wherein the U-net network, the transformer encoder and the transformer decoder are trained together end-to-end, using a training dataset comprising 2D images annotated with keypoints corresponding to joints of a hand and without annotation of vertices corresponding to a 3D mesh of the hand. 8. The computing system of claim 1 , wherein the computing system is one of: a mobile device; a smart appliance; an Internet of Things (IOT) device; an augmented reality (AR) device; or a virtual reality (VR) device. 9. A computer-implemented method for estimating a set of 3D keypoints representing a 3D hand pose comprising: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints. 10. The method of claim 9 , further comprising: estimating a set of one or more bone lengths from the first set of 2D keypoints; processing the first set of 2D keypoints and the estimated set of one or more bone lengths using respective one-layer feedforward networks, to obtain a plurality of hand class estimations representing a handedness of the detected hand; and aggregating the plurality of hand class estimations to obtain an estimated hand class representing an estimated handedness of the detected hand. 11. The method of claim 9 , further comprising: rendering a 3D model representation of the detected hand by mapping the set of estimated 3D keypoints to a 3D mesh. 12. The method of claim 9 , further comprising: performing gesture recognition by processing the set of estimated 3D keypoints using a gesture recognition software module. 13. The method of claim 9 , wherein the first set of 2D keypoints represent estimated 2D locations of the keypoints in the first 2D view corresponding to a view of the detected hand captured in the 2D image, and wherein the second set of 2D keypoints represent estimated 2D locations of the keypoints in the second 2D view corresponding to a different view of the detected hand not captured in the 2D image. 14. The method of claim 9 , wherein concatenating information from the global feature vector and the heatmap to obtain a set of input tokens comprises: processing the global feature vector using a feedforward network to obtain a regressor representing a 2D global estimation of each respective keypoint; processing the heatmap using a spatial softmax layer to obtain a 2D heatmap estimation of each respective keypoint; and concatenating the global feature vector with the 2D global estimation and the 2D heatmap estimation to obtain the input token for each respective keypoint. 15. The method of claim 9 , wherein the U-net network, the transformer encoder and the transformer decoder are trained together end-to-end, using a training dataset comprising 2D images annotated with keypoints corresponding to joints of a hand and without annotation of vertices corresponding to a 3D mesh of the hand. 16. A non-transitory computer-readable medium having instructions encoded thereon, wherein the instructions are executable by a processing unit of a computing system to cause the computing system to perform operations of: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimat

Assignees

Inventors

Classifications

  • Recognition of hand or arm movements, e.g. recognition of deaf sign language (static hand signs G06V40/113) · CPC title

  • G06F3/017Primary

    Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title

  • G06T17/20Primary

    Finite element generation, e.g. wire-frame surface description, {tesselation} · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12148097B2 cover?
Methods and systems for estimation of a 3D hand pose are disclosed. A 2D image containing a detected hand is processed using a U-net network to obtain a global feature vector and a heatmap for the keypoints of the hand. Information from the global feature vector and the heatmap are concatenated to obtain a set of input tokens that are processed using a transformer encoder to obtain a first set …
Who is the assignee on this patent?
Verdie Yannick, Yang Zihao, Sridhar Deepak, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F3/017. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).